Skip to content

ahmedanees-m/pen-stack

Repository files navigation

PEN-STACK

The verification & grounding substrate for genome-writing AI — matured into a co-scientist

The foundation models generate; PEN-STACK checks. It tells you where in the genome you can safely and durably write, which enzyme can write it there, and how to design the write end-to-end — then verifies every design against rule-grounded mechanism, reports calibrated confidence, cites its reasoning, and says "out of scope" rather than guess. Every number comes from a validated tool; nothing is fabricated.

PyPI CI Publish codecov License: MIT Python 3.11+ Version Status Tests Lint: ruff Runtime: Docker Validation: pre-registered Genome-Writing Bench v0.3

Built on five prior, separately published repositories:

genome-atlas mech-class pen-score pen-assemble pen-compare


What is PEN-STACK?

PEN-STACK is a single, installable, pre-registered computational stack that builds the reference and design layer the genome-writing era lacks. It consolidates five earlier research projects into one citable package, then adds the two reference maps and the design engine the field was missing.

Genome editing changes a base or short stretch in place. Genome writing installs new information - inserting genes, flipping or excising kilobases, placing programmable landing pads. Writing is the harder, less-tooled, and more clinically transformative modality, and it is gated by questions that today have no canonical answer.

The problem, and the gaps PEN-STACK closes

Two questions gate every genome-writing project, and before PEN-STACK no resource answered them together:

Gap The problem today What PEN-STACK provides
Where can you write? Each lab re-derives an ad-hoc "safe harbour" shortlist from inconsistent criteria; published lists range from ~2,000 sites to 25, none predict expression durability from a learned model, none are writer-aware, most cover one cell type. The Writable Genome - a learned, cell-type-aware, writer-aware atlas scoring every locus for safety (genotoxicity risk) x durability (will the cassette stay expressed) x reachability (which enzyme can engage it).
What can write there, and how well? Enzyme capabilities are scattered across papers; no catalogue places all genome-writing families on common, measured axes with their targeting requirements. The Writer Atlas - 33,370 enzyme systems across 8 families on common measured axes, joined to the Writable Genome by a bidirectional cross-link.
How do I design the actual write? Destination, enzyme, cargo and delivery are interdependent and goal-dependent; no tool optimises them jointly. The Write Planner - inverse design that, given a goal and an edit_intent, returns ranked, traceable site x writer x cargo x delivery plans.
Where might my bridge-recombinase design go off-target? Bridge recombinases are the most programmable writers, but had no genome-wide off-target screening tool (a "CRISPOR" equivalent); their developers list this as future work. The bridge off-target engine (pen-bridge) - measured-data-validated screening that nominates and ranks candidate off-target locations (a screen, not a per-site risk calculator).

Everything is built on bulk-downloadable public data, runs on a single GPU, and is validated blind against a pre-registered, honest baseline before release.

🎓 v6.0.0 — "1.0 — First Stable"

The Closed-Loop arc is complete (7/7) and PEN-STACK has graduated to 1.0 — First Stable. The public API exercised across every surface — verify, safety, generative design, the twin, active experiment design, the build interface, the closed loop, the co_scientist, and the Genome-Writing Challenge — is documented and frozen with a deprecation policy (docs/STABILITY.md). Development Status :: 5 - Production/Stable. "First Stable" is earned, not declared — cut only after the closed loop was demonstrated (v5.12) and the benchmark went public (v5.13).

"1.0 — First Stable" is a commitment to API stability, not a claim of solving genetic engineering. The unknown funnel remains — made legible (scope flags, known-unknowns, honest baselines, no fabrication), not hidden.

What is new in v6.4 — Live Oracles (the foundation models actually execute)

The foundation-model oracles now run for real, not just defer: ViennaRNA (in-process), AlphaGenome (free DeepMind API — variant effect + regulatory tracks), Evo2-40B (NVIDIA hosted — DNA generation), ProteinMPNN / ESM3-open / RFdiffusion (local GPU model servers). Each is opt-in via PEN_STACK_ORACLE_NET=1; with the flag off, every oracle behaves exactly as before — and the no-fabrication invariant is untouched (generated outputs stay candidates, OOD inputs are flagged, a down backend defers).

A new execution + latency surface (GET /oracles, configs/oracles/execution.yaml) tells you the cost up front: instant / seconds / slow (~1–2 min, warn) / long_job. The cloud-A100 structure models (AlphaFold3 · Boltz-2 · Chai-1 · Protenix) are held on purpose — they run separately on a rented A100, never on the 16 GB VM and never in the request path. Arc STATE's perturbation outcome stays honestly deferred (it needs the State-Transition model + a reference scRNA pipeline). See docs/live_oracles.md.

What is new in v6.3 — The Hybrid Co-Scientist (grounded engine + general intelligence)

The chat now does three things at once without ever blurring them: it runs the engine for genome-writing requests, explains what the numbers mean, and answers general biology questions — each in its own lane with its own provenance, so a general-knowledge fact can never be mistaken for a PEN-STACK result.

Lane Trigger Source Provenance
🔬 Design "insert FIX at AAVS1 with AAV in hepatocytes" the engine (verify/safety/immune/twin) — guard ON "PEN-STACK · grounded"
📖 Explain "what does that 0.55 mean?" the metric guide (scale, direction, reference range, how computed) + the prior dossier "PEN-STACK · metric guide"
⚙️ Meta "how many enzymes? how is immunogenicity computed? how accurate?" the live capability facts (33,370 systems / 8 families / 5 immune axes …) — guard ON "PEN-STACK · about the engine"
🧠 General "hi", "what is AAV", "how does AAV work" the LLM's trained knowledge "General knowledge — not PEN-STACK-verified", + a pointer to what PEN-STACK can compute
  • The one rule that keeps the core intact: a number presented as a PEN-STACK result is engine-grounded (the guard runs on the first three lanes); a number from general knowledge is visibly labelled and, wherever PEN-STACK could compute it, redirected to the engine. We added a lane; we did not loosen the grounded one.
  • Numbers now come with meaning — every value is explained with its scale, what's good/bad, its reference range, and how it was computed (configs/metric_guide.yaml).
  • Conversation memory — follow-ups resolve against the prior answer (in-session until you refresh).
  • Workstream WS-HYBRID (pen_stack/web/{router,guide}.py + the rewritten llm.py); prereg/ws_hybrid.yaml.

What is new in v6.2 — The Web Platform (the human surface)

Post-1.0, the adoption surface for bench scientists: a complete, friendly web application — a grounded co-scientist chat plus structured feature pages — over the same typed v6.1 API the AI surface uses. The LLM narrates and routes but never sources a number; every quantitative answer renders with its confidence band, its provenance, and an explicit ledger of what PEN-STACK can't tell you.

Workstream What it adds Result
BACKEND pen_stack/web/server.py one FastAPI gateway that mounts the v6.1 engine surface (under /api) + adds the grounded /chat (+ SSE /chat/stream) + CORS + serves the built frontend
CHAT (the hard gate) pen_stack/web/{tools,llm}.py the grounded co-scientist: the engine computes every number (run_tools), the LLM narrates over the tool results (Ollama → Nemotron → deterministic), and a grounding guard strikes any number the model can't trace to a tool result (asserted in tests/unit/test_ws_chat.py)
FRONTEND web/ (React/Vite + Tailwind) the honest-UX libraryConfidenceBand · ProvenanceChip · ScopeLedger · SafetyBadge · ImmuneProfileCard — so a number is never shown without its uncertainty + provenance
PAGES 11 feature pages Co-Scientist · Site Finder · Writer Atlas · Designer · Verify · Delivery & Immunity · Digital Twin · Guardian · Experiments · Challenge · Scope & About
DEPLOY docker/web.Dockerfile + compose one-command self-host: a node:20 stage builds the frontend, a slim Python serves UI + API from one origin — docker compose up web ollama, open http://localhost:8000

The LLM never sources a number — it explains, compares, and routes over the engine's tool outputs, and the science runs (deterministic mode) with no LLM at all. See web/README.md and prereg/ws_{chat,frontend}.yaml. (Lowers the usability barrier; a real-data validation + a first lab user remain the standing bottleneck — no new science.)

What is new in v6.1 — The AI Integration Surface

Post-1.0, the adoption surface for AI builders: PEN-STACK is now self-describing — an external agent can ask, programmatically, "what can you do, and what do you refuse to answer?" and get a typed, honest answer it can route on.

Workstream What it adds Result
MANIFEST (the differentiator) pen_stack/api/manifest.py capability_manifest() (tools, all fabricates=False) + scope_manifest() — the known-unknowns + oracle scope cards as machine-readable data (what PEN-STACK refuses to answer)
OPENAPI server/api.py GET /capabilities + GET /scope + tool routes (/verify /safety /immune /generate /predict /suggest /session); auto OpenAPI 3.1 at /openapi.json
MCP agent/mcp_server.py resources pen-stack://capabilities + pen-stack://scope; engine tools; a hazardous design returns a structured refusal
EXAMPLE examples/ golden-path external_agent.py (REST), mcp_client.py (MCP), agent_tools.py (framework specs built from the live manifest + dispatcher)

Scope is data, not a disclaimer — the honesty machinery becomes an API, the thing that makes trustworthy autonomy something another system can build on. See Integrate PEN-STACK in your AI and prereg/ws_{manifest,openapi,mcp}.yaml. (Lowers the adoption barrier; outreach + a real result remain the standing bottleneck — no new science.)

What is new in v5.13 — The Standard (Genome-Writing Challenge + Co-Scientist II)

v5.13 (Closed-Loop arc, Cycle 7 of 7) makes PEN-STACK the field's reference and its most useful face: the accumulated bench tasks become the Genome-Writing Challenge — an open, recurring, held-out, reproducible benchmark others build to — while a co-scientist drives the whole loop for a working scientist, every output safe, legal, calibrated, cited, scope-ledgered, and immune-profiled.

Workstream What it adds Result
CHALLENGE benchmarks/genome_writing_challenge/ held-out leaderboard; evaluate(Submission, round) scores an external predict_fn without label leakage; no circular labels; no-fabrication audit; immune-risk task (v5.6)
COSCI2 agent/co_scientist.py::co_scientist_session drives generate→predict→decide→protocols; returns Pareto strategies + calibrated outcomes + immune profiles (first-class) + experiments + citations + scope ledger + safety
ADOPT MCP + submission API + worked example (docs/integrations.md) the integration surface is shipped; ≥1 external integration/submission depends on outreach (honest bottleneck)

python benchmarks/genome_writing_challenge/run.py scores the reference. See docs/challenge.md, docs/co_scientist_loop.md, docs/integrations.md, and prereg/ws_{challenge,cosci2}.yaml. (v6.0.0 "1.0 — First Stable" follows.)

What is new in v5.12 — The Closed Loop (autonomy Level 3)

v5.12 (Closed-Loop arc, Cycle 6 of 7) integrates every prior cycle into one continual design→build→test→learn loop — humans/lab in control, no fabrication, drift-aware — reaching autonomy Level 3 (the program's honest ceiling, not Level 5).

Workstream What it adds Result
LOOP loop/cycle.py run_loop orchestrates generate(v5.8)→decide(v5.10)→safety-gated export(v5.7+v5.11)→sim/real run(v5.11)→ingest(v4.5)→drift→learn; gates await the approver; autonomy_level=3, human_in_control=True
DRIFT loop/drift.py predicted (twin) vs observed → growing miscalibration widens uncertainty (inflate, don't over-trust)
CONTINUAL loop/continual.py recalibrate on admitted outcomes only; versioned + reversible (rollback_to); immune proxy → validated needs a CI'd measurement
DEMO+AUTONOMY loop_converges_faster_than_random · docs/autonomy.md convergence reported with CI; Level-3 criteria asserted (closed · human-gated · anomaly-flagging · no-fabrication); Levels 4/5 not claimed
BENCH bench v0.3.8 closed_loop hard gate loop integrity vs an ungated autopilot (no gates, no drift, no versioned beliefs)

The loop is Level 3 — closed, but with a human in control at every gate, NOT autonomous. See docs/closed_loop.md, docs/autonomy.md, and prereg/ws_{loop,continual,drift}.yaml.

What is new in v5.11 — The Build Interface (digital→physical bridge)

v5.11 (Closed-Loop arc, Cycle 5 of 7) makes designs executable and results ingestible — loop-ready, lab-optional, safety-gated, with the immune-risk profile attached as protocol metadata. PEN-STACK emits protocols and ingests results; it does not run experiments.

Workstream What it adds Result
PROTO build/protocol.py export_protocol runs verify() first; a safety-refused/illegal design raises ProtocolExportError; a cleared design → a DRAFT (Opentrons/PyLabRobot/cloud-lab) carrying the v5.6 immune profile
INGEST build/ingest.py a result → a quarantined measured Candidate; the only path into the curated world-model is the v4.5 gate (checks + human approval) — no auto-edit
SIMLAB build/simlab.py run_simulated samples from the v5.9 twin (+ noise), labelled SIMULATED; the loop export → sim → ingest runs without hardware
BENCH bench v0.3.7 protocol_safety hard gate cleared exports w/ immune metadata · hazard+illegal blocked · sim loop completes; an ungated exporter fails by construction

Protocols are drafts for human/lab review, never auto-executed; export is hard-blocked for anything the safety gate flags. See docs/build_interface.md and prereg/ws_{proto,ingest,simlab}.yaml.

What is new in v5.10 — The Experiment Designer (active learning / EIG)

v5.10 (Closed-Loop arc, Cycle 4 of 7) is the Learn brain of a self-driving lab: it turns "I'm uncertain" into "run this experiment next." It reads the calibrated v5.9 twin's uncertainty and the v5.6 immune labels, scores each candidate experiment by the information it is expected to yield, assembles a diverse batch, and proves on held-out data — with confidence intervals — that this learns faster than random/greedy.

Workstream What it adds Result
ACQ active/acquire.py EIG from the twin (≥0, monotone in uncertainty); immune-VOI rewards experiments that would validate a v5.6 proxy axis; deterministic, traceable
DESIGN active/design.py select_batch — diverse batch (acquisition − redundancy penalty), not k copies of the most-uncertain point; each carries expected info gain
VALIDATE active/validate.py retrospective active vs random/greedy learning curves with reps + bootstrap CI on the curve-area gap; beats random only if CI excludes 0, else reports not-yet-useful
BENCH bench v0.3.6 experiment_design hard gate gate = the Learn engine's honesty + falsifiability; a random selector (no acquisition, no falsifiable curve) fails by construction

Falsifiable by construction and lab-optional — it chooses informative experiments but does not run them (prospective benefit awaits a lab; no autonomy claim). See docs/experiment_design.md and prereg/ws_{acq,aldesign,alvalidate}.yaml.

What is new in v5.9 — The Digital Twin (calibrated outcome prediction)

v5.9 (Closed-Loop arc, Cycle 3 of 7) adds the missing layer — what does the cell do after the write? — predicted with calibrated honesty. The twin computes what mechanism allows, adds an in-distribution virtual-cell estimate (OOD-gated), screens immune outcome from the v5.6 profile, and is explicit about its boundary at phenotype. A hypothesis engine, not an oracle of truth.

Workstream What it adds Result
VCELL oracles/vcell.py + scope cards state/scgpt Arc STATE / scGPT under the OracleResult contract; perturbation prediction is a candidate, OOD-gated, deferred value never fabricated
MECH twin/mechanistic.py cassette_expression = promoter × copy × accessibility (closed form); physics where computable, NOT a phenotype
OUTCOME twin/outcome.py fuses mechanism + in-dist VC + v5.6 immune; interval widens under OOD; in-vivo durability conditioned on the grounded NAb axis; phenotype/in-vivo-magnitude scope-flagged
CAL twin/calibrate.py calibration reported two-sided (coverage + MAE-gap-vs-naive bootstrap CI); beats naive only if CI excludes 0, else the negative is reported
BENCH bench v0.3.5 outcome_prediction hard gate gate = the twin's honesty properties (two-sided cal + OOD widening + immune dim + phenotype out-of-scope); an overconfident predictor fails by construction

The interval is a heuristic band, not a trained conformal interval (no public perturbation-outcome calibration set; Arc's Virtual Cell Challenge shows models don't yet consistently beat naive baselines). See docs/digital_twin.md and prereg/ws_{vcell,mech,outcome,twincal}.yaml.

What is new in v5.8 — The Live Agent & Generative Designer

v5.8 (Closed-Loop arc, Cycle 2 of 7) turns PEN-STACK from a checker into a grounded designer: it generates candidate end-to-end writing systems, keeps only those that pass safety + legality + calibration (verifier-as-discriminator), and returns the Pareto frontier of real tradeoffs — with immunogenicity-risk now a grounded axis sourced from the v5.6 profile, not a placeholder.

Workstream What it adds Result
GEN design/{space,generate}.py generate → verify() discriminates; hazardous + illegal candidates discarded, never returned; survivors are calibrated, immune-profiled, output_kind="candidate"
PARETO design/pareto.py non-dominated frontier over efficiency/durability/safety/deliverability/neg_immune_risk/neg_cost; immune axis = worst-case per-axis from v5.6 (uncertainty band carried, in-vivo magnitude flagged)
ORCH agent/orchestrator_live.py plan → generate → oracle critique (cache-first) → verify() → refine; every number tool-sourced; seed-locked replay reproduces the trace
BENCH bench v0.3.4 generative_design hard gate grounded designer keeps only legal+safe+calibrated+immune survivors on a grounded-immune Pareto frontier; an ungrounded generator ships hazardous/illegal designs and fails by construction

A generated output is a candidate, never a claim; novelty is bounded by the oracles' validity and the rules' legality. See docs/generative_design.md and prereg/ws_{gen,pareto,orch}.yaml.

What is new in v5.7 — The Guardian (biosecurity / dual-use safety gate)

v5.7 opens the Closed-Loop arc (Cycle 1 of 7) by making PEN-STACK safe by construction: every design submitted to verify() first passes a biosecurity / dual-use screening gate. A design matching a select-agent, pandemic-pathogen, or controlled-toxin signature is refused or escalated; legitimate therapeutic designs pass untouched. This is orthogonal to the v5.1–v5.6 immune-risk profile — the Guardian asks "is this design itself hazardous/dual-use?", the immune profile asks "will the patient react?" — and both attach to every Verdict.

Workstream What it adds Result
SCREEN safety/{registry,screen}.py + configs/safety/hazard_registry.yaml version-pinned HazardRegistry; function_flag / taxon_flag / chimera_context / sequence_homology screens; the function screen catches AI-homologs (low identity, hazardous function) homology alone misses
POLICY safety/{policy,gate,audit}.py + configs/safety/policy.yaml SafetyVerdict {clear/flag/refuse/escalate}; ambiguous dual-use → escalate (human review); tamper-evident hash-chained audit; re-framing can't flip refuse→clear
INTEGRATE Verdict.safety; verify(design, actor=…) the gate runs first; a refuse short-circuits (design not scored further); no-fabrication holds
REDTEAM safety/redteam.py adversarial probes (AI-homolog, split-hazard, reframing, chimera) caught; reframing-stable
BENCH bench v0.3.3 safety_screening hard gate benign 0 false refusals · hazards refused/escalated · evasions never clear; beats a no-safety baseline (1.0 vs 0.33); 17/17 tasks, planner beats naive 13/13

Signatures are function/family/taxon-level only (public Pfam accessions + public control-list references: 42 CFR 73 / 7 CFR 331 / 9 CFR 121 / Australia Group / HHS P3CO/DURC) — no hazard sequences, no synthesis detail. All Pfam accessions were independently verified against EBI InterPro before reliance (one error, PF01375, caught and corrected). The gate is a defensive safeguard, not a guarantee, and not a substitute for institutional biosafety / IBC review. See docs/responsible_use.md, docs/biosecurity.md, and prereg/ws_{screen,policy,redteam}.yaml.

What is new in v5.6 — Immunology completion & calibration (anti-PEG · proxy honesty · unified profile)

v5.6 finishes the delivery-immunology arc and tells the truth about it. It adds the missing anti-PEG axis (gates LNP re-dosing), calibrates every proxy two-sided, and exposes a unified per-design immune-risk profile in which each axis keeps its own uncertainty and none is ever fused into one overconfident number.

Workstream What it adds Result
PEG planner/antipeg_oracle.py + configs/antipeg.yaml anti-PEG prevalence (25–72%) → preexisting_antipeg_score; gates re-dosing; abstains for non-PEG vehicles; range surfaced as uncertainty
CALIB validate/immune_calibration.py each axis labelled outcome-validated (ρ + CI excluding 0) or mechanistic/population proxy; with no public paired-outcome data, all axes are honestly labelled proxies (the label travels with the profile)
PROFILE planner/immune_profile.py + Verdict.immune_profile per-design vector of all axes, each with value + uncertainty + scope + validation label; collapsed_score is None (never fused); known-unknowns listed
EXT route/immune-privilege modifier + new known-unknowns eye/CNS immune-privilege as a documented qualitative modifier (no magnitude); CD4/MHC-II, pre-existing capsid T-cell, complement/CARPA registered as known-unknowns

The in-vivo magnitude and patient-specific titer stay declared known-unknowns. See docs/delivery_immunology.md and prereg/ws_{peg,calib,profile}.yaml.

What is new in v5.5 — Anti-vector seroprevalence oracle (the last immune axis, from data)

This completes the computable delivery-immunology axes. Pre-existing humoral immunity (B-cell / NAb) to a viral capsid is the one axis that cannot be computed from sequence — it is a population prevalence from natural exposure — so v5.5 grounds it in published serosurvey data (AAV: Calcedo 2009 / Boutin 2010; adenovirus: Mast 2010; HSV-1: Looker 2015). preexisting_score = 1 − midpoint(seroprevalence)/100, with the literature range surfaced as native uncertainty.

serotype (vehicle) NAb seroprevalence pre-existing score
Ad5 → HDAd 40–90% 0.35
AAV (aggregate) → AAV 30–60% 0.55
HSV-1 → HSV 50–70% 0.40
VSV → lentivirus 0–5% 0.975

Folded into the pre-existing axis for in-vivo vehicles (muted for ex-vivo, where serum NAb can't reach ex-vivo cells); non-viral → 1.0 by mechanism. It is a population prevalence — not a given patient's NAb titer (a known-unknown). See pen_stack/planner/seroprevalence_oracle.py, configs/seroprevalence.yaml, prereg/ws_seroprev.yaml, and the seroprevalence scope card.

With v5.5, four of the five delivery-immunology axes are grounded in data or sequence — genotoxicity (VISDB×COSMIC), adaptive/CD8 (MHCflurry), innate (CpG/dsRNA), pre-existing/NAb (serosurveys) — each abstaining rather than fabricating, with the in-vivo magnitude always a declared known-unknown.

What is new in v5.4 — Computed innate-sensing scorer (completes the computable immune axes)

The third computed delivery-immunology signal (after v5.2 genotoxicity and v5.3 capsid epitope load). Innate sensing of a delivered nucleic acid is computed directly from the cargo sequence, covering every cargo form. It is a sequence-intrinsic motif-load signal; the realized in-vivo innate response stays a known-unknown, and the mRNA score is honestly partial (the dominant lever — nucleoside modification — isn't derivable from sequence).

Cargo form Pathway Computed from sequence Score
DNA (AAV / HDAd / HSV / plasmid) TLR9 / cGAS CpG observed/expected ratio max(0, 1 − CpG_O/E) — vertebrate genome ~0.2 tolerated, non-depleted DNA → 1 stimulatory
mRNA (LNP-mRNA / electroporation) TLR7/8 + RIG-I/MDA5/PKR U-fraction + ViennaRNA dsRNA pairing partial / extrapolating (nucleoside modification out of scope)
RNP (eVLP / electroporation) minimal (transient gRNA) ~0.9 by mechanism

verify() surfaces it as a cargo_innate_sensing flag whenever a cargo_seq is supplied. See pen_stack/planner/innate_sensing.py, prereg/ws_innate.yaml, and the innate_sensing scope card.

What is new in v5.3 — Computed capsid epitope-load oracle (covers all vectors)

v5.2 computed genotoxicity only touches integrating vectors. v5.3 brings the NetMHC-style calculation to the adaptive (CD8 T-cell) axis: the fraction of a viral vector's capsid/envelope presentable across a frequent HLA-I panel (MHCflurry), so the computed immune signal now covers all 8 vehicles — 5 viral computed, 3 non-viral by mechanism. It is a population-level, sequence-intrinsic presentation signal; the realized patient-HLA-specific T-cell response stays a known-unknown (and it is CD8/MHC-I only, not antibody).

Workstream What it adds Result
EPITOPE build scripts/p53_build_epitope_oracle.py → committed configs/capsid_epitope_oracle.yaml per viral capsid: epitope_fraction_strong over 9-mers × 12 HLA-I alleles (MHCflurry %rank ≤ 0.5), from UniProt-verified sequences (AAV2 VP1, Ad5 hexon, VSV-G, HSV gD/gB); MHCflurry stays on the VM, only the summary ships
EPITOPE oracle planner/capsid_epitope_oracle.py (OracleResult) capsid_immune_score = 1 − epitope_fraction_strong; non-viral → 1.0 by mechanism; abstains when no sequence
wired into the adaptive axis folded only for in-vivo vehicles AAV2 least epitope-dense (0.72), Ad5 hexon among the most (0.82) — documented adaptive ordering reproduced from sequence; ex-vivo lentivirus's intrinsic VSV-G load is reported but muted (host barely sees it ex vivo)

See prereg/ws_epitope.yaml and the capsid_epitope scope card.

What is new in v5.2 — Computed genotoxicity oracle (data, not a documented tier)

v5.1 scored genotoxicity as a documented low/moderate/high tier. v5.2 makes it computed from data for integrating vectors: the observed enrichment of a vector class's integration sites near COSMIC oncogenes (VISDB integration catalogues × the Phase-1 COSMIC-CGC oncogene annotation), surfaced through the v4.0 OracleResult contract. The in-vivo clonal / leukemogenesis outcome stays a known-unknown — this is a relative integration-preference signal, not a per-patient oncogenesis probability.

Workstream What it adds Result
GENOTOX build scripts/p52_build_genotox_oracle.py → committed configs/genotoxicity_oracle.yaml per integrating class: P(site within 50 kb of a COSMIC oncogene), enrichment vs background, CI, n — from VISDB × COSMIC CGC v104 (raw data stays on the VM; only the auditable summary ships)
GENOTOX oracle planner/genotoxicity_oracle.py (OracleResult, output_kind="baseline") genotox_score = min(1, 1/enrichment); non-integrating → 1.0 by mechanism; abstains when it has no computed class (never fabricates); small-n classes flagged extrapolating
wired into v5.1 balance safety_efficacy_profile() prefers computed genotox, falls back to the documented tier lentiviral 2.08× vs gammaretroviral 5.65× oncogene-proximity enrichment — reproduces the lentivirus-safer-than-gammaretrovirus ordering from data; computed LV score (0.48) validates the v5.1 documented tier (0.5)

See prereg/ws_genotox.yaml and the delivery_genotoxicity scope card.

What is new in v5.1 — Delivery immunology (the safety↔efficacy balance)

v5.1 makes the delivery palette's safety↔efficacy tradeoff legible and user-weightable. Every vehicle now carries a documented, cited, qualitative immune + safety + efficacy profile — so you can ask for a balance (AAV is safe by integration but neutralizing-antibody/pre-existing-immunity limited; lentivirus is a highly efficacious integrator but its genotoxicity is the dominant concern). Crucially, the in-vivo immune magnitude stays a declared known-unknown — v5.1 surfaces documented priors, it does not predict a patient-specific immune response.

The full delivery-immunology story (v5.1 → v5.5), with every axis, method, and outcome, is in docs/delivery_immunology.md. By v5.5, four of the five immune/safety axes are computed from data or sequence (genotoxicity, adaptive/CD8, innate, pre-existing/NAb) rather than hand-typed tiers.

Workstream What it adds Result
IMMUNE config immune_safety block on all 8 vehicles in configs/delivery_vehicles.yaml documented ordinal (low/moderate/high) priors for pre-existing immunity, neutralizing antibody, innate/adaptive immune, genotoxicity, efficacy — every immune_doi Crossref-verified and in the curated-DOI set
IMMUNE planner planner/delivery_immunology.pysafety_efficacy_profile() / recommend_delivery() two separate safety sub-axes (immune_score reversible vs genotox_score permanent), never collapsed; headline safety_score = min(...) (worst-axis); ranks the palette along the safety↔efficacy frontier by a user weight
IMMUNE verify Verdict.delivery_profile + delivery_immune_profile scope flag verify() surfaces the documented tradeoff for a chosen vehicle, always attaching the in_vivo_immunogenicity known-unknown flag — never adding confidence, never predicting a magnitude

See prereg/ws_immune.yaml.

What is new in v5.0 — the Co-Scientist (smart because it is grounded)

v5.0 matures the reasoning layer on top of everything beneath it. Given a goal and an intent, PEN-STACK returns a small set of materially distinct, ranked, fully-traceable strategies — each verified, calibrated, cited, and scope-ledgered — while the no-fabrication guarantee holds by construction: the reasoning layer proposes and critiques, but every number still comes from a validated tool or oracle. Intelligence rises while groundedness never falls.

Workstream What it adds Result
PLAN + MULTI agent/co_scientist.pypropose_strategies() / deliberate() 2–3 materially-distinct strategies (≥2 design axes differ — measured, not reworded), each independently legal + confidence-tagged; deliberative planner benchmarked vs the deterministic baseline
CRIT + SCOPE2 self-critique/revise loop + scope ledger the critic only flags + swaps (never invents a number); revisions are re-verified and falsifiable (improve flawed plans illegal→legal, never touch clean ones); every recommendation carries a complete scope ledger itemising the known-unknowns
CITE + GEN agent/cite.py — cited rationale + scoped generalisation citations are drawn from the curated world-model (resolve by construction); a guard rejects any hallucinated DOI; adjacent tasks are grounded-or-refused
central gate co_scientist_grounded bench (v0.3.2) grounded rate 1.0 vs ungrounded 0.0; no-fabrication holds across the full reasoning stack (asserted)

See docs/co_scientist.md and prereg/ws_{plan,crit,cite}.yaml.

What is new in v4.5 — the Living World-Model (a knowledge graph that keeps itself current)

v4.5 promotes the flat atlas/WT-KB/crosslink tables into a queryable knowledge graph: writers, loci, cargo, delivery vehicles, cell types, write types and measured outcomes are typed nodes joined by typed edges, each carrying its provenance, its uncertainty, and the scope within which it holds. An agent answers a multi-hop design question in one grounded traversal, and the graph stays current through a gated loop — new literature evidence is proposed as candidate edges and admitted only through a validation/human gate, never auto-merged.

Workstream What it adds Result
G — knowledge graph pen_stack/graph/{schema,build,query} — typed nodes + provenance/uncertainty/scope-tagged edges, built from the v4.0 curated tables; REST POST /graph/query + MCP graph_query multi-hop design queries return fully provenanced paths (the answer is the path); deliverable_by edges reproduce the v3.3 verifier with 0 parity mismatches
MON — gated living loop pen_stack/graph/ingest.py — PEN-MONITOR emits candidate edges; quarantined; admitted only via gate_admit(approved) with a versioned record no process auto-edits the curated truth (Principle 1, asserted); back-test admits the recent ISPpu10 bridge system only through the gate
CT — cell-type expansion Tier-A cell types (iPSC/ESC, primary T cells, hepatocytes) as nodes with coverage cards + Tier-B roadmap partial coverage degrades gracefully (confidence capped, raw reported); cross-cell-type queries OOD-labelled (v3.2 finding); Tier-B documented, never silently extrapolated
BA — graph reasoning bench graph_multihop_reasoning (bench v0.3.1) graph reasoning accuracy 1.0 vs ungrounded 0.0; every answer grounded by a provenanced path; no-fabrication holds

See docs/world_model.md and prereg/ws_{graph,mon,ct,ba_v45}.yaml.

What is new in v4.0 — the Oracle Mesh (sitting on top of the foundation models)

v4.0 makes PEN-STACK the composition + verification layer over the biomolecular foundation models. It wraps AlphaGenome, Evo2, AlphaFold3, Boltz-2, Chai-1, Protenix, ESM3, RFdiffusion and ProteinMPNN under one contract that carries each model's provenance, native uncertainty, and a scope card stating what it is valid for — then routes their outputs through the rule-grounded verifier and the calibrated trust layer. A generated sequence or structure is always a candidate to be checked, never a claim. For the writer enzyme itself, v4.0 builds verification, not invention: proposed/variant writers are scored against measured DMS data and predicted structure, recovering known enhanced variants blind and refusing to assert activity for anything unsupported.

Workstream What it adds Result
O — the oracle mesh pen_stack/oracles/OracleResult{value, provenance(model+version), native_uncertainty, scope_card, output_kind}; adapters for genome / structure / protein-design / RNA / energetics; deterministic version-pinned cache one contract; generative output = candidate (as_claim() raises — the pen-assemble lesson in code); AlphaGenome OOD-gated; cross-oracle disagreement widens the interval; ViennaRNA + energetics real
WV — writer verification atlas/writer_verify.py — DMS- + structure-grounded variant scoring; candidate critique wired into verify() recovers the known enhancers (N322P / H50K / R278M) above measured-worse controls; unmeasured variants flagged, not claimable; a generated writer is critiqued (fold/active-site/deliverable/reachable), never returned as a working pen
ATLAS — mesh + delivery oracle wgenome/mesh_features.py (OOD-gated feature hook + honest blind re-validation) + a computable AAV packaging-margin delivery rule atlas re-validation reports parity vs v3.x when oracles are deferred (delta 0.0, never hidden); titre-margin flag fires near the AAV capsid limit; immunogenicity magnitude stays a scope flag

See docs/oracles.md, docs/writer_verification.md, and prereg/ws_{o,wv,atlas}.yaml.

What is new in v3.4 — the Environment (a place to train and grade genome-writing AI)

v3.4 makes PEN-STACK the surface an AI agent can be trained and graded in, the counterpart to v3.3's verifier (the surface for checking): a Gymnasium environment whose every action is checked by the rule-grounded verifier and whose reward is the legal, calibrated plan score; Genome-Writing Bench v0.3 with multi-write-type and adversarial robustness probes; and a demonstration of whether plan-confidence actually predicts documented outcomes. The environment is an interface + evaluation harness (near-one-shot decision) — no claim that a learned policy beats the deterministic planner.

Workstream What it adds Result
ENV — the environment full gymnasium.Env: 5-stage MDP (write_type → site → writer → cargo → delivery), verifier-driven step validity, reward = legality gate × L4 calibrated plan score, a reserved abstain action for justified refusal; env/policies.py (random + greedy-planner) passes check_env; greedy(planner) ≥ random and greedy-legal on the frozen seed set (sanity, not a learning claim)
BENCH — Bench v0.3 multi_write_type_legality (route + judge legality across all 6 non-insertion write types) + adversarial_robustness (T13–T16: out-of-scope-in-disguise, contradictory constraints, prompt-injection, distribution-shift) multi-write-type accuracy 1.0 vs ungrounded 0.0; verifier-backed agent passes 4/4 adversarial probes vs an over-confident baseline 0/4; no-fabrication holds even under prompt injection
CAL — outcome-calibration validate/outcome_calibration.py: plan-level reliability diagram + ECE + bootstrap-CI selective prediction on the DOI writer panel honest result — useful for ranking (high-confidence 0.30 vs low-confidence 0.0 documented-choice recovery, gap CI95 [0.17, 0.43], monotone) but poorly calibrated in absolute terms (ECE 0.71): high confidence narrows the feasible field, it does not uniquely identify the documented choice

See docs/environment.md, the v0.3 benchmarks/genome_writing_bench/LEADERBOARD.md, and prereg/ws_{env,bench,cal}.yaml.

What is new in v3.3 — the Verifier (a type checker for genome writes)

v3.3 lifts the laws of genome writing out of code into a versioned, machine-readable rule base and exposes a single verify(design) → Verdict call: submit a proposed write and get back legal / illegal + the named violated rule + a calibrated confidence + a scope flag — over Python, REST (POST /verify), and an MCP tool (verify_write) any AI agent can submit to. PEN-STACK becomes the layer that checks what the foundation models generate.

Workstream What it adds Result
R — rule base + solver the laws lifted into configs/rules/*.yaml (9 rules: reachability, fold, payload, multiplex, delivery), each id/kind/mechanism/param/citation/test; a solver returning legality + named reasons a parity test proves the rules reproduce the prior in-code decisions (relocation, not behaviour change); positives legal, negatives rejected by the correct named rule
D — delivery palette the AAV-only assumption replaced by an 8-vehicle palette (AAV single/dual, lentivirus, HDAd ~35 kb, HSV amplicon >100 kb, LNP-mRNA, eVLP, electroporation) with capacity/integration/cargo-form/DOIs hard rejects (cargo>capacity, RNP-into-DNA-only-vehicle, non-integrating-goal+integrating-vehicle); immunogenicity magnitude declared out-of-scope, never predicted
ROUTE — write-type router the fixed insertion chain becomes one sub-graph of a router over insertion / excision / inversion / replacement / regulatory-rewrite / landing-pad / multiplex each type routes to its rule sub-graph; unsupported/ambiguous types defer, never guess
V — verification service verify(design) → Verdict over Python/REST/MCP; legality (rules) + confidence (v3.2 L4) + scope, kept as distinct axes every Verdict carries legality + (confidence ∨ abstention) + scope; no fabrication (every number tool-sourced)
BA — bench + agent Bench v0.2.1 adds T12 rule-grounded legality-with-explanation; the agent submits its own plan to the verifier verifier verdict+reason accuracy 1.0; an ungrounded judge cannot cite a rule (0.0) — the verifier uniquely supplies grounded reasons; no-fabrication intact

See docs/verify.md, docs/rules.md, docs/delivery.md.

What is new in v3.2 — a calibrated, self-aware co-scientist

v3.2 makes the genome-writing funnel trustworthy: every value the funnel returns now carries a calibrated confidence, an extrapolation flag, and — where the biology is beyond any tool here — an explicit "out of scope." The LLM may plan, but ideas pass through computable filters, and the system says how much to trust each number and where the edge of its knowledge is. Every workstream is pre-registered (prereg/ws_{uq,ep,mc,ba}.yaml, SHA-locked) and reports its honest negatives.

Workstream What it adds Honest headline result
UQ — calibrated uncertainty + OOD conformal prediction intervals / sets over the existing heads (no retraining), an out-of-distribution detector, and selective prediction calibrated UQ is useful on the expression axis: the durability expression interval covers 0.895 vs 0.90 nominal on held-out chromosomes (within tolerance) and risk-coverage accuracy rises 0.739→0.930 under abstention. On the silenced axis it is informative-in-name-only at this N — the set covers 0.996 with mean size 1.93 of 2 (≈ the full label set), because the head is weak (we say so plainly). OOD fires strongly on a real chromatin-state shift (euchromatin→heterochromatin AUROC 0.98) but is weak across biological context — K562→HSPC 0.72, K562→HepG2 0.65, even cross-species mESC→human 0.56 — because chromatin-mark distributions barely move across cell types/species; reported as a heuristic feature-space-novelty signal, not a guarantee
EP — epistemic scope a three-tier status (grounded-confident / grounded-extrapolating / not-computable) on every output, plus a known-unknowns registry + scope matcher out-of-scope probes deferred 1.0, in-scope false-defer 0.0 (zero fabrication); the no-fabrication hard gate still holds. The unknown funnel (structure→phenotype, in-vivo immunogenicity, long-term durability, epistasis, polygenic, germline) is made legible, not closed
MC — mechanistic filters a hard target-site/PAM/att-site reachability reject, vehicle-specific delivery-sequence penalties, and an off-target energetics model positive+negative target-site controls 9/9 (a physically impossible writer–site pairing is rejected); off-target energetics beats the 0.77 baseline at AUROC 0.88 on the comparable (core-disrupted) construction and ships as the default ranker — but a reviewer-driven re-run shows that gap is mostly the core-penalisation artifact: with the core held matched, the non-core substitution-identity gain is real but modest (Δ≈0.04: 0.687 vs 0.646); both AUROCs carry a favourable-negative-set caveat
BA — bench v0.2 + uncertainty-aware agent four trust tasks (T8 calibration, T9 selective prediction, T10 OOD honesty, T11 out-of-scope) + the agent emits confidence + epistemic status + abstains the uncertainty-aware agent beats an over-confident baseline 4/4 on the trust tasks; the leaderboard now separates trustworthy agents, not just grounded ones

Optional: a thin Gymnasium environment interface (pen_stack/env/, [env] extra) for agent-developer interoperability — interface only, no RL superiority claimed. See docs/uncertainty.md, docs/scope.md, docs/mechanistic_constraints.md.

What is new in v3.1

v3.1 hardens the honesty of the planning benchmark, surrounds the models with strong baselines, adds a predicted-structure safety axis, and ships the first benchmark and grounded agent for the genome-writing side of the field. Every workstream is pre-registered (prereg/ws_*.yaml, SHA-locked) and reports its honest negatives, not just its wins.

Workstream What it adds Honest headline result
A - De-circularized benchmark (gate) retires the circular targeted-intent recovery@k; the headline is now blind safe-harbour discovery, on a gold set scaled from 5 to 16 loci blind GSH discovery on 16 curated loci: AUROC 0.68 (95% CI 0.53-0.82); validated-only (N=8) 0.70 (CI 0.48-0.91, underpowered) vs safety-only 0.51 - a weak, honestly-bounded signal (the 0.92-on-5 was fragile). The full Pellenz-2019 35-site set is also included as a separate exploratory tier and scores near chance (0.54) - the model does not over-rank weak computational candidates
B - Strong baselines + safety metric switch endogenous-expression baseline, multi-mark ablation, published GSH rule-set; safe-harbour discrimination is the primary safety metric headline is the learned model's absolute discrimination: writability AUROC 0.68 (95% CI 0.53-0.82, N=16). The published distance rule is reported as a qualitative failure case, not a delta to beat - it scores at/below chance (curated 0.51; validated-8 0.48) because validated harbours are intragenic (AAVS1/PPP1R12C, CCR5), so a "far-from-genes" prior mis-ranks them; the learned-minus-rule delta is kept only as a non-significant diagnostic. The circular genotoxic_cis AUROC is demoted to a labeled diagnostic
C - AlphaGenome integration predicted sequence tracks + a predicted 3D structural-risk axis (Hi-C contact-map deltas) via the hosted AlphaGenome API per-track transfers well (HepG2 ATAC 0.91), but the composite score degrades from predicted tracks, so the measured atlas stays the backbone (flagged)
D - Cargo Polish scores the insert for silencing/instability triggers (CpG islands, GC, cryptic splice, MFE, silencers) directional: high-CpG bacterial cassette 0.75 vs CpG-depleted 0.0, every flag carries a fix
E - Genome-Writing Bench v0.1 + PEN-Agent the first benchmark for the writing side, plus a grounded agent that cannot fabricate planner beats the naive baseline 3/3; the grounded agent reaches the planner's numbers only by grounding (0 fabricated). T7 ungrounded contrast: the same models with no tools fabricate 100% of tool-only values under a naive prompt (qwen2.5:7b, Nemotron) - so the bench separates grounded from ungrounded agents, not just "did it call the tool"
F - Local recalibration / private-data adaptation recalibrate or fine-tune the released models on your own assays, in-container, behind a validation gate the adapted model activates only if it beats the released model AND a no-skill baseline; the released model is provably unchanged
G - Multiplex + guide QC a pairwise translocation-risk screen for multi-edit plans, and a bridge-RNA guide ranker DSB-free recombinase plans carry ~zero translocation risk by construction; the guide-QC ranker is validated by a synthetic positive-control unit test (hand-constructed guides each tripping one failure mode rank below a clean control) - this tests the ranking logic, not real guide outcomes

The Genome-Writing Bench (workstream E) is v3.1's adoption vehicle: a one-command, SHA-locked, leaderboard benchmark with deterministic scorers and no circular labels. See benchmarks/genome_writing_bench/ and docs/positioning.md.

Architecture

At 1.0, PEN-STACK is a closed design→build→test→learn loop standing on two reference layers, the oracle mesh, and the data foundation — with the verifier at the centre as the discriminator: nothing is generated, built, or learned unless it passes safe + legal + calibrated, and no number is fabricated. The loop stops deliberately at autonomy Level 3 (a human in control at every gate).

                  +-----------------------------------------------------------------------+
                  |  THE FACES (v5.13 / 1.0)                                               |
                  |  Co-Scientist (drives the loop) . Genome-Writing Challenge (held-out)  |
                  |  . MCP server . REST API . Streamlit web app                           |
                  +-----------------------------------^-----------------------------------+
                                                      |
 +----------------------------------------------------+---------------------------------------------------+
 |  THE CLOSED LOOP  (v5.12 - autonomy Level 3; a human in control at safety / build / belief-admission)  |
 |                                                                                                        |
 |   GENERATE ----> PREDICT ----> DECIDE ----> [ SAFETY ] ----> BUILD ----> INGEST ----> LEARN            |
 |   v5.8           v5.9          v5.10         v5.7            v5.11       v4.5 gate    v5.12             |
 |   generative     digital       experiment    GUARDIAN        protocol    (no auto-    continual         |
 |   designer       twin          designer       gate           export      edit)        + drift          |
 |   (Pareto)       (OOD-gated)   (EIG+immVOI)   (refuse=stop)  (DRAFT)                  (versioned)       |
 |                                                                                                        |
 |   Every candidate is DISCARDED unless it passes THE VERIFIER below.  No number is fabricated.          |
 +----------------------------------------------------+---------------------------------------------------+
                                                      |
                  +-----------------------------------v-----------------------------------+
                  |  THE VERIFIER   verify(design) -> Verdict        (the discriminator)   |
                  |  legality (v3.3 rules)  .  SAFETY gate (v5.7, runs first)  .            |
                  |  calibrated confidence (v3.2)  .  per-axis immune-risk profile (v5.6)   |
                  +-----------------------------------^-----------------------------------+
                                                      |
 +--------------------------------+-------------------+----------------+--------------------------------+
 |  WRITABLE GENOME  (Paper 1)    |  WRITER ATLAS  (Paper 2)           |  ORACLE MESH  (v4.0)            |
 |  learned safety x durability   |  33,370 systems on measured axes   |  AF3 . Boltz-2 . Chai-1 . ESM3  |
 |  x reachability                |  + Writer-Targeting KB (8 families)|  . Evo2 . AlphaGenome . RFdiff. |
 |  -> writability profile        |      <----- reachability ----->    |  ProteinMPNN . v5.9 STATE/scGPT |
 |  + WRITE PLANNER (Paper 3)     |  + BRIDGE OFF-TARGET ENGINE (P4)   |  one OracleResult contract,     |
 |    inverse design (edit_intent)|                                    |  OOD-gated, generative=candidate|
 |  + DELIVERY IMMUNOLOGY (v5.1-6) per-axis immune-risk profile (each axis validated-or-proxy, never collapsed) |
 |  + WORLD-MODEL GRAPH (v4.5)     typed provenanced edges, gated living loop (propose-only)              |
 +--------------------------------+------------------------------------+--------------------------------+
 +----------------------------------------------------------------------------------------------------+
 |  DATA FOUNDATION  (bulk-downloadable, public)                                                       |
 |  hg38 . ENCODE/Roadmap chromatin . Hi-C/LADs . TRIP position effects . RID/VISDB/MLV integration .   |
 |  clinical genotoxic CIS . COSMIC . DepMap . gnomAD . GTEx . UniProt . Pfam/InterPro .                |
 |  bridge off-target + DMS (Perry 2025) . anti-vector + anti-PEG serosurveys (immunology)              |
 +----------------------------------------------------------------------------------------------------+

Reading the loop. A goal enters at the top. The generative designer (v5.8) proposes candidate writing systems; the twin (v5.9) predicts their calibrated outcomes; the experiment designer (v5.10) chooses the most informative ones to test; the Guardian (v5.7) refuses anything hazardous; the build interface (v5.11) exports a safety-gated protocol DRAFT and ingests results through the v4.5 gate; and continual learning (v5.12) recalibrates, drift-aware, versioned and reversible. Each stage is gated by the verifier — the single discriminator that keeps the whole loop honest. Two faces sit on top: a co-scientist that drives the loop for a working scientist, and the Genome-Writing Challenge, an open held-out benchmark others build to.

How it works

PEN-STACK is organised as the data + reference layers, the oracle mesh, the verifier (the discriminator), the closed loop, and the faces — each layer below feeding the one above, with the verifier keeping the whole loop honest.

Component Module Role Status
Writable Genome (flagship) pen_stack.wgenome learned per-locus safety x durability x reachability Paper 1
Writer Atlas (companion) pen_stack.atlas, .mech, .score cross-family enzyme catalogue + Writer-Targeting KB Paper 2
Cross-link pen_stack.atlas.crosslink bidirectional writer to locus queries Paper 2
Write Planner (engine) pen_stack.planner inverse design, edit_intent-conditioned Paper 3
Delivery immunology (v5.1-5.6) pen_stack.planner.delivery_immunology + {genotoxicity,capsid_epitope,seroprevalence,antipeg}_oracle, innate_sensing, immune_profile; validate.immune_calibration safety↔efficacy balance over the 8-vehicle palette; immune axes computed/grounded from data+sequence + anti-PEG (gates LNP re-dosing) → a unified per-axis Verdict.immune_profile (each axis validated-or-proxy, never collapsed); magnitude + patient titer stay known-unknowns (docs) M2
Agentic platform pen_stack.agent goal to cited, auditable plan; MCP server; one-command deploy Paper 3
Bridge off-target engine pen_stack.bridge "CRISPOR for bridge recombinases" + guide QC (v3.1) Paper 4
Oracle Mesh (v4.0) pen_stack.oracles one OracleResult contract over AF3/Boltz-2/Chai-1/ESM3/Evo2/AlphaGenome/STATE/scGPT; OOD-gated; generative=candidate v4.0
World-Model Graph (v4.5) pen_stack.graph living knowledge graph; typed provenanced edges; gated propose-only loop v4.5
The Verifier (v3.3+) pen_stack.verify verify(design) -> Verdict: legality + safety + calibrated confidence + immune profile — the discriminator v3.3
The Guardian (v5.7) pen_stack.safety biosecurity / dual-use gate; runs first in verify(); refuse short-circuits; tamper-evident audit v5.7
Generative Designer (v5.8) pen_stack.design generate → verifier-as-discriminator (hazardous/illegal discarded); Pareto with grounded immune axis v5.8
Digital Twin (v5.9) pen_stack.twin calibrated, OOD-gated outcome prediction; immune-outcome from v5.6; phenotype-bounded v5.9
Experiment Designer (v5.10) pen_stack.active active learning (EIG + immune-VOI); retrospective active-vs-random (reps+CI) v5.10
Build Interface (v5.11) pen_stack.build safety-gated protocol export (DRAFT + immune metadata); gated ingestion; sim-lab v5.11
The Closed Loop (v5.12) pen_stack.loop one gated DBTL command; autonomy Level 3; drift-aware; versioned/reversible continual learning v5.12
Co-Scientist + Challenge (v5.13) pen_stack.agent.co_scientist, benchmarks/genome_writing_challenge the co-scientist drives the loop; the public held-out benchmark others build to v5.13
Genome-Writing Bench (v3.1) benchmarks/, bench/run.py first writing-side benchmark; deterministic scorers, leaderboard M2
PEN-Agent (v3.1) pen_stack.agent.pen_agent grounded write-planning state machine; zero fabrication M2
3D structural risk (v3.1) pen_stack.wgenome.structure3d AlphaGenome contact-map deltas as a safety axis M1
Cargo Polish (v3.1) pen_stack.planner.cargo_polish cargo-sequence silencing-risk scan M1
Local adaptation (v3.1) pen_stack.adapt gated recalibration / fine-tuning on private data M1
Multiplex risk (v3.1) pen_stack.planner.multiplex pairwise translocation-risk screen for multi-edit plans M3
Platform services monitor, rag, ui, server living database, grounded RAG, web app, REST API -

Headline results (all blind / pre-registered)

  • Paper 1 (Writable Genome): a genome-wide atlas of 3,031,030 loci x 3 cell types (K562, HepG2, CD34+ HSPC) recovers validated safe harbours as highly writable and clinical genotoxic loci as non-writable, blind. Durability transfers mouse to human (Spearman rho = 0.42).
  • Paper 2 (Writer Atlas): 33,370 enzyme systems across 8 families on common measured axes; mechanism classifier agrees with the audited labels on the curated core (1.00); cross-link validated on AAVS1.
  • Paper 3 / v3.1 (Write Planner + de-circularized benchmark): the honest headline is blind safe-harbour site discovery - run genome-wide (so no on-target identity term fires), the planner's writability is tested for whether it ranks held-out safe harbours above matched-context controls. On a gold set scaled from 5 to 16 independent loci (8 functionally validated + 8 computationally-defined universal-GSH, classic harbours + Lin et al. 2024) this is a weak signal, honestly bounded: all-loci AUROC 0.68 (95% CI 0.53-0.82), validated-only 0.70 (95% CI 0.48-0.91, underpowered at N=8) vs a safety-only baseline 0.51. The earlier 0.92-on-5 was an over-estimate from tiny N; the AUROC is always cited with its CI and N. Writer-family recovery@1 = 0.86 vs prevalence 0.29 across 4 families (14 documented writes, including honest misses where labs chose a non-minimal-capacity writer - see Limitations). The earlier "recovery@10 = 1.00, McNemar p" for targeted intents was definitional, not predictive (an on-target identity term dominates), so it is reported only as a specification-compliance table - see docs/benchmark_circularity.md. A tool-using agent never fabricates a number.
  • Paper 4 (Bridge off-target engine): to our knowledge the first measured-data-validated tool that nominates and ranks candidate off-target locations for bridge recombinases. On the measured Perry 2025 data (6,856 real off-targets) the per-position profile confirms the central core (positions 7-9) is the specificity determinant, and the model ranks real off-targets above core-disrupted decoys at AUROC 0.77 vs 0.62 for Hamming. Stated plainly: it is a screening tool, not a quantitative safety calculator, it does not quantify how much recombination occurs at each site (sequence-risk vs measured magnitude, rho approximately 0.30). A first-of-its-kind beachhead for a genuinely unoccupied gap, not a Nature-tier breakthrough; the Writable Genome (Paper 1) remains the flagship novelty.

The Genome-Writing Bench (v0.2.1, M2)

The first benchmark for the writing side of genome engineering - where to write, what writer to use, how to design the cargo, and what off-target / structural risk a write carries - complementing the many editing-side (Cas9 / base / prime) benchmarks. Six tasks, each with a deterministic scorer and a documented ground-truth source; no task is scored against a circular label (it inherits the de-circularization gate).

python bench/run.py --agent          # one command -> out/bench_results.json + a leaderboard
docker compose run --rm bench python bench/run.py --agent   # same, on the clean image
Solver Beats naive on No-fabrication Note
deterministic planner 3/3 grounded tasks n/a the validated planning tools (reference)
naive baseline - n/a safety-only / prevalence / Hamming
grounded LLM agent (PEN-Agent) = planner (grounded) PASS a real LLM drives the tools; reaches the planner only by grounding every value, 0 fabricated

Ungrounded-LLM contrast (T7) - the benchmark separates agents, not just "did it call the tool": the same models with no tools fabricate tool-only values. Under a naive prompt, qwen2.5:7b and Nemotron both fabricate 100% of planning fields (and invent in-human clinical numbers no tool could produce - qwen 100%, Nemotron 67% on ungroundable goals). Even coached to refuse, qwen still slips (4%) while Nemotron refuses fully - but the grounded agent is 0.0 under every prompt and model, by construction. Grounding, not prompting, is what removes fabrication. (Transcripts cached under data/llm_bench_cache/ for offline replay; bench/run.py --ungrounded-live repopulates them on the VM.)

Per-task (planner vs naive): site selection 0.70 vs 0.51 (validated GSH, N=8; all-16-loci 0.68, CI 0.53-0.82), writer recovery 0.86 vs 0.29 (N=14 writes), off-target 0.77 vs 0.62, intent 7/7, no-fabrication PASS (a hard gate). The gold sets were scaled in v3.1.1 and every metric is reported with its N and CI - see Limitations. PEN-Agent (pen_stack.agent) is a grounded write-planning state machine - goal to site to writer to cargo (with Cargo Polish) to off-target to 3D structural risk to report - that copies every number from a validated tool with provenance and refuses or degrades rather than invent. See benchmarks/genome_writing_bench/, docs/agent.md, and the leaderboard submission guide.

How PEN-STACK connects to the prior repositories

PEN-STACK v3.0 consolidates and re-grounds five earlier projects. Their genuinely reusable assets are imported here; the originals are archived read-only for provenance and DOI stability. This is what makes PEN-STACK "the thing you cite instead of rebuilding the pipeline."

  genome-atlas  --+  18-family InterPro-audited Pfam whitelist (v1.2.1)  -->  WT-KB + mechanism classifier
  mech-class  ----+  multi-source mechanism classifier                   -->  family / mechanism calls
  pen-score  -----+- 9 scoring axes (dsb/cargo/deliv/immuno/prog/...)     -->  re-grounded therapeutic axes
  pen-assemble  --+  IS110 ortholog / design set                         -->  part of the 1,058-entity universe
  pen-compare  ---+  unified_editor_universe.parquet (1,058) + scorecard  -->  canonical universe + scorecard
Prior repo Pinned version What v3.0 reuses What changed
genome-atlas v0.7.2 the audited 18-family Pfam backbone - spine of the WT-KB and the at-scale mechanism classifier GraphSAGE link-prediction framing retired
mech-class v0.5.4 the mechanism classifier (Pfam + RHEA + CRISPRcasdb + UniProt) reused as the family/mechanism caller
pen-score v0.1.3 the scoring axes (deliv / immuno / cargo, ...) prog/cargo re-grounded; hand-set overrides removed
pen-assemble v0.5.2 the ortholog sequence set de-novo chimera generation retired -> DMS-grounded point-variant proposal
pen-compare v0.1.0 the 1,058-entity universe + scorecard scaffold + tests circular 5-gate "certification" -> descriptive scorecard with blind concordance

One canonical assembly path (pen_stack/atlas/universe.py::assemble) feeds the classifier, the scorer, and the scorecard identical metadata, so the cross-module inconsistency in the prior pipelines cannot recur.

Repository structure

pen-stack/
├── pen_stack/                        the installable package
│   ├── wgenome/                      Writable Genome (Paper 1)
│   │   ├── features.py               unified feature matrix (accessibility + histones + safety + integration)
│   │   ├── safety.py                 calibrated genotoxicity-risk model (chrom-block CV + baseline)
│   │   ├── durability.py             conditional chromatin->expression model (TRIP-trained, transferable)
│   │   ├── writability.py            decomposable safety x durability x reachability integration
│   │   ├── uncertainty.py            v3.2 conformal intervals/sets over the heads (no retraining)
│   │   ├── ood.py                    v3.2 out-of-distribution / extrapolation detector
│   │   ├── structure3d.py            3D structural-risk axis (AlphaGenome contact-map deltas, 11 hijack loci)
│   │   └── export_tracks.py          BigWig / BED atlas export
│   ├── atlas/                        Writer Atlas + WT-KB + cross-link (Papers 1-2)
│   │   ├── schema.py                 pydantic WriterEntry (enforces >=1 DOI per row)
│   │   ├── build_wtkb.py             Writer-Targeting Knowledge Base builder (8 families, tiered)
│   │   ├── expand.py                 ortholog ingestion -> atlas.parquet (33,370 systems)
│   │   ├── crosslink.py              writers_for_locus / loci_for_writer / loci_for_gene
│   │   ├── variant_propose.py        DMS-grounded point-mutation proposal (no chimeras)
│   │   ├── universe.py               THE canonical universe assembly (1,058 entities)
│   │   └── scorecard.py              descriptive scorecard + blind concordance
│   ├── mech/                         mechanism classification at scale (audited 18-family whitelist v1.2.1)
│   ├── score/                        re-grounded axes + therapeutic-readiness scoring
│   ├── planner/                      Write Planner (Paper 3): optimize / cargo / cargo_polish / multiplex / pipeline
│   │                                   + v3.2 target_site (hard PAM/att/core reject) / delivery_constraints
│   │                                   + v3.3 router (write-type dispatch) / delivery_vehicles (8-vehicle palette)
│   │                                   + v5.1-5.6 delivery_immunology (safety<->efficacy balance) and the five
│   │                                     immune-axis oracles: genotoxicity_oracle (VISDB x COSMIC) /
│   │                                     capsid_epitope_oracle (MHCflurry) / innate_sensing (CpG-O/E + dsRNA) /
│   │                                     seroprevalence_oracle (anti-vector NAb serosurveys) /
│   │                                     antipeg_oracle (anti-PEG, gates LNP re-dosing) [v5.6]
│   │                                   + v5.6 immune_profile (unified per-axis immune-risk vector; never collapsed)
│   ├── bridge/                       bridge off-target engine (Paper 4): offtarget / fold_qc / guide_qc / pipeline / cli
│   │                                   + v3.2 offtarget_energetics (position x substitution; held-out 0.88, ships)
│   ├── agent/                        agentic platform: tools / orchestrator / pen_agent / mcp_server / guardrails; v5.0 co_scientist + cite (multi-strategy, self-critique, cited rationale, scope ledger); v5.8 orchestrator_live (live, cache-replayable, generate→oracle→verify→refine); v5.13 co_scientist_session (drives the full loop, immune-risk first-class)
│   │                                   + v3.2 epistemic (3-tier status) / scope (known-unknowns matcher)
│   ├── graph/                        v4.5 living world-model knowledge graph (schema/build/query/ingest/cell_types); typed provenanced edges; gated living loop (propose-only)
│   ├── oracles/                      v4.0 L1 oracle mesh: OracleResult contract + adapters (genome/structure/protein_design/rna/energetics) over the foundation models; version-pinned cache; v5.2-5.6 delivery-immunology scope cards (delivery_genotoxicity/capsid_epitope/innate_sensing/seroprevalence/antipeg); v5.9 vcell (Arc STATE/scGPT, OOD-gated, output_kind=candidate)
│   ├── rules/                        v3.3 machine-readable rules engine (schema/evaluators/loader/solver) over configs/rules/*.yaml
│   ├── verify/                       v3.3 verification service: verify(design) -> Verdict (legal+reasons+confidence+scope; v4.0 writer_critique; v5.1 delivery_profile; v5.6 immune_profile per-axis vector; v5.7 safety SafetyVerdict)
│   ├── safety/                       v5.7 the Guardian: biosecurity/dual-use gate (registry/screen/policy/gate/audit/redteam); runs first in verify(); refuse short-circuits; tamper-evident audit
│   ├── design/                       v5.8 generative designer: space (candidate_space) / generate (verifier-as-discriminator; hazardous+illegal discarded) / pareto (frontier w/ grounded v5.6 immune axis)
│   ├── twin/                         v5.9 digital twin: mechanistic (cassette expression, closed-form) / outcome (fuse mech+vcell+v5.6 immune; OOD widens interval; phenotype-bounded) / calibrate (honest two-sided)
│   ├── active/                       v5.10 experiment designer: acquire (EIG/immune-VOI over the v5.9 twin) / design (diverse batch) / validate (retrospective active-vs-random, reps+CI, falsifiable)
│   ├── build/                        v5.11 build interface: protocol (safety-gated export, DRAFT + v5.6 immune metadata) / ingest (typed gated -> v4.5 world-model, no auto-edit) / simlab (export->sim->ingest, SIMULATED)
│   ├── loop/                         v5.12 closed loop (autonomy L3): cycle (run_loop, gated DBTL) / drift (predicted-vs-observed, inflate) / continual (admitted-only, versioned+reversible recalibration)
│   ├── adapt/                        local recalibration / private-data adaptation behind a gate (v3.1, WS-F)
│   ├── env/                          v3.4 full Gymnasium environment over router+verifier (genome_writing_env + policies; [env] extra)
│   ├── monitor/                      PEN-MONITOR living database (Europe PMC)
│   ├── rag/                          grounded, cited Q&A (hybrid LLM: Ollama primary, Nemotron fallback)
│   ├── validate/                     benchmarks: blind_gsh_discovery / durability_baselines / writer_recovery /
│   │                                   within_locus_ranking / agent_eval / ungrounded_baseline (T7) / adapt_demo /
│   │                                   v3.2 selective_prediction / uncertainty_eval / bench_trust_tasks (T8-T11) /
│   │                                   out_of_scope_refusal / target_site_controls / offtarget_energetics_eval /
│   │                                   v3.3 bench_rule_tasks (T12) / v3.4 bench_writetype_tasks + bench_adversarial_tasks (T13-16) + outcome_calibration /
│   │                                   v5.6 immune_calibration (proxy-vs-observed; labels each axis validated-or-proxy, two-sided) /
│   │                                   v5.7 safety_screening (the Guardian hard-gate: benign 0-false-refusal · hazards refused/escalated · evasions never clear) /
│   │                                   v5.8 generative_design (verifier-as-discriminator hard-gate: hazardous+illegal discarded; survivors calibrated+immune; grounded-immune Pareto) /
│   │                                   v5.9 outcome_prediction (digital-twin hard-gate: two-sided calibration + OOD widening + immune dim + phenotype out-of-scope) /
│   │                                   v5.10 experiment_design (active-learning hard-gate: EIG monotone + immune-VOI + diverse batch + retrospective active-vs-random reps+CI) /
│   │                                   v5.11 protocol_safety (build-interface hard-gate: cleared exports w/ immune metadata · hazard+illegal blocked · sim loop completes) /
│   │                                   v5.12 closed_loop (loop-integrity hard-gate: gated end-to-end run · Level-3 human-in-control · drift detection · versioned/reversible continual learning)
│   ├── data/                         ingestion (genome, chromatin, integration, TRIP, safety annotations)
│   ├── api/                          v6.1 AI integration surface: manifest (capability_manifest + scope_manifest = machine-readable known-unknowns + oracle scope cards)
│   ├── web/                          v6.2 Web Platform backend: tools (deterministic engine dossier) / llm (grounded co-scientist + grounding-guard) / server (FastAPI gateway + /chat + SSE) + v6.3 hybrid: router (4-lane design/explain/meta/general classifier) / guide (metric-interpretation cards + live capability facts)
│   ├── server/api.py                 FastAPI REST (atlas, crosslink, writable, plan, bridge, ask; v3.3 verify; v6.1 /capabilities /scope /safety /immune /generate /predict /suggest /session + openapi.json 3.1; v5.13 /challenge/{tasks,leaderboard})
│   ├── ui/app.py                     Streamlit web app (16 pages; v3.2 PEN-Agent shows confidence + epistemic status)
│   └── cli.py                        unified CLI
├── benchmarks/genome_writing_bench/  Genome-Writing Bench v0.3.8 (T1-T16 + co_scientist + safety_screening + generative_design + outcome_prediction + experiment_design + protocol_safety + closed_loop; tasks / harness / solvers / LEADERBOARD / SHAs)
├── benchmarks/genome_writing_challenge/ v5.13 the public Genome-Writing Challenge (held-out rounds + immune-risk task + submission API; harness / run / README / SUBMISSIONS)
├── bench/run.py                      one-command bench entrypoint (--agent, --verify)
├── examples/                         v6.1 runnable golden path: external_agent.py (REST) / mcp_client.py (MCP) / agent_tools.py (framework-agnostic tool specs from the live manifest + dispatcher)
├── web/                              v6.2 the Web Platform frontend (React/Vite + Tailwind): honest-UX library (ConfidenceBand/ProvenanceChip/ScopeLedger/SafetyBadge/ImmuneProfileCard) + 11 feature pages + the grounded co-scientist chat; built by docker/web.Dockerfile (node:20 stage)
├── scripts/                          reproducible pipeline drivers (p1_*, p2_*, p4_*, p52/p53 delivery-immunology oracle builds, ws_*_report)
├── configs/                          pinned datasets + thresholds + curation (YAML); v3.2 known_unknowns /
│                                       target_sites / delivery_constraints; v5.1-5.6 delivery_vehicles immune_safety /
│                                       genotoxicity_oracle / capsid_epitope_oracle + capsid_sequences.fasta /
│                                       seroprevalence / antipeg + oracles/scope_cards (+ v5.6 known_unknowns:
│                                       cd4_mhcii_help / preexisting_capsid_tcell / complement_carpa);
│                                       v5.7 safety/{hazard_registry,policy,probes} (Guardian; function/family/taxon-level only)
├── prereg/                           SHA-locked success criteria (paper1..4 + ws_a..ws_h + v3.2-v6.0 ws_{uq,ep,mc,ba,
│                                       r,v,route,env,bench,cal,o,wv,atlas,graph,mon,ct,plan,crit,cite,immune,
│                                       genotox,epitope,innate,seroprev,peg,calib,profile,screen,policy,redteam,
│                                       gen,pareto,orch,vcell,mech,outcome,twincal,acq,aldesign,alvalidate,
│                                       proto,ingest,simlab,loop,continual,drift,challenge,cosci2,
│                                       manifest,openapi,mcp,chat,frontend,hybrid} + SHA256 locks)
├── data/curated/                     small committed tables (universe, gene coords, measured bridge profile,
│                                       v3.2 bridge_offtarget_energetics.json)
├── data/llm_bench_cache/             28 cached ungrounded-LLM transcripts (T7, offline/CI replay)
├── data/alphagenome_cache/           cached AlphaGenome predictions (tracks + contact maps; offline reproducibility)
├── tests/unit/                       unit + regression + blind-validation suite
├── docs/                             mkdocs site (cards, tutorials, INFRA, DEPLOY, MCP);
│                                       v3.2: uncertainty.md / scope.md / mechanistic_constraints.md / BACKLOG.md;
│                                       v4.0-4.5: oracles.md / writer_verification.md / world_model.md;
│                                       v5.0-5.6: co_scientist.md / delivery_immunology.md;
│                                       v5.7-v6.0 (the closed loop): responsible_use.md / biosecurity.md /
│                                       generative_design.md / digital_twin.md / experiment_design.md /
│                                       build_interface.md / closed_loop.md / autonomy.md / challenge.md /
│                                       co_scientist_loop.md / integrations.md / STABILITY.md (1.0 API freeze)
├── docker/                           CUDA image + UI image + v6.2 web.Dockerfile (multi-stage: node:20 frontend build → slim Python gateway) + pinned requirements
├── tools/penctl.py                   laptop<->VM orchestrator (paramiko SSH/SFTP, Docker-only)
├── docker-compose.yml                one-command self-hostable platform
└── pyproject.toml  CITATION.cff  CHANGELOG.md  LICENSE

Data policy. Large artifacts (3 M-row atlases, BigWig tracks, models) and any third-party copyrighted data are not committed - they are released via Zenodo (DOI) or fetched from the original source, and are reproducible by re-running the scripts. Only small curated tables and derived products live in git.

Installation and quick start

From PyPI (the library, CLI, agent, and pure-logic tools):

pip install pen-stack            # core
pip install "pen-stack[models,bio,bridge,server,services]"   # full stack

The wheel ships the importable package and the command-line tools. The full data pipeline (the 3 M-row atlases, BigWig tracks, and curated configs) is distributed via the cloned repo + Zenodo, per the data policy below; point an installed copy at a checkout with export PEN_STACK_HOME=/path/to/pen-stack to use the config-driven features. Most users who want the whole pipeline clone the repo:

git clone https://github.com/ahmedanees-m/pen-stack.git && cd pen-stack
pip install -e ".[dev]"                                   # core + tests
pip install -e ".[models,bio,bridge,server,services]"     # full stack
pytest -q                                                 # 115 tests
pen-stack info                                            # stack status
python bench/run.py --agent                               # run the Genome-Writing Bench (under 5 min)

A five-minute quickstart that runs a bench task end-to-end is in docs/quickstart.md.

Query the stack:

pen-stack atlas --coverage                                # Writer Atlas coverage (33,370 systems x 8 families)
pen-stack writable --gene CCR5 --ct k562                  # rank writable loci near a gene
pen-stack crosslink --chrom chr19 --bin 55090             # which writers reach AAVS1
pen-stack plan --gene TRAC --intent knock_in_with_disruption --cargo-bp 2000   # inverse-design plans
pen-bridge design --target ACGTGTCTACGTGA --donor TTGCATCTAGGCAC               # bridge design + off-target + QC
pen-stack monitor --back-test                             # PEN-MONITOR living-database scan

Self-host the whole platform (API + web app + agent + MCP + LLM), one command:

docker compose up -d
docker compose exec ollama ollama pull qwen2.5:7b-instruct   # first run only (local fallback model)
# Web app :8501  .  API :8000 (/plan, /bridge/design, /ask)  .  MCP :8765   (see docs/DEPLOY.md)

LLM backend (hybrid, non-load-bearing). Services (agent, RAG, PEN-MONITOR) use one switch in configs/llm.yaml. On the compute tier (the GPU VM) the default is the local Ollama model (qwen2.5:7b-instruct, free, private, tool-calling verified) with automatic fallback to the hosted NVIDIA Nemotron (free, no local resources), then to a deterministic no-LLM path. A cooldown cache and bounded timeouts mean an absent or slow provider degrades in seconds rather than stalling. The LLM is non-load-bearing - every number and citation comes from a validated tool - so the choice never affects scientific reproducibility, only orchestration quality. Set NVIDIA_API_KEY (or a gitignored configs/nvidia_api_key.txt) for the hosted fallback; a low-RAM laptop with no GPU uses it automatically. The core scientific compute uses no LLM at all.

The web platform

pen_stack/ui/app.py is a single Streamlit app over the whole stack (11 pages):

  • Writable Genome - Overview, Forward query (gene to writability/safety/durability), Site finder (inverse), Atlas browser, Validation dashboard, Cross-cell-type transfer.
  • Writer Atlas - family coverage and measured-axis comparison.
  • Write Planner - goal + edit_intent to ranked, traceable plans.
  • Bridge design - design a bridge RNA, fold/cross-loop QC, genome-wide off-target scan.
  • Ask - grounded, cited Q&A (numbers from validated tools).
  • Agent - a goal to a cited, auditable end-to-end plan.

Data sources (all public)

hg38 (UCSC); ENCODE / Roadmap chromatin (ATAC/DNase + histone marks; K562, HepG2, CD34+ progenitor, mouse ES-Bruce4); GENCODE v46; COSMIC Cancer Gene Census v104; DepMap Public 26Q1; LaFave 2014 (NHGRI GeIST) MLV integrations; VISDB; TRIP / Akhtar 2013 (GEO GSE49806/49807); UniProt orthologs; Pfam/InterPro; Europe PMC; Addgene; Perry 2025 bridge-recombinase off-target + DMS data (Science adz0276; copyrighted - kept local, only derived products released). Every accession and DOI is pinned in configs/datasets.yaml and independently verified.

Validation philosophy

  • Pre-register before training. Success criteria, baselines and held-out sets are SHA-locked in prereg/ (paper1..4) before any model sees test data.
  • Always report an honest baseline (oncogene-distance for safety; H3K9me3/LAD for durability; intent-blind ranking for the Planner; Hamming for the bridge engine).
  • Blind external concordance - recover validated safe harbours, clinical genotoxic loci, documented writes, and measured off-targets the model never trained on.
  • Report failure honestly - cross-cell-type degradation, small benchmark N, and the limits of sequence-only off-target magnitude prediction are quantified results, not footnotes.
  • Every estimate carries its N and CI; statistical power is a stated limitation. The validated gold sets are small: blind GSH discovery rests on 8 functionally-validated harbours (16 curated loci; +35 Pellenz-2019 exploratory candidates reported separately, near chance), writer recovery on 14 documented writes, within-locus on 5 loci, the 3D structural sanity on 11 hijacking loci, and the LLM-agent bench on a few goals. Headline AUROCs are bootstrap-CI'd and the CIs are wide - e.g. blind GSH discovery is 0.68 (95% CI 0.53-0.82), not a precise 0.92. Scaling these gold sets (the literature has dozens of candidate harbours and many documented large-cargo integrase/CAST/PASTE writes) is the top priority for turning this from a proof of concept into an adopted resource; v3.1.1 began that scaling (5 -> 16 GSH loci).
  • Grounded services - every quantitative answer comes from a validated tool call (never a language model); the living database never auto-edits the atlas; clinical directives are refused.

Papers and phases

# Title Phase Status
1 (flagship) The Writable Genome: a predictive, writer-aware atlas of safe & durable insertion sites 1 complete
2 (platform) PEN-STACK: unified open infrastructure for non-destructive genome writing 2 complete
3 (capstone) The Write Planner: end-to-end inverse design of genomic writes 3 complete
4 (beachhead) Genome-wide off-target prediction for RNA-guided bridge recombinases 1.5 complete
M1 (v3.1) Writable Genome hardened: strong baselines, AlphaGenome sequence + 3D structural-risk axis v3.1 B,C,D,F complete
M2 (v3.1) The Genome-Writing Bench + PEN-Agent: the writing-side benchmark and a grounded agent v3.1 E complete
M3 (v3.1) Multiplex translocation-risk + bridge-RNA guide QC v3.1 G complete

The v3.1 cycle (workstreams A-H) is recorded in CHANGELOG.md, docs/positioning.md, and the SHA-locked prereg/ws_*.yaml; preprint drafts are in manuscripts/.

Per-phase build records, execution summaries, and Zenodo deposit packages are kept alongside the program plan. Data releases are deposited on Zenodo (one per paper).

Citation

@software{penstack2026,
  author  = {Mahaboob Ali, Anees Ahmed},
  title   = {PEN-STACK: open infrastructure for genome writing (The Writable Genome)},
  year    = {2026},
  version = {3.3.0},
  url     = {https://github.com/ahmedanees-m/pen-stack}
}

Author: Anees Ahmed Mahaboob Ali, VIT University, Vellore. MIT licensed.

Decision-support, not a clinical directive - every score is traceable to public data and a pre-registered model.

About

Open infrastructure for genome writing: the Writable Genome (learned safe+durable insertion-site atlas), the Writer Atlas of 33k genome-writing enzymes, an inverse-design Write Planner, and a bridge-recombinase off-target engine. Consolidates 5 prior repos. Pre-registered, blind-validated, single install.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors