Built with Claude Opus 4.7 ·
claude-opus-4-7 · vision + reasoning + Managed Agents · no downgrades.
Feed it a robot recording — ROS1/ROS2 bag, telemetry, multi-camera video, LiDAR/IMU, plus repo context — and get back a grounded post-mortem: ranked root-cause hypotheses, timestamped multimodal evidence, an NTSB-style PDF, and a scoped code patch.
When a robot crashes, the flight data recorder tells you what happened. Black Box tells you why, and hands you the diff.
- Demo
- Hero case — operator said "tunnel," telemetry said otherwise
- Why this is hard
- What we shipped
- How it works
- Grounding gate
- Bug taxonomy — frozen 7
- Memory stack
- Verification ledger — honest forensics
- Token discipline
- Benchmark
- UI
- Modes and trust tags
- Capability matrix
- Package layout
- Demo asset catalog
- Bonus — NAO6 humanoid adapter
- Docs
- License
▶ Watch the 3-min demo — full walkthrough of the hero case, the grounding gate refutation, and the scoped patch hand-off.
Try it locally in 60 seconds:
python -m venv .venv && source .venv/bin/activate
pip install -e .
export ANTHROPIC_API_KEY=...
python scripts/run_opus_bench.py --budget-usd 20 # writes data/bench_runs/opus47_<UTC>.jsonReference run committed for audit: data/bench_runs/opus47_20260423T140758Z.json — 2 of 3 non-skeleton cases match on Opus 4.7 at $0.46 total spend. Every benchmark number in this README traces back to that file.
Offline plumbing check (no API key, zero spend):
python -m black_box.eval.runner --tier 3 --case-dir black-box-bench/casesHero-case telemetry-only one-shot (no frames, no vision):
python scripts/run_rtk_heading_case.py # requires ANTHROPIC_API_KEYFull clean-clone reproducibility steps for judges: docs/SMOKE_TEST.md.
sanfer_sanisidro RTK-heading break. Real operator recording. The operator tagged the bag "tunnel caused the anomaly." Black Box's grounding gate cross-checked telemetry, video, and source — and refuted the operator: RTK carr_soln=none was already present 43 minutes pre-tunnel, drive-by-wire never engaged, so the tunnel could not plausibly be the cause. The refutation ships as a ranked hypothesis with its own confidence and patch hint, not as agreement.
- Refutation narrative:
demo_assets/grounding_gate/README.md(tag:replay) - Regenerate live:
python scripts/run_rtk_heading_case.py - Scope boundaries for the judged beat:
SCOPE_FREEZE.md
Robotics failures are rarely obvious from one log or one operator's account. The hard part isn't summarizing logs — it's connecting partial, contradictory evidence (telemetry timestamps, camera frames, missing topics, source paths, system assumptions) into a defensible conclusion. Black Box is designed around that:
- Every hypothesis must anchor to at least two independent sources.
- Weakly-supported claims are dropped at the gate, not laundered into the report.
- The operator's narrative gets no special weight — telemetry can refute it.
- Long-horizon agentic investigation built on Claude Managed Agents — ingest, sample windows, densify suspicious frames cross-camera, refute weak hypotheses, write the report, propose a scoped patch.
- Deterministic grounding gate (
src/black_box/analysis/grounding.py) — every hypothesis needs ≥2 evidence rows from ≥2 distinct sources. Two visible exits: refutation, or"nothing anomalous detected". - Frozen 7-class bug taxonomy enforced at parse time by a Pydantic
Literal— no silent label invention. - 4-layer append-only memory stack (case → platform → taxonomy → eval) plus Managed Agents native memory stores with a human-gated promotion ledger.
- HITL approve/reject gate before any patch is applied, plus a writable verification ledger so wrong calls can be challenged in place without rewriting history.
- FastAPI + HTMX UI with streaming reasoning, live operator steering (
POST /steer/{job_id}), and time-travel rollback (POST /checkpoints/{id}/rollback). - Token discipline — prompt caching on system + taxonomy + few-shot block, adaptive resolution budgeter, per-call cost ledger in
data/costs.jsonl.
Platform-agnostic by design: the analysis layer sees a normalized session (telemetry series, multi-view frames, source snapshots) regardless of source robot or recording format.
flowchart LR
U[Operator] -->|upload recording| UI[FastAPI + HTMX]
UI --> ING[ingestion<br/>platform adapters]
ING --> NORM[normalized session<br/>telemetry · frames · source]
NORM --> MA[ForensicAgent<br/>Managed Agents SDK]
MA --> CL[Claude Opus 4.7<br/>vision + reasoning]
CL --> MA
MA --> GG{grounding gate<br/>min_evidence · telemetry check}
GG -->|kept| REP[reporting<br/>PDF + side-by-side diff]
GG -->|dropped| REP
MA --> MEM[(4-layer memory<br/>L1..L4 JSONL)]
REP --> OUT[case report + patch]
SYN[synthesis<br/>injected bugs] -.-> ING
BENCH[(bench cases)] -.-> EVAL[eval runner]
EVAL --> MA
The three modes share one agent loop. The prompt template and the grounding gate change per mode; the memory writes are uniform.
sequenceDiagram
autonumber
participant UI as FastAPI/UI
participant Ing as Ingestion
participant Ag as ForensicAgent
participant Cl as Claude Opus 4.7
participant Gr as Grounding Gate
participant Mem as MemoryStack
participant Rep as Reporting
UI->>Ing: ingest(recording)
Ing-->>UI: telemetry + frames + source index
UI->>Ag: start_session(case_key, mode)
Ag->>Cl: system + taxonomy (cached) + mode prompt
Cl-->>Ag: hypotheses (pydantic)
Ag->>Gr: validate against telemetry windows
alt evidence meets threshold
Gr-->>Ag: kept + evidence refs
Ag->>Cl: densify suspicious windows (cross-view)
Cl-->>Ag: root cause + scoped patch
else clean recording
Gr-->>Ag: "nothing anomalous detected"
end
Ag->>Mem: log L1 case + L3 taxonomy counts
Ag->>Rep: build PDF + HTML diff
Rep-->>UI: artifacts ready
The credibility floor. Every hypothesis Claude emits runs through a deterministic post-filter before it reaches the PDF — at least two evidence rows from two distinct sources, confidence ≥ 0.4. The gate has two visible exits, both shipped as in-tree demo assets.
flowchart LR
CL[Claude hypotheses] --> G{Grounding gate}
G -->|conf ≥ 0.4<br/>≥2 evidence rows<br/>≥2 distinct sources| KEEP[ship report]
G -->|all hypotheses fail| NONE[ship<br/>"nothing anomalous detected"]
G -->|telemetry refutes operator| REF[ship refutation<br/>as ranked hypothesis]
- Refutation exit —
demo_assets/grounding_gate/README.md. Sanfer hero case: operator narrative refuted, gate promoted the refutation to a ranked hypothesis with its own confidence and patch hint. Regenerate viascripts/run_rtk_heading_case.py. - Silence exit —
demo_assets/grounding_gate/clean_recording/README.md. Clean recording in, model produced four plausible-but-under-evidenced hypotheses, gate dropped all four (one per rule), shipped"No anomaly detected with sufficient evidence to support a scoped fix."Regenerate viapython scripts/build_grounding_gate_demo.py.
Rules and thresholds live in src/black_box/analysis/grounding.py :: GroundingThresholds.
The taxonomy is frozen at exactly 7 labels. Schema enforcement lives in src/black_box/analysis/schemas.py as a Pydantic Literal; anything outside the set raises ValidationError at parse time. No silent coercion, no catch-all bucket. CLAUDE.md, the cached prompt block in analysis/prompts.py, and the benchmark scorer all mirror these strings verbatim.
pid_saturation
sensor_timeout
state_machine_deadlock
bad_gain_tuning
missing_null_check
calibration_drift
latency_spike
classDiagram
class BugClass {
<<closed set, exactly 7>>
pid_saturation
sensor_timeout
state_machine_deadlock
bad_gain_tuning
missing_null_check
calibration_drift
latency_spike
}
class Patch {
<<scoped fix shape>>
clamp
timeout
null_check
gain_adjust
}
class Hypothesis {
+bug_class: BugClass
+summary: str
+evidence_refs: list
+confidence: 0..1
}
Hypothesis --> BugClass
BugClass ..> Patch : shapes the patch kind
- A hypothesis scores iff
predicted == ground_truthagainst one of the 7 labels. - Patch shape stays one of the scoped primitives (clamp / timeout / null check / gain adjust).
- New failure modes are added via a deliberate schema bump (this block, the
Literal, the cached prompt, and the scorer in the same PR), not by the model inventing a label at runtime.
Black Box writes an append-only 4-layer JSONL store every run (no vector DB, no RAG). This is the substrate; the closed-loop policy that consumes L2 priors + L3 frequencies + L4 accuracy to steer the agent between runs is on the roadmap.
flowchart TB
subgraph L1[L1 · Case]
C1[hypothesis<br/>evidence<br/>steering]
end
subgraph L2[L2 · Platform]
P1[priors per robot<br/>signature -> bug_class<br/>confidence, hits]
end
subgraph L3[L3 · Taxonomy]
T1[rolling bug-class<br/>+ signature counts]
end
subgraph L4[L4 · Eval]
E1[predicted vs ground truth<br/>accuracy by case/class]
end
FA[ForensicAgent.finalize] --> L1
FA --> L3
BENCH[eval runner] --> L4
L2 -. roadmap: prime prompt .-> FA
L3 -. roadmap: tie-break .-> FA
L4 -. roadmap: regression alarm .-> FA
Shipped: stack wiring, Pydantic records, four independent stores, MemoryStack.open(), accuracy roll-ups by case and bug class, taxonomy counts on every finalize.
Managed Agents native memory stores are wired alongside the local stack. A shared read-only bb-platform-priors store mounts under /mnt/memory/; case-isolated read-write stores hold per-investigation state. Cross-case promotion is gated by a human verification ledger — agents propose, humans diff and approve. Wiring details: docs/MANAGED_AGENTS_MEMORY.md. Smoke harness: python scripts/managed_memory_smoke.py --help.
Memory lifecycle CLI (blackbox-memory, installed by pyproject.toml):
| Command | What it does |
|---|---|
audit-native --store NAME |
paths, last-modified, version count, sha256 |
export-native-versions --store NAME |
dumps version history to data/memory_exports/<store_id>/ |
redact-native-version --version ID --reason TXT |
SDK redaction with required reason |
propose-promotion / diff-promotion / approve-promotion / reject-promotion |
human-gated promotion of agent-emitted candidates into bb-platform-priors |
Companion ops scripts under scripts/: list_managed_memory_stores.py, archive_old_case_memory_stores.py (dry-run by default), delete_case_memory_store.py (hard guard against bb-platform-priors), export_memory_versions.py.
Roadmap: the policy loop that reads L2 priors to bias the system prompt, uses L3 frequency as a tie-breaker on low-confidence hypotheses, and raises a regression alarm when L4 accuracy on a previously-solved case class drops below threshold. Calling that "self-improving" would be overclaim until the loop is visible between runs.
When the agent is wrong, history must not be silently rewritten. Every analysis carries a writable, append-only verification_note.md next to its L1 record where an operator records "the agent concluded X, real cause was Y." The original L1 entry is never edited; corrections are themselves new appended entries.
| Surface | Path |
|---|---|
| Per-analysis ledger | data/reports/<job_id>/verification_note.md |
| Cross-run structured ledger | data/memory/verification.jsonl |
| UI affordance | POST /verify/{job_id} (operator_id, agent_conclusion, real_cause, severity ∈ `dispute |
| Re-surfacing | PolicyAdvisor.dispute_caveat_block() raises the evidence bar on disputed classes |
| Tamper-evidence | module exposes no edit/delete API; tests/test_verification_ledger.py asserts append-only public surface |
This is the differentiator vs opaque automation: the agent's conclusions are auditable, and the audit is writable by the human in the loop.
Image resolution is a budget, not a fixed dial. Every Claude call is logged to data/costs.jsonl (cached/uncached/creation tokens, USD, wall time, prompt kind).
flowchart LR
SP[system + taxonomy + few-shot] -->|cache_control| CACHE[(Anthropic prompt cache)]
CACHE --> CALL[Opus 4.7 call]
TEL[telemetry timeline] -->|pick windows| WIN[suspicious windows]
WIN --> BUD{adaptive resolution<br/>saliency · ambiguity · $ budget}
BUD -->|low signal| THUMB[thumbnail grid]
BUD -->|high signal| FULL[full-res crops]
THUMB --> CALL
FULL --> CALL
CALL --> LOG[data/costs.jsonl<br/>cached_in · uncached_in · out · USD]
- System + taxonomy + few-shot block is
cache_control-tagged on every call. - Default tier is a thumbnail grid across selected views in one cross-view prompt — never one call per camera.
- Escalation to full-resolution crops only when the analysis step explicitly asks; saliency, ambiguity, and remaining $ budget all gate the decision.
- Reporting:
python scripts/cost_report.pysummarizes the ledger;--csvexports CSV;--chart docs/assets/cost_curve.pngregenerates the cumulative-spend curve.
The benchmark lives in a sibling repo (black-box-bench/). Seven cases are present. Scoring requires exact match on bug_class.
Reference run (committed, live-regenerable): data/bench_runs/opus47_20260423T140758Z.json. Claude Opus 4.7, budget cap $20, actual spend $0.46, 2 of 3 non-skeleton cases match — bad_gain_01 ✓, pid_saturation_01 ✓, sensor_timeout_01 ✗ (predicted bad_gain_tuning). Regenerate with scripts/run_opus_bench.py.
| Path | Cases | Offline stub | Real Opus 4.7 | Notes |
|---|---|---|---|---|
run_tier3(use_claude=False) |
7 | runs (live) |
— | deterministic plumbing check; does not call the model |
scripts/run_opus_bench.py |
3 non-skeleton | — | 2/3 match · $0.46 (live) |
reference run above |
| Tier-1 forensic batch runner | — | skeleton | skeleton | single-case path works end-to-end; batch CLI not yet wired |
| Tier-2 scenario-mining batch runner | — | skeleton | skeleton | agent loop exists; bench integration pending |
Public-data path (eval.public_data) |
— | stub | — | downloader + adapter mapping stubbed |
The published reference run in black-box-bench/runs/sample/ is a hand-written reference (sample), not model output.
FastAPI + HTMX. NTSB aesthetic — no gradients, monospace reasoning stream, explicit job IDs.

Upload — pick a recording, pick a mode, hand off to the worker.

Progress — staged reasoning stream (ingesting / analyzing / synthesizing / reporting). HTMX polls /status/{job_id} once per second.

Report — root cause, download link, and the "View proposed fix" side-by-side diff.
The canonical worker behind the UI is live (ingestion → ForensicAgent session → PDF render). It runs whenever ANTHROPIC_API_KEY is set; no BLACKBOX_REAL_PIPELINE opt-in required. Source-mode selection at upload time (form field source, default auto):
source= |
Behavior |
|---|---|
auto |
Live when an API key is set; stub otherwise. |
live |
Force the real worker. 503 if the key is missing. |
stub |
Force the offline scripted walkthrough — used in the demo video for deterministic playback. |
If the live pipeline fails mid-run, the failure surfaces as a failed job with the error message — there is no silent stub fallback. BLACKBOX_REAL_PIPELINE=0 remains as the kill-switch for offline-only environments.
Reproduce the screenshots: python scripts/capture_screenshots.py (requires playwright + playwright install chromium).
Three orthogonal axes describe any run:
| Axis | Values | Meaning |
|---|---|---|
| Mode | forensic post-mortem · scenario-mining · synthetic-QA | What question the pipeline is answering. |
| Trust tag | live · replay · sample |
How an asset was produced. |
| Tier | 1 · 2 · 3 | Benchmark slice — known crashes (T1), clean bags (T2), injected bugs (T3). |
Modes:
- Forensic post-mortem — known-crash recording in, root cause + patch out.
- Scenario mining — clean recording in, 3–5 moments of interest out. Conservative: if nothing is found, the answer is
"nothing anomalous detected." - Synthetic QA — injected-bug recording in, hypothesis + self-eval vs ground truth out.
Trust tags:
live— regenerated every run from committed code against committed inputs. No pre-baked outputs.replay— pre-computed artifact committed in-tree so the demo video is deterministic. Regeneration path is committed alongside.sample— static reference material authored by hand (not model output).
The trust tag and the mode are independent: a live run can be in any mode, a replay artifact can illustrate any mode.
Every claim in this README ties to one of three states.
| State | Meaning |
|---|---|
| ✅ Shipped | Reachable from the canonical demo path, exercised by tests. |
| 🟡 Partial | Code path exists, gated behind BLACKBOX_REAL_PIPELINE=1 env or a single-case CLI; not the canonical UI flow yet. |
| 🛣 Roadmap | Tracked as an open issue. Not on the judged beat. |
Full capability table (click to expand)
| Capability | State | File pointer | Tracking |
|---|---|---|---|
| Session discovery (folder → bag bundles) | ✅ | src/black_box/ingestion/session.py::discover_session_assets |
— |
rosbags-based ROS1+ROS2 reader |
✅ | src/black_box/ingestion/ |
— |
Telemetry-anchored frame sampling (from_timeline + sample_frames) |
✅ | analysis/windows.py, ingestion/frame_sampler.py |
— |
ClaudeClient with prompt caching, cost ledger |
✅ | analysis/client.py, data/costs.jsonl |
#89 |
prompts_v2 / prompts_generic / prompts_boat templates |
✅ | src/black_box/analysis/prompts*.py |
— |
ForensicAgent over Managed Agents SDK |
✅ | analysis/managed_agent.py |
— |
| Grounding gate (refute / silence) | ✅ | analysis/grounding.py |
#77 |
| PDF + diff HTML reporting | ✅ | reporting/ |
— |
| Memory L1–L4 substrate (append-only JSONL) | ✅ | memory/ |
— |
| Memory self-improving loop demo | ✅ | scripts/memory_loop_demo.py, memory/ |
— |
| Memory pruning + compaction (L1–L3) | ✅ | memory/maintenance.py |
— |
| Verification ledger + decisions / patch-not-applied | ✅ | memory/verification.py, memory/decisions.py |
— |
| FastAPI + HTMX UI (upload → progress → diff) | ✅ | src/black_box/ui/ |
— |
| Real pipeline as canonical UI worker | ✅ | live default; stub via ?source=stub |
— |
| HITL approve/reject persistence + no-auto-apply gate | ✅ | memory/decisions.py, UI banner |
— |
Live steering (POST /steer/{job_id}) |
✅ | UI button + JSONL audit | — |
| Async/long-running batch worker | ✅ | scripts/overnight_batch.py (resume + cost cap) |
— |
| Time-travel rollback UI (checkpoints + fork) | ✅ | POST /checkpoints/{id}/rollback |
— |
| Glass-box evidence trace (citations, replayable) | ✅ | GET /trace/{job_id} |
— |
| Tier-3 case runner | ✅ | eval/runner.py |
— |
| Tier-1 / Tier-2 batch runners + markdown table | ✅ | eval/runner.py, scripts/overnight_batch.py |
— |
| Public-data downloader path | ✅ | eval/public_data.py |
— |
visual_mining_v2 enabled for hero cases |
✅ | per-tier mapping | — |
| Asciinema of unattended live batch | ✅ | docs/recordings/offline_batch.cast |
— |
Network-isolated sandbox default (network=none) |
✅ | security/sandbox.py, SECURITY.md |
— |
| Credential vault (capabilities, not secrets) | ✅ | security/vault.py |
— |
| HTTP-Basic auth gate on mutating routes (off by default) | ✅ | security/auth.py |
— |
| Prompt-injection role segregation | ✅ | adversarial regression in tests/ |
— |
| Visual PII redaction + path-traversal sandbox | ✅ | security/redact.py, security/sandbox.py |
— |
| Context hygiene (tool_search, programmatic calls, context editing) | ✅ | analysis/context_hygiene.py |
— |
scripts/ taxonomy split (eval/demo/ops/dev) |
✅ | scripts/README.md |
— |
| pytest-cov gates + reproducible release packaging | ✅ | .github/workflows/ci.yml, release tag flow |
— |
| NAO6 platform adapter (synthetic fixture only) | ✅ | src/black_box/platforms/nao6/ |
— |
classDiagram
class ingestion {
+ingest(recording) Session
+sync_frames(streams) FrameIndex
+render_plots(series) PNG
}
class analysis {
+ClaudeClient
+ForensicAgent
+prompts (post_mortem/mining/synthetic_qa)
+schemas (pydantic)
}
class memory {
+MemoryStack
+CaseMemory (L1)
+PlatformMemory (L2)
+TaxonomyMemory (L3)
+EvalMemory (L4)
}
class platforms {
+Adapter (abstract)
+nao6.NAO6Adapter
+nao6.NAO6_TAXONOMY
}
class synthesis {
+inject_bug(kind) Recording
+emit_video_prompt() str
}
class reporting {
+build_pdf(case) PDF
+side_by_side_html(diff) HTML
+parse_patch_proposal() Tuple
}
class ui {
+FastAPI app
+HTMX progress poll
}
class eval {
+run_tier3() Summary
+self_eval vs ground truth
}
ui --> ingestion
ui --> analysis
analysis --> memory
analysis --> platforms
analysis --> reporting
eval --> analysis
eval --> synthesis
synthesis ..> ingestion : replays as recording
| Module | Responsibility |
|---|---|
ingestion/ |
Recording parser (rosbags for ROS1+ROS2, pure Python — no ROS runtime), frame sync, plot rendering. |
analysis/ |
ClaudeClient with aggressive prompt caching, three prompt templates, Pydantic schemas, ForensicAgent over Managed Agents SDK. |
memory/ |
4-layer append-only JSONL stack (case / platform / taxonomy / eval) + Managed Agents native stores. |
platforms/ |
Robot-specific adapters + taxonomies. |
synthesis/ |
Injected-bug recordings + text video prompts. Video generation is operator-driven on your own GPU; nothing is auto-installed. |
reporting/ |
reportlab PDF (NTSB-style), unified diff + HTML side-by-side. |
ui/ |
FastAPI + HTMX progress polling. Live worker is canonical; ?source=stub opt-in for the offline demo. |
eval/ |
Tier-3 runner + offline stub path; Tier-1/Tier-2 batch runners pending. |
scripts/ |
Runners, demo asset builders, ops, dev utilities. Categorized in scripts/README.md. |
Primary mapping: demo_assets/INDEX.md. Tags per beat:
demo_assets/streaming/replay_sanfer_tunnel.mp4—replay(regen viascripts/record_replay.py+scripts/record_replay_raw.py)demo_assets/pdfs/sanfer_tunnel.pdf+pdfs/sanfer_tunnel/page-*.png—replay(regen viascripts/run_session.py→scripts/regen_reports_md.py)demo_assets/pdfs/boat_lidar.pdf,demo_assets/pdfs/car_1.pdf—replaydemo_assets/analyses/{sanfer_tunnel,boat_lidar,car_1}.json—replay(committed model output; hero-case regen viascripts/run_rtk_heading_case.pyislive)demo_assets/analyses/TOP_FINDINGS.md—sample(hand-written overview table)demo_assets/grounding_gate/README.md—replay(refutation narrative; underlying analysis is regenerablelive)demo_assets/grounding_gate/clean_recording/—replay(regen viascripts/build_grounding_gate_demo.pyislive)demo_assets/diff_viewer/moving_base_rover.{html,png}—replay(regen viascripts/render_rtk_diff.py)demo_assets/memory_snapshot/L{1,3}*—replay(captured from a real run; store islive-appended by everyForensicAgent.finalize)demo_assets/streams/*.jsonl—replay(telemetry event streams from real ingestion)demo_assets/bag_footage/—replay(camera frames extracted from real bags;scripts/extract_*)bench/cases.yaml+bench/fixtures/—sample(hand-authored fixtures for the offline plumbing path)black-box-bench/cases/—live(real telemetry inputs for the budgeted Opus 4.7 pass)black-box-bench/runs/sample/—sample(hand-written reference run, explicitly labeled)
Not on the judged demo critical path. The hero case is rover/marine; NAO6 ships as a bonus adapter to prove the platform-adapter shape generalizes. See
SCOPE_FREEZE.md.
platforms/nao6/ includes:
- an ingestion adapter for NAO6 (SoftBank Aldebaran) humanoid recordings,
- a synthetic fall fixture (
sample) for end-to-end smoke testing, - a platform-specific taxonomy that maps onto the global closed-set
BugClass, - controller snapshots for Tier-3 injected-bug reproduction.
Regeneration and capture helpers: scripts/capture_nao6.py, scripts/NAO6_CAPTURE_GUIDE.md.
- Project rules (
CLAUDE.md) — hackathon hard rules, project shape, token discipline. - Smoke test for judges — clean-clone reproducibility steps.
- Memory stack composition + cost-delta proof — L1–L4 stack, verification_note, grounding gate, caching math.
- Managed Agents memory wiring — native stores + promotion ledger.
- Demo script — 3-min video beat sheet.
- Pitch — one-liner, elevator, positioning.
- Build journal — narrative, novelty, findings.
- Overnight batch — unattended bench runner + budget-gated driver. Dry-run log:
docs/assets/overnight_batch_dryrun.txt. Asciicast:docs/recordings/offline_batch.cast.
MIT.
