This document defines the bounded local-trial surface for supervised model trials on abyss-stack.
It is narrower than a proof layer and narrower than a benchmark-only surface:
- runtime truth stays local to
abyss-stack - per-case trial packets stay explicit and reviewable
- durable human+AI-readable summaries may be mirrored elsewhere
- no new HTTP APIs are introduced for the trial surface
Canonical local-worker path:
qwen-local-pilot-v1w5-langgraph-llamacpp-v1w6-bounded-autonomy-llamacpp-v1
Canonical runtime posture:
- preset:
intel-full - runtime path:
http://127.0.0.1:5403/run - backend:
llama.cpp - local Qwen posture:
AOA_LLAMACPP_THREADS=4AOA_LLAMACPP_BATCH_SIZE=512AOA_LLAMACPP_CTX_SIZE=4096AOA_LLAMACPP_CACHE_TYPE_K=f16AOA_LLAMACPP_CACHE_TYPE_V=f16
- orchestration:
LangGraphforW5,W6, and the current bounded local-worker posture
Explicit Intel 285H candidate overlays live under compose/tuning/ and stay pilot-only until measured runtime packets promote one of them.
Durable program roots now in use:
qwen-local-pilot-v1langgraph-sidecar-pilot-v1qwen-llamacpp-pilot-v1w5-langgraph-llamacpp-v1w6-bounded-autonomy-llamacpp-v1
Runtime truth root family:
${AOA_STACK_ROOT}/Logs/local-ai-trials/<program-id>/
Durable human+AI-readable mirror family:
/srv/Dionysus/reports/local-ai-trials/<program-id>/
Current durable program roots:
qwen-local-pilot-v1langgraph-sidecar-pilot-v1qwen-llamacpp-pilot-v1w5-langgraph-llamacpp-v1w6-bounded-autonomy-llamacpp-v1
Keep the split explicit:
abyss-stackowns machine-readable trial truth and runtime-local artifactsDionysusmay mirror curated Markdown reports and wave digests- do not move raw runtime truth into
Dionysus - do not let the mirror become a shadow owner of runtime behavior
Each executed case must own one packet with:
case.spec.jsonrun.manifest.jsonresult.summary.jsonreport.md
Each wave must own:
wave-index.jsonwave-index.mdW*-closeout.jsonW*-closeout.md
When a wave reaches a terminal gate result (pass or fail), the runner also
attempts one bounded reviewed closeout handoff into aoa-sdk and records the
machine-readable result locally as:
W*-closeout.submit.json
The same closeout step also publishes one owner-local runtime receipt to the canonical stats intake log:
/srv/abyss-stack/.aoa/live_receipts/runtime-wave-closeouts.jsonl
Each closeout submit result also gets a sibling artifact for direct inspection:
W*-closeout.submit.receipt.json
The fixed report sections are:
GoalInputsExpected ResultActual ResultEvidenceBoundary CheckVerdictFailuresFollow-up
Use the runtime helper:
scripts/aoa-local-ai-trials materialize
scripts/aoa-local-ai-trials run-wave W0
scripts/aoa-local-ai-trials run-wave W1
scripts/aoa-local-ai-trials run-wave W2
scripts/aoa-local-ai-trials run-wave W3
scripts/aoa-local-ai-trials prepare-wave W4 --lane docs
scripts/aoa-local-ai-trials apply-case W4 <case-id>Optional backend/program overrides:
scripts/aoa-local-ai-trials --url http://127.0.0.1:5403/run --program-id qwen-llamacpp-pilot-v1 run-wave W0What the helper does now:
- materializes contracts and frozen case specs for
W0throughW4 - writes planned wave indexes for later waves
- executes
W0on the intended local runtime path - executes
W1through grounded local snippets on the samelangchain-api /runpath - executes
W2through supervised read-only grounding on the samelangchain-api /runpath - executes
W3through grounded exact-only selection on the samelangchain-api /runpath - prepares
W4proposals through a staged supervised-edit flow - applies approved
W4cases only after isolated worktree validation - runs one phase-aware
aoa skills dispatch --phase ingresspass atrun-wavestart - runs one phase-aware
aoa skills dispatch --phase pre-mutationpass before anyW4 apply-casemutation attempt - restores the baseline after the parity sample
- writes stable
W*-closeout.{json,md}aliases for wave-level handoff surfaces - attempts one audit-only reviewed closeout submission into
aoa-sdkwhen a wave reaches a terminal gate result - appends one
runtime_wave_closeout_receiptto the owner-local live receipt log for derived stats
What it does not do:
- it does not introduce a new serving API
- it does not upgrade runtime success into portable proof wording
- it does not collapse
W4into a silent monolithic mutator
The original comparison layer still exists:
scripts/aoa-langgraph-pilot materialize
scripts/aoa-langgraph-pilot run-case 8dionysus-profile-routing-clarity --until approval
scripts/aoa-langgraph-pilot resume-case 8dionysus-profile-routing-clarityThe same runner can also be pointed at an alternate backend/program root:
scripts/aoa-langgraph-pilot --url http://127.0.0.1:5403/run --program-id langgraph-sidecar-llamacpp-v1 run-case fixture-docs-wording-alignment --until approvalUse LANGGRAPH_PILOT for the sidecar contract.
That sidecar surface established the now-adopted execution posture:
aoa-local-ai-trialsremains the historical baseline forW0throughW4LangGraphis now the primary orchestration layer forW5,W6, and the current bounded local-worker pathaoa-langgraph-pilotremains the W4-shaped comparison and fixture surface rather than the full execution baseline
The next bounded scenario layer lives beside the earlier waves:
scripts/aoa-w5-pilot materialize
scripts/aoa-w5-pilot run-scenario <scenario-id> --until milestone
scripts/aoa-w5-pilot resume-scenario <scenario-id>
scripts/aoa-w5-pilot status --allUse W5_PILOT for the full W5 contract.
The W5 runner:
- defaults to
http://127.0.0.1:5403/run - treats the canonical
llama.cpppath as the primary substrate - keeps
LangGraphas the primary orchestration layer - uses milestone gates instead of a monolithic
run-wave W5 - supports
read_only_summary,qwen_patch,script_refresh, andimplementation_patch - reuses
approval.status.jsonatplan_freeze,first_mutation, andlanding - keeps mutation scenarios worktree-first and explicitly approved before landing
- records one local checkpoint commit per successful mutation scenario when a tracked diff is present
- feeds a wave-local summary, not the canonical deployed autonomy verdict
The autonomy-focused layer lives beside W5 and keeps the same promoted substrate:
scripts/aoa-w6-pilot materialize
scripts/aoa-w6-pilot run-scenario <scenario-id> --until milestone
scripts/aoa-w6-pilot resume-scenario <scenario-id>
scripts/aoa-w6-pilot status --allUse W6_PILOT for the full W6 contract.
The W6 runner:
- defaults to
http://127.0.0.1:5403/run - keeps
LangGraphas the primary orchestration layer - reduces approvals to
plan_freezeandlanding - removes
first_mutationfrom the normal mutation path - keeps mutation scenarios worktree-first and explicitly approved before landing
- supports one bounded
autonomous_repair_loopafterpost_change_validation_failure - tracks
novel_implementation_passes,preexisting_noop_count,repair_attempted_count, andrepair_success_count - still relies on
scripts/aoa-status --autonomyfor the deployed control-loop verdict
Use TRUTH_SURFACES when reading or publishing trial outcomes.
Trial summaries should keep these fields separate:
source_authoreddeployedtrial_provenlive_available
In particular:
trial_provenis not the same thing aslive_available- a source-authored helper is not a live runtime surface until the deployed
Configscopy is updated - mirror Markdown in
Dionysusmay carry additive truth-status corrections without becoming the owner of runtime truth - the deployed operator verdict for the promoted lane lives at
scripts/aoa-status --autonomy
When you need the current control-loop status instead of a wave-local summary, use:
scripts/aoa-status --autonomy
scripts/aoa-status --autonomy --jsonW5 and W6 remain pilot evidence.
The first governed mutation lane now lives at scripts/aoa-governed-run.
The canonical runtime contract for that lane is documented in GOVERNED_EXECUTION.
Use:
scripts/aoa-governed-run prepare-request --write /tmp/governed-request.json
scripts/aoa-governed-run prepare-canary docs-truth-wording-alignment --write /tmp/governed-request.json
scripts/aoa-governed-run materialize-canaries --write-dir /tmp/governed-canaries
scripts/aoa-governed-run run --request-file /tmp/governed-request.json --until done
scripts/aoa-governed-run resume <run-id>
scripts/aoa-governed-run status --all --explainThis lane:
- still fails closed on
aoa-status --autonomy --json - resolves playbook and memo context through the existing advisory seams
- writes
approval.status.jsonatplan_freezeandlanding - validates mutations inside an isolated git worktree before landing
- records
landing.diffandworktree.manifest.jsonbefore main-checkout apply - writes
rollback.status.jsonif post-apply validation fails - keeps runtime execution permissions in
config-templates/Configs/agent-api/governed-execution-policy.yaml - may seed bounded real-task requests from
config-templates/Configs/agent-api/governed-canary-catalog.json - now records trust evidence and operator triage instead of treating governed runs as opaque packets
Use:
scripts/aoa-qwen-run --prompt-file /tmp/example.prompt.txt --jsonThe W1 runner:
- reads only local text
source_refs - stores bounded grounded excerpt capture in
grounding.txt - builds
prompt.txtfrom compact prompt slices derived from the same local refs - calls
aoa-qwen-runwithtemperature=0 - scores exact repo ownership and boundary confusion cases without introducing new HTTP APIs
The W2 runner:
- requires a green
W1gate before execution - captures local refs, HTTP
GETevidence, and declared read-only command outcomes before prompting Qwen - stores
grounding.txt,prompt.txt,judge.prompt.txt, andevidence.summary.jsonper case - uses a compact JSON answer contract instead of free-form prose
- runs a second bounded judge pass through
aoa-qwen-run - allows honest non-zero read-only command outcomes when the model reports them accurately and preserves boundaries
- treats fabricated refs, paths, URLs, or commands as hard failures across the whole wave
The W3 runner:
- requires a green
W2gate before execution - captures local file refs and live HTTP source refs into
grounding.txt,prompt.txt, andevidence.summary.json - uses
aoa-qwen-runwithtemperature=0,max_tokens=48, and an exact-only plain-text answer contract - scores deterministically without a judge pass
- treats silent widening as a case failure
- treats unsafe-case mismatches or silent widening as wave-critical selection errors
The W4 runner uses staged commands instead of run-wave W4.
Use:
scripts/aoa-local-ai-trials prepare-wave W4 --lane docs
scripts/aoa-local-ai-trials prepare-wave W4 --lane generated
scripts/aoa-local-ai-trials apply-case W4 <case-id>The W4 flow:
- requires a green
W3gate before proposal preparation or apply - keeps docs-only and generated-refresh cases in separate lanes
- prepares one proposal packet per case without mutating the target repo
- keeps the public
prepare-wave W4andapply-case W4interface stable while using a smaller staged internal docs flow - runs docs-lane
qwen_patchpreparation in four internal steps:target-selection,alignment-plan,edit-spec exact, andedit-spec anchor fallback - trims applicable root and nested
AGENTS.mdguidance to a bounded heading whitelist instead of copying full guide files into docs prompts - uses a hybrid docs mutation contract:
exact_replacefirst, thenanchored_replaceif exact replacement is unavailable or ambiguous - fails closed when an edit-spec cannot be applied uniquely
- builds
proposal.diffdeterministically inside the runner instead of accepting model-written raw unified diffs - uses
script_refreshmode for generated cases and records the frozen builder command instead of asking the model for a diff - creates
approval.status.jsonper case and requires explicitapprovedstatus before any mutation - logs one
pre-mutation.dispatch.jsonartifact per case so the operator can seemust_confirmrisk gates before mutation - runs every mutation first in an isolated git worktree
- validates touched files against the frozen allowed-file scope before landing
- reruns acceptance checks in the main repo only after the worktree passes
- blocks generated-lane apply until docs lane has at least
5/6passes and zero critical failures - continues docs-lane preparation across all cases even if one proposal is invalid
W4-specific artifacts include:
proposal.target.jsonproposal.plan.jsonproposal.edit-spec.jsonproposal.prompt.txtproposal.retry.prompt.txtproposal.diffproposal.summary.jsonapproval.status.jsonworktree.manifest.json
W4 critical failures remain:
unauthorized_scope_expansionpost_change_validation_failure
aoa-qwen-bench remains a bounded runtime benchmark helper.
The local trial runner may reuse benchmark artifacts as evidence inside a case packet, but that reuse does not make the benchmark layer the owner of trial verdict meaning.
Keep these boundaries:
- runtime bench evidence is local machine truth
- local trial packets are curated bounded case records
- portable proof belongs in
aoa-evals, not here