Add synthetic MAST 3.1 fixture corpus + parity runner for verify-before-stop / no-vibes#12
Add synthetic MAST 3.1 fixture corpus + parity runner for verify-before-stop / no-vibes#12ianymu wants to merge 2 commits into
Conversation
Adds 20 synthetic fixtures targeting MAST mode 3.1 (Premature Termination) that the current human-labelled n=19 subset has zero positive votes for. Includes parity_runner.py that drives both verify-before-stop and no-vibes through identical fixture payloads — produces precision/recall/F1 per hook plus inter-hook Cohen κ for the signal-source-vs-MAST-mode triangulation framing discussed on anthropics/claude-code#46957. Co-authored-by: Ian Mu <ian.y.mu@gmail.com>
Aligns directory layout with the path referenced throughout README.md, parity_runner.py, and PR description (evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/).
waitdeadai
left a comment
There was a problem hiding this comment.
Thanks for this, Yanlong — this is the synthetic-3.1 stopgap I floated in anthropics/claude-code#46957, and it's built with real care for a first external PR. The signal-source × MAST-mode framing is the right idea, the runner is clean stdlib-only code, and the caveats section pre-empts the co-evolved-corpus trap (a synthetic set the rules were tuned on producing a fake F1=1.0) directly. I want to merge a version of this. A few things need to change first; the first is load-bearing.
Strengths
parity_runner.pyis correct where it counts.materialise_operator_statebuilds a real tmp git repo with a baseline commit sogit diffhas something to diff against, sets localuser.email/user.nameso it works in fresh CI envs, and times thestop-verify.logentries offnow − verify_log_age_seconds.classify/prf1/cohens_kappaare textbook and the κpe==1.0guard is a nice touch. The Stop-event JSON shape matches the existingevaluation/mast/run_bash_parity.pycontract.- The honesty discipline is right.
dry_run_results.csvis framed as the operator's manually-derived expectation, not a measured run, and the README makes no synthetic-to-in-the-wild generalization claim. That matches theLIMITATIONS.md/CLAIM_LEDGER.mdconventions here. - The negative controls (group D) are well-chosen — D02 verified-completion, D03 honest partial-blocked (mirrors
fixtures/closeout/wrap_up.jsonlwrap_up_passes_specific_partial), D05 bounded-choice. The deliberate verify-vs-no-vibes disagreement rows (B04, D03) are the most informative part.
Required changes
-
(blocking) The
expected_no_vibes_decisionlabels are keyed to the wrong hook. The shippedadapters/claude-code/hooks/no-vibes.shruns onlyrun_agentcloseout_physics_hook evidence_claims(line 9).wrap_up,cliffhanger, andno_cherry_pick_rollupare separate hooks (no-wrap-up.sh,no-cliffhanger.sh). But A02/A03/A04, B02/B03/B05, and C01–C05 setexpected_no_vibes_decision: blockwith rationales citing "wrap_up generic tail", "cliffhanger rule", and "no_cherry_pick_rollup". Those rules aren't in no-vibes' path — a real--no-vibes-hook adapters/claude-code/hooks/no-vibes.shrun will pass most of them, so the committed dry-run CSV (no-vibes F1 0.889) wouldn't survive a measured run. Two acceptable fixes: (a) re-derive expectations against theevidence_claimsrule pack only, or (b) target the combined family explicitly and driveno-wrap-up.sh/no-cliffhanger.sh/no-vibes.shas separate columns (arguably the better design and fits the multi-hook driver goal in your Next Steps). Either way the rationales must name the hook that actually fires. -
(should-fix) MAST 3.1 mapping: align with this repo's existing source of truth,
evaluation/mast/mast_hook_map.py, which already declares"3.1": ["no_cherry_pick_rollup", "cliffhanger", "wrap_up"]. The new corpus re-derives the 3.1 hook set independently in a parallelevaluation/synthetic_mast_3_1/dir. ImportingHOOK_TO_MAST["3.1"](asrun_bash_parity.pydoes) keeps the two from drifting and makes provenance explicit. -
(should-fix) Nothing wires this into CI and
dry_run_results.csvis committed static output, so it can rot against rule-pack bumps — the drift you flag in caveat #3. Minimum bar: addtests/test_synthetic_3_1.pythat runs the runner in--dry-runand asserts the CSV is byte-stable (or regenerates and diffs).pytestalready runs inci.yml; cheapest regression guard.
Smaller notes (non-blocking)
verify-before-stop.shisn't in this repo andwaitdeadai/claude-verify-before-stopdoesn't exist publicly yet, so the verify side of the parity can't run in CI today. Fine for landing the corpus, but state it in the README so a reader doesn't assume a measured verify column is reproducible here.- Consider folding this under
evaluation/mast/rather than a siblingevaluation/synthetic_mast_3_1/dir, to keep the MAST work in one place. parity_runner.pywritesrows[0].keys()as the CSV header and willIndexErroron an empty corpus (e.g.--max-fixtures 0). One-line guard.
Scope: this is request-changes, not a rejection. Fix the no-vibes labeling (1) and I'll take it; (2) and (3) can land in the same push or a fast follow. Appreciate the rigor.
|
Hi @ianymu — circling back on this so it doesn't go stale. To be clear up front: I want this in, and the changes I asked for are small relative to how solid the contribution is. The only load-bearing item is #1 (the Items #2 (import No rush on a timeline, just wanted to make sure you weren't waiting on me. The signal-source × MAST-mode triangulation framing and the honesty discipline in the README/caveats are exactly right, and the B04/D03 disagreement rows are the most interesting part of the whole thing. Thanks again for picking this up from the issue thread. |
Synthetic MAST mode 3.1 corpus + parity runner (verify-before-stop × no-vibes)
What this PR adds
evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/— 20 hand-authored fixturesexercising MAST mode 3.1 (Premature Termination): 5 pure-premature, 5 mid-task
stop, 5 wrap-up vocabulary, 5 negative controls.
evaluation/synthetic_mast_3_1/parity_runner.py— standalone (stdlib-only)runner that drives two Stop-event hooks against the same fixtures:
verify-before-stop.sh(operator-side signal: git diff + verifier log)no-vibes.sh(text-side signal: closeout vocabulary viaagentcloseout-physics)evaluation/synthetic_mast_3_1/README.md— usage and design notes.evaluation/synthetic_mast_3_1/dry_run_results.csv— expected outcomes,generated by
--dry-runso reviewers can validate the corpus without invokingany hook.
Why
Follow-up on anthropics/claude-code#46957
where Fernando noted:
This PR is the synthetic stopgap I committed to in that thread:
Methodology: signal-source × MAST-mode triangulation
The two hooks attack 3.1 through different evidence streams:
no-vibes.shverify-before-stop.shVERIFIEDentry in.claude/state/stop-verify.logEach fixture in the corpus carries both a
closeout_text(consumed by no-vibes)and an
operator_stateblock (files_touched,verify_log_entries,verify_log_age_seconds,files_committed). The parity runner materialises theoperator state into a tmpdir git repo, runs verify-before-stop from that cwd,
then runs no-vibes with the same closeout text. Both invocations follow the
standard exit-code-2 contract.
Where the two hooks agree is corroboration. Where they disagree is the
interesting signal — the disagreement tells you which evidence stream caught the
failure mode the other missed.
Headline (expected, from dry-run)
The dry-run numbers are the operator's expected outcome based on a manual
reading of each hook's rules, not measured F1 from a real engine run. They
are intended as the ground truth the parity script verifies the hooks against,
not a benchmark claim.
The disagreement pattern is the interesting part — five fixtures sit at the
seam between the two signal sources:
modified. no-vibes catches the wrap-up tail; verify-before-stop passes
(correctly — there's nothing to verify). 3.1 manifesting as text-only.
src/api/users.ts and src/api/auth.ts..."). verify-before-stop blocks on the
dirty tree; no-vibes passes because the text has no wrap-up signature.
3.1 manifesting as operator-state-only.
wrap_up.passes_partial_blockedpattern from the existing closeout fixtures). no-vibes correctly passes.
verify-before-stop blocks because it doesn't read closeout text and the tree
is still dirty — a known stricter-than-text behaviour, not a bug.
How to run
Standard library only on the runner side. No pip installs.
Honest caveats
construction. They are not human-labelled real traces and we make no claim
that hook F1 on this corpus generalises to MAD or DarkBench-style data.
run. A future commit can record measured numbers once both hooks are run
in CI; the dry-run is the contract.
expected_no_vibes_decisionmay drift if the underlying rule pack(
rules/closeout/wrap_up.yml,cliffhanger.yml,no_cherry_pick_rollup.yml)evolves. I'd suggest re-deriving expectations whenever the rule pack version
bumps.
two hooks be compared on a mode neither has been measured against. The
natural next step is expanding the n=19 human-labelled subset toward
category-3 examples — at which point the synthetic corpus retires.
Files touched
License: Apache-2.0, matching the rest of the repo.
Next steps (out of scope for this PR)
surface.
corpus in favour of human-labelled traces and keep
parity_runner.pyas themulti-hook driver.
(e.g. test runner exit code, build status) so additional hooks can be slotted
in without changing fixture schema.