Add synthetic MAST 3.1 fixture corpus + parity runner for verify-before-stop / no-vibes by ianymu · Pull Request #12 · waitdeadai/agent-closeout-bench

ianymu · 2026-05-21T06:41:21Z

Synthetic MAST mode 3.1 corpus + parity runner (verify-before-stop × no-vibes)

What this PR adds

evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/ — 20 hand-authored fixtures
exercising MAST mode 3.1 (Premature Termination): 5 pure-premature, 5 mid-task
stop, 5 wrap-up vocabulary, 5 negative controls.
evaluation/synthetic_mast_3_1/parity_runner.py — standalone (stdlib-only)
runner that drives two Stop-event hooks against the same fixtures:
- verify-before-stop.sh (operator-side signal: git diff + verifier log)
- no-vibes.sh (text-side signal: closeout vocabulary via agentcloseout-physics)
evaluation/synthetic_mast_3_1/README.md — usage and design notes.
evaluation/synthetic_mast_3_1/dry_run_results.csv — expected outcomes,
generated by --dry-run so reviewers can validate the corpus without invoking
any hook.

Why

Follow-up on anthropics/claude-code#46957
where Fernando noted:

"Our n=19 human-labelled subset has zero positive votes for MAST mode 3.1 —
we can't measure verify-before-stop against 3.1 yet."

This PR is the synthetic stopgap I committed to in that thread:

"I can put together a synthetic-3.1 corpus from the verify-before-stop log
format (filenames touched × VERIFIED-entry presence as operator-side ground
truth) and open a PR against agent-closeout-bench with a parity script that
runs verify-before-stop and no-vibes against the same traces."

Methodology: signal-source × MAST-mode triangulation

The two hooks attack 3.1 through different evidence streams:

Hook	Signal source	Fires when
`no-vibes.sh`	Closeout text content	wrap-up vocabulary, cliffhanger framing, unsupported rollup
`verify-before-stop.sh`	Operator-side state	files modified on disk + no recent `VERIFIED` entry in `.claude/state/stop-verify.log`

Each fixture in the corpus carries both a closeout_text (consumed by no-vibes)
and an operator_state block (files_touched, verify_log_entries,
verify_log_age_seconds, files_committed). The parity runner materialises the
operator state into a tmpdir git repo, runs verify-before-stop from that cwd,
then runs no-vibes with the same closeout text. Both invocations follow the
standard exit-code-2 contract.

Where the two hooks agree is corroboration. Where they disagree is the
interesting signal — the disagreement tells you which evidence stream caught the
failure mode the other missed.

Headline (expected, from dry-run)

fixtures: 20 (positive=15, negative=5)
verify-before-stop:  TP=10 FP=1 FN=5 TN=4  P=0.9091  R=0.6667  F1=0.7692
no-vibes:            TP=12 FP=0 FN=3 TN=5  P=1.0     R=0.8     F1=0.8889
inter-hook agreement (Cohen κ): 0.4898
disagreements: 5/20

The dry-run numbers are the operator's expected outcome based on a manual
reading of each hook's rules, not measured F1 from a real engine run. They
are intended as the ground truth the parity script verifies the hooks against,
not a benchmark claim.

The disagreement pattern is the interesting part — five fixtures sit at the
seam between the two signal sources:

A02 / A03 / A04: closeout text uses wrap-up vocabulary, but no files were
modified. no-vibes catches the wrap-up tail; verify-before-stop passes
(correctly — there's nothing to verify). 3.1 manifesting as text-only.
B04: 3 files modified with a terse specific closeout ("Modified
src/api/users.ts and src/api/auth.ts..."). verify-before-stop blocks on the
dirty tree; no-vibes passes because the text has no wrap-up signature.
3.1 manifesting as operator-state-only.
D03: honest partial-completion closeout (the wrap_up.passes_partial_blocked
pattern from the existing closeout fixtures). no-vibes correctly passes.
verify-before-stop blocks because it doesn't read closeout text and the tree
is still dirty — a known stricter-than-text behaviour, not a bug.

How to run

# dry-run (no hooks invoked, validates corpus + expected outcomes)
python3 evaluation/synthetic_mast_3_1/parity_runner.py \
    --corpus evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/ \
    --dry-run \
    --output evaluation/synthetic_mast_3_1/dry_run_results.csv

# real run (requires both hook scripts; no-vibes also needs the Rust engine
# built — same as your existing run_bash_parity.py)
python3 evaluation/synthetic_mast_3_1/parity_runner.py \
    --corpus evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/ \
    --verify-hook /path/to/claude-verify-before-stop/verify-before-stop.sh \
    --no-vibes-hook adapters/claude-code/hooks/no-vibes.sh \
    --output evaluation/synthetic_mast_3_1/measured_results.csv

Standard library only on the runner side. No pip installs.

Honest caveats

Synthetic ≠ in-the-wild. These fixtures are operator-side ground truth by
construction. They are not human-labelled real traces and we make no claim
that hook F1 on this corpus generalises to MAD or DarkBench-style data.
Expected decisions are derived from hook rules, not from a parity engine
run. A future commit can record measured numbers once both hooks are run
in CI; the dry-run is the contract.
expected_no_vibes_decision may drift if the underlying rule pack
(rules/closeout/wrap_up.yml, cliffhanger.yml, no_cherry_pick_rollup.yml)
evolves. I'd suggest re-deriving expectations whenever the rule pack version
bumps.
The corpus is 20 fixtures, not 200. It's the smallest thing that lets the
two hooks be compared on a mode neither has been measured against. The
natural next step is expanding the n=19 human-labelled subset toward
category-3 examples — at which point the synthetic corpus retires.

Files touched

evaluation/synthetic_mast_3_1/
├── README.md
├── parity_runner.py
├── dry_run_results.csv
└── synthetic-3.1-corpus/
    ├── README.md
    ├── A01_pure_premature_done.json
    ├── A02_task_complete_no_work.json
    ├── A03_cliffhanger_no_files.json
    ├── A04_premature_hope_helps.json
    ├── A05_silent_handoff.json
    ├── B01_mid_task_implementation_complete.json
    ├── B02_mid_task_all_done.json
    ├── B03_mid_task_stale_verify.json
    ├── B04_mid_task_quiet_closeout.json
    ├── B05_mid_task_only_verify_action.json
    ├── C01_summarize_dirty.json
    ├── C02_in_conclusion_dirty.json
    ├── C03_overall_dirty.json
    ├── C04_let_me_know_dirty.json
    ├── C05_summarize_one_dirty.json
    ├── D01_read_only_session.json
    ├── D02_verified_completion.json
    ├── D03_partial_blocked.json
    ├── D04_clean_tree_specific_answer.json
    └── D05_bounded_choice.json

License: Apache-2.0, matching the rest of the repo.

Next steps (out of scope for this PR)

Wire the runner into CI on a smoke fixture so regressions in either hook
surface.
Once the n=19 subset is expanded toward 3.1 positives, retire the synthetic
corpus in favour of human-labelled traces and keep parity_runner.py as the
multi-hook driver.
Extend the operator-state schema to capture additional evidence streams
(e.g. test runner exit code, build status) so additional hooks can be slotted
in without changing fixture schema.

Adds 20 synthetic fixtures targeting MAST mode 3.1 (Premature Termination) that the current human-labelled n=19 subset has zero positive votes for. Includes parity_runner.py that drives both verify-before-stop and no-vibes through identical fixture payloads — produces precision/recall/F1 per hook plus inter-hook Cohen κ for the signal-source-vs-MAST-mode triangulation framing discussed on anthropics/claude-code#46957. Co-authored-by: Ian Mu <ian.y.mu@gmail.com>

Aligns directory layout with the path referenced throughout README.md, parity_runner.py, and PR description (evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/).

waitdeadai

Thanks for this, Yanlong — this is the synthetic-3.1 stopgap I floated in anthropics/claude-code#46957, and it's built with real care for a first external PR. The signal-source × MAST-mode framing is the right idea, the runner is clean stdlib-only code, and the caveats section pre-empts the co-evolved-corpus trap (a synthetic set the rules were tuned on producing a fake F1=1.0) directly. I want to merge a version of this. A few things need to change first; the first is load-bearing.

Strengths

parity_runner.py is correct where it counts. materialise_operator_state builds a real tmp git repo with a baseline commit so git diff has something to diff against, sets local user.email/user.name so it works in fresh CI envs, and times the stop-verify.log entries off now − verify_log_age_seconds. classify/prf1/cohens_kappa are textbook and the κ pe==1.0 guard is a nice touch. The Stop-event JSON shape matches the existing evaluation/mast/run_bash_parity.py contract.
The honesty discipline is right. dry_run_results.csv is framed as the operator's manually-derived expectation, not a measured run, and the README makes no synthetic-to-in-the-wild generalization claim. That matches the LIMITATIONS.md / CLAIM_LEDGER.md conventions here.
The negative controls (group D) are well-chosen — D02 verified-completion, D03 honest partial-blocked (mirrors fixtures/closeout/wrap_up.jsonl wrap_up_passes_specific_partial), D05 bounded-choice. The deliberate verify-vs-no-vibes disagreement rows (B04, D03) are the most informative part.

Required changes

(blocking) The expected_no_vibes_decision labels are keyed to the wrong hook. The shipped adapters/claude-code/hooks/no-vibes.sh runs only run_agentcloseout_physics_hook evidence_claims (line 9). wrap_up, cliffhanger, and no_cherry_pick_rollup are separate hooks (no-wrap-up.sh, no-cliffhanger.sh). But A02/A03/A04, B02/B03/B05, and C01–C05 set expected_no_vibes_decision: block with rationales citing "wrap_up generic tail", "cliffhanger rule", and "no_cherry_pick_rollup". Those rules aren't in no-vibes' path — a real --no-vibes-hook adapters/claude-code/hooks/no-vibes.sh run will pass most of them, so the committed dry-run CSV (no-vibes F1 0.889) wouldn't survive a measured run. Two acceptable fixes: (a) re-derive expectations against the evidence_claims rule pack only, or (b) target the combined family explicitly and drive no-wrap-up.sh / no-cliffhanger.sh / no-vibes.sh as separate columns (arguably the better design and fits the multi-hook driver goal in your Next Steps). Either way the rationales must name the hook that actually fires.
(should-fix) MAST 3.1 mapping: align with this repo's existing source of truth, evaluation/mast/mast_hook_map.py, which already declares "3.1": ["no_cherry_pick_rollup", "cliffhanger", "wrap_up"]. The new corpus re-derives the 3.1 hook set independently in a parallel evaluation/synthetic_mast_3_1/ dir. Importing HOOK_TO_MAST["3.1"] (as run_bash_parity.py does) keeps the two from drifting and makes provenance explicit.
(should-fix) Nothing wires this into CI and dry_run_results.csv is committed static output, so it can rot against rule-pack bumps — the drift you flag in caveat #3. Minimum bar: add tests/test_synthetic_3_1.py that runs the runner in --dry-run and asserts the CSV is byte-stable (or regenerates and diffs). pytest already runs in ci.yml; cheapest regression guard.

Smaller notes (non-blocking)

verify-before-stop.sh isn't in this repo and waitdeadai/claude-verify-before-stop doesn't exist publicly yet, so the verify side of the parity can't run in CI today. Fine for landing the corpus, but state it in the README so a reader doesn't assume a measured verify column is reproducible here.
Consider folding this under evaluation/mast/ rather than a sibling evaluation/synthetic_mast_3_1/ dir, to keep the MAST work in one place.
parity_runner.py writes rows[0].keys() as the CSV header and will IndexError on an empty corpus (e.g. --max-fixtures 0). One-line guard.

Scope: this is request-changes, not a rejection. Fix the no-vibes labeling (1) and I'll take it; (2) and (3) can land in the same push or a fast follow. Appreciate the rigor.

waitdeadai · 2026-06-01T20:22:47Z

Hi @ianymu — circling back on this so it doesn't go stale. To be clear up front: I want this in, and the changes I asked for are small relative to how solid the contribution is.

The only load-bearing item is #1 (the no-vibes labeling). As shipped, adapters/claude-code/hooks/no-vibes.sh runs just run_agentcloseout_physics_hook evidence_claims — the wrap_up, cliffhanger, and no_cherry_pick_rollup rules your A02/A03/A04, B02/B03/B05 and C01–C05 rationales cite live in no-wrap-up.sh / no-cliffhanger.sh. So a real --no-vibes-hook adapters/claude-code/hooks/no-vibes.sh run would pass most of those rows and the committed no-vibes F1 (0.889) wouldn't reproduce. Either re-derive the no-vibes expectations against evidence_claims only, or — which I think is the better design and fits your "multi-hook driver" Next Step — add no-wrap-up.sh / no-cliffhanger.sh as their own columns and point each rationale at the hook that actually fires.

Items #2 (import HOOK_TO_MAST["3.1"] from evaluation/mast/mast_hook_map.py instead of re-deriving the 3.1 hook set) and #3 (a tiny tests/test_synthetic_3_1.py that runs --dry-run and asserts the CSV is byte-stable, since pytest already runs in CI) are happy to land as a fast follow if you'd rather not block on them — your call.

No rush on a timeline, just wanted to make sure you weren't waiting on me. The signal-source × MAST-mode triangulation framing and the honesty discipline in the README/caveats are exactly right, and the B04/D03 disagreement rows are the most interesting part of the whole thing. Thanks again for picking this up from the issue thread.

Yanlong Mu and others added 2 commits May 21, 2026 14:40

Move fixture JSONs into synthetic-3.1-corpus/ subdirectory

92170af

Aligns directory layout with the path referenced throughout README.md, parity_runner.py, and PR description (evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/).

This was referenced May 21, 2026

Claude fabricates comparison tables and repeatedly lies about verification results (3rd incident) anthropics/claude-code#46957

Open

Opus 4.7 produces structurally correct code that silently discards user input anthropics/claude-code#61107

Open

waitdeadai requested changes May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add synthetic MAST 3.1 fixture corpus + parity runner for verify-before-stop / no-vibes#12

Add synthetic MAST 3.1 fixture corpus + parity runner for verify-before-stop / no-vibes#12
ianymu wants to merge 2 commits into
waitdeadai:mainfrom
ianymu:add-synthetic-3.1-corpus

ianymu commented May 21, 2026

Uh oh!

waitdeadai left a comment

Uh oh!

waitdeadai commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ianymu commented May 21, 2026

Synthetic MAST mode 3.1 corpus + parity runner (verify-before-stop × no-vibes)

What this PR adds

Why

Methodology: signal-source × MAST-mode triangulation

Headline (expected, from dry-run)

How to run

Honest caveats

Files touched

Next steps (out of scope for this PR)

Uh oh!

waitdeadai left a comment

Choose a reason for hiding this comment

Uh oh!

waitdeadai commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants