Skip to content

Add synthetic MAST 3.1 fixture corpus + parity runner for verify-before-stop / no-vibes#12

Open
ianymu wants to merge 2 commits into
waitdeadai:mainfrom
ianymu:add-synthetic-3.1-corpus
Open

Add synthetic MAST 3.1 fixture corpus + parity runner for verify-before-stop / no-vibes#12
ianymu wants to merge 2 commits into
waitdeadai:mainfrom
ianymu:add-synthetic-3.1-corpus

Conversation

@ianymu

@ianymu ianymu commented May 21, 2026

Copy link
Copy Markdown

Synthetic MAST mode 3.1 corpus + parity runner (verify-before-stop × no-vibes)

What this PR adds

  • evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/ — 20 hand-authored fixtures
    exercising MAST mode 3.1 (Premature Termination): 5 pure-premature, 5 mid-task
    stop, 5 wrap-up vocabulary, 5 negative controls.
  • evaluation/synthetic_mast_3_1/parity_runner.py — standalone (stdlib-only)
    runner that drives two Stop-event hooks against the same fixtures:
    • verify-before-stop.sh (operator-side signal: git diff + verifier log)
    • no-vibes.sh (text-side signal: closeout vocabulary via agentcloseout-physics)
  • evaluation/synthetic_mast_3_1/README.md — usage and design notes.
  • evaluation/synthetic_mast_3_1/dry_run_results.csv — expected outcomes,
    generated by --dry-run so reviewers can validate the corpus without invoking
    any hook.

Why

Follow-up on anthropics/claude-code#46957
where Fernando noted:

"Our n=19 human-labelled subset has zero positive votes for MAST mode 3.1 —
we can't measure verify-before-stop against 3.1 yet."

This PR is the synthetic stopgap I committed to in that thread:

"I can put together a synthetic-3.1 corpus from the verify-before-stop log
format (filenames touched × VERIFIED-entry presence as operator-side ground
truth) and open a PR against agent-closeout-bench with a parity script that
runs verify-before-stop and no-vibes against the same traces."

Methodology: signal-source × MAST-mode triangulation

The two hooks attack 3.1 through different evidence streams:

Hook Signal source Fires when
no-vibes.sh Closeout text content wrap-up vocabulary, cliffhanger framing, unsupported rollup
verify-before-stop.sh Operator-side state files modified on disk + no recent VERIFIED entry in .claude/state/stop-verify.log

Each fixture in the corpus carries both a closeout_text (consumed by no-vibes)
and an operator_state block (files_touched, verify_log_entries,
verify_log_age_seconds, files_committed). The parity runner materialises the
operator state into a tmpdir git repo, runs verify-before-stop from that cwd,
then runs no-vibes with the same closeout text. Both invocations follow the
standard exit-code-2 contract.

Where the two hooks agree is corroboration. Where they disagree is the
interesting signal — the disagreement tells you which evidence stream caught the
failure mode the other missed.

Headline (expected, from dry-run)

fixtures: 20 (positive=15, negative=5)
verify-before-stop:  TP=10 FP=1 FN=5 TN=4  P=0.9091  R=0.6667  F1=0.7692
no-vibes:            TP=12 FP=0 FN=3 TN=5  P=1.0     R=0.8     F1=0.8889
inter-hook agreement (Cohen κ): 0.4898
disagreements: 5/20

The dry-run numbers are the operator's expected outcome based on a manual
reading of each hook's rules
, not measured F1 from a real engine run. They
are intended as the ground truth the parity script verifies the hooks against,
not a benchmark claim.

The disagreement pattern is the interesting part — five fixtures sit at the
seam between the two signal sources:

  • A02 / A03 / A04: closeout text uses wrap-up vocabulary, but no files were
    modified. no-vibes catches the wrap-up tail; verify-before-stop passes
    (correctly — there's nothing to verify). 3.1 manifesting as text-only.
  • B04: 3 files modified with a terse specific closeout ("Modified
    src/api/users.ts and src/api/auth.ts..."). verify-before-stop blocks on the
    dirty tree; no-vibes passes because the text has no wrap-up signature.
    3.1 manifesting as operator-state-only.
  • D03: honest partial-completion closeout (the wrap_up.passes_partial_blocked
    pattern from the existing closeout fixtures). no-vibes correctly passes.
    verify-before-stop blocks because it doesn't read closeout text and the tree
    is still dirty — a known stricter-than-text behaviour, not a bug.

How to run

# dry-run (no hooks invoked, validates corpus + expected outcomes)
python3 evaluation/synthetic_mast_3_1/parity_runner.py \
    --corpus evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/ \
    --dry-run \
    --output evaluation/synthetic_mast_3_1/dry_run_results.csv

# real run (requires both hook scripts; no-vibes also needs the Rust engine
# built — same as your existing run_bash_parity.py)
python3 evaluation/synthetic_mast_3_1/parity_runner.py \
    --corpus evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/ \
    --verify-hook /path/to/claude-verify-before-stop/verify-before-stop.sh \
    --no-vibes-hook adapters/claude-code/hooks/no-vibes.sh \
    --output evaluation/synthetic_mast_3_1/measured_results.csv

Standard library only on the runner side. No pip installs.

Honest caveats

  1. Synthetic ≠ in-the-wild. These fixtures are operator-side ground truth by
    construction. They are not human-labelled real traces and we make no claim
    that hook F1 on this corpus generalises to MAD or DarkBench-style data.
  2. Expected decisions are derived from hook rules, not from a parity engine
    run.
    A future commit can record measured numbers once both hooks are run
    in CI; the dry-run is the contract.
  3. expected_no_vibes_decision may drift if the underlying rule pack
    (rules/closeout/wrap_up.yml, cliffhanger.yml, no_cherry_pick_rollup.yml)
    evolves. I'd suggest re-deriving expectations whenever the rule pack version
    bumps.
  4. The corpus is 20 fixtures, not 200. It's the smallest thing that lets the
    two hooks be compared on a mode neither has been measured against. The
    natural next step is expanding the n=19 human-labelled subset toward
    category-3 examples — at which point the synthetic corpus retires.

Files touched

evaluation/synthetic_mast_3_1/
├── README.md
├── parity_runner.py
├── dry_run_results.csv
└── synthetic-3.1-corpus/
    ├── README.md
    ├── A01_pure_premature_done.json
    ├── A02_task_complete_no_work.json
    ├── A03_cliffhanger_no_files.json
    ├── A04_premature_hope_helps.json
    ├── A05_silent_handoff.json
    ├── B01_mid_task_implementation_complete.json
    ├── B02_mid_task_all_done.json
    ├── B03_mid_task_stale_verify.json
    ├── B04_mid_task_quiet_closeout.json
    ├── B05_mid_task_only_verify_action.json
    ├── C01_summarize_dirty.json
    ├── C02_in_conclusion_dirty.json
    ├── C03_overall_dirty.json
    ├── C04_let_me_know_dirty.json
    ├── C05_summarize_one_dirty.json
    ├── D01_read_only_session.json
    ├── D02_verified_completion.json
    ├── D03_partial_blocked.json
    ├── D04_clean_tree_specific_answer.json
    └── D05_bounded_choice.json

License: Apache-2.0, matching the rest of the repo.

Next steps (out of scope for this PR)

  • Wire the runner into CI on a smoke fixture so regressions in either hook
    surface.
  • Once the n=19 subset is expanded toward 3.1 positives, retire the synthetic
    corpus in favour of human-labelled traces and keep parity_runner.py as the
    multi-hook driver.
  • Extend the operator-state schema to capture additional evidence streams
    (e.g. test runner exit code, build status) so additional hooks can be slotted
    in without changing fixture schema.

Yanlong Mu and others added 2 commits May 21, 2026 14:40
Adds 20 synthetic fixtures targeting MAST mode 3.1 (Premature Termination)
that the current human-labelled n=19 subset has zero positive votes for.

Includes parity_runner.py that drives both verify-before-stop and no-vibes
through identical fixture payloads — produces precision/recall/F1 per hook
plus inter-hook Cohen κ for the signal-source-vs-MAST-mode triangulation
framing discussed on anthropics/claude-code#46957.

Co-authored-by: Ian Mu <ian.y.mu@gmail.com>
Aligns directory layout with the path referenced throughout README.md,
parity_runner.py, and PR description (evaluation/synthetic_mast_3_1/synthetic-3.1-corpus/).

@waitdeadai waitdeadai left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, Yanlong — this is the synthetic-3.1 stopgap I floated in anthropics/claude-code#46957, and it's built with real care for a first external PR. The signal-source × MAST-mode framing is the right idea, the runner is clean stdlib-only code, and the caveats section pre-empts the co-evolved-corpus trap (a synthetic set the rules were tuned on producing a fake F1=1.0) directly. I want to merge a version of this. A few things need to change first; the first is load-bearing.

Strengths

  • parity_runner.py is correct where it counts. materialise_operator_state builds a real tmp git repo with a baseline commit so git diff has something to diff against, sets local user.email/user.name so it works in fresh CI envs, and times the stop-verify.log entries off now − verify_log_age_seconds. classify/prf1/cohens_kappa are textbook and the κ pe==1.0 guard is a nice touch. The Stop-event JSON shape matches the existing evaluation/mast/run_bash_parity.py contract.
  • The honesty discipline is right. dry_run_results.csv is framed as the operator's manually-derived expectation, not a measured run, and the README makes no synthetic-to-in-the-wild generalization claim. That matches the LIMITATIONS.md / CLAIM_LEDGER.md conventions here.
  • The negative controls (group D) are well-chosen — D02 verified-completion, D03 honest partial-blocked (mirrors fixtures/closeout/wrap_up.jsonl wrap_up_passes_specific_partial), D05 bounded-choice. The deliberate verify-vs-no-vibes disagreement rows (B04, D03) are the most informative part.

Required changes

  1. (blocking) The expected_no_vibes_decision labels are keyed to the wrong hook. The shipped adapters/claude-code/hooks/no-vibes.sh runs only run_agentcloseout_physics_hook evidence_claims (line 9). wrap_up, cliffhanger, and no_cherry_pick_rollup are separate hooks (no-wrap-up.sh, no-cliffhanger.sh). But A02/A03/A04, B02/B03/B05, and C01–C05 set expected_no_vibes_decision: block with rationales citing "wrap_up generic tail", "cliffhanger rule", and "no_cherry_pick_rollup". Those rules aren't in no-vibes' path — a real --no-vibes-hook adapters/claude-code/hooks/no-vibes.sh run will pass most of them, so the committed dry-run CSV (no-vibes F1 0.889) wouldn't survive a measured run. Two acceptable fixes: (a) re-derive expectations against the evidence_claims rule pack only, or (b) target the combined family explicitly and drive no-wrap-up.sh / no-cliffhanger.sh / no-vibes.sh as separate columns (arguably the better design and fits the multi-hook driver goal in your Next Steps). Either way the rationales must name the hook that actually fires.

  2. (should-fix) MAST 3.1 mapping: align with this repo's existing source of truth, evaluation/mast/mast_hook_map.py, which already declares "3.1": ["no_cherry_pick_rollup", "cliffhanger", "wrap_up"]. The new corpus re-derives the 3.1 hook set independently in a parallel evaluation/synthetic_mast_3_1/ dir. Importing HOOK_TO_MAST["3.1"] (as run_bash_parity.py does) keeps the two from drifting and makes provenance explicit.

  3. (should-fix) Nothing wires this into CI and dry_run_results.csv is committed static output, so it can rot against rule-pack bumps — the drift you flag in caveat #3. Minimum bar: add tests/test_synthetic_3_1.py that runs the runner in --dry-run and asserts the CSV is byte-stable (or regenerates and diffs). pytest already runs in ci.yml; cheapest regression guard.

Smaller notes (non-blocking)

  • verify-before-stop.sh isn't in this repo and waitdeadai/claude-verify-before-stop doesn't exist publicly yet, so the verify side of the parity can't run in CI today. Fine for landing the corpus, but state it in the README so a reader doesn't assume a measured verify column is reproducible here.
  • Consider folding this under evaluation/mast/ rather than a sibling evaluation/synthetic_mast_3_1/ dir, to keep the MAST work in one place.
  • parity_runner.py writes rows[0].keys() as the CSV header and will IndexError on an empty corpus (e.g. --max-fixtures 0). One-line guard.

Scope: this is request-changes, not a rejection. Fix the no-vibes labeling (1) and I'll take it; (2) and (3) can land in the same push or a fast follow. Appreciate the rigor.

@waitdeadai

Copy link
Copy Markdown
Owner

Hi @ianymu — circling back on this so it doesn't go stale. To be clear up front: I want this in, and the changes I asked for are small relative to how solid the contribution is.

The only load-bearing item is #1 (the no-vibes labeling). As shipped, adapters/claude-code/hooks/no-vibes.sh runs just run_agentcloseout_physics_hook evidence_claims — the wrap_up, cliffhanger, and no_cherry_pick_rollup rules your A02/A03/A04, B02/B03/B05 and C01–C05 rationales cite live in no-wrap-up.sh / no-cliffhanger.sh. So a real --no-vibes-hook adapters/claude-code/hooks/no-vibes.sh run would pass most of those rows and the committed no-vibes F1 (0.889) wouldn't reproduce. Either re-derive the no-vibes expectations against evidence_claims only, or — which I think is the better design and fits your "multi-hook driver" Next Step — add no-wrap-up.sh / no-cliffhanger.sh as their own columns and point each rationale at the hook that actually fires.

Items #2 (import HOOK_TO_MAST["3.1"] from evaluation/mast/mast_hook_map.py instead of re-deriving the 3.1 hook set) and #3 (a tiny tests/test_synthetic_3_1.py that runs --dry-run and asserts the CSV is byte-stable, since pytest already runs in CI) are happy to land as a fast follow if you'd rather not block on them — your call.

No rush on a timeline, just wanted to make sure you weren't waiting on me. The signal-source × MAST-mode triangulation framing and the honesty discipline in the README/caveats are exactly right, and the B04/D03 disagreement rows are the most interesting part of the whole thing. Thanks again for picking this up from the issue thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants