waitdeadai · waitdeadai · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/.gitignore b/.gitignore
@@ -8,3 +8,5 @@
 # Stress runner output — regenerate locally with `bash tests/stress/run.sh`.
 # CI uploads it as a workflow artifact; no need to commit.
 tests/stress/STRESS-REPORT.md
+__pycache__/
+*.pyc
diff --git a/README.md b/README.md
@@ -94,6 +94,7 @@ The following 9 hooks conceptually target their MAST mode but did not produce me
 | `no-cliffhanger` | 1.5 Unaware of Termination Conditions, 3.1 Premature Termination | `zone: tail` (last 520 chars) is the trajectory tail, not a closeout sentence |
 | `no-aggregator-hallucination`, `no-fake-stats` | 2.6 Action-Reasoning Mismatch | Tuned for supervisor closeouts; synthesis claim buried in trajectory chatter |
 | `no-cherry-pick-rollup`, `no-silent-worker-success`, `no-sandbagging-disguise` | 3.1 / 3.2 Verification failures | Calibrated for supervisor reports, not multi-turn collaboration text |
+| `no-count-drift` | 3.2 No or Incomplete Verification (self-consistency) | Stated count vs the message's own enumeration/arithmetic; deterministic, abstain-on-ambiguity. Proposed by @beq00000 on `recognition-without-arrest-corpus#9` |
 
 The methodology gap is structural: hooks are tuned for individual Claude Code closeout messages; MAD's text is full multi-agent trajectory. Per-message scanning is the planned next experiment ([`MAST-RESULTS.md` §"Next steps"](evaluation/MAST-RESULTS.md)).
 
@@ -114,7 +115,7 @@ Also outside MAD's text-only scope but conceptually a Stage 3 (non-gating) failu
 The active catalog is organized in six branches by mechanism:
 
 - **Interaction-style** (8): catch *how* the model talks. `no-vibes`, `time-anchor`, `no-curfew`, `no-sycophancy`, `no-cliffhanger`, `no-wrap-up`, `no-tldr-bait`, `honest-eta`.
-- **Fact-fabrication** (5): catch *what* the model claims. `no-fake-recall`, `no-fake-stats`, `no-fake-cite`, `no-phantom-tool-call`, `no-rollback-claim-without-evidence`.
+- **Fact-fabrication** (6): catch *what* the model claims. `no-fake-recall`, `no-fake-stats`, `no-fake-cite`, `no-phantom-tool-call`, `no-rollback-claim-without-evidence`, `no-count-drift` (self-consistency: a stated count vs the message's own enumeration — orthogonal to `no-fake-stats`, which is citation-presence).
 - **Continuity** (1): counter context loss rather than block dishonest output. `no-amnesia`.
 - **Multi-agent orchestration** (5): catch supervisor / +N-parallel-instance failure modes. `no-aggregator-hallucination`, `no-silent-worker-success`, `no-cherry-pick-rollup`, `no-ownership-violation`, `no-handoff-loop`.
 - **Agentic safety** (3): catch credential leak, sandbagging disguise, approval-sneak surfaces. `no-credential-leak-in-handoff`, `no-sandbagging-disguise`, `no-approval-sneak`.

diff --git a/evaluation/v6/RESULTS.md b/evaluation/v6/RESULTS.md
@@ -0,0 +1,33 @@
+# v6 count-drift — RESULTS
+
+Scorer: `evaluation/v6/score_count_drift.py` over `fixtures.jsonl` (28 fixtures: 9 positive / 19 adversarial negative).
+
+| metric | value |
+|---|---|
+| precision | 1.000 |
+| recall | 1.000 |
+| F1 | 1.000 |
+| F1 95% CI (bootstrap, n=1000, seed=42) | [1.000, 1.000] |
+| true positives | 9 |
+| **false positives** | **0** |
+| misses | 0 |
+
+SC1 (zero false positives on the adversarial negative set): PASS
+
+## Independent evaluation (non-circular)
+
+Detector run over corpora it was NOT authored against — real LLM `model_response`/`prompt_text` from `evaluation/raw_results.jsonl` and the stress fixtures authored for the *other* hooks. No count-drift labels exist there, so the metric is the false-positive rate (every block is a candidate false fire). Reproduce: `python3 evaluation/v6/independent_eval.py`.
+
+| corpus | texts | blocks |
+|---|---|---|
+| MAD raw_results | 660 | 0 |
+| stress fixtures (other hooks) | 328 | 0 |
+| **total** | **988** | **0** |
+
+False-positive rate on independent text: **0.0000**. This is the load-bearing, non-circular precision evidence — distinct from the hand-authored F1 below. (Two real false positives found during development — a too-loose lead-in and a missing word-boundary on number words — were fixed and locked in as regression negatives.)
+
+## Honesty caveat (read before citing F1)
+
+This corpus is **hand-authored** — the same author wrote the detector and the fixtures — so an F1 of 1.0 here is **not** a wild-generalization claim; it is a co-evolved-corpus number and would inflate if cited as field performance. What the number legitimately shows: the detector behaves to spec on the designed cases, **including the adversarial negatives authored to break it** (nested-colon lead-ins, section-index numbers, label words, approximation markers, ambiguous multi-list scope, nested-list depth). The load-bearing, generalizable metric is **precision / zero-false-positives on those adversarial negatives** — the property a blocking gate must hold.
+
+Recall is reported, not gated. Per the statcheck precedent (deterministic internal-consistency check: ~96-100%% specificity but only ~61%% recall in the wild), real-world recall here will be far below 1.0, bounded by structural extraction coverage. That trade is intentional: abstain rather than false-fire.
diff --git a/evaluation/v6/SPEC.md b/evaluation/v6/SPEC.md
@@ -0,0 +1,196 @@
+# SPEC — v6: `no-count-drift` Stop hook (count-vs-enumeration self-consistency gate)
+
+Status: ACTIVE (pre-implementation). Author: waitdeadai. Date: 2026-05-25.
+Origin: proposed by @beq00000 (Brendan Quinn) on `recognition-without-arrest-corpus#9` —
+"a final-pass diff between every count-claim in prose and its enumeration or table
+source," a verification gate that lives *outside* the writing agent's recall.
+
+## 1. Problem Statement
+
+LLM agents state a count in prose ("fourteen instances", "six instances", "5 of 7",
+"2/35 = 5.7%") that contradicts the artifact's own enumeration or arithmetic — a
+self-consistency (faithfulness) failure, not a missing-citation (factuality) failure.
+No existing hook in the suite catches it: `no-fake-stats` checks citation presence and
+deliberately ignores small integers.
+
+## 2. Success Criteria (measurable)
+
+- **SC1 (precision floor, hard gate):** On the v6 fixture set, the deterministic core
+  produces **zero false positives on the negative set** (precision = 1.000). A blocking
+  gate must not false-fire. The negative set MUST be **adversarial and authored
+  independently of the detector patterns** (correct counts that resemble the positives;
+  ambiguous/multi-enumeration scope; nested lists a naive counter would miscount; vague
+  cardinality) — precision claimed only against that adversarial set, to avoid the
+  co-evolved-corpus fake-precision trap. Verified by `tests/test-count-drift.sh` exiting 0.
+- **SC2 (seeded true positives caught):** Each present as a fixture, each yields
+  `decision=block`: (a) "fourteen" prose vs 15 enumerated bullets (R3); (b) "six
+  instances" headline vs five enumerated (R3); (c) a genuinely wrong fraction-to-percent,
+  e.g. "9/10 = 80%" (R1; 9/10 is 90%). NOTE: the #9 cross-section case ("2/35=5.7%" in §2
+  vs "2/42=4.8%" in §3) is the same-quantity-different-denominator linking case and is
+  explicitly OUT of scope (§3, advisory only); "2/35 = 5.7%" is itself arithmetically
+  CORRECT (5.71% rounds to 5.7%) and MUST pass — it belongs in the adversarial negatives.
+- **SC3 (abstention on ambiguity):** Inputs with 0 or ≥2 candidate enumerations in
+  scope, vague cardinality ("a few"), or prose-only counts yield `decision=pass`
+  (abstain). Verified by negative fixtures.
+- **SC4 (fail-open):** Missing `jq` OR missing `python3` → hook exits 0 (never breaks a
+  session). Verified by `tests/test-count-drift.sh` env-stubbed cases.
+- **SC5 (determinism):** Identical input → identical verdict across two runs, zero delta.
+  Verified by running the fixture scorer twice and diffing.
+- **SC6 (no regression):** `bash scripts/*hook-smoke*` (or the repo's hook smoke) still
+  passes with `no-count-drift` wired into `hooks/hooks.json`.
+- **SC7 (reported, not gated):** F1 + bootstrap 95% CI on the full fixture set, reported
+  in `evaluation/v6/RESULTS.md`. Recall is reported, NOT required high (bounded by
+  structural extraction, per statcheck precedent ~61% recall at >96% specificity).
+
+Non-criteria (explicitly): high recall is NOT a success criterion. LLM-judge accuracy is
+out of scope for v1.
+
+## 3. Scope
+
+**In scope:**
+- `hooks/no-count-drift.sh` — bash wrapper (reads `.last_assistant_message`, fail-open,
+  calls the Python core, exit 2 on block) matching the suite's hook conventions.
+- `lib/count_drift.py` — pure-stdlib Python core (no pip deps). Three TIER-1 deterministic
+  detectors with strict abstention.
+- `evaluation/v6/fixtures.jsonl` — hand-authored positives + negatives (incl. the SC2
+  seeds and SC3 abstention cases).
+- `evaluation/v6/score_count_drift.py` — scorer (precision/recall/F1 + bootstrap CI).
+- `tests/test-count-drift.sh` — bash harness (assert pattern from `test-pack-loader.sh`).
+- `hooks/hooks.json` — wire `no-count-drift.sh` into `Stop` and `SubagentStop`.
+- `evaluation/v6/RESULTS.md` + this `SPEC.md`.
+
+**Out of scope (this PR):**
+- LLM-judge advisory tier (optional follow-up `no-count-drift-warn.sh`, env-gated, never
+  exits 2 — mirrors existing `no-sycophancy-warn.sh`). Reason: keep v1 deterministic and
+  high-precision; the literature ("Too Consistent to Detect", 2025-05) shows LLM judges
+  miss self-consistent count errors.
+- Cross-section semantic same-quantity linking (prose number vs a table cell for the
+  "same" labeled metric) — the KPI-Check problem, ~73% F1; too fuzzy to block on.
+- A Rust YAML rule pack in `agent-closeout-bench` — the engine is regex-match-only and
+  cannot count items or recompute fractions, so the computation must live in Python. N/A.
+
+## 4. Detector design (TIER-1 deterministic, abstain-on-ambiguity)
+
+Input: the assistant message text. Output JSON: `{decision: block|pass, rule, evidence}`.
+
+- **R1 — fraction/percentage arithmetic self-check (safest, ~100% precision, no linking):**
+  Find `A/B = P%` / `A/B (P%)` patterns; recompute `A/B*100`; `block` if
+  `|computed − P| > tol` (tol = max(0.5pp, one-ulp of stated precision)). No enumeration
+  needed; pure arithmetic.
+- **R2 — "N of M" bound check:** parse `N of M <noun>` (word or digit); `block` if `N > M`
+  (impossible). If exactly one enumeration of the noun is in scope with a counted size,
+  also check N against it; else just the N>M bound.
+- **R3 — headline count vs single enumeration:** a count-claim (`<num> <noun>` as a
+  heading/lead-in, **count ≥ 2**, colon **adjacent to the noun phrase** with no
+  intervening punctuation/number) immediately followed, before the next heading, by
+  **exactly one markdown LIST** whose **top-level** item count ≠ the claimed number.
+  Tables are excluded (a 2×2 matrix has 4 cells but 2 rows — a poor count proxy). Number
+  words require a leading word boundary (so "of-**ten**" / "writ-**ten**" do not parse as
+  "ten"). **Abstain (pass)** on count < 2, 0 or ≥2 candidate lists, depth ambiguity,
+  label/section indices, second-number lead-ins, or vague cardinality. (R3's loose
+  lead-in and word-boundary gaps were caught by the independent MAD eval — §7 — not the
+  hand-authored fixtures.)
+
+Number parsing: built-in word→int lexicon (one..nineteen, tens, hundred, thousand,
+ordinals, "a dozen"=12); no `text2num`/pip dep. Markdown counting: count top-level list
+items (min-indent of the contiguous block) and table data rows (exclude header+separator)
+via stdlib line scanning. Every detector **abstains** rather than guesses.
+
+## 5. Agent-Native Estimate
+
+- Estimate type: agent-native wall-clock.
+- Execution topology: local (the precision-critical parser/abstention logic is one tightly
+  coupled reasoning loop; do not split). Fixtures are a small optional sidecar.
+- Capacity evidence: capacity is NOT the binding constraint — this is local single-loop
+  implementation, not parallelizable dense work; `parallel-capacity.sh` not consulted
+  because lanes don't reduce the critical path here.
+- Effective lanes: 1 (optionally 2 if fixtures are delegated).
+- Critical path: SPEC → /specqa → /introspect → `count_drift.py` core → fixtures →
+  scorer/tests → /verify.
+- Agent wall-clock: optimistic ~6 build/verify cycles, likely ~10, pessimistic ~16
+  (pessimistic if hitting SC1 zero-false-positive needs a precision-tuning iteration on R3).
+- Agent-hours: low (single-file core + fixtures + harness).
+- Human touch time: design direction (given), review of the diff, merge/PR decision.
+  No external credentials.
+- Calendar blockers: none — isolated worktree, feature branch, no deploy. (Pushing the
+  branch later needs the `workflow` scope only if hooks.json wiring counted as a workflow
+  file — it is NOT under `.github/workflows`, so no scope blocker.)
+- Confidence: medium — downgrade reason: R3 referent-linking abstention may need one
+  tuning pass to guarantee zero false positives (SC1) without collapsing R3 recall to zero.
+- Human-equivalent baseline (secondary only): ~half a day for a developer to write the
+  parser, fixtures, and tests carefully.
+
+## 6. Implementation Plan
+
+### Task 1: Python core `lib/count_drift.py`
+Definition of Done:
+- [ ] Reads message text from stdin or `--text`/`--file`; emits decision JSON.
+- [ ] R1 fraction/percentage recompute with tolerance.
+- [ ] R2 "N of M" bound + optional in-scope list check.
+- [ ] R3 headline-count vs single-enumeration with strict abstention.
+- [ ] Word→int lexicon; top-level markdown list + table-row counting (stdlib only).
+- [ ] Abstains (pass) on every ambiguous case enumerated in §4.
+
+### Task 2: Bash hook `hooks/no-count-drift.sh`
+Definition of Done:
+- [ ] Reads stdin JSON, extracts `.last_assistant_message` via jq.
+- [ ] Fail-open (exit 0) if jq or python3 absent, or input not JSON.
+- [ ] Calls `lib/count_drift.py`; on `decision=block` exit 2 with
+      `BLOCKED:` + `Matched rule:` + `Evidence:` + `Repair guidance:` (suite format).
+- [ ] `stop_hook_active=true` → exit 0 (no re-entrancy), matching siblings.
+
+### Task 3: Fixtures + scorer + tests
+Definition of Done:
+- [ ] `evaluation/v6/fixtures.jsonl` with the SC2 positives, SC3 abstentions, and a
+      balanced negative set (correct counts that must pass).
+- [ ] `evaluation/v6/score_count_drift.py` → precision/recall/F1 + bootstrap 95% CI (seed
+      fixed, samples=1000) writing `evaluation/v6/RESULTS.md`.
+- [ ] `tests/test-count-drift.sh` asserts SC1 (0 FP), SC2 (seeds blocked), SC3 (abstain),
+      SC4 (fail-open via PATH stub), SC5 (determinism).
+
+### Task 4: Wiring + docs
+Definition of Done:
+- [ ] `hooks/hooks.json` adds `no-count-drift.sh` to `Stop` and `SubagentStop`.
+- [ ] `evaluation/v6/RESULTS.md` populated from a real scorer run.
+- [ ] README hook table / METHODOLOGY updated to list the hook + its MAST FM-3.2 mapping.
+
+## 7. Verification
+
+- SC1/SC2/SC3/SC4/SC5 → `bash tests/test-count-drift.sh` exits 0 (each assertion).
+- SC6 → repo hook-smoke passes with the new wiring.
+- SC7 → `python3 evaluation/v6/score_count_drift.py` writes RESULTS.md with F1 + CI;
+  run twice, diff = empty (also covers SC5).
+- Hook-level smoke: pipe a positive `{"hook_event_name":"Stop","last_assistant_message":...}`
+  → exit 2; negative → exit 0; `PATH` without jq → exit 0.
+- **Independent (non-circular) precision** → `python3 evaluation/v6/independent_eval.py`
+  runs the detector over corpora it was NOT authored against (real LLM responses in
+  `evaluation/raw_results.jsonl` + the other hooks' stress fixtures) and must report
+  **0 blocks** (zero false positives). Result: 0 / 988 texts. This is the precision
+  evidence that the hand-authored F1 cannot give (co-evolved-corpus); it caught two real
+  R3 false-positive classes that the fixtures missed.
+
+## 8. Rollback Plan
+
+1. The work is isolated on branch `feature/v6-count-drift` in a worktree; nothing is on
+   `main` until an explicit merge.
+2. To unwire without removing files: delete the `no-count-drift.sh` entries from
+   `hooks/hooks.json` (each hook is independent by design).
+3. Full revert: `git branch -D feature/v6-count-drift` (pre-merge) or `git revert <sha>`
+   (post-merge); remove `hooks/no-count-drift.sh`, `lib/count_drift.py`, `evaluation/v6/`.
+4. Verify rollback: repo hook-smoke passes and `grep -c no-count-drift hooks/hooks.json` = 0.
+
+## Source ledger (deepresearch, accessed 2026-05-25)
+
+- statcheck (CRAN; Nuijten validity study 2017): deterministic recompute-and-compare,
+  specificity 96–100%, recall ~61% — high-precision/narrow-recall precedent.
+- ContraDoc (NAACL 2024, arXiv:2311.09182): "Numeric" is a named and EASIEST
+  intra-document self-contradiction type.
+- "Sequential Enumeration in LLMs" (arXiv:2512.04727) + "Too Consistent to Detect"
+  (arXiv:2505.17656): counting is a rule-based-symbolic strength; LLM/self-consistency
+  judges miss self-consistent count errors → counting belongs in deterministic code.
+- HalluLens (ACL 2025, arXiv:2504.17550): this is intrinsic/faithfulness hallucination,
+  orthogonal to factuality.
+- MAST (NeurIPS 2025 D&B, arXiv:2503.13657): maps to FM-3.2 "No or incomplete
+  verification" (+ FM-2.6). DarkBench (ICLR 2025, arXiv:2503.10728) does not cover it.
+- Landscape check: no existing hook in `llm-dark-patterns` / `agent-closeout-bench` /
+  `cc-safe-setup` does in-message count-vs-enumeration — confirmed gap, not duplication.