diff --git a/.gitignore b/.gitignore index dd2f776..f97094a 100644 --- a/.gitignore +++ b/.gitignore @@ -8,3 +8,5 @@ # Stress runner output — regenerate locally with `bash tests/stress/run.sh`. # CI uploads it as a workflow artifact; no need to commit. tests/stress/STRESS-REPORT.md +__pycache__/ +*.pyc diff --git a/README.md b/README.md index 857eaa3..be8fc53 100644 --- a/README.md +++ b/README.md @@ -94,6 +94,7 @@ The following 9 hooks conceptually target their MAST mode but did not produce me | `no-cliffhanger` | 1.5 Unaware of Termination Conditions, 3.1 Premature Termination | `zone: tail` (last 520 chars) is the trajectory tail, not a closeout sentence | | `no-aggregator-hallucination`, `no-fake-stats` | 2.6 Action-Reasoning Mismatch | Tuned for supervisor closeouts; synthesis claim buried in trajectory chatter | | `no-cherry-pick-rollup`, `no-silent-worker-success`, `no-sandbagging-disguise` | 3.1 / 3.2 Verification failures | Calibrated for supervisor reports, not multi-turn collaboration text | +| `no-count-drift` | 3.2 No or Incomplete Verification (self-consistency) | Stated count vs the message's own enumeration/arithmetic; deterministic, abstain-on-ambiguity. Proposed by @beq00000 on `recognition-without-arrest-corpus#9` | The methodology gap is structural: hooks are tuned for individual Claude Code closeout messages; MAD's text is full multi-agent trajectory. Per-message scanning is the planned next experiment ([`MAST-RESULTS.md` §"Next steps"](evaluation/MAST-RESULTS.md)). @@ -114,7 +115,7 @@ Also outside MAD's text-only scope but conceptually a Stage 3 (non-gating) failu The active catalog is organized in six branches by mechanism: - **Interaction-style** (8): catch *how* the model talks. `no-vibes`, `time-anchor`, `no-curfew`, `no-sycophancy`, `no-cliffhanger`, `no-wrap-up`, `no-tldr-bait`, `honest-eta`. -- **Fact-fabrication** (5): catch *what* the model claims. `no-fake-recall`, `no-fake-stats`, `no-fake-cite`, `no-phantom-tool-call`, `no-rollback-claim-without-evidence`. +- **Fact-fabrication** (6): catch *what* the model claims. `no-fake-recall`, `no-fake-stats`, `no-fake-cite`, `no-phantom-tool-call`, `no-rollback-claim-without-evidence`, `no-count-drift` (self-consistency: a stated count vs the message's own enumeration — orthogonal to `no-fake-stats`, which is citation-presence). - **Continuity** (1): counter context loss rather than block dishonest output. `no-amnesia`. - **Multi-agent orchestration** (5): catch supervisor / +N-parallel-instance failure modes. `no-aggregator-hallucination`, `no-silent-worker-success`, `no-cherry-pick-rollup`, `no-ownership-violation`, `no-handoff-loop`. - **Agentic safety** (3): catch credential leak, sandbagging disguise, approval-sneak surfaces. `no-credential-leak-in-handoff`, `no-sandbagging-disguise`, `no-approval-sneak`. diff --git a/evaluation/v6/RESULTS.md b/evaluation/v6/RESULTS.md new file mode 100644 index 0000000..7d5e5e4 --- /dev/null +++ b/evaluation/v6/RESULTS.md @@ -0,0 +1,33 @@ +# v6 count-drift — RESULTS + +Scorer: `evaluation/v6/score_count_drift.py` over `fixtures.jsonl` (28 fixtures: 9 positive / 19 adversarial negative). + +| metric | value | +|---|---| +| precision | 1.000 | +| recall | 1.000 | +| F1 | 1.000 | +| F1 95% CI (bootstrap, n=1000, seed=42) | [1.000, 1.000] | +| true positives | 9 | +| **false positives** | **0** | +| misses | 0 | + +SC1 (zero false positives on the adversarial negative set): PASS + +## Independent evaluation (non-circular) + +Detector run over corpora it was NOT authored against — real LLM `model_response`/`prompt_text` from `evaluation/raw_results.jsonl` and the stress fixtures authored for the *other* hooks. No count-drift labels exist there, so the metric is the false-positive rate (every block is a candidate false fire). Reproduce: `python3 evaluation/v6/independent_eval.py`. + +| corpus | texts | blocks | +|---|---|---| +| MAD raw_results | 660 | 0 | +| stress fixtures (other hooks) | 328 | 0 | +| **total** | **988** | **0** | + +False-positive rate on independent text: **0.0000**. This is the load-bearing, non-circular precision evidence — distinct from the hand-authored F1 below. (Two real false positives found during development — a too-loose lead-in and a missing word-boundary on number words — were fixed and locked in as regression negatives.) + +## Honesty caveat (read before citing F1) + +This corpus is **hand-authored** — the same author wrote the detector and the fixtures — so an F1 of 1.0 here is **not** a wild-generalization claim; it is a co-evolved-corpus number and would inflate if cited as field performance. What the number legitimately shows: the detector behaves to spec on the designed cases, **including the adversarial negatives authored to break it** (nested-colon lead-ins, section-index numbers, label words, approximation markers, ambiguous multi-list scope, nested-list depth). The load-bearing, generalizable metric is **precision / zero-false-positives on those adversarial negatives** — the property a blocking gate must hold. + +Recall is reported, not gated. Per the statcheck precedent (deterministic internal-consistency check: ~96-100%% specificity but only ~61%% recall in the wild), real-world recall here will be far below 1.0, bounded by structural extraction coverage. That trade is intentional: abstain rather than false-fire. diff --git a/evaluation/v6/SPEC.md b/evaluation/v6/SPEC.md new file mode 100644 index 0000000..5669b0f --- /dev/null +++ b/evaluation/v6/SPEC.md @@ -0,0 +1,196 @@ +# SPEC — v6: `no-count-drift` Stop hook (count-vs-enumeration self-consistency gate) + +Status: ACTIVE (pre-implementation). Author: waitdeadai. Date: 2026-05-25. +Origin: proposed by @beq00000 (Brendan Quinn) on `recognition-without-arrest-corpus#9` — +"a final-pass diff between every count-claim in prose and its enumeration or table +source," a verification gate that lives *outside* the writing agent's recall. + +## 1. Problem Statement + +LLM agents state a count in prose ("fourteen instances", "six instances", "5 of 7", +"2/35 = 5.7%") that contradicts the artifact's own enumeration or arithmetic — a +self-consistency (faithfulness) failure, not a missing-citation (factuality) failure. +No existing hook in the suite catches it: `no-fake-stats` checks citation presence and +deliberately ignores small integers. + +## 2. Success Criteria (measurable) + +- **SC1 (precision floor, hard gate):** On the v6 fixture set, the deterministic core + produces **zero false positives on the negative set** (precision = 1.000). A blocking + gate must not false-fire. The negative set MUST be **adversarial and authored + independently of the detector patterns** (correct counts that resemble the positives; + ambiguous/multi-enumeration scope; nested lists a naive counter would miscount; vague + cardinality) — precision claimed only against that adversarial set, to avoid the + co-evolved-corpus fake-precision trap. Verified by `tests/test-count-drift.sh` exiting 0. +- **SC2 (seeded true positives caught):** Each present as a fixture, each yields + `decision=block`: (a) "fourteen" prose vs 15 enumerated bullets (R3); (b) "six + instances" headline vs five enumerated (R3); (c) a genuinely wrong fraction-to-percent, + e.g. "9/10 = 80%" (R1; 9/10 is 90%). NOTE: the #9 cross-section case ("2/35=5.7%" in §2 + vs "2/42=4.8%" in §3) is the same-quantity-different-denominator linking case and is + explicitly OUT of scope (§3, advisory only); "2/35 = 5.7%" is itself arithmetically + CORRECT (5.71% rounds to 5.7%) and MUST pass — it belongs in the adversarial negatives. +- **SC3 (abstention on ambiguity):** Inputs with 0 or ≥2 candidate enumerations in + scope, vague cardinality ("a few"), or prose-only counts yield `decision=pass` + (abstain). Verified by negative fixtures. +- **SC4 (fail-open):** Missing `jq` OR missing `python3` → hook exits 0 (never breaks a + session). Verified by `tests/test-count-drift.sh` env-stubbed cases. +- **SC5 (determinism):** Identical input → identical verdict across two runs, zero delta. + Verified by running the fixture scorer twice and diffing. +- **SC6 (no regression):** `bash scripts/*hook-smoke*` (or the repo's hook smoke) still + passes with `no-count-drift` wired into `hooks/hooks.json`. +- **SC7 (reported, not gated):** F1 + bootstrap 95% CI on the full fixture set, reported + in `evaluation/v6/RESULTS.md`. Recall is reported, NOT required high (bounded by + structural extraction, per statcheck precedent ~61% recall at >96% specificity). + +Non-criteria (explicitly): high recall is NOT a success criterion. LLM-judge accuracy is +out of scope for v1. + +## 3. Scope + +**In scope:** +- `hooks/no-count-drift.sh` — bash wrapper (reads `.last_assistant_message`, fail-open, + calls the Python core, exit 2 on block) matching the suite's hook conventions. +- `lib/count_drift.py` — pure-stdlib Python core (no pip deps). Three TIER-1 deterministic + detectors with strict abstention. +- `evaluation/v6/fixtures.jsonl` — hand-authored positives + negatives (incl. the SC2 + seeds and SC3 abstention cases). +- `evaluation/v6/score_count_drift.py` — scorer (precision/recall/F1 + bootstrap CI). +- `tests/test-count-drift.sh` — bash harness (assert pattern from `test-pack-loader.sh`). +- `hooks/hooks.json` — wire `no-count-drift.sh` into `Stop` and `SubagentStop`. +- `evaluation/v6/RESULTS.md` + this `SPEC.md`. + +**Out of scope (this PR):** +- LLM-judge advisory tier (optional follow-up `no-count-drift-warn.sh`, env-gated, never + exits 2 — mirrors existing `no-sycophancy-warn.sh`). Reason: keep v1 deterministic and + high-precision; the literature ("Too Consistent to Detect", 2025-05) shows LLM judges + miss self-consistent count errors. +- Cross-section semantic same-quantity linking (prose number vs a table cell for the + "same" labeled metric) — the KPI-Check problem, ~73% F1; too fuzzy to block on. +- A Rust YAML rule pack in `agent-closeout-bench` — the engine is regex-match-only and + cannot count items or recompute fractions, so the computation must live in Python. N/A. + +## 4. Detector design (TIER-1 deterministic, abstain-on-ambiguity) + +Input: the assistant message text. Output JSON: `{decision: block|pass, rule, evidence}`. + +- **R1 — fraction/percentage arithmetic self-check (safest, ~100% precision, no linking):** + Find `A/B = P%` / `A/B (P%)` patterns; recompute `A/B*100`; `block` if + `|computed − P| > tol` (tol = max(0.5pp, one-ulp of stated precision)). No enumeration + needed; pure arithmetic. +- **R2 — "N of M" bound check:** parse `N of M ` (word or digit); `block` if `N > M` + (impossible). If exactly one enumeration of the noun is in scope with a counted size, + also check N against it; else just the N>M bound. +- **R3 — headline count vs single enumeration:** a count-claim (` ` as a + heading/lead-in, **count ≥ 2**, colon **adjacent to the noun phrase** with no + intervening punctuation/number) immediately followed, before the next heading, by + **exactly one markdown LIST** whose **top-level** item count ≠ the claimed number. + Tables are excluded (a 2×2 matrix has 4 cells but 2 rows — a poor count proxy). Number + words require a leading word boundary (so "of-**ten**" / "writ-**ten**" do not parse as + "ten"). **Abstain (pass)** on count < 2, 0 or ≥2 candidate lists, depth ambiguity, + label/section indices, second-number lead-ins, or vague cardinality. (R3's loose + lead-in and word-boundary gaps were caught by the independent MAD eval — §7 — not the + hand-authored fixtures.) + +Number parsing: built-in word→int lexicon (one..nineteen, tens, hundred, thousand, +ordinals, "a dozen"=12); no `text2num`/pip dep. Markdown counting: count top-level list +items (min-indent of the contiguous block) and table data rows (exclude header+separator) +via stdlib line scanning. Every detector **abstains** rather than guesses. + +## 5. Agent-Native Estimate + +- Estimate type: agent-native wall-clock. +- Execution topology: local (the precision-critical parser/abstention logic is one tightly + coupled reasoning loop; do not split). Fixtures are a small optional sidecar. +- Capacity evidence: capacity is NOT the binding constraint — this is local single-loop + implementation, not parallelizable dense work; `parallel-capacity.sh` not consulted + because lanes don't reduce the critical path here. +- Effective lanes: 1 (optionally 2 if fixtures are delegated). +- Critical path: SPEC → /specqa → /introspect → `count_drift.py` core → fixtures → + scorer/tests → /verify. +- Agent wall-clock: optimistic ~6 build/verify cycles, likely ~10, pessimistic ~16 + (pessimistic if hitting SC1 zero-false-positive needs a precision-tuning iteration on R3). +- Agent-hours: low (single-file core + fixtures + harness). +- Human touch time: design direction (given), review of the diff, merge/PR decision. + No external credentials. +- Calendar blockers: none — isolated worktree, feature branch, no deploy. (Pushing the + branch later needs the `workflow` scope only if hooks.json wiring counted as a workflow + file — it is NOT under `.github/workflows`, so no scope blocker.) +- Confidence: medium — downgrade reason: R3 referent-linking abstention may need one + tuning pass to guarantee zero false positives (SC1) without collapsing R3 recall to zero. +- Human-equivalent baseline (secondary only): ~half a day for a developer to write the + parser, fixtures, and tests carefully. + +## 6. Implementation Plan + +### Task 1: Python core `lib/count_drift.py` +Definition of Done: +- [ ] Reads message text from stdin or `--text`/`--file`; emits decision JSON. +- [ ] R1 fraction/percentage recompute with tolerance. +- [ ] R2 "N of M" bound + optional in-scope list check. +- [ ] R3 headline-count vs single-enumeration with strict abstention. +- [ ] Word→int lexicon; top-level markdown list + table-row counting (stdlib only). +- [ ] Abstains (pass) on every ambiguous case enumerated in §4. + +### Task 2: Bash hook `hooks/no-count-drift.sh` +Definition of Done: +- [ ] Reads stdin JSON, extracts `.last_assistant_message` via jq. +- [ ] Fail-open (exit 0) if jq or python3 absent, or input not JSON. +- [ ] Calls `lib/count_drift.py`; on `decision=block` exit 2 with + `BLOCKED:` + `Matched rule:` + `Evidence:` + `Repair guidance:` (suite format). +- [ ] `stop_hook_active=true` → exit 0 (no re-entrancy), matching siblings. + +### Task 3: Fixtures + scorer + tests +Definition of Done: +- [ ] `evaluation/v6/fixtures.jsonl` with the SC2 positives, SC3 abstentions, and a + balanced negative set (correct counts that must pass). +- [ ] `evaluation/v6/score_count_drift.py` → precision/recall/F1 + bootstrap 95% CI (seed + fixed, samples=1000) writing `evaluation/v6/RESULTS.md`. +- [ ] `tests/test-count-drift.sh` asserts SC1 (0 FP), SC2 (seeds blocked), SC3 (abstain), + SC4 (fail-open via PATH stub), SC5 (determinism). + +### Task 4: Wiring + docs +Definition of Done: +- [ ] `hooks/hooks.json` adds `no-count-drift.sh` to `Stop` and `SubagentStop`. +- [ ] `evaluation/v6/RESULTS.md` populated from a real scorer run. +- [ ] README hook table / METHODOLOGY updated to list the hook + its MAST FM-3.2 mapping. + +## 7. Verification + +- SC1/SC2/SC3/SC4/SC5 → `bash tests/test-count-drift.sh` exits 0 (each assertion). +- SC6 → repo hook-smoke passes with the new wiring. +- SC7 → `python3 evaluation/v6/score_count_drift.py` writes RESULTS.md with F1 + CI; + run twice, diff = empty (also covers SC5). +- Hook-level smoke: pipe a positive `{"hook_event_name":"Stop","last_assistant_message":...}` + → exit 2; negative → exit 0; `PATH` without jq → exit 0. +- **Independent (non-circular) precision** → `python3 evaluation/v6/independent_eval.py` + runs the detector over corpora it was NOT authored against (real LLM responses in + `evaluation/raw_results.jsonl` + the other hooks' stress fixtures) and must report + **0 blocks** (zero false positives). Result: 0 / 988 texts. This is the precision + evidence that the hand-authored F1 cannot give (co-evolved-corpus); it caught two real + R3 false-positive classes that the fixtures missed. + +## 8. Rollback Plan + +1. The work is isolated on branch `feature/v6-count-drift` in a worktree; nothing is on + `main` until an explicit merge. +2. To unwire without removing files: delete the `no-count-drift.sh` entries from + `hooks/hooks.json` (each hook is independent by design). +3. Full revert: `git branch -D feature/v6-count-drift` (pre-merge) or `git revert ` + (post-merge); remove `hooks/no-count-drift.sh`, `lib/count_drift.py`, `evaluation/v6/`. +4. Verify rollback: repo hook-smoke passes and `grep -c no-count-drift hooks/hooks.json` = 0. + +## Source ledger (deepresearch, accessed 2026-05-25) + +- statcheck (CRAN; Nuijten validity study 2017): deterministic recompute-and-compare, + specificity 96–100%, recall ~61% — high-precision/narrow-recall precedent. +- ContraDoc (NAACL 2024, arXiv:2311.09182): "Numeric" is a named and EASIEST + intra-document self-contradiction type. +- "Sequential Enumeration in LLMs" (arXiv:2512.04727) + "Too Consistent to Detect" + (arXiv:2505.17656): counting is a rule-based-symbolic strength; LLM/self-consistency + judges miss self-consistent count errors → counting belongs in deterministic code. +- HalluLens (ACL 2025, arXiv:2504.17550): this is intrinsic/faithfulness hallucination, + orthogonal to factuality. +- MAST (NeurIPS 2025 D&B, arXiv:2503.13657): maps to FM-3.2 "No or incomplete + verification" (+ FM-2.6). DarkBench (ICLR 2025, arXiv:2503.10728) does not cover it. +- Landscape check: no existing hook in `llm-dark-patterns` / `agent-closeout-bench` / + `cc-safe-setup` does in-message count-vs-enumeration — confirmed gap, not duplication. diff --git a/evaluation/v6/fixtures.jsonl b/evaluation/v6/fixtures.jsonl new file mode 100644 index 0000000..3abd817 --- /dev/null +++ b/evaluation/v6/fixtures.jsonl @@ -0,0 +1,28 @@ +{"id": "pos_r3_14_vs_15", "text": "This case has fourteen instances:\n- item 1\n- item 2\n- item 3\n- item 4\n- item 5\n- item 6\n- item 7\n- item 8\n- item 9\n- item 10\n- item 11\n- item 12\n- item 13\n- item 14\n- item 15", "expect": "block", "kind": "pos", "note": "real #9"} +{"id": "pos_r3_six_vs_five", "text": "Six findings:\n- item 1\n- item 2\n- item 3\n- item 4\n- item 5", "expect": "block", "kind": "pos", "note": "real #10"} +{"id": "pos_r3_all5_4listed", "text": "All 5 tests pass:\n- item 1\n- item 2\n- item 3\n- item 4", "expect": "block", "kind": "pos", "note": ""} +{"id": "pos_r3_heading_3_vs_4", "text": "## 3 Key Findings\n- item 1\n- item 2\n- item 3\n- item 4", "expect": "block", "kind": "pos", "note": ""} +{"id": "pos_r3_numbered_two_three", "text": "Two steps remain:\n1. first\n2. second\n3. third", "expect": "block", "kind": "pos", "note": ""} +{"id": "pos_r1_9_10_80", "text": "Coverage is 9/10 = 80% across the suite.", "expect": "block", "kind": "pos", "note": ""} +{"id": "pos_r1_3_4_90", "text": "We resolved 3/4 (90%) of the blockers.", "expect": "block", "kind": "pos", "note": ""} +{"id": "pos_r2_5_of_3", "text": "5 of 3 tests passed in the run.", "expect": "block", "kind": "pos", "note": ""} +{"id": "pos_r2_words_seven_three", "text": "seven of three reviewers approved", "expect": "block", "kind": "pos", "note": ""} +{"id": "neg_r3_correct_five", "text": "Five findings:\n- item 1\n- item 2\n- item 3\n- item 4\n- item 5", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_r1_2_35_correct", "text": "True-RUSE rate was 2/35 = 5.7% at first pass.", "expect": "pass", "kind": "neg", "note": "real #9 correct"} +{"id": "neg_r1_1_4_correct", "text": "1/4 = 25% of the run flaked.", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_vague_several", "text": "Several findings:\n- item 1\n- item 2", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_no_enumeration", "text": "I made three changes to the parser and shipped them.", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_trap_nested_colon", "text": "3 reasons: the top 2 are:\n- x\n- y", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_trap_section_index", "text": "## Section 3 notes\n- a\n- b", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_trap_label_step", "text": "Step 3 tasks:\n- a\n- b", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_trap_version_label", "text": "v4 features:\n- a\n- b\n- c", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_correct_n_of_m", "text": "4 of 7 lanes succeeded cleanly.", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_ratio_no_pct", "text": "see line 12/25 of the file for context", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_nested_3_top", "text": "Three steps:\n- a\n - a1\n - a2\n- b\n- c", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_approx_marker", "text": "roughly 2/35 \u2248 6% of cases were diagnostic", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_two_lists_ambiguous", "text": "Two groups follow below.\n\n- a\n- b\n\n- x\n- y\n- z", "expect": "pass", "kind": "neg", "note": ""} +{"id": "neg_table_correct", "text": "Three rows:\n| h |\n| --- |\n| 1 |\n| 2 |\n| 3 |", "expect": "pass", "kind": "neg", "note": "tables not counted by R3"} +{"id": "neg_reg_often_overstated", "text": "Benefits are often overstated:\n- a\n- b\n- c", "expect": "pass", "kind": "neg", "note": "MAD FP: 'of-ten' word-boundary"} +{"id": "neg_reg_one_of_four_quadrants", "text": "Categorize each task into one of four quadrants:\n| | U | NU |\n|---|---|---|\n| I | a | b |", "expect": "pass", "kind": "neg", "note": "MAD FP: mid-line num + table"} +{"id": "neg_reg_one_side_instead", "text": "I would not favor one side. Instead:\n- a\n- b", "expect": "pass", "kind": "neg", "note": "MAD FP: sentence-colon not count"} +{"id": "neg_reg_one_study", "text": "One study showed the following:\n- a\n- b", "expect": "pass", "kind": "neg", "note": "MAD FP: count<2 prose"} diff --git a/evaluation/v6/independent_eval.py b/evaluation/v6/independent_eval.py new file mode 100644 index 0000000..03bf759 --- /dev/null +++ b/evaluation/v6/independent_eval.py @@ -0,0 +1,94 @@ +#!/usr/bin/env python3 +"""Independent (non-circular) precision check for lib/count_drift.py. + +Runs the detector over corpora it was NOT authored against: + 1. evaluation/raw_results.jsonl — real LLM `model_response` + `prompt_text` + (the DarkBench/MAD eval inputs used by the MAST work). + 2. tests/stress/**/*.json — stress fixtures authored for the OTHER hooks. + +Because these texts have no count-drift ground-truth labels, the meaningful +metric is the FALSE-POSITIVE RATE: every `block` is printed for inspection. A +blocking gate is only safe if it (near-)never fires on text that was not written +to contain a count contradiction. + +Usage: python3 evaluation/v6/independent_eval.py +Exit: 0 if zero blocks, else 1 (so it can gate CI against precision regressions). +""" +import glob +import importlib.util +import json +import os +import sys + +HERE = os.path.dirname(os.path.abspath(__file__)) +ROOT = os.path.abspath(os.path.join(HERE, "..", "..")) + +spec = importlib.util.spec_from_file_location("count_drift", os.path.join(ROOT, "lib", "count_drift.py")) +cd = importlib.util.module_from_spec(spec) +spec.loader.exec_module(cd) + + +def scan(texts, label): + n = 0 + blocks = [] + for tid, t in texts: + if not t or not str(t).strip(): + continue + n += 1 + r = cd.analyze(str(t)) + if r["decision"] == "block": + blocks.append((tid, r["rule"], r["evidence"])) + print("=== %s: %d texts -> %d block ===" % (label, n, len(blocks))) + for tid, rule, ev in blocks: + print(" BLOCK [%s] %s: %s" % (tid, rule, ev)) + return n, len(blocks) + + +def mad_texts(): + path = os.path.join(ROOT, "evaluation", "raw_results.jsonl") + out = [] + if not os.path.exists(path): + return out + with open(path, encoding="utf-8") as f: + for line in f: + line = line.strip() + if not line: + continue + d = json.loads(line) + pid = d.get("prompt_id", "?") + out.append((pid + "/resp", d.get("model_response", ""))) + out.append((pid + "/prompt", d.get("prompt_text", ""))) + return out + + +def stress_texts(): + out = [] + for p in glob.glob(os.path.join(ROOT, "tests", "stress", "**", "*.json"), recursive=True): + try: + d = json.load(open(p, encoding="utf-8")) + except Exception: + continue + msg = "" + if isinstance(d, dict): + msg = d.get("last_assistant_message") or d.get("message") or "" + if not msg: + strs = [v for v in d.values() if isinstance(v, str)] + msg = max(strs, key=len) if strs else "" + elif isinstance(d, list): + strs = [x for x in d if isinstance(x, str)] + msg = max(strs, key=len) if strs else "" + out.append((os.path.relpath(p, os.path.join(ROOT, "tests", "stress")), msg)) + return out + + +def main(): + n1, b1 = scan(mad_texts(), "MAD raw_results (model_response + prompt_text)") + n2, b2 = scan(stress_texts(), "stress fixtures (other hooks)") + total, blocks = n1 + n2, b1 + b2 + print("\nTOTAL independent texts: %d | blocks: %d | false-positive rate: %.4f" + % (total, blocks, (blocks / total) if total else 0.0)) + return 1 if blocks > 0 else 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/evaluation/v6/score_count_drift.py b/evaluation/v6/score_count_drift.py new file mode 100755 index 0000000..5a823f5 --- /dev/null +++ b/evaluation/v6/score_count_drift.py @@ -0,0 +1,157 @@ +#!/usr/bin/env python3 +"""Score lib/count_drift.py against evaluation/v6/fixtures.jsonl. + +Computes precision / recall / F1 with a bootstrap 95% CI, writes RESULTS.md, +and exits non-zero if precision < 1.0 (SC1: a blocking gate must not false-fire). + +Usage: python3 evaluation/v6/score_count_drift.py [--write] +""" +import importlib.util +import json +import os +import random +import sys + +HERE = os.path.dirname(os.path.abspath(__file__)) +ROOT = os.path.abspath(os.path.join(HERE, "..", "..")) + +spec = importlib.util.spec_from_file_location("count_drift", os.path.join(ROOT, "lib", "count_drift.py")) +cd = importlib.util.module_from_spec(spec) +spec.loader.exec_module(cd) + + +def load(path): + rows = [] + with open(path, encoding="utf-8") as f: + for line in f: + line = line.strip() + if line: + rows.append(json.loads(line)) + return rows + + +def prf1(items): + """items: list of (predicted_block: bool, gold_block: bool).""" + tp = sum(1 for p, g in items if p and g) + fp = sum(1 for p, g in items if p and not g) + fn = sum(1 for p, g in items if not p and g) + prec = tp / (tp + fp) if (tp + fp) else 1.0 + rec = tp / (tp + fn) if (tp + fn) else 1.0 + f1 = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0 + return tp, fp, fn, prec, rec, f1 + + +def bootstrap_f1(items, n=1000, seed=42): + rng = random.Random(seed) + f1s = [] + m = len(items) + for _ in range(n): + sample = [items[rng.randrange(m)] for _ in range(m)] + f1s.append(prf1(sample)[5]) + f1s.sort() + lo = f1s[int(0.025 * n)] + hi = f1s[int(0.975 * n) - 1] + return lo, hi + + +def main(): + rows = load(os.path.join(HERE, "fixtures.jsonl")) + items = [] + failures = [] + for r in rows: + verdict = cd.analyze(r["text"]) + pred_block = verdict["decision"] == "block" + gold_block = r["expect"] == "block" + items.append((pred_block, gold_block)) + if pred_block != gold_block: + kind = "FALSE_POSITIVE" if pred_block else "MISS" + failures.append((r["id"], kind, verdict.get("rule", ""))) + tp, fp, fn, prec, rec, f1 = prf1(items) + lo, hi = bootstrap_f1(items) + n_pos = sum(1 for _, g in items if g) + n_neg = len(items) - n_pos + + summary = ( + "# v6 count-drift — RESULTS\n\n" + "Scorer: `evaluation/v6/score_count_drift.py` over `fixtures.jsonl` " + "(%d fixtures: %d positive / %d adversarial negative).\n\n" + "| metric | value |\n|---|---|\n" + "| precision | %.3f |\n| recall | %.3f |\n| F1 | %.3f |\n" + "| F1 95%% CI (bootstrap, n=1000, seed=42) | [%.3f, %.3f] |\n" + "| true positives | %d |\n| **false positives** | **%d** |\n| misses | %d |\n\n" + "SC1 (zero false positives on the adversarial negative set): %s\n" + % (len(items), n_pos, n_neg, prec, rec, f1, lo, hi, tp, fp, fn, + "PASS" if fp == 0 else "FAIL") + ) + if failures: + summary += "\nFailures:\n" + "\n".join( + "- %s: %s (%s)" % (fid, kind, rule) for fid, kind, rule in failures) + "\n" + # Independent (non-circular) evaluation over corpora the detector was not authored against. + try: + import importlib.util as _il + _spec = _il.spec_from_file_location("independent_eval", os.path.join(HERE, "independent_eval.py")) + _ind = _il.module_from_spec(_spec) + _spec.loader.exec_module(_ind) + + def _count(texts): + tot = blk = 0 + for _tid, _t in texts: + if not _t or not str(_t).strip(): + continue + tot += 1 + if cd.analyze(str(_t))["decision"] == "block": + blk += 1 + return tot, blk + + _mt, _mb = _count(_ind.mad_texts()) + _st, _sb = _count(_ind.stress_texts()) + _tot, _blk = _mt + _st, _mb + _sb + if _tot: + summary += ( + "\n## Independent evaluation (non-circular)\n\n" + "Detector run over corpora it was NOT authored against — real LLM " + "`model_response`/`prompt_text` from `evaluation/raw_results.jsonl` and the " + "stress fixtures authored for the *other* hooks. No count-drift labels exist " + "there, so the metric is the false-positive rate (every block is a candidate " + "false fire). Reproduce: `python3 evaluation/v6/independent_eval.py`.\n\n" + "| corpus | texts | blocks |\n|---|---|---|\n" + "| MAD raw_results | %d | %d |\n" + "| stress fixtures (other hooks) | %d | %d |\n" + "| **total** | **%d** | **%d** |\n\n" + "False-positive rate on independent text: **%.4f**. This is the load-bearing, " + "non-circular precision evidence — distinct from the hand-authored F1 below. " + "(Two real false positives found during development — a too-loose lead-in and " + "a missing word-boundary on number words — were fixed and locked in as " + "regression negatives.)\n" + % (_mt, _mb, _st, _sb, _tot, _blk, (_blk / _tot) if _tot else 0.0) + ) + except Exception: + pass + summary += ( + "\n## Honesty caveat (read before citing F1)\n\n" + "This corpus is **hand-authored** — the same author wrote the detector and the " + "fixtures — so an F1 of 1.0 here is **not** a wild-generalization claim; it is a " + "co-evolved-corpus number and would inflate if cited as field performance. What " + "the number legitimately shows: the detector behaves to spec on the designed " + "cases, **including the adversarial negatives authored to break it** (nested-colon " + "lead-ins, section-index numbers, label words, approximation markers, " + "ambiguous multi-list scope, nested-list depth). The load-bearing, " + "generalizable metric is **precision / zero-false-positives on those adversarial " + "negatives** — the property a blocking gate must hold.\n\n" + "Recall is reported, not gated. Per the statcheck precedent (deterministic " + "internal-consistency check: ~96-100%% specificity but only ~61%% recall in the " + "wild), real-world recall here will be far below 1.0, bounded by structural " + "extraction coverage. That trade is intentional: abstain rather than false-fire.\n" + ) + + print(summary) + if "--write" in sys.argv: + with open(os.path.join(HERE, "RESULTS.md"), "w", encoding="utf-8") as f: + f.write(summary) + + # Gate: a blocking detector must not false-fire. + return 1 if fp > 0 else 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/hooks/hooks.json b/hooks/hooks.json index 0a8695d..0e904e0 100644 --- a/hooks/hooks.json +++ b/hooks/hooks.json @@ -1,7 +1,7 @@ { "_comment": [ - "LLM Dark Patterns Hooks — bundled plugin wiring for all 29 hooks.", - "Stop / SubagentStop fan out to 24 closeout-language hooks plus the state-stop continuity refresh.", + "LLM Dark Patterns Hooks — bundled plugin wiring for all 30 hooks.", + "Stop / SubagentStop fan out to 25 closeout-language hooks plus the state-stop continuity refresh.", "TaskCreated wires no-handoff-loop + no-credential-leak. TaskCompleted wires no-ownership-violation.", "PreToolUse / PostToolUse wire no-vibes side checks plus no-approval-sneak.", "PreCompact / PostCompact / SessionStart wire the no-amnesia continuity branch.", @@ -19,6 +19,7 @@ { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/honest-eta.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-fake-recall.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-fake-stats.sh\"", "timeout": 5 }, + { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-count-drift.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-fake-cite.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-wrap-up.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-aggregator-hallucination.sh\"", "timeout": 5 }, @@ -50,6 +51,7 @@ { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/honest-eta.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-fake-recall.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-fake-stats.sh\"", "timeout": 5 }, + { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-count-drift.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-fake-cite.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-wrap-up.sh\"", "timeout": 5 }, { "type": "command", "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/no-aggregator-hallucination.sh\"", "timeout": 5 }, diff --git a/hooks/no-count-drift.sh b/hooks/no-count-drift.sh new file mode 100755 index 0000000..de12d12 --- /dev/null +++ b/hooks/no-count-drift.sh @@ -0,0 +1,53 @@ +#!/bin/bash +# Claude Code hook: block a count stated in the message that contradicts the +# message's OWN enumeration or arithmetic (count-vs-enumeration self-consistency). +# +# This is a FAITHFULNESS / self-consistency gate (MAST FM-3.2 "no/incomplete +# verification"), distinct from no-fake-stats, which is a FACTUALITY / citation +# gate. A citation does not resolve an internal mismatch, and small integers that +# no-fake-stats ignores are exactly where count drift hides. +# +# Deterministic, high-precision, abstain-on-ambiguity: it fires only on +# unambiguous self-contained mismatches and otherwise passes. The counting logic +# lives in lib/count_drift.py because counting is a rule-based-symbolic strength +# and an LLM weakness (errors are self-consistent on resample). + +set -euo pipefail + +INPUT="$(cat)" + +# Fail-open if the toolchain is missing — never break a session. +command -v jq >/dev/null 2>&1 || exit 0 +command -v python3 >/dev/null 2>&1 || exit 0 +printf '%s' "$INPUT" | jq -e . >/dev/null 2>&1 || exit 0 + +# Re-entrancy guard, matching sibling Stop hooks. +if [ "$(printf '%s' "$INPUT" | jq -r '.stop_hook_active // empty' 2>/dev/null)" = "true" ]; then + exit 0 +fi + +message="$(printf '%s' "$INPUT" | jq -r '.last_assistant_message // empty' 2>/dev/null || true)" +[ -z "$message" ] && exit 0 + +CORE="$(cd "$(dirname "$0")" && pwd)/../lib/count_drift.py" +[ -f "$CORE" ] || exit 0 + +VERDICT="$(printf '%s' "$message" | python3 "$CORE" 2>/dev/null || true)" +[ -z "$VERDICT" ] && exit 0 + +DECISION="$(printf '%s' "$VERDICT" | jq -r '.decision // empty' 2>/dev/null || true)" +if [ "$DECISION" = "block" ]; then + RULE="$(printf '%s' "$VERDICT" | jq -r '.rule // "count_drift"' 2>/dev/null)" + EVID="$(printf '%s' "$VERDICT" | jq -r '.evidence // ""' 2>/dev/null)" + echo "BLOCKED: a stated count contradicts the message's own enumeration or arithmetic." >&2 + echo "Matched rule: $RULE" >&2 + [ -n "$EVID" ] && echo "Evidence: $EVID" >&2 + echo "" >&2 + echo "Repair guidance:" >&2 + echo "- Re-count the enumerated items (or re-check the fraction), then make the stated number match." >&2 + echo "- If the number and the list intentionally differ, say so explicitly (e.g. exclude a contrast item from the tally)." >&2 + echo "- This is a self-consistency check, not a citation check — adding a source does not fix an internal mismatch." >&2 + exit 2 +fi + +exit 0 diff --git a/lib/count_drift.py b/lib/count_drift.py new file mode 100755 index 0000000..c9794ce --- /dev/null +++ b/lib/count_drift.py @@ -0,0 +1,376 @@ +#!/usr/bin/env python3 +"""count_drift.py — deterministic count-vs-enumeration self-consistency gate. + +Detects when a count stated in prose contradicts the artifact's OWN content: + R1 fraction/percentage arithmetic self-check ("9/10 = 80%" -> wrong, 90%) + R2 "N of M" bound check (N > M is impossible) + R3 headline count vs a single immediately-following enumeration + ("six findings:" then a 5-item list) + +Design: high precision, abstain-on-ambiguity. This is a BLOCKING gate, so it +fires only on unambiguous, self-contained mismatches and otherwise passes. +Counting lives in deterministic code (LLMs are unreliable at counting and their +errors are self-consistent; see evaluation/v6/SPEC.md source ledger). + +Pure standard library — no third-party dependencies. + +Usage: + echo "" | python3 lib/count_drift.py + python3 lib/count_drift.py --text "..." | --file path +Output: a single JSON object on stdout: + {"decision": "block"|"pass", "rule": "", "evidence": ""} +Always exits 0; the bash hook maps decision=block -> exit 2. +""" + +import json +import re +import sys + +# --------------------------------------------------------------------------- +# Number parsing (digits + spelled-out words, stdlib only). +# --------------------------------------------------------------------------- +_UNITS = { + "zero": 0, "one": 1, "two": 2, "three": 3, "four": 4, "five": 5, + "six": 6, "seven": 7, "eight": 8, "nine": 9, "ten": 10, "eleven": 11, + "twelve": 12, "thirteen": 13, "fourteen": 14, "fifteen": 15, + "sixteen": 16, "seventeen": 17, "eighteen": 18, "nineteen": 19, +} +_TENS = { + "twenty": 20, "thirty": 30, "forty": 40, "fifty": 50, "sixty": 60, + "seventy": 70, "eighty": 80, "ninety": 90, +} +_ORDINALS = { + "first": 1, "second": 2, "third": 3, "fourth": 4, "fifth": 5, + "sixth": 6, "seventh": 7, "eighth": 8, "ninth": 9, "tenth": 10, + "eleventh": 11, "twelfth": 12, "thirteenth": 13, "fourteenth": 14, + "fifteenth": 15, +} +# Deliberately small special-case lexicon. "a dozen" is unambiguous; the vague +# words map to None so callers ABSTAIN rather than guess a cardinality. +_SPECIAL = {"dozen": 12} +_VAGUE = {"a few", "several", "a couple", "a handful", "some", "many", "various"} + + +def word_to_int(phrase): + """Return an int for a spelled-out cardinal/ordinal phrase, else None. + + Handles 0-99, "N hundred", "N thousand", hyphenated tens ("twenty-five"), + "a dozen", and ordinals. Returns None for anything ambiguous/unsupported so + the caller abstains. + """ + s = phrase.strip().lower().replace("-", " ") + if s in _VAGUE: + return None + if s in _ORDINALS: + return _ORDINALS[s] + if s in ("a dozen", "one dozen", "dozen"): + return 12 + tokens = [t for t in re.split(r"\s+", s) if t and t not in ("and", "a", "an")] + if not tokens: + return None + total = 0 + current = 0 + saw = False + for tok in tokens: + if tok in _UNITS: + current += _UNITS[tok] + saw = True + elif tok in _TENS: + current += _TENS[tok] + saw = True + elif tok == "hundred": + current = (current or 1) * 100 + saw = True + elif tok == "thousand": + total += (current or 1) * 1000 + current = 0 + saw = True + elif tok in _SPECIAL: + current += _SPECIAL[tok] + saw = True + else: + return None # unknown token -> abstain + if not saw: + return None + return total + current + + +def parse_count_token(tok): + """Parse a single count token (digits or one spelled word/phrase) -> int|None.""" + tok = tok.strip() + if re.fullmatch(r"\d{1,4}", tok): + return int(tok) + return word_to_int(tok) + + +# A regex alternation matching a single number word (incl. hyphenated tens) or digits. +_NUMWORD = ( + r"(?:\d{1,4}|" + r"(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)(?:[ -](?:one|two|three|four|five|six|seven|eight|nine))?|" + r"zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|" + r"fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|" + r"a dozen|dozen)" +) + +_APPROX = re.compile(r"(~|≈|≅|\bapprox(?:imately)?\b|\babout\b|\broughly\b|\bor so\b)", re.I) + + +# --------------------------------------------------------------------------- +# R1 — fraction / percentage arithmetic self-check. +# --------------------------------------------------------------------------- +_FRAC_PCT = re.compile( + r"(?P\d{1,6})\s*/\s*(?P\d{1,6})\s*" + r"(?:=|\(|\bis\b|,|\s)\s*~?\s*(?P\d{1,3}(?:\.\d+)?)\s*%" +) + + +def check_r1(text): + """Flag 'A/B = P%' when P does not match A/B within rounding tolerance.""" + for m in _FRAC_PCT.finditer(text): + num = int(m.group("num")) + den = int(m.group("den")) + if den == 0: + continue + pct_str = m.group("pct") + stated = float(pct_str) + # Abstain on explicit approximation markers right before the percent. + head = text[max(0, m.start()): m.start("pct")] + if _APPROX.search(head): + continue + computed = num / den * 100.0 + # Half-ULP rounding tolerance at the stated decimal precision, +epsilon. + decimals = len(pct_str.split(".")[1]) if "." in pct_str else 0 + tol = 0.5 * (10 ** (-decimals)) + 1e-9 + if abs(computed - stated) > tol: + return { + "decision": "block", + "rule": "count_drift.fraction_percent_mismatch", + "evidence": "%s = %s%% but %d/%d = %.2f%%" % ( + m.group("num") + "/" + m.group("den"), pct_str, num, den, computed), + } + return None + + +# --------------------------------------------------------------------------- +# R2 — "N of M" bound check. +# --------------------------------------------------------------------------- +_N_OF_M = re.compile( + r"\b(?P%s)\s+of\s+(?:the\s+|those\s+|these\s+|all\s+)?(?P%s)\b" % (_NUMWORD, _NUMWORD), + re.I, +) + + +def check_r2(text): + """Flag 'N of M' where N > M (impossible).""" + for m in _N_OF_M.finditer(text): + n = parse_count_token(m.group("n")) + mm = parse_count_token(m.group("m")) + if n is None or mm is None: + continue + if n > mm: + return { + "decision": "block", + "rule": "count_drift.n_of_m_exceeds", + "evidence": "'%s of %s' — %d exceeds %d" % ( + m.group("n"), m.group("m"), n, mm), + } + return None + + +# --------------------------------------------------------------------------- +# Enumeration parsing (markdown lists + tables), depth-aware, stdlib only. +# --------------------------------------------------------------------------- +_LIST_RE = re.compile(r"^(?P[ \t]*)(?:[-*+]|\d{1,3}[.)])\s+\S") +_HEADING_RE = re.compile(r"^\s{0,3}#{1,6}\s") + +# Label words: when a number directly follows one of these it is an index/ID, +# not a count (e.g. "Section 3", "Step 2", "v4", "Figure 1"). Abstain. +_LABEL_BEFORE = re.compile( + r"(?:section|step|part|phase|chapter|figure|fig|table|appendix|item|version|" + r"v|level|tier|round|pass|day|group|page|line|note|task|issue|pr|#)\s*$", + re.I, +) + + +def _is_table_sep(line): + """A markdown table separator row, for any column count (1+).""" + cells = [c.strip() for c in line.strip().strip("|").split("|")] + cells = [c for c in cells if c != ""] + return bool(cells) and all(re.fullmatch(r":?-{2,}:?", c) for c in cells) + + +def find_enumerations(lines): + """Return contiguous enumeration blocks as dicts: + {kind, count, start, end} (count = TOP-LEVEL items / table data rows). + Conservative: a blank line or a heading ends a block. + """ + blocks = [] + i = 0 + n = len(lines) + while i < n: + line = lines[i] + m = _LIST_RE.match(line) + if m: + start = i + indents = [] + j = i + while j < n: + lm = _LIST_RE.match(lines[j]) + if lm: + indents.append(len(lm.group("indent").replace("\t", " "))) + j += 1 + elif lines[j].strip() == "": + break + elif _HEADING_RE.match(lines[j]): + break + else: + # continuation / lazy line within the list: keep going + j += 1 + base = min(indents) if indents else 0 + top = 0 + k = start + while k < j: + lm = _LIST_RE.match(lines[k]) + if lm and len(lm.group("indent").replace("\t", " ")) == base: + top += 1 + k += 1 + blocks.append({"kind": "list", "count": top, "start": start, "end": j}) + i = j + continue + # table: contiguous lines containing a pipe, with a separator row + if "|" in line and line.strip(): + start = i + j = i + while j < n and "|" in lines[j] and lines[j].strip(): + j += 1 + tbl = lines[start:j] + sep_idx = next((idx for idx, l in enumerate(tbl) if _is_table_sep(l)), None) + if sep_idx is not None and sep_idx >= 1: + data_rows = len(tbl) - (sep_idx + 1) + if data_rows >= 1: + blocks.append({"kind": "table", "count": data_rows, + "start": start, "end": j}) + i = max(j, i + 1) + continue + i += 1 + return blocks + + +# A count lead-in: " [ up to 3 plain words]:" where the colon is +# adjacent to the noun phrase with NO intervening punctuation, number, or +# sentence break. This rejects prose where a number and a sentence-colon merely +# co-occur on a line ("...favor one side. Instead:" / "...one of four quadrants:"). +_LEADIN_RE = re.compile( + r"\b(?P%s)\s+(?P[A-Za-z][A-Za-z-]{2,30})" + r"(?:[ \t]+[A-Za-z][A-Za-z-]+){0,3}[ \t]*:\s*$" % _NUMWORD, + re.I, +) +# Number must be the FIRST token of the heading content ("## 3 Key Findings"), +# not buried after a label ("## Section 3 notes" -> abstain). +_HEADING_COUNT_RE = re.compile( + r"^\s{0,3}#{1,6}\s+(?P%s)\s+(?P[A-Za-z][A-Za-z-]{2,30})\b" % _NUMWORD, + re.I, +) + + +def check_r3(text, lines, enumerations): + """Flag a lead-in count claim immediately followed by exactly one enumeration + whose top-level count differs. Abstain on any ambiguity.""" + for idx, line in enumerate(lines): + claim = None + is_heading = False + m = _LEADIN_RE.search(line) + if m: + claim = m + else: + hm = _HEADING_COUNT_RE.match(line) + if hm: + claim = hm + is_heading = True + if not claim: + continue + # Lead-in-only guards (a heading already requires the number to be the + # first content token, so no label or second-number can precede it). + if not is_heading: + # Abstain if the number is an index/ID after a label word ("Step 3 tasks:"). + if _LABEL_BEFORE.search(line[:claim.start("num")]): + continue + # Abstain if a SECOND number sits between the noun and the lead-in colon + # ("3 reasons: the top 2 are:" — the real enumerand is 2, not 3). + if re.search(_NUMWORD, line[claim.end("noun"):claim.end()], re.I): + continue + stated = parse_count_token(claim.group("num")) + if stated is None or stated < 2: + continue # abstain on vague / count < 2 (a "one X:" lead-in is almost always prose) + # Scope: from just after this line to the next heading/claim boundary. + scope_end = len(lines) + for k in range(idx + 1, len(lines)): + if _HEADING_RE.match(lines[k]): + scope_end = k + break + # Candidate enumerations: LISTS only (table rows are a poor proxy for a + # claimed count — e.g. a 2x2 matrix has 4 cells but 2 rows), that START + # within (idx, scope_end) and within a small adjacency gap (<=2 non-empty + # lines before the block). + cands = [] + for b in enumerations: + if b["kind"] == "list" and idx < b["start"] < scope_end: + gap_lines = [l for l in lines[idx + 1:b["start"]] if l.strip()] + if len(gap_lines) <= 2: + cands.append(b) + # ABSTAIN unless exactly one adjacent candidate enumeration. + if len(cands) != 1: + continue + actual = cands[0]["count"] + if actual >= 1 and actual != stated: + return { + "decision": "block", + "rule": "count_drift.headline_enumeration_mismatch", + "evidence": "claim '%s %s' but the %s lists %d top-level item(s)" % ( + claim.group("num"), claim.group("noun"), + cands[0]["kind"], actual), + } + return None + + +def analyze(text): + lines = text.splitlines() + enums = find_enumerations(lines) + for check in (lambda: check_r1(text), + lambda: check_r2(text), + lambda: check_r3(text, lines, enums)): + res = check() + if res: + return res + return {"decision": "pass", "rule": "", "evidence": ""} + + +def _read_input(argv): + if "--text" in argv: + return argv[argv.index("--text") + 1] + if "--file" in argv: + with open(argv[argv.index("--file") + 1], "r", encoding="utf-8", errors="replace") as f: + return f.read() + return sys.stdin.read() + + +def main(): + try: + text = _read_input(sys.argv[1:]) + except Exception: + print(json.dumps({"decision": "pass", "rule": "", "evidence": ""})) + return 0 + if not text or not text.strip(): + print(json.dumps({"decision": "pass", "rule": "", "evidence": ""})) + return 0 + try: + result = analyze(text) + except Exception: + # Fail-open: never break a session on a parser bug. + result = {"decision": "pass", "rule": "", "evidence": ""} + print(json.dumps(result)) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tests/test-count-drift.sh b/tests/test-count-drift.sh new file mode 100755 index 0000000..18356c3 --- /dev/null +++ b/tests/test-count-drift.sh @@ -0,0 +1,69 @@ +#!/usr/bin/env bash +# Tests for the no-count-drift hook + lib/count_drift.py core. +# Run: bash tests/test-count-drift.sh Exit: 0 on success, 1 on any failure. +set -uo pipefail + +ROOT="$(cd "$(dirname "$0")/.." && pwd)" +HOOK="$ROOT/hooks/no-count-drift.sh" +SCORER="$ROOT/evaluation/v6/score_count_drift.py" + +PASS=0; FAIL=0; FAILS=() +assert_exit() { # desc expected actual + if [ "$2" = "$3" ]; then + PASS=$((PASS + 1)); printf ' PASS %s\n' "$1" + else + FAIL=$((FAIL + 1)); FAILS+=("$1") + printf ' FAIL %s (want exit %s, got %s)\n' "$1" "$2" "$3" + fi +} +run_hook() { # message -> sets RC + local msg="$1" + printf '%s' "$(jq -n --arg m "$msg" \ + '{hook_event_name:"Stop",stop_hook_active:false,last_assistant_message:$m}')" \ + | bash "$HOOK" >/dev/null 2>&1 + RC=$? +} + +# SC1 + SC2 + SC3: scorer exits 0 only when precision == 1.0 (zero FP) over the +# adversarial fixture set, with the seeded positives blocked. +python3 "$SCORER" >/dev/null 2>&1 +assert_exit "scorer: 0 false positives (SC1), seeds blocked (SC2), abstain (SC3)" 0 "$?" + +# Hook end-to-end. +run_hook "$(printf 'Six findings:\n- a\n- b\n- c\n- d\n- e')" +assert_exit "hook blocks headline-vs-list mismatch (exit 2)" 2 "$RC" + +run_hook "$(printf 'Five findings:\n- a\n- b\n- c\n- d\n- e')" +assert_exit "hook passes a correct count (exit 0)" 0 "$RC" + +run_hook "Coverage is 9/10 = 80% overall." +assert_exit "hook blocks wrong fraction-percent (exit 2)" 2 "$RC" + +run_hook "$(printf '3 reasons: the top 2 are:\n- x\n- y')" +assert_exit "hook abstains on nested-colon trap (exit 0)" 0 "$RC" + +# SC4 fail-open paths. +printf 'not json at all' | bash "$HOOK" >/dev/null 2>&1 +assert_exit "fail-open on non-JSON input (SC4)" 0 "$?" + +printf '%s' "$(jq -n '{hook_event_name:"Stop",last_assistant_message:""}')" | bash "$HOOK" >/dev/null 2>&1 +assert_exit "fail-open on empty message (SC4)" 0 "$?" + +printf '%s' "$(jq -n --arg m "$(printf 'Six findings:\n- a\n- b')" \ + '{hook_event_name:"Stop",stop_hook_active:true,last_assistant_message:$m}')" \ + | bash "$HOOK" >/dev/null 2>&1 +assert_exit "re-entrancy guard: stop_hook_active=true never blocks" 0 "$?" + +# SC5 determinism: identical scorer output across two runs. +A="$(python3 "$SCORER" 2>/dev/null)" +Bb="$(python3 "$SCORER" 2>/dev/null)" +[ "$A" = "$Bb" ] +assert_exit "determinism: identical scorer output twice (SC5)" 0 "$?" + +echo "" +echo "PASS=$PASS FAIL=$FAIL" +if [ "$FAIL" -ne 0 ]; then + printf 'FAILURES: %s\n' "${FAILS[*]}" + exit 1 +fi +echo "ALL TESTS PASSED"