Skip to content

eval(no-count-drift): committed recall probe (in-scope coverage 25/25)#28

Merged
waitdeadai merged 1 commit into
mainfrom
feature/v6-count-drift-recall
May 25, 2026
Merged

eval(no-count-drift): committed recall probe (in-scope coverage 25/25)#28
waitdeadai merged 1 commit into
mainfrom
feature/v6-count-drift-recall

Conversation

@waitdeadai

Copy link
Copy Markdown
Owner

Follow-up to #27. Adds a committed recall probe (evaluation/v6/recall_probe.jsonl, 25 genuine count-drift positives spanning phrasing variety) and folds recall into the scorer/RESULTS. Result: 25/25 in-scope recall alongside the 0/988 independent false positives from #27. Data-backed conclusion: no LLM-judge tier or engine is warranted — the deterministic tier covers the common phrasing space at full precision; out-of-scope forms stay abstained by design. No hook logic changed.

🤖 Generated with Claude Code

… 25/25)

Answers the open question from the v6 merge: precision was proven (0/988
independent FP) but recall was unmeasured because the MAD corpus barely contains
the target pattern. Adds evaluation/v6/recall_probe.jsonl — 25 genuine
count-drift positives authored to span phrasing variety (digit/word lead-ins,
number-first headings, prose prefixes, "all N passed", "there/here are N",
numbered lists, "a dozen", N-of-M, fraction/percent) — and folds recall into
score_count_drift.py / RESULTS.md.

Result: 25/25 in-scope recall, alongside 0/988 independent FP. Conclusion
(data-backed): no LLM-judge tier or Rust engine is warranted — the deterministic
tier already covers the common phrasing space at full precision. Out-of-scope
forms remain abstained by design. No hook logic changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@waitdeadai waitdeadai merged commit a275d3a into main May 25, 2026
2 checks passed
@waitdeadai waitdeadai deleted the feature/v6-count-drift-recall branch May 25, 2026 21:36
waitdeadai pushed a commit that referenced this pull request May 25, 2026
Bumps the plugin version and description to include no-count-drift, the
count-vs-enumeration self-consistency gate (MAST FM-3.2) merged in #27/#28.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants