Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions evaluation/v6/RESULTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,14 @@ Detector run over corpora it was NOT authored against — real LLM `model_respon

False-positive rate on independent text: **0.0000**. This is the load-bearing, non-circular precision evidence — distinct from the hand-authored F1 below. (Two real false positives found during development — a too-loose lead-in and a missing word-boundary on number words — were fixed and locked in as regression negatives.)

## Recall probe (in-scope phrasing coverage)

25 genuine count-drift positives authored to span phrasing variety (digit/word lead-ins, number-first headings, prose prefixes, 'all N passed', 'there/here are N', numbered lists, 'a dozen', N-of-M, fraction/percent). Reproduce: `python3 evaluation/v6/score_count_drift.py`.

Recall on in-scope positives: **25/25 = 1.00**.

Caveat: hand-authored, so this measures coverage across the phrasing space the author could enumerate, not wild recall. Out-of-scope forms (a count with no adjacent enumeration, flowing-prose counts, table/semantic matches) are abstained by design; extending to them needs an LLM-judge advisory tier, deliberately deferred (it never blocks, and self-consistent count errors are exactly what LLM judges miss).

## Honesty caveat (read before citing F1)

This corpus is **hand-authored** — the same author wrote the detector and the fixtures — so an F1 of 1.0 here is **not** a wild-generalization claim; it is a co-evolved-corpus number and would inflate if cited as field performance. What the number legitimately shows: the detector behaves to spec on the designed cases, **including the adversarial negatives authored to break it** (nested-colon lead-ins, section-index numbers, label words, approximation markers, ambiguous multi-list scope, nested-list depth). The load-bearing, generalizable metric is **precision / zero-false-positives on those adversarial negatives** — the property a blocking gate must hold.
Expand Down
Empty file modified evaluation/v6/independent_eval.py
100644 → 100755
Empty file.
25 changes: 25 additions & 0 deletions evaluation/v6/recall_probe.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{"id": "leadin_colon_digit", "text": "I found 3 issues:\n- thing 1\n- thing 2", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "leadin_colon_word", "text": "Three reasons:\n- thing 1\n- thing 2\n- thing 3\n- thing 4", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "leadin_following_phrase", "text": "The following 5 steps:\n- thing 1\n- thing 2\n- thing 3\n- thing 4", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "heading_numfirst", "text": "## 4 Findings\n- thing 1\n- thing 2\n- thing 3", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "prose_prefix_we_shipped", "text": "We shipped four fixes:\n- thing 1\n- thing 2\n- thing 3", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "prose_prefix_produced", "text": "This produced six cases:\n- thing 1\n- thing 2\n- thing 3\n- thing 4\n- thing 5", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "n_of_m_wrong", "text": "5 of 4 tests passed.", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "n_of_m_words", "text": "all seven of five lanes are green", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "frac_pct_wrong", "text": "Pass rate 7/10 = 60% this run.", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "frac_paren_wrong", "text": "Resolved 3/4 (50%) of items.", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "numbered_two_three", "text": "Two options:\n1. thing 1\n2. thing 2\n3. thing 3", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "dozen_vs_11", "text": "a dozen items:\n- thing 1\n- thing 2\n- thing 3\n- thing 4\n- thing 5\n- thing 6\n- thing 7\n- thing 8\n- thing 9\n- thing 10\n- thing 11", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "fifteen_vs_14", "text": "fifteen entries:\n- thing 1\n- thing 2\n- thing 3\n- thing 4\n- thing 5\n- thing 6\n- thing 7\n- thing 8\n- thing 9\n- thing 10\n- thing 11\n- thing 12\n- thing 13\n- thing 14", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "all_n_passed", "text": "All 6 checks passed:\n- thing 1\n- thing 2\n- thing 3\n- thing 4", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "there_are_n", "text": "There are three blockers:\n- thing 1\n- thing 2", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "here_are_n", "text": "Here are the four options:\n- thing 1\n- thing 2\n- thing 3\n- thing 4\n- thing 5", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "count_changes", "text": "We made five changes:\n- thing 1\n- thing 2\n- thing 3", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "findings_plural", "text": "Seven findings surfaced:\n- thing 1\n- thing 2\n- thing 3\n- thing 4\n- thing 5\n- thing 6", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "steps_numbered_wrong", "text": "Three steps:\n1. thing 1\n2. thing 2", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "issues_caps", "text": "TWO issues remain:\n- thing 1\n- thing 2\n- thing 3", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "nine_items", "text": "Nine items:\n- thing 1\n- thing 2\n- thing 3\n- thing 4\n- thing 5\n- thing 6\n- thing 7", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "ten_reasons", "text": "Ten reasons below:\n- thing 1\n- thing 2\n- thing 3\n- thing 4\n- thing 5\n- thing 6\n- thing 7\n- thing 8", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "eight_tasks", "text": "Eight tasks left:\n- thing 1\n- thing 2\n- thing 3\n- thing 4\n- thing 5\n- thing 6", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "frac_half_wrong", "text": "covered 1/2 = 70% of cases", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
{"id": "n_of_m_the", "text": "6 of the 4 modules compiled", "expect": "block", "note": "genuine count-drift; in-scope phrasing-coverage probe"}
27 changes: 27 additions & 0 deletions evaluation/v6/score_count_drift.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,33 @@ def _count(texts):
)
except Exception:
pass

# Recall probe: in-scope phrasing-coverage over genuine count-drift positives.
try:
probe = load(os.path.join(HERE, "recall_probe.jsonl"))
verdicts = [(r["id"], cd.analyze(r["text"])["decision"]) for r in probe]
pcaught = sum(1 for _, d in verdicts if d == "block")
pmiss = [pid for pid, d in verdicts if d != "block"]
if probe:
summary += (
"\n## Recall probe (in-scope phrasing coverage)\n\n"
"%d genuine count-drift positives authored to span phrasing variety "
"(digit/word lead-ins, number-first headings, prose prefixes, 'all N "
"passed', 'there/here are N', numbered lists, 'a dozen', N-of-M, "
"fraction/percent). Reproduce: `python3 evaluation/v6/score_count_drift.py`.\n\n"
"Recall on in-scope positives: **%d/%d = %.2f**.%s\n\n"
"Caveat: hand-authored, so this measures coverage across the phrasing space "
"the author could enumerate, not wild recall. Out-of-scope forms (a count "
"with no adjacent enumeration, flowing-prose counts, table/semantic matches) "
"are abstained by design; extending to them needs an LLM-judge advisory tier, "
"deliberately deferred (it never blocks, and self-consistent count errors are "
"exactly what LLM judges miss).\n"
% (len(probe), pcaught, len(probe), pcaught / len(probe),
"" if not pmiss else " Misses: " + ", ".join(pmiss) + ".")
)
except Exception:
pass

summary += (
"\n## Honesty caveat (read before citing F1)\n\n"
"This corpus is **hand-authored** — the same author wrote the detector and the "
Expand Down
Loading