Add CLI-level regression ship-gate corpus (evals/)#15
Merged
Conversation
evals.json covers skill-level LLM behavior; there was no deterministic, machine-checkable regression for the check-file engine itself. This adds a labeled corpus + runner so unit/source/matcher changes can be gated without an LLM in the loop. Why: a usability test on a materials-science reference set surfaced that the tool accepted 0/8 real claims. Root causes split into a matcher/units layer (#7/#8 landed; #10/#11/#14 + an up-to comparator gap open) and an abstract reachability layer (#12/#13). This corpus pins both as invariants vs. progress. - cli_regression.jsonl: 12 rows spanning the publisher matrix (crossref / openalex-only / none) and claim types (numeric value+unit, relational, fabricated, dead DOI), each labeled with expected_verdict, must_accept / must_not_accept invariants, and gated_on issues. - run_cli_regression.py: stdlib runner. SAFETY invariants (no fabricated / relational / unreachable / over-accepting claim ever ACCEPTs; the one clean supported claim stays ACCEPT) exit non-zero; gated rows report as PENDING. - cli_regression.md: documents the two invariant classes and how to run. Current snapshot flags B3 (>220 evidence accepts exact-220 claim, #11) as the one live SAFETY failure; ACCEPT labels for A2/A3/B2 were grounded against live CrossRef/OpenAlex abstracts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-ran the gate after #11/#13/#14 merged to main (commits 329cde9, a41e92f): - A3 (1.7 eV) flipped PARTIAL -> ACCEPT (#14 condition handling) - B3 over-acceptance flipped ACCEPT -> PARTIAL (#11) - A2 now reachable via OpenAlex (#13) but still PARTIAL (residual trailing scope qualifier) -> remains PENDING SAFETY is now 12/12 (gate green). Runner: control rows (must_not_accept) now PASS on any non-ACCEPT verdict instead of pinning a single one, since UNVERIFIABLE vs PARTIAL can vary with Semantic-Scholar availability (e.g. D1 Amiour). The invariant is "never ACCEPT". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 8, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A deterministic, machine-checkable regression set for the
check-fileengine, complementing the existingevals/evals.json(which covers skill-level LLM behavior). No LLM in the loop — it pins verdicts.evals/cli_regression.jsonl— 12 labeled rows across the publisher matrix (crossref/openalex-only/none) and claim types (numeric value+unit, relational, fabricated, dead DOI). Each row hasexpected_verdict,must_accept/must_not_acceptinvariants,gated_onissues, andreachable_via.evals/run_cli_regression.py— stdlib runner. Splits results into SAFETY (release-blocking invariants) and PROGRESS (gated rows that flip to PASS as issues land).evals/cli_regression.md— documents the two invariant classes and how to run.Why
A usability test on a materials-science reference set found the tool accepted 0/8 real claims. Root causes split into two layers — a matcher/units layer (#7/#8 landed; #10/#11/#14 + an "up to" comparator gap still open) and an abstract-reachability layer (#12/#13). This corpus encodes both as a regression gate so progress is measurable and the safety promise can't silently regress.
Run
Non-zero exit iff a SAFETY invariant breaks.
Current snapshot (latest
main)A1(clean supported claim) ACCEPTs and must stay green.B1, relationalC1/C2, unreachableD1/D2, dead DOIE1) all hold.in/at/for [condition](residual behind #9) #14 / up-to comparator.B3is a live SAFETY failure:>220 °Cevidence currently ACCEPTs an exact-220 °Cclaim ([bug]>Xevidence wrongly entails anexact Xclaim (over-acceptance) #11). The gate flags it as a release blocker until [bug]>Xevidence wrongly entails anexact Xclaim (over-acceptance) #11 lands.ACCEPT labels for
A2/A3/B2were grounded by fetching the live CrossRef/OpenAlex abstracts and confirming the value appears verbatim — no label asserts support absent from a fetched abstract.Related: #10, #11, #13, #14.
🤖 Generated with Claude Code