Add CLI-level regression ship-gate corpus (evals/) by moonweave · Pull Request #15 · Moonweave-Research/ref-verify

moonweave · 2026-06-08T13:25:57Z

What

A deterministic, machine-checkable regression set for the check-file engine, complementing the existing evals/evals.json (which covers skill-level LLM behavior). No LLM in the loop — it pins verdicts.

evals/cli_regression.jsonl — 12 labeled rows across the publisher matrix (crossref / openalex-only / none) and claim types (numeric value+unit, relational, fabricated, dead DOI). Each row has expected_verdict, must_accept / must_not_accept invariants, gated_on issues, and reachable_via.
evals/run_cli_regression.py — stdlib runner. Splits results into SAFETY (release-blocking invariants) and PROGRESS (gated rows that flip to PASS as issues land).
evals/cli_regression.md — documents the two invariant classes and how to run.

Why

A usability test on a materials-science reference set found the tool accepted 0/8 real claims. Root causes split into two layers — a matcher/units layer (#7/#8 landed; #10/#11/#14 + an "up to" comparator gap still open) and an abstract-reachability layer (#12/#13). This corpus encodes both as a regression gate so progress is measurable and the safety promise can't silently regress.

Run

PYTHONPATH=src python3 evals/run_cli_regression.py

Non-zero exit iff a SAFETY invariant breaks.

Current snapshot (latest `main`)

SAFETY: 11/12 ok  |  PROGRESS pending: 5
SAFETY FAILURES (release blockers): B3-diez-overaccept

A1 (clean supported claim) ACCEPTs and must stay green.
5 never-accept controls (fabricated B1, relational C1/C2, unreachable D1/D2, dead DOI E1) all hold.
5 PENDING gated false-negatives track [bug] Subject/number split across comma-clauses yields false PARTIAL #10 / [enhancement] Add OpenAlex as an abstract source (covers IEEE/APS/ECS gaps that CrossRef+S2 miss) #13 / [bug] Scope/comparative suffix over-blocks definite measurements qualified by in/at/for [condition] (residual behind #9) #14 / up-to comparator.
B3 is a live SAFETY failure: >220 °C evidence currently ACCEPTs an exact-220 °C claim ([bug] >X evidence wrongly entails an exact X claim (over-acceptance) #11). The gate flags it as a release blocker until [bug] >X evidence wrongly entails an exact X claim (over-acceptance) #11 lands.

ACCEPT labels for A2/A3/B2 were grounded by fetching the live CrossRef/OpenAlex abstracts and confirming the value appears verbatim — no label asserts support absent from a fetched abstract.

Related: #10, #11, #13, #14.

🤖 Generated with Claude Code

evals.json covers skill-level LLM behavior; there was no deterministic, machine-checkable regression for the check-file engine itself. This adds a labeled corpus + runner so unit/source/matcher changes can be gated without an LLM in the loop. Why: a usability test on a materials-science reference set surfaced that the tool accepted 0/8 real claims. Root causes split into a matcher/units layer (#7/#8 landed; #10/#11/#14 + an up-to comparator gap open) and an abstract reachability layer (#12/#13). This corpus pins both as invariants vs. progress. - cli_regression.jsonl: 12 rows spanning the publisher matrix (crossref / openalex-only / none) and claim types (numeric value+unit, relational, fabricated, dead DOI), each labeled with expected_verdict, must_accept / must_not_accept invariants, and gated_on issues. - run_cli_regression.py: stdlib runner. SAFETY invariants (no fabricated / relational / unreachable / over-accepting claim ever ACCEPTs; the one clean supported claim stays ACCEPT) exit non-zero; gated rows report as PENDING. - cli_regression.md: documents the two invariant classes and how to run. Current snapshot flags B3 (>220 evidence accepts exact-220 claim, #11) as the one live SAFETY failure; ACCEPT labels for A2/A3/B2 were grounded against live CrossRef/OpenAlex abstracts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…corpus

Re-ran the gate after #11/#13/#14 merged to main (commits 329cde9, a41e92f): - A3 (1.7 eV) flipped PARTIAL -> ACCEPT (#14 condition handling) - B3 over-acceptance flipped ACCEPT -> PARTIAL (#11) - A2 now reachable via OpenAlex (#13) but still PARTIAL (residual trailing scope qualifier) -> remains PENDING SAFETY is now 12/12 (gate green). Runner: control rows (must_not_accept) now PASS on any non-ACCEPT verdict instead of pinning a single one, since UNVERIFIABLE vs PARTIAL can vary with Semantic-Scholar availability (e.g. D1 Amiour). The invariant is "never ACCEPT". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…corpus

moonweave and others added 3 commits June 8, 2026 22:25

Merge remote-tracking branch 'origin/main' into evals/cli-regression-…

a81228f

…corpus

moonweave added 3 commits June 8, 2026 13:45

Merge remote-tracking branch 'origin/main' into evals/cli-regression-…

816ad62

…corpus

Update CLI regression corpus gate status

50c224a

Wire CLI regression gate into live smoke

d7af64f

moonweave merged commit d625700 into main Jun 8, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLI-level regression ship-gate corpus (evals/)#15

Add CLI-level regression ship-gate corpus (evals/)#15
moonweave merged 6 commits into
mainfrom
evals/cli-regression-corpus

moonweave commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

moonweave commented Jun 8, 2026

What

Why

Run

Current snapshot (latest main)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Current snapshot (latest `main`)