Skip to content

Add CLI-level regression ship-gate corpus (evals/)#15

Merged
moonweave merged 6 commits into
mainfrom
evals/cli-regression-corpus
Jun 8, 2026
Merged

Add CLI-level regression ship-gate corpus (evals/)#15
moonweave merged 6 commits into
mainfrom
evals/cli-regression-corpus

Conversation

@moonweave

Copy link
Copy Markdown
Collaborator

What

A deterministic, machine-checkable regression set for the check-file engine, complementing the existing evals/evals.json (which covers skill-level LLM behavior). No LLM in the loop — it pins verdicts.

  • evals/cli_regression.jsonl — 12 labeled rows across the publisher matrix (crossref / openalex-only / none) and claim types (numeric value+unit, relational, fabricated, dead DOI). Each row has expected_verdict, must_accept / must_not_accept invariants, gated_on issues, and reachable_via.
  • evals/run_cli_regression.py — stdlib runner. Splits results into SAFETY (release-blocking invariants) and PROGRESS (gated rows that flip to PASS as issues land).
  • evals/cli_regression.md — documents the two invariant classes and how to run.

Why

A usability test on a materials-science reference set found the tool accepted 0/8 real claims. Root causes split into two layers — a matcher/units layer (#7/#8 landed; #10/#11/#14 + an "up to" comparator gap still open) and an abstract-reachability layer (#12/#13). This corpus encodes both as a regression gate so progress is measurable and the safety promise can't silently regress.

Run

PYTHONPATH=src python3 evals/run_cli_regression.py

Non-zero exit iff a SAFETY invariant breaks.

Current snapshot (latest main)

SAFETY: 11/12 ok  |  PROGRESS pending: 5
SAFETY FAILURES (release blockers): B3-diez-overaccept

ACCEPT labels for A2/A3/B2 were grounded by fetching the live CrossRef/OpenAlex abstracts and confirming the value appears verbatim — no label asserts support absent from a fetched abstract.

Related: #10, #11, #13, #14.

🤖 Generated with Claude Code

moonweave and others added 3 commits June 8, 2026 22:25
evals.json covers skill-level LLM behavior; there was no deterministic,
machine-checkable regression for the check-file engine itself. This adds a
labeled corpus + runner so unit/source/matcher changes can be gated without
an LLM in the loop.

Why: a usability test on a materials-science reference set surfaced that the
tool accepted 0/8 real claims. Root causes split into a matcher/units layer
(#7/#8 landed; #10/#11/#14 + an up-to comparator gap open) and an abstract
reachability layer (#12/#13). This corpus pins both as invariants vs. progress.

- cli_regression.jsonl: 12 rows spanning the publisher matrix
  (crossref / openalex-only / none) and claim types (numeric value+unit,
  relational, fabricated, dead DOI), each labeled with expected_verdict,
  must_accept / must_not_accept invariants, and gated_on issues.
- run_cli_regression.py: stdlib runner. SAFETY invariants (no fabricated /
  relational / unreachable / over-accepting claim ever ACCEPTs; the one clean
  supported claim stays ACCEPT) exit non-zero; gated rows report as PENDING.
- cli_regression.md: documents the two invariant classes and how to run.

Current snapshot flags B3 (>220 evidence accepts exact-220 claim, #11) as the
one live SAFETY failure; ACCEPT labels for A2/A3/B2 were grounded against live
CrossRef/OpenAlex abstracts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-ran the gate after #11/#13/#14 merged to main (commits 329cde9, a41e92f):
- A3 (1.7 eV) flipped PARTIAL -> ACCEPT (#14 condition handling)
- B3 over-acceptance flipped ACCEPT -> PARTIAL (#11)
- A2 now reachable via OpenAlex (#13) but still PARTIAL (residual trailing
  scope qualifier) -> remains PENDING
SAFETY is now 12/12 (gate green).

Runner: control rows (must_not_accept) now PASS on any non-ACCEPT verdict
instead of pinning a single one, since UNVERIFIABLE vs PARTIAL can vary with
Semantic-Scholar availability (e.g. D1 Amiour). The invariant is "never ACCEPT".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@moonweave moonweave merged commit d625700 into main Jun 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant