Skip to content

evals: discrimination checks for --fixture (negative fixtures)#404

Draft
Alemusica wants to merge 1 commit into
vercel-labs:mainfrom
Alemusica:evals-discrimination-fixtures
Draft

evals: discrimination checks for --fixture (negative fixtures)#404
Alemusica wants to merge 1 commit into
vercel-labs:mainfrom
Alemusica:evals-discrimination-fixtures

Conversation

@Alemusica

Copy link
Copy Markdown

Implements #403.

What

Adds an optional negativeFixtures field to EvalCase: near-miss sources the case's gates MUST reject. In --fixture mode each negative runs through the same local validation path as the golden; the case fails if any negative is not rejected by at least one gate — surfacing cases whose gates are too weak to discriminate a wrong answer.

Three base-case negatives

  • hello-world — drops the trailing \n → caught by the stdout/run comparison.
  • fibonacci — drops the fib(10) == 55 verification term (the run still prints the same line) → caught by requiredSourcePatterns, proving the per-value asserts are load-bearing.
  • scale-multi-command-climultiply returns x - y with help and the other commands correct → caught by the multiply 6 7 → 42 runCheck, proving multi-command coverage isn't met by the happy path.

Verified locally

All three goldens still pass and each negative is rejected by its expected gate. A self-test (injecting a valid source as a negative) correctly fails the case with "not rejected by any gate", confirming the check is not a no-op.

Direction (not in this PR)

These negatives are hand-written. A natural extension is to generate near-misses by mutating the golden with a few rules (mutation-testing flavor), so each case proves its own discrimination automatically. Happy to discuss whether that's wanted before building it.

Scope: TS harness + eval methodology only; no compiler changes.

Add an optional negativeFixtures field to EvalCase: near-miss sources each
case must reject. In --fixture mode every negative runs through the same local
validation as the golden; the case fails if any negative is not rejected by at
least one gate, surfacing non-discriminating evals (gates too weak to reject a
wrong answer). Three base-case negatives: hello-world (stdout), fibonacci
(requiredSourcePatterns), scale-multi-command-cli (runCheck). TS harness only,
no compiler changes.

Implements vercel-labs#403.
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

@alelab is attempting to deploy a commit to the Vercel Labs Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants