evals: discrimination checks for --fixture (negative fixtures)#404
Draft
Alemusica wants to merge 1 commit into
Draft
evals: discrimination checks for --fixture (negative fixtures)#404Alemusica wants to merge 1 commit into
Alemusica wants to merge 1 commit into
Conversation
Add an optional negativeFixtures field to EvalCase: near-miss sources each case must reject. In --fixture mode every negative runs through the same local validation as the golden; the case fails if any negative is not rejected by at least one gate, surfacing non-discriminating evals (gates too weak to reject a wrong answer). Three base-case negatives: hello-world (stdout), fibonacci (requiredSourcePatterns), scale-multi-command-cli (runCheck). TS harness only, no compiler changes. Implements vercel-labs#403.
|
@alelab is attempting to deploy a commit to the Vercel Labs Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements #403.
What
Adds an optional
negativeFixturesfield toEvalCase: near-miss sources the case's gates MUST reject. In--fixturemode each negative runs through the same local validation path as the golden; the case fails if any negative is not rejected by at least one gate — surfacing cases whose gates are too weak to discriminate a wrong answer.Three base-case negatives
\n→ caught by the stdout/run comparison.fib(10) == 55verification term (the run still prints the same line) → caught byrequiredSourcePatterns, proving the per-value asserts are load-bearing.multiplyreturnsx - ywithhelpand the other commands correct → caught by themultiply 6 7 → 42runCheck, proving multi-command coverage isn't met by the happy path.Verified locally
All three goldens still pass and each negative is rejected by its expected gate. A self-test (injecting a valid source as a negative) correctly fails the case with "not rejected by any gate", confirming the check is not a no-op.
Direction (not in this PR)
These negatives are hand-written. A natural extension is to generate near-misses by mutating the golden with a few rules (mutation-testing flavor), so each case proves its own discrimination automatically. Happy to discuss whether that's wanted before building it.
Scope: TS harness + eval methodology only; no compiler changes.