fix(eval): exit non-zero when any sheet hard-fails#73
Merged
Conversation
…s a dishonest gate A crashed/OOMed sheet contributes ZERO tested cells, so the accuracy-only exit gate (>=85%) never saw it. Observed live on the A-1 canonical eval: the 17-sheet cluster child OOMed its 12GB heap, only the 3 standalone sheets were scored, and the harness printed an overall accuracy of 99.9% and exited 0 — a confident wrong summary from the canonical harness itself. Hard failures (status crash/oom/error) now force exit 1; the report still records the surviving sheets accuracy and the failed sheets status (honest and visible, not masked). Regression test kills the cluster child via a 10ms EVAL_CLUSTER_TIMEOUT_MS next to a healthy 100% standalone sheet — red pre-fix (exit 0), green post-fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Observed live on tonight's A-1 canonical eval: the 17-sheet cluster child OOMed its 12GB heap, only the 3 standalone sheets were scored, and the harness printed
Overall accuracy: 99.9%and exited 0 — a confident wrong summary from the canonical harness itself. A crashed sheet contributes zero tested cells, so the accuracy-only exit gate (>= 85%) never saw it.Hard failures (status
crash/oom/error) now force exit 1, with a loud summary line. The report still records surviving sheets' accuracy and the failed sheets' status — honest and visible, not masked.Tests
New
test-per-sheet-eval-exit-honesty.mjs: builds a cluster + healthy standalone through the real rust-parser and kills the cluster child viaEVAL_CLUSTER_TIMEOUT_MS=10. Negative-controlled via stash — pre-fix the 100% standalone hides the dead cluster (exit 0); post-fix exit 1. 5/5; eval suites green.🤖 Generated with Claude Code