Skip to content

fix(eval): exit non-zero when any sheet hard-fails#73

Merged
ebootheee merged 1 commit into
mainfrom
fix/eval-exit-honesty
Jun 10, 2026
Merged

fix(eval): exit non-zero when any sheet hard-fails#73
ebootheee merged 1 commit into
mainfrom
fix/eval-exit-honesty

Conversation

@ebootheee

Copy link
Copy Markdown
Owner

What

Observed live on tonight's A-1 canonical eval: the 17-sheet cluster child OOMed its 12GB heap, only the 3 standalone sheets were scored, and the harness printed Overall accuracy: 99.9% and exited 0 — a confident wrong summary from the canonical harness itself. A crashed sheet contributes zero tested cells, so the accuracy-only exit gate (>= 85%) never saw it.

Hard failures (status crash/oom/error) now force exit 1, with a loud summary line. The report still records surviving sheets' accuracy and the failed sheets' status — honest and visible, not masked.

Tests

New test-per-sheet-eval-exit-honesty.mjs: builds a cluster + healthy standalone through the real rust-parser and kills the cluster child via EVAL_CLUSTER_TIMEOUT_MS=10. Negative-controlled via stash — pre-fix the 100% standalone hides the dead cluster (exit 0); post-fix exit 1. 5/5; eval suites green.

🤖 Generated with Claude Code

…s a dishonest gate

A crashed/OOMed sheet contributes ZERO tested cells, so the accuracy-only exit
gate (>=85%) never saw it. Observed live on the A-1 canonical eval: the
17-sheet cluster child OOMed its 12GB heap, only the 3 standalone sheets were
scored, and the harness printed an overall accuracy of 99.9% and exited 0 — a
confident wrong summary from the canonical harness itself.

Hard failures (status crash/oom/error) now force exit 1; the report still
records the surviving sheets accuracy and the failed sheets status (honest
and visible, not masked). Regression test kills the cluster child via a 10ms
EVAL_CLUSTER_TIMEOUT_MS next to a healthy 100% standalone sheet — red pre-fix
(exit 0), green post-fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ebootheee ebootheee merged commit 04f7ea3 into main Jun 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant