fix(eval): slim per-sheet-eval cluster child to the engine's exact surface#69
Merged
Conversation
…t surface The canonical cluster child OOMed a 16GB heap on the real 17-sheet returns cluster. Three legs, all removed by mirroring the SHIPPED engine's emitted cluster loop (chunked_emitter.rs) decision-for-decision: - parse the GT seed DIRECTLY into ctx.values (the parse-then-copy pattern held two full copies of a multi-GB object); the orchestrator now points children at the build's own _ground-truth.json instead of re-serializing a copy, and frees its grouped-by-sheet GT once samples are on disk - drop the ~4.7M-fresh-string _written Set and per-cell prevSnapshot object in favor of the engine's sampled surface: every numeric cell on the cluster's OWN sheets + a bounded strided safety net, with a parallel _before baseline array - NaN-fill on non-convergence now covers the engine's exact surface (numeric cells on member sheets) instead of the written set — the old surface was subtly non-lockstep (it diffed/NaN-filled writes onto NON-member sheets and skipped never-written member-sheet input cells) Also: - a cluster member is NEVER size-skipped (MAX_SHEET_SIZE_MB now applies to standalone sheets only): the 150MB default silently dropped the monster sheets from the cluster, converging a partial cluster to a silently wrong fixed point — regression test goes red on the old code (clustersTotal=0) - cluster tasks get their own timeout (EVAL_CLUSTER_TIMEOUT_MS, default 60min; a real warm-seeded cluster is ~2-3 passes at ~10-15min/pass) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Closes the canonical-harness OOM (the cluster child OOMed a 16GB heap / 61min on the real 17-sheet A-1 returns cluster) by making the child mirror the shipped engine's emitted cluster loop decision-for-decision — which is simultaneously the memory fix and a lockstep-fidelity fix.
ctx.values; orchestrator points children at the build's own_ground-truth.json(no re-serialized copy) and frees the grouped-by-sheet GT after sampling._writtenSet (~4.7M fresh strings) + per-cellprevSnapshotobject: every numeric cell on the cluster's OWN sheets + bounded strided safety net, parallel_beforebaseline array — verbatim fromchunked_emitter.rs.MAX_SHEET_SIZE_MB(default 150) silently dropped the 200MB+ monster sheets from the cluster task, so the cluster converged without them — a silently wrong fixed point scored as the model. Cap now applies to standalone sheets only; oversized members are included loudly.EVAL_CLUSTER_TIMEOUT_MS(default 60min) so the cluster task isn't killed by the 5min per-sheet default mid-pass.Tests
test-per-sheet-eval-cluster-size.mjs— negative-controlled: on pre-fix code it fails exactly as predicted (clustersTotal=0, members silently dropped); green post-fix.npm testgreen.🤖 Generated with Claude Code