Skip to content

fix(eval): slim per-sheet-eval cluster child to the engine's exact surface#69

Merged
ebootheee merged 1 commit into
mainfrom
fix/per-sheet-eval-slim
Jun 10, 2026
Merged

fix(eval): slim per-sheet-eval cluster child to the engine's exact surface#69
ebootheee merged 1 commit into
mainfrom
fix/per-sheet-eval-slim

Conversation

@ebootheee

Copy link
Copy Markdown
Owner

What

Closes the canonical-harness OOM (the cluster child OOMed a 16GB heap / 61min on the real 17-sheet A-1 returns cluster) by making the child mirror the shipped engine's emitted cluster loop decision-for-decision — which is simultaneously the memory fix and a lockstep-fidelity fix.

  • Single GT copy: child parses the seed directly into ctx.values; orchestrator points children at the build's own _ground-truth.json (no re-serialized copy) and frees the grouped-by-sheet GT after sampling.
  • Engine's sampled surface replaces the _written Set (~4.7M fresh strings) + per-cell prevSnapshot object: every numeric cell on the cluster's OWN sheets + bounded strided safety net, parallel _before baseline array — verbatim from chunked_emitter.rs.
  • NaN-fill scope = engine's exact surface. The old written-set surface diffed/NaN-filled writes onto non-member sheets (engine doesn't) and skipped never-written member-sheet input cells (engine NaN-fills them).
  • A cluster member is never size-skipped. MAX_SHEET_SIZE_MB (default 150) silently dropped the 200MB+ monster sheets from the cluster task, so the cluster converged without them — a silently wrong fixed point scored as the model. Cap now applies to standalone sheets only; oversized members are included loudly.
  • EVAL_CLUSTER_TIMEOUT_MS (default 60min) so the cluster task isn't killed by the 5min per-sheet default mid-pass.

Tests

  • New test-per-sheet-eval-cluster-size.mjs — negative-controlled: on pre-fix code it fails exactly as predicted (clustersTotal=0, members silently dropped); green post-fix.
  • Existing convergence-semantics suite green: per-sheet-eval (10), intracycle (4), lockstep (6), transient-div0 (15), divergent-cap (4).
  • Full npm test green.

🤖 Generated with Claude Code

…t surface

The canonical cluster child OOMed a 16GB heap on the real 17-sheet returns
cluster. Three legs, all removed by mirroring the SHIPPED engine's emitted
cluster loop (chunked_emitter.rs) decision-for-decision:

- parse the GT seed DIRECTLY into ctx.values (the parse-then-copy pattern held
  two full copies of a multi-GB object); the orchestrator now points children
  at the build's own _ground-truth.json instead of re-serializing a copy, and
  frees its grouped-by-sheet GT once samples are on disk
- drop the ~4.7M-fresh-string _written Set and per-cell prevSnapshot object in
  favor of the engine's sampled surface: every numeric cell on the cluster's
  OWN sheets + a bounded strided safety net, with a parallel _before baseline
  array
- NaN-fill on non-convergence now covers the engine's exact surface (numeric
  cells on member sheets) instead of the written set — the old surface was
  subtly non-lockstep (it diffed/NaN-filled writes onto NON-member sheets and
  skipped never-written member-sheet input cells)

Also:
- a cluster member is NEVER size-skipped (MAX_SHEET_SIZE_MB now applies to
  standalone sheets only): the 150MB default silently dropped the monster
  sheets from the cluster, converging a partial cluster to a silently wrong
  fixed point — regression test goes red on the old code (clustersTotal=0)
- cluster tasks get their own timeout (EVAL_CLUSTER_TIMEOUT_MS, default 60min;
  a real warm-seeded cluster is ~2-3 passes at ~10-15min/pass)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant