Skip to content

fix(eval): cluster-child per-pass telemetry + full stderr crash capture#74

Merged
ebootheee merged 1 commit into
mainfrom
fix/eval-cluster-telemetry
Jun 10, 2026
Merged

fix(eval): cluster-child per-pass telemetry + full stderr crash capture#74
ebootheee merged 1 commit into
mainfrom
fix/eval-cluster-telemetry

Conversation

@ebootheee

Copy link
Copy Markdown
Owner

Two canonical A-1 cluster runs died at ~56 min under 12GB AND 20GB heaps, silent, with the V8 GC dump truncated to 200 chars. The child now logs pass/delta/heap per pass (stderr + _cluster-progress.log next to the report, live-tailable) and the orchestrator keeps a real 2500-char stderr tail + classifies V8 heap-dump signatures as OOM. Eval suites green (10+6+5+6). Diagnosis run is next.

🤖 Generated with Claude Code

The canonical A-1 cluster child died at ~56 min under BOTH a 12GB and a 20GB
heap — cap-independent death time — and the orchestrator kept only 200 chars
of err.message, discarding the V8 GC dump that says which heap space exhausted
and every trace of how far the loop got. Two hour-long runs were undiagnosable.

- the child logs every pass to stderr AND _cluster-progress.log next to the
  report (pass time, maxDelta, non-finite cell, heapUsed/rss, sample size) —
  live-tailable, survives tmp cleanup
- on crash the orchestrator keeps a 2500-char stderr tail (classifying V8
  heap-dump signatures as OOM) and prints the last lines; on success it
  surfaces the per-pass telemetry in the orchestrator log
- a 12-min/pass real-model cluster is never a black box again

Diagnosis itself is the next step (probe converges in 2 passes under 6GB on a
strided sample; the engine-mirrored full surface dies — trajectory unknown
until an instrumented run lands).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ebootheee ebootheee merged commit 451acfb into main Jun 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant