Skip to content

Harden CI eval: container crash detection, empty STEP guard, extended timeout#218

Merged
PunchTheDev merged 6 commits into
mainfrom
punch/ci-eval-hardening
Jun 3, 2026
Merged

Harden CI eval: container crash detection, empty STEP guard, extended timeout#218
PunchTheDev merged 6 commits into
mainfrom
punch/ci-eval-hardening

Conversation

@PunchTheDev
Copy link
Copy Markdown
Owner

Summary

Three fixes from a scale readiness audit against the 100+ miner launch scenario:

  • Container crash detection (scripts/run_eval_pool.py): OOM kills, segfaults, and other Docker exits now surface as Container exited 137 with stderr tail instead of the misleading Invalid JSON output: message. Distinguishes clean bad-output from hard crashes.

  • 0-byte STEP guard (scripts/record_submissions.py): The eval pre-creates .forge_step_{spec_id}.step as an empty file before docker run so the container can write to it. If the container crashes, the file stays 0 bytes. Previously this empty blob was base64-encoded and stored in SQLite, setting has_step=true for a submission with no geometry — breaking the 3D viewer. Now skips STEP files under 200 bytes.

  • Score-round timeout 90→150 min (.github/workflows/score.yml): 15 specs × ~180s each + Docker overhead ≈ 50 min per round. 90 minutes is too close to the wire, especially on loaded GitHub runners. eval.yml and hidden-eval (3 specs each) remain at 90 min.

Test plan

  • CI eval with a crashing agent shows Container exited N | stderr: ... in the PR comment
  • CI eval with a passing agent but container crash mid-STEP-write does not set has_step=true
  • score.yml full-round job has timeout-minutes: 150 in the workflow file

🤖 Generated with Claude Code

Punch and others added 6 commits June 3, 2026 13:14
Three independent fixes identified in scale readiness audit:

1. run_eval_pool.py: distinguish container crash (returncode != 0, no output)
   from bad JSON (container ran but output is garbage). Previously both
   showed "Invalid JSON output" — crash now shows "Container exited 137"
   with stderr tail, making OOM kills and segfaults debuggable by miners.

2. record_submissions.py: skip STEP files smaller than 200 bytes.
   The file is pre-created as 0 bytes before docker run so the container
   can write to it; if the container crashes mid-run the file stays empty.
   Storing an empty BLOB sets has_step=true for a submission with no
   geometry, breaking the 3D viewer for that entry.

3. score.yml: increase score-round timeout-minutes from 90 → 150.
   15 specs × ~180s each + Docker overhead ≈ 50 min per round; 90 min
   was dangerously close to the limit for slower specs under high load.
   eval.yml and hidden-eval remain at 90 min (3 specs each — sufficient).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@PunchTheDev PunchTheDev merged commit 96c2320 into main Jun 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant