Harden CI eval: container crash detection, empty STEP guard, extended timeout by PunchTheDev · Pull Request #218 · PunchTheDev/forge

PunchTheDev · 2026-06-03T13:42:59Z

Summary

Three fixes from a scale readiness audit against the 100+ miner launch scenario:

Container crash detection (scripts/run_eval_pool.py): OOM kills, segfaults, and other Docker exits now surface as Container exited 137 with stderr tail instead of the misleading Invalid JSON output: message. Distinguishes clean bad-output from hard crashes.
0-byte STEP guard (scripts/record_submissions.py): The eval pre-creates .forge_step_{spec_id}.step as an empty file before docker run so the container can write to it. If the container crashes, the file stays 0 bytes. Previously this empty blob was base64-encoded and stored in SQLite, setting has_step=true for a submission with no geometry — breaking the 3D viewer. Now skips STEP files under 200 bytes.
Score-round timeout 90→150 min (.github/workflows/score.yml): 15 specs × ~180s each + Docker overhead ≈ 50 min per round. 90 minutes is too close to the wire, especially on loaded GitHub runners. eval.yml and hidden-eval (3 specs each) remain at 90 min.

Test plan

CI eval with a crashing agent shows Container exited N | stderr: ... in the PR comment
CI eval with a passing agent but container crash mid-STEP-write does not set has_step=true
score.yml full-round job has timeout-minutes: 150 in the workflow file

🤖 Generated with Claude Code

Three independent fixes identified in scale readiness audit: 1. run_eval_pool.py: distinguish container crash (returncode != 0, no output) from bad JSON (container ran but output is garbage). Previously both showed "Invalid JSON output" — crash now shows "Container exited 137" with stderr tail, making OOM kills and segfaults debuggable by miners. 2. record_submissions.py: skip STEP files smaller than 200 bytes. The file is pre-created as 0 bytes before docker run so the container can write to it; if the container crashes mid-run the file stays empty. Storing an empty BLOB sets has_step=true for a submission with no geometry, breaking the 3D viewer for that entry. 3. score.yml: increase score-round timeout-minutes from 90 → 150. 15 specs × ~180s each + Docker overhead ≈ 50 min per round; 90 min was dangerously close to the limit for slower specs under high load. eval.yml and hidden-eval remain at 90 min (3 specs each — sufficient). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Punch and others added 6 commits June 3, 2026 13:14

Clamp baseline geometry to build volume, cover load point Z

29aab0e

Leave 2mm margin from build volume boundary in baseline

df509db

Fix model IDs in CONTRIBUTING, add llm-agent to examples list

78b6193

Serialize score.yml to prevent concurrent DB-write overload

ca1a7c6

Add session-2 changelog entries

97b274c

PunchTheDev merged commit 96c2320 into main Jun 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden CI eval: container crash detection, empty STEP guard, extended timeout#218

Harden CI eval: container crash detection, empty STEP guard, extended timeout#218
PunchTheDev merged 6 commits into
mainfrom
punch/ci-eval-hardening

PunchTheDev commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PunchTheDev commented Jun 3, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant