CI reliability + LLM-only doc fixes + threat model corrections by PunchTheDev · Pull Request #213 · PunchTheDev/forge

PunchTheDev · 2026-06-03T05:20:08Z

Summary

Critical bug fix — false-positive optimization label:

eval.yml: When FORGE_API_URL is temporarily down during a PR eval, all specs appeared "unclaimed" because SOTA fetches silently failed. This caused anyBeatsSota = true → every passing agent got the optimization label (2× Gittensor multiplier) incorrectly. Fix tracks apiReachable separately: 404 = genuinely unclaimed; fetch failure = show "— (API unreachable)" and don't set anyBeatsSota. The forge-api is still authoritative for eligibility at submission record time, but the label now correctly reflects what the CI actually knows.

CI reliability (readiness audit for 100+ miners):

eval.yml: Add concurrency: group, cancel-in-progress: true keyed on PR branch. Prevents CI queue pile-up when a miner pushes multiple commits rapidly.
scripts/record_submissions.py: 3-attempt exponential backoff (1s → 2s → 4s) for submission POST.
scripts/run_hidden_eval.py: Same retry backoff via shared _api_request() helper.

Critical doc fix (onboarding blocker):

README.md, QUICKSTART.md, agents/template/agent.py: All three showed generate(spec) as a valid alternative even after PR Require LLM agent contract; reject static agents #212 made it mandatory to use generate(spec, llm). New miners following the scaffold would write a static agent and get a cryptic rejection at eval time. Removes all static-agent references; template now imports LLMClient and uses the correct two-param signature with an inline usage example.
Removes dead link to examples/deterministic-agent/ (deleted in PR Require LLM agent contract; reject static agents #212).

Threat model corrections:

Corrects "3× determinism check" → "2× on first spec only"
Documents known gap: stochastic agents varying only on later specs may slip through
Adds Threat 9: Specialist gaming (old avg_rank exploit, mitigated by overall_score)
Adds Determinism check coverage row to summary table

Test plan

API-unreachable path: spec key missing from sotaBySpec → "— (API unreachable)" shown, no optimization label
Unclaimed path: 404 from API → spec set to undefined in sotaBySpec → "⭐ unclaimed" shown, anyBeatsSota=true
Retry logic is stdlib-only, no new deps
Template agent TypeScript equivalent: imports LLMClient correctly

🤖 Generated with Claude Code

…gaming - Correct '3× determinism check' → '2× on first spec only' (spec 0 runs twice; remaining specs run once to keep CI time manageable) - Document the known gap: stochastic agents that vary only on later specs may slip through the single-run check - Add Threat 9: Specialist gaming — old avg_rank model allowed a miner entering 3 easy specs at rank 1 to beat a well-rounded agent averaging rank 1.5 across all 45. Mitigated by switching overall leaderboard to breadth-normalized overall_score (unentered specs count as baseline 1.0) - Add 'Determinism check coverage' row to summary table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eval.yml: Add concurrency group keyed on branch name with cancel-in-progress=true. When a miner pushes multiple commits rapidly to the same PR branch, only the latest push runs eval — prevents CI queue pile-up from spurious duplicate runs. record_submissions.py: Replace single-attempt POST with 3-attempt exponential backoff (1s → 2s → 4s). Transient forge-api blips (restart, brief 503) no longer silently drop leaderboard submissions. All three attempts must fail before the error is logged as non-blocking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

run_hidden_eval.py GET and POST were single-attempt with no retry. Consolidate into _api_request() with 3-attempt exponential backoff (1s → 2s → 4s). Raises RuntimeError only after all attempts fail, which propagates cleanly to the CI step rather than leaving the hidden eval silently unrecorded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-only PR #212 made generate(spec, llm) mandatory but left three places still showing generate(spec) as a supported alternative: - README.md: Remove static-agent code block; clarify that agents without the llm parameter are rejected at eval time - QUICKSTART.md: Remove static-agent section; remove reference to deleted examples/deterministic-agent/; update reference list to current examples - agents/template/agent.py: Change scaffold to use generate(spec, llm) signature with LLMClient import; add LLM usage example in TODO comments; remove misleading "Two supported signatures" docstring Without this fix, a miner following the scaffold + docs would write a static agent, run CI, and get a cryptic rejection error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When FORGE_API_URL is temporarily down during a PR eval, the SOTA fetch silently returns nothing and all specs appear 'unclaimed' → anyBeatsSota becomes true → every passing agent gets the optimization label. Fix: track apiReachable separately. 404 responses from the API mark specs as genuinely unclaimed; no response (fetch failed, timeout, etc.) leaves sotaBySpec without that key. In the table-building loop, if the spec's key is missing from sotaBySpec AND the API was unreachable, show '— (API unreachable)' instead of '⭐ unclaimed' and don't set anyBeatsSota. This prevents false-positive optimization labels during API downtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously the PR comment compared raw scores against SOTA and applied the optimization label based on whether score < sota (minimize) or score > sota (maximize). This ignored the time-decay marginal-gain rule (1.0% for 0-7 days, 0.5% for 7-30 days, 0.1% for 30-90 days). A submission improving SOTA by 0.001g on day 3 would incorrectly receive the optimization label (2× Gittensor multiplier). Fix: after fetching the current SOTA score, also call GET /sota/{spec_id}/eligibility?score={eval_score} for each spec. Use the eligibility.eligible field (which applies the marginal-gain rule) to set anyBeatsSota and the label: - eligible=true → '✓ beats X (margin ok)' → optimization label - beats raw score but margin too small → '⚠ beats X — margin too small' - doesn't beat SOTA → current score shown, no label - API unreachable → '— (API unreachable)', no label Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The harness (since PR #212) rejects agents with a one-param generate(spec) signature. The baseline was using the old single-param form and would fail eval if anyone tried to run it through CI. Updated to generate(spec, llm: LLMClient) to match the required contract. The baseline does not call the LLM — it is deterministic geometry — so llm is accepted but unused (noqa: ARG001). The docstring clarifies this is permitted: the harness requires the parameter in the signature, not that it must be called. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Punch and others added 3 commits June 3, 2026 05:19

PunchTheDev changed the title ~~Fix threat model: determinism claim + Threat 9 specialist gaming~~ Fix threat model doc + CI reliability improvements Jun 3, 2026

PunchTheDev changed the title ~~Fix threat model doc + CI reliability improvements~~ CI reliability + LLM-only doc fixes + threat model corrections Jun 3, 2026

Punch and others added 3 commits June 3, 2026 05:36

PunchTheDev merged commit b2f686d into main Jun 3, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI reliability + LLM-only doc fixes + threat model corrections#213

CI reliability + LLM-only doc fixes + threat model corrections#213
PunchTheDev merged 7 commits into
mainfrom
punch/fix-threat-model-doc

PunchTheDev commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PunchTheDev commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PunchTheDev commented Jun 3, 2026 •

edited

Loading