CI reliability + LLM-only doc fixes + threat model corrections#213
Merged
Conversation
…gaming - Correct '3× determinism check' → '2× on first spec only' (spec 0 runs twice; remaining specs run once to keep CI time manageable) - Document the known gap: stochastic agents that vary only on later specs may slip through the single-run check - Add Threat 9: Specialist gaming — old avg_rank model allowed a miner entering 3 easy specs at rank 1 to beat a well-rounded agent averaging rank 1.5 across all 45. Mitigated by switching overall leaderboard to breadth-normalized overall_score (unentered specs count as baseline 1.0) - Add 'Determinism check coverage' row to summary table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval.yml: Add concurrency group keyed on branch name with cancel-in-progress=true. When a miner pushes multiple commits rapidly to the same PR branch, only the latest push runs eval — prevents CI queue pile-up from spurious duplicate runs. record_submissions.py: Replace single-attempt POST with 3-attempt exponential backoff (1s → 2s → 4s). Transient forge-api blips (restart, brief 503) no longer silently drop leaderboard submissions. All three attempts must fail before the error is logged as non-blocking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_hidden_eval.py GET and POST were single-attempt with no retry. Consolidate into _api_request() with 3-attempt exponential backoff (1s → 2s → 4s). Raises RuntimeError only after all attempts fail, which propagates cleanly to the CI step rather than leaving the hidden eval silently unrecorded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-only PR #212 made generate(spec, llm) mandatory but left three places still showing generate(spec) as a supported alternative: - README.md: Remove static-agent code block; clarify that agents without the llm parameter are rejected at eval time - QUICKSTART.md: Remove static-agent section; remove reference to deleted examples/deterministic-agent/; update reference list to current examples - agents/template/agent.py: Change scaffold to use generate(spec, llm) signature with LLMClient import; add LLM usage example in TODO comments; remove misleading "Two supported signatures" docstring Without this fix, a miner following the scaffold + docs would write a static agent, run CI, and get a cryptic rejection error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When FORGE_API_URL is temporarily down during a PR eval, the SOTA fetch silently returns nothing and all specs appear 'unclaimed' → anyBeatsSota becomes true → every passing agent gets the optimization label. Fix: track apiReachable separately. 404 responses from the API mark specs as genuinely unclaimed; no response (fetch failed, timeout, etc.) leaves sotaBySpec without that key. In the table-building loop, if the spec's key is missing from sotaBySpec AND the API was unreachable, show '— (API unreachable)' instead of '⭐ unclaimed' and don't set anyBeatsSota. This prevents false-positive optimization labels during API downtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the PR comment compared raw scores against SOTA and applied
the optimization label based on whether score < sota (minimize) or
score > sota (maximize). This ignored the time-decay marginal-gain rule
(1.0% for 0-7 days, 0.5% for 7-30 days, 0.1% for 30-90 days). A
submission improving SOTA by 0.001g on day 3 would incorrectly receive
the optimization label (2× Gittensor multiplier).
Fix: after fetching the current SOTA score, also call
GET /sota/{spec_id}/eligibility?score={eval_score} for each spec. Use
the eligibility.eligible field (which applies the marginal-gain rule)
to set anyBeatsSota and the label:
- eligible=true → '✓ beats X (margin ok)' → optimization label
- beats raw score but margin too small → '⚠ beats X — margin too small'
- doesn't beat SOTA → current score shown, no label
- API unreachable → '— (API unreachable)', no label
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The harness (since PR #212) rejects agents with a one-param generate(spec) signature. The baseline was using the old single-param form and would fail eval if anyone tried to run it through CI. Updated to generate(spec, llm: LLMClient) to match the required contract. The baseline does not call the LLM — it is deterministic geometry — so llm is accepted but unused (noqa: ARG001). The docstring clarifies this is permitted: the harness requires the parameter in the signature, not that it must be called. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Critical bug fix — false-positive
optimizationlabel:eval.yml: WhenFORGE_API_URLis temporarily down during a PR eval, all specs appeared "unclaimed" because SOTA fetches silently failed. This causedanyBeatsSota = true→ every passing agent got theoptimizationlabel (2× Gittensor multiplier) incorrectly. Fix tracksapiReachableseparately: 404 = genuinely unclaimed; fetch failure = show "— (API unreachable)" and don't setanyBeatsSota. The forge-api is still authoritative for eligibility at submission record time, but the label now correctly reflects what the CI actually knows.CI reliability (readiness audit for 100+ miners):
eval.yml: Addconcurrency: group, cancel-in-progress: truekeyed on PR branch. Prevents CI queue pile-up when a miner pushes multiple commits rapidly.scripts/record_submissions.py: 3-attempt exponential backoff (1s → 2s → 4s) for submission POST.scripts/run_hidden_eval.py: Same retry backoff via shared_api_request()helper.Critical doc fix (onboarding blocker):
README.md,QUICKSTART.md,agents/template/agent.py: All three showedgenerate(spec)as a valid alternative even after PR Require LLM agent contract; reject static agents #212 made it mandatory to usegenerate(spec, llm). New miners following the scaffold would write a static agent and get a cryptic rejection at eval time. Removes all static-agent references; template now importsLLMClientand uses the correct two-param signature with an inline usage example.examples/deterministic-agent/(deleted in PR Require LLM agent contract; reject static agents #212).Threat model corrections:
avg_rankexploit, mitigated byoverall_score)Test plan
🤖 Generated with Claude Code