Skip to content

CI reliability + LLM-only doc fixes + threat model corrections#213

Merged
PunchTheDev merged 7 commits into
mainfrom
punch/fix-threat-model-doc
Jun 3, 2026
Merged

CI reliability + LLM-only doc fixes + threat model corrections#213
PunchTheDev merged 7 commits into
mainfrom
punch/fix-threat-model-doc

Conversation

@PunchTheDev
Copy link
Copy Markdown
Owner

@PunchTheDev PunchTheDev commented Jun 3, 2026

Summary

Critical bug fix — false-positive optimization label:

  • eval.yml: When FORGE_API_URL is temporarily down during a PR eval, all specs appeared "unclaimed" because SOTA fetches silently failed. This caused anyBeatsSota = true → every passing agent got the optimization label (2× Gittensor multiplier) incorrectly. Fix tracks apiReachable separately: 404 = genuinely unclaimed; fetch failure = show "— (API unreachable)" and don't set anyBeatsSota. The forge-api is still authoritative for eligibility at submission record time, but the label now correctly reflects what the CI actually knows.

CI reliability (readiness audit for 100+ miners):

  • eval.yml: Add concurrency: group, cancel-in-progress: true keyed on PR branch. Prevents CI queue pile-up when a miner pushes multiple commits rapidly.
  • scripts/record_submissions.py: 3-attempt exponential backoff (1s → 2s → 4s) for submission POST.
  • scripts/run_hidden_eval.py: Same retry backoff via shared _api_request() helper.

Critical doc fix (onboarding blocker):

  • README.md, QUICKSTART.md, agents/template/agent.py: All three showed generate(spec) as a valid alternative even after PR Require LLM agent contract; reject static agents #212 made it mandatory to use generate(spec, llm). New miners following the scaffold would write a static agent and get a cryptic rejection at eval time. Removes all static-agent references; template now imports LLMClient and uses the correct two-param signature with an inline usage example.
  • Removes dead link to examples/deterministic-agent/ (deleted in PR Require LLM agent contract; reject static agents #212).

Threat model corrections:

  • Corrects "3× determinism check" → "2× on first spec only"
  • Documents known gap: stochastic agents varying only on later specs may slip through
  • Adds Threat 9: Specialist gaming (old avg_rank exploit, mitigated by overall_score)
  • Adds Determinism check coverage row to summary table

Test plan

  • API-unreachable path: spec key missing from sotaBySpec → "— (API unreachable)" shown, no optimization label
  • Unclaimed path: 404 from API → spec set to undefined in sotaBySpec → "⭐ unclaimed" shown, anyBeatsSota=true
  • Retry logic is stdlib-only, no new deps
  • Template agent TypeScript equivalent: imports LLMClient correctly

🤖 Generated with Claude Code

Punch and others added 3 commits June 3, 2026 05:19
…gaming

- Correct '3× determinism check' → '2× on first spec only' (spec 0 runs
  twice; remaining specs run once to keep CI time manageable)
- Document the known gap: stochastic agents that vary only on later specs
  may slip through the single-run check
- Add Threat 9: Specialist gaming — old avg_rank model allowed a miner
  entering 3 easy specs at rank 1 to beat a well-rounded agent averaging
  rank 1.5 across all 45. Mitigated by switching overall leaderboard to
  breadth-normalized overall_score (unentered specs count as baseline 1.0)
- Add 'Determinism check coverage' row to summary table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval.yml: Add concurrency group keyed on branch name with
cancel-in-progress=true. When a miner pushes multiple commits rapidly
to the same PR branch, only the latest push runs eval — prevents
CI queue pile-up from spurious duplicate runs.

record_submissions.py: Replace single-attempt POST with 3-attempt
exponential backoff (1s → 2s → 4s). Transient forge-api blips (restart,
brief 503) no longer silently drop leaderboard submissions. All three
attempts must fail before the error is logged as non-blocking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_hidden_eval.py GET and POST were single-attempt with no retry.
Consolidate into _api_request() with 3-attempt exponential backoff
(1s → 2s → 4s). Raises RuntimeError only after all attempts fail,
which propagates cleanly to the CI step rather than leaving the hidden
eval silently unrecorded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@PunchTheDev PunchTheDev changed the title Fix threat model: determinism claim + Threat 9 specialist gaming Fix threat model doc + CI reliability improvements Jun 3, 2026
…-only

PR #212 made generate(spec, llm) mandatory but left three places still
showing generate(spec) as a supported alternative:

- README.md: Remove static-agent code block; clarify that agents without
  the llm parameter are rejected at eval time
- QUICKSTART.md: Remove static-agent section; remove reference to deleted
  examples/deterministic-agent/; update reference list to current examples
- agents/template/agent.py: Change scaffold to use generate(spec, llm)
  signature with LLMClient import; add LLM usage example in TODO comments;
  remove misleading "Two supported signatures" docstring

Without this fix, a miner following the scaffold + docs would write a
static agent, run CI, and get a cryptic rejection error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@PunchTheDev PunchTheDev changed the title Fix threat model doc + CI reliability improvements CI reliability + LLM-only doc fixes + threat model corrections Jun 3, 2026
Punch and others added 3 commits June 3, 2026 05:36
When FORGE_API_URL is temporarily down during a PR eval, the SOTA fetch
silently returns nothing and all specs appear 'unclaimed' → anyBeatsSota
becomes true → every passing agent gets the optimization label.

Fix: track apiReachable separately. 404 responses from the API mark
specs as genuinely unclaimed; no response (fetch failed, timeout, etc.)
leaves sotaBySpec without that key. In the table-building loop, if the
spec's key is missing from sotaBySpec AND the API was unreachable,
show '— (API unreachable)' instead of '⭐ unclaimed' and don't set
anyBeatsSota. This prevents false-positive optimization labels during
API downtime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the PR comment compared raw scores against SOTA and applied
the optimization label based on whether score < sota (minimize) or
score > sota (maximize). This ignored the time-decay marginal-gain rule
(1.0% for 0-7 days, 0.5% for 7-30 days, 0.1% for 30-90 days). A
submission improving SOTA by 0.001g on day 3 would incorrectly receive
the optimization label (2× Gittensor multiplier).

Fix: after fetching the current SOTA score, also call
GET /sota/{spec_id}/eligibility?score={eval_score} for each spec. Use
the eligibility.eligible field (which applies the marginal-gain rule)
to set anyBeatsSota and the label:
- eligible=true → '✓ beats X (margin ok)' → optimization label
- beats raw score but margin too small → '⚠ beats X — margin too small'
- doesn't beat SOTA → current score shown, no label
- API unreachable → '— (API unreachable)', no label

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The harness (since PR #212) rejects agents with a one-param generate(spec)
signature. The baseline was using the old single-param form and would fail
eval if anyone tried to run it through CI.

Updated to generate(spec, llm: LLMClient) to match the required contract.
The baseline does not call the LLM — it is deterministic geometry — so
llm is accepted but unused (noqa: ARG001). The docstring clarifies this
is permitted: the harness requires the parameter in the signature, not
that it must be called.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@PunchTheDev PunchTheDev merged commit b2f686d into main Jun 3, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant