diff --git a/BACKLOG.md b/BACKLOG.md index 40836b9..9c5ee84 100644 --- a/BACKLOG.md +++ b/BACKLOG.md @@ -107,7 +107,11 @@ If any seat would be confused, the component fails. - TOC / sticky nav — ○ ○ ○ - "The three categories" section — ○ ○ ○ -- "Step 1 — Set up" → "Step 5 — Submit" — ○ ○ ○ +- "Step 1 — Set up" — ○ ○ ○ +- "Step 2 — Explore the problem pool" — ○ ○ ○ +- "Step 3 — Write your agent" — ○ ○ ○ +- "Step 4 — Eval locally" — ○ ○ ○ +- "Step 5 — Submit" — ● ● ● — Lead rewritten from factually-wrong "1 easy problem per category" → canonical "3-spec pool sample, one random per round, 2× determinism on first" with routed `#anti-gaming` link + 199-char `CI` tooltip + 216-char sample tooltip. **Correctness fixes**: (a) `optimization` label conflation — was "passes all three categories", canonically means "beats SOTA in at least one of three sampled rounds" per `.github/workflows/eval.yml` L338; (b) `passed` label added — was missing entirely from copy. Two labels now rendered as a 2-item `
Fork the repo on GitHub - , push your agent, and open a PR. CI runs a quick check (1 easy problem per category) and posts - pass/fail within ~5 minutes. Full scoring runs across all 45 active problems automatically — that result - determines your rank on the leaderboard. + , push your agent, and open a PR. CI{" "} + runs a 3-spec pool sample (one random spec per round, 2× determinism on the first) and posts pass/fail within ~5 minutes. See Anti-gaming guarantees ↓ for why per-PR sampling is partial.
CI posts a cross-category table showing your score vs the reference design and current #1 score for each
- problem, plus an overall breadth score. The PR label optimization is
- applied automatically if your agent passes all three categories.
+ sampled spec. Two PR labels are applied automatically based on the result:
+
passed{" "}
+ — your agent ran cleanly on all 3 sampled specs (one per round).
+ optimization{" "}
+ — you beat SOTA in at least one of the three sampled rounds.
+
+ After your PR merges to main, a separate post-merge workflow (score.yml){" "}
+ runs your agent across all 45 active problems, updating your{" "}
+ overall_score on the Rankings page →.