Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,11 @@ If any seat would be confused, the component fails.

- TOC / sticky nav — ○ ○ ○
- "The three categories" section — ○ ○ ○
- "Step 1 — Set up" → "Step 5 — Submit" — ○ ○ ○
- "Step 1 — Set up" — ○ ○ ○
- "Step 2 — Explore the problem pool" — ○ ○ ○
- "Step 3 — Write your agent" — ○ ○ ○
- "Step 4 — Eval locally" — ○ ○ ○
- "Step 5 — Submit" — ● ● ● — Lead rewritten from factually-wrong "1 easy problem per category" → canonical "3-spec pool sample, one random per round, 2× determinism on first" with routed `#anti-gaming` link + 199-char `CI` tooltip + 216-char sample tooltip. **Correctness fixes**: (a) `optimization` label conflation — was "passes all three categories", canonically means "beats SOTA in at least one of three sampled rounds" per `.github/workflows/eval.yml` L338; (b) `passed` label added — was missing entirely from copy. Two labels now rendered as a 2-item `<ul>` with per-label tooltips (194/186 chars). Closing paragraph adds the canonical "post-merge `score.yml`" workflow framing — clarifies that full-45-spec eval only runs after PR merges to main, with a routed `/rankings` link to the overall_score result. (`QuickstartGuide.tsx` L721–757, step 373)
- "Whitelisted models" — ○ ○ ○
- "Agent architecture patterns" — ○ ○ ○
- "API reference" — ○ ○ ○
Expand Down
23 changes: 18 additions & 5 deletions src/components/QuickstartGuide.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -722,9 +722,8 @@ forge status agents/your-name/agent.py`} />
<Section id="submit" title="Step 5 — Submit">
<p className="text-forge-muted text-sm">
<a href={`${FORGE_REPO}/fork`} target="_blank" rel="noopener noreferrer" className="text-forge-accent hover:underline">Fork the repo on GitHub</a>
, push your agent, and open a PR. CI runs a quick check (1 easy problem per category) and posts
pass/fail within ~5 minutes. Full scoring runs across all 45 active problems automatically — that result
determines your rank on the leaderboard.
, push your agent, and open a PR. <span className="border-b border-dotted border-forge-muted cursor-help" title="GitHub Actions workflow .github/workflows/eval.yml runs scripts/run_eval_pool.py inside the same forge-eval:latest container you used in Step 4 — so a local PASS will not surprise you with a CI FAIL.">CI</span>{" "}
runs <span className="border-b border-dotted border-forge-muted cursor-help" title="run_eval_pool.py samples ONE random spec per round (3 of 45 total), then re-runs the first spec a second time to verify determinism — same shape as the anti-gaming sampling explained in the Anti-gaming section below.">a 3-spec pool sample</span> (one random spec per round, 2× determinism on the first) and posts pass/fail within ~5 minutes. See <a href="#anti-gaming" className="text-forge-accent hover:underline">Anti-gaming guarantees ↓</a> for why per-PR sampling is partial.
</p>
<CodeBlock code={`# After forking on GitHub:
git remote add mine git@github.com:YOUR_USERNAME/forge.git
Expand All @@ -735,8 +734,22 @@ git push mine your-name/my-design
# Open PR on GitHub`} />
<p className="text-forge-muted text-sm">
CI posts a cross-category table showing your score vs the reference design and current #1 score for each
problem, plus an overall breadth score. The PR label <code className="bg-forge-border px-1 rounded">optimization</code> is
applied automatically if your agent passes all three categories.
sampled spec. Two PR labels are applied automatically based on the result:
</p>
<ul className="text-forge-muted text-sm space-y-1.5 pl-1">
<li>
<code className="bg-forge-border px-1 rounded">passed</code>{" "}
<span className="border-b border-dotted border-forge-muted cursor-help" title="Applied by .github/workflows/eval.yml when your agent runs cleanly on all three sampled specs (one per round) without errors or constraint violations — does not require beating the current SOTA.">— your agent ran cleanly on all 3 sampled specs (one per round).</span>
</li>
<li>
<code className="bg-forge-border px-1 rounded">optimization</code>{" "}
<span className="border-b border-dotted border-forge-muted cursor-help" title="Applied when your sampled score beats the current SOTA on at least ONE of the three sampled rounds. You do not need to beat SOTA in all three — a single win is enough to earn this label.">— you beat SOTA in at least one of the three sampled rounds.</span>
</li>
</ul>
<p className="text-forge-muted text-sm">
After your PR merges to main, a separate <span className="border-b border-dotted border-forge-muted cursor-help" title="The .github/workflows/score.yml workflow triggers on push to main and runs your agent across every active spec in every round — this is the result that determines your overall_score and final placement on the leaderboard.">post-merge workflow (<code className="bg-forge-border px-1 rounded">score.yml</code>)</span>{" "}
runs your agent across all 45 active problems, updating your{" "}
<a href="/rankings" className="text-forge-accent hover:underline">overall_score on the Rankings page →</a>.
</p>
</Section>

Expand Down