docs: add eval prompt guidelines and evaluations section (#112)

marc0olo · web-flow · commit 2dbae3c9a22f · 2026-03-25T10:26:02.000+01:00
Closes #99.
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
@@ -60,6 +60,8 @@ node scripts/evaluate-skills.js <skill-name> --triggers-only         # Trigger e
 ```
 Results are saved to `evaluations/results/` (gitignored). See `evaluations/icp-cli.json` for the format.
 
+**Eval prompt guidelines:** Keep prompts focused to avoid the 120s timeout. Scope the response ("just the function, no deploy steps"), ask for one thing, and match expected behaviors to what the prompt actually asks. See CONTRIBUTING.md for details and examples.
+
 ## Writing Guidelines
 
 - **Write for agents, not humans.** Be explicit with canister IDs, function signatures, and error messages.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -147,7 +147,26 @@ Create `evaluations/<skill-name>.json` with test cases that verify the skill wor
 - **`output_evals`** — realistic prompts with expected behaviors a judge can check
 - **`trigger_evals`** — queries that should/shouldn't activate the skill
 
-See `evaluations/icp-cli.json` for a working example. Write prompts the way a developer would actually ask — vague and incomplete, not over-specified test questions. Aim for every pitfall in your skill to have at least one eval covering it — pitfalls are where agents hallucinate most.
+See `evaluations/icp-cli.json` for a working example. Aim for every pitfall in your skill to have at least one eval covering it — pitfalls are where agents hallucinate most.
+
+#### Writing eval prompts
+
+Eval prompts run through the `claude` CLI with a 120-second timeout. Open-ended prompts cause the model to generate long responses (full tutorials, backend code, deploy steps) that exceed this limit. Focus each prompt on one thing:
+
+- **Scope the response explicitly** — say what you want ("just the function", "just the YAML snippet") and what to exclude ("no backend code, no deploy steps")
+- **Ask for one thing** — "What URL should I use for the local identityProvider?" runs faster than "How do I set up II locally?"
+- **Match expected behaviors to the prompt** — don't expect the model to volunteer information the prompt doesn't ask for. If you ask about local URLs, don't fail the eval for missing the mainnet URL
+- **Test before committing** — run the eval and verify it completes within the timeout
+
+```
+# Bad — open-ended, will generate full tutorial and likely timeout
+"Show me how to add Internet Identity login to my Vite frontend app."
+
+# Good — scoped, excludes irrelevant content
+"Show me just the JavaScript module that initializes AuthClient and checks
+if the user is already authenticated. Keep it minimal — no backend code,
+no icp.yaml, no deploy steps."
+```
 
 **Running evaluations** (optional, requires `claude` CLI):
 
diff --git a/README.md b/README.md
@@ -73,6 +73,19 @@ The files are plain markdown — paste into any system prompt, rules file, or co
 | Skill index | [`llms.txt`](https://skills.internetcomputer.org/llms.txt) | All skills with descriptions and discovery links |
 | Skill page | [`/skills/{name}/`](https://skills.internetcomputer.org/skills/ckbtc/) | Pre-rendered skill page for humans |
 
+## Evaluations
+
+Each skill can have an evaluation file at `evaluations/<skill-name>.json` that tests whether agents produce correct output with the skill loaded. Evals compare agent output with and without the skill, using an LLM judge to score expected behaviors.
+
+```bash
+node scripts/evaluate-skills.js <skill-name>                    # All evals
+node scripts/evaluate-skills.js <skill-name> --eval 2            # Single eval
+node scripts/evaluate-skills.js <skill-name> --no-baseline       # Skip without-skill baseline
+node scripts/evaluate-skills.js <skill-name> --triggers-only     # Trigger evals only
+```
+
+Results are saved to `evaluations/results/` (gitignored). See [CONTRIBUTING.md](CONTRIBUTING.md#4-add-evaluation-cases) for how to write eval cases and prompts.
+
 ## Contributing
 
 See [CONTRIBUTING.md](CONTRIBUTING.md) for how to add or update skills.