Skip to content

Commit 2dbae3c

Browse files
authored
docs: add eval prompt guidelines and evaluations section (#112)
Closes #99.
1 parent f6be531 commit 2dbae3c

3 files changed

Lines changed: 35 additions & 1 deletion

File tree

.claude/CLAUDE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ node scripts/evaluate-skills.js <skill-name> --triggers-only # Trigger e
6060
```
6161
Results are saved to `evaluations/results/` (gitignored). See `evaluations/icp-cli.json` for the format.
6262

63+
**Eval prompt guidelines:** Keep prompts focused to avoid the 120s timeout. Scope the response ("just the function, no deploy steps"), ask for one thing, and match expected behaviors to what the prompt actually asks. See CONTRIBUTING.md for details and examples.
64+
6365
## Writing Guidelines
6466

6567
- **Write for agents, not humans.** Be explicit with canister IDs, function signatures, and error messages.

CONTRIBUTING.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,26 @@ Create `evaluations/<skill-name>.json` with test cases that verify the skill wor
147147
- **`output_evals`** — realistic prompts with expected behaviors a judge can check
148148
- **`trigger_evals`** — queries that should/shouldn't activate the skill
149149

150-
See `evaluations/icp-cli.json` for a working example. Write prompts the way a developer would actually ask — vague and incomplete, not over-specified test questions. Aim for every pitfall in your skill to have at least one eval covering it — pitfalls are where agents hallucinate most.
150+
See `evaluations/icp-cli.json` for a working example. Aim for every pitfall in your skill to have at least one eval covering it — pitfalls are where agents hallucinate most.
151+
152+
#### Writing eval prompts
153+
154+
Eval prompts run through the `claude` CLI with a 120-second timeout. Open-ended prompts cause the model to generate long responses (full tutorials, backend code, deploy steps) that exceed this limit. Focus each prompt on one thing:
155+
156+
- **Scope the response explicitly** — say what you want ("just the function", "just the YAML snippet") and what to exclude ("no backend code, no deploy steps")
157+
- **Ask for one thing** — "What URL should I use for the local identityProvider?" runs faster than "How do I set up II locally?"
158+
- **Match expected behaviors to the prompt** — don't expect the model to volunteer information the prompt doesn't ask for. If you ask about local URLs, don't fail the eval for missing the mainnet URL
159+
- **Test before committing** — run the eval and verify it completes within the timeout
160+
161+
```
162+
# Bad — open-ended, will generate full tutorial and likely timeout
163+
"Show me how to add Internet Identity login to my Vite frontend app."
164+
165+
# Good — scoped, excludes irrelevant content
166+
"Show me just the JavaScript module that initializes AuthClient and checks
167+
if the user is already authenticated. Keep it minimal — no backend code,
168+
no icp.yaml, no deploy steps."
169+
```
151170

152171
**Running evaluations** (optional, requires `claude` CLI):
153172

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,19 @@ The files are plain markdown — paste into any system prompt, rules file, or co
7373
| Skill index | [`llms.txt`](https://skills.internetcomputer.org/llms.txt) | All skills with descriptions and discovery links |
7474
| Skill page | [`/skills/{name}/`](https://skills.internetcomputer.org/skills/ckbtc/) | Pre-rendered skill page for humans |
7575

76+
## Evaluations
77+
78+
Each skill can have an evaluation file at `evaluations/<skill-name>.json` that tests whether agents produce correct output with the skill loaded. Evals compare agent output with and without the skill, using an LLM judge to score expected behaviors.
79+
80+
```bash
81+
node scripts/evaluate-skills.js <skill-name> # All evals
82+
node scripts/evaluate-skills.js <skill-name> --eval 2 # Single eval
83+
node scripts/evaluate-skills.js <skill-name> --no-baseline # Skip without-skill baseline
84+
node scripts/evaluate-skills.js <skill-name> --triggers-only # Trigger evals only
85+
```
86+
87+
Results are saved to `evaluations/results/` (gitignored). See [CONTRIBUTING.md](CONTRIBUTING.md#4-add-evaluation-cases) for how to write eval cases and prompts.
88+
7689
## Contributing
7790

7891
See [CONTRIBUTING.md](CONTRIBUTING.md) for how to add or update skills.

0 commit comments

Comments
 (0)