You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Results are saved to `evaluations/results/` (gitignored). See `evaluations/icp-cli.json` for the format.
62
62
63
+
**Eval prompt guidelines:** Keep prompts focused to avoid the 120s timeout. Scope the response ("just the function, no deploy steps"), ask for one thing, and match expected behaviors to what the prompt actually asks. See CONTRIBUTING.md for details and examples.
64
+
63
65
## Writing Guidelines
64
66
65
67
-**Write for agents, not humans.** Be explicit with canister IDs, function signatures, and error messages.
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+20-1Lines changed: 20 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -147,7 +147,26 @@ Create `evaluations/<skill-name>.json` with test cases that verify the skill wor
147
147
-**`output_evals`** — realistic prompts with expected behaviors a judge can check
148
148
-**`trigger_evals`** — queries that should/shouldn't activate the skill
149
149
150
-
See `evaluations/icp-cli.json` for a working example. Write prompts the way a developer would actually ask — vague and incomplete, not over-specified test questions. Aim for every pitfall in your skill to have at least one eval covering it — pitfalls are where agents hallucinate most.
150
+
See `evaluations/icp-cli.json` for a working example. Aim for every pitfall in your skill to have at least one eval covering it — pitfalls are where agents hallucinate most.
151
+
152
+
#### Writing eval prompts
153
+
154
+
Eval prompts run through the `claude` CLI with a 120-second timeout. Open-ended prompts cause the model to generate long responses (full tutorials, backend code, deploy steps) that exceed this limit. Focus each prompt on one thing:
155
+
156
+
-**Scope the response explicitly** — say what you want ("just the function", "just the YAML snippet") and what to exclude ("no backend code, no deploy steps")
157
+
-**Ask for one thing** — "What URL should I use for the local identityProvider?" runs faster than "How do I set up II locally?"
158
+
-**Match expected behaviors to the prompt** — don't expect the model to volunteer information the prompt doesn't ask for. If you ask about local URLs, don't fail the eval for missing the mainnet URL
159
+
-**Test before committing** — run the eval and verify it completes within the timeout
160
+
161
+
```
162
+
# Bad — open-ended, will generate full tutorial and likely timeout
163
+
"Show me how to add Internet Identity login to my Vite frontend app."
164
+
165
+
# Good — scoped, excludes irrelevant content
166
+
"Show me just the JavaScript module that initializes AuthClient and checks
167
+
if the user is already authenticated. Keep it minimal — no backend code,
Copy file name to clipboardExpand all lines: README.md
+13Lines changed: 13 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,6 +73,19 @@ The files are plain markdown — paste into any system prompt, rules file, or co
73
73
| Skill index |[`llms.txt`](https://skills.internetcomputer.org/llms.txt)| All skills with descriptions and discovery links |
74
74
| Skill page |[`/skills/{name}/`](https://skills.internetcomputer.org/skills/ckbtc/)| Pre-rendered skill page for humans |
75
75
76
+
## Evaluations
77
+
78
+
Each skill can have an evaluation file at `evaluations/<skill-name>.json` that tests whether agents produce correct output with the skill loaded. Evals compare agent output with and without the skill, using an LLM judge to score expected behaviors.
79
+
80
+
```bash
81
+
node scripts/evaluate-skills.js <skill-name># All evals
82
+
node scripts/evaluate-skills.js <skill-name> --eval 2 # Single eval
node scripts/evaluate-skills.js <skill-name> --triggers-only # Trigger evals only
85
+
```
86
+
87
+
Results are saved to `evaluations/results/` (gitignored). See [CONTRIBUTING.md](CONTRIBUTING.md#4-add-evaluation-cases) for how to write eval cases and prompts.
88
+
76
89
## Contributing
77
90
78
91
See [CONTRIBUTING.md](CONTRIBUTING.md) for how to add or update skills.
0 commit comments