diff --git a/.agents/skills/create-skill/SKILL.md b/.agents/skills/create-skill/SKILL.md index 8c44fc6..54b7fce 100644 --- a/.agents/skills/create-skill/SKILL.md +++ b/.agents/skills/create-skill/SKILL.md @@ -8,7 +8,7 @@ tags: - authoring metadata: author: Anthropic - version: "1.7.0" + version: "1.8.0" source: github.com/anthropics/skills catalog: utility category: meta diff --git a/.agents/skills/create-skill/evals/evals.json b/.agents/skills/create-skill/evals/evals.json index 79d07b7..0cb84a7 100644 --- a/.agents/skills/create-skill/evals/evals.json +++ b/.agents/skills/create-skill/evals/evals.json @@ -12,6 +12,7 @@ "Produces or revises SKILL.md instructions", "Keeps metadata and body budgets in mind", "Uses references only when they reduce main-file complexity", + "Uses the reference format that best teaches the behavior instead of defaulting to terse bullets", "Avoids placeholder bundled resources", "Preserves portability and safety" ] @@ -27,6 +28,7 @@ "Produces or revises SKILL.md instructions", "Keeps metadata and body budgets in mind", "Uses references only when they reduce main-file complexity", + "Uses the reference format that best teaches the behavior instead of defaulting to terse bullets", "Avoids placeholder bundled resources", "Preserves portability and safety" ] @@ -42,6 +44,7 @@ "Produces or revises SKILL.md instructions", "Keeps metadata and body budgets in mind", "Uses references only when they reduce main-file complexity", + "Uses the reference format that best teaches the behavior instead of defaulting to terse bullets", "Avoids placeholder bundled resources", "Preserves portability and safety" ] @@ -57,6 +60,7 @@ "Produces or revises SKILL.md instructions", "Keeps metadata and body budgets in mind", "Uses references only when they reduce main-file complexity", + "Uses the reference format that best teaches the behavior instead of defaulting to terse bullets", "Avoids placeholder bundled resources", "Preserves portability and safety" ] @@ -72,6 +76,7 @@ "Produces or revises SKILL.md instructions", "Keeps metadata and body budgets in mind", "Uses references only when they reduce main-file complexity", + "Uses the reference format that best teaches the behavior instead of defaulting to terse bullets", "Avoids placeholder bundled resources", "Preserves portability and safety" ] @@ -87,6 +92,7 @@ "Produces or revises SKILL.md instructions", "Keeps metadata and body budgets in mind", "Uses references only when they reduce main-file complexity", + "Uses the reference format that best teaches the behavior instead of defaulting to terse bullets", "Avoids placeholder bundled resources", "Preserves portability and safety" ] @@ -102,6 +108,7 @@ "Produces or revises SKILL.md instructions", "Keeps metadata and body budgets in mind", "Uses references only when they reduce main-file complexity", + "Uses the reference format that best teaches the behavior instead of defaulting to terse bullets", "Avoids placeholder bundled resources", "Preserves portability and safety" ] @@ -117,6 +124,7 @@ "Produces or revises SKILL.md instructions", "Keeps metadata and body budgets in mind", "Uses references only when they reduce main-file complexity", + "Uses the reference format that best teaches the behavior instead of defaulting to terse bullets", "Avoids placeholder bundled resources", "Preserves portability and safety" ] @@ -251,7 +259,9 @@ "Routes to evaluation guidance", "Creates realistic prompt-level eval cases", "Includes route or trigger boundary coverage", - "Uses objective expectations where possible", + "Derives assertions from the skill contract when objective checks are useful", + "Covers distinct failure modes or input classes without redundant assertions", + "Includes at least one negative assertion for evals with objective checks", "Keeps evals inside the skill folder", "Mentions reproducible iteration or benchmark workflow when relevant" ] @@ -266,7 +276,9 @@ "Routes to evaluation guidance", "Creates realistic prompt-level eval cases", "Includes route or trigger boundary coverage", - "Uses objective expectations where possible", + "Derives assertions from the skill contract when objective checks are useful", + "Covers distinct failure modes or input classes without redundant assertions", + "Includes at least one negative assertion for evals with objective checks", "Keeps evals inside the skill folder", "Mentions reproducible iteration or benchmark workflow when relevant" ] @@ -281,7 +293,9 @@ "Routes to evaluation guidance", "Creates realistic prompt-level eval cases", "Includes route or trigger boundary coverage", - "Uses objective expectations where possible", + "Derives assertions from the skill contract when objective checks are useful", + "Covers distinct failure modes or input classes without redundant assertions", + "Includes at least one negative assertion for evals with objective checks", "Keeps evals inside the skill folder", "Mentions reproducible iteration or benchmark workflow when relevant" ] @@ -296,7 +310,9 @@ "Routes to evaluation guidance", "Creates realistic prompt-level eval cases", "Includes route or trigger boundary coverage", - "Uses objective expectations where possible", + "Derives assertions from the skill contract when objective checks are useful", + "Covers distinct failure modes or input classes without redundant assertions", + "Includes at least one negative assertion for evals with objective checks", "Keeps evals inside the skill folder", "Mentions reproducible iteration or benchmark workflow when relevant" ] @@ -311,7 +327,9 @@ "Routes to evaluation guidance", "Creates realistic prompt-level eval cases", "Includes route or trigger boundary coverage", - "Uses objective expectations where possible", + "Derives assertions from the skill contract when objective checks are useful", + "Covers distinct failure modes or input classes without redundant assertions", + "Includes at least one negative assertion for evals with objective checks", "Keeps evals inside the skill folder", "Mentions reproducible iteration or benchmark workflow when relevant" ] @@ -326,7 +344,9 @@ "Routes to evaluation guidance", "Creates realistic prompt-level eval cases", "Includes route or trigger boundary coverage", - "Uses objective expectations where possible", + "Derives assertions from the skill contract when objective checks are useful", + "Covers distinct failure modes or input classes without redundant assertions", + "Includes at least one negative assertion for evals with objective checks", "Keeps evals inside the skill folder", "Mentions reproducible iteration or benchmark workflow when relevant" ] @@ -341,7 +361,9 @@ "Routes to evaluation guidance", "Creates realistic prompt-level eval cases", "Includes route or trigger boundary coverage", - "Uses objective expectations where possible", + "Derives assertions from the skill contract when objective checks are useful", + "Covers distinct failure modes or input classes without redundant assertions", + "Includes at least one negative assertion for evals with objective checks", "Keeps evals inside the skill folder", "Mentions reproducible iteration or benchmark workflow when relevant" ] @@ -356,7 +378,9 @@ "Routes to evaluation guidance", "Creates realistic prompt-level eval cases", "Includes route or trigger boundary coverage", - "Uses objective expectations where possible", + "Derives assertions from the skill contract when objective checks are useful", + "Covers distinct failure modes or input classes without redundant assertions", + "Includes at least one negative assertion for evals with objective checks", "Keeps evals inside the skill folder", "Mentions reproducible iteration or benchmark workflow when relevant" ] diff --git a/.agents/skills/create-skill/references/agent-compatibility.md b/.agents/skills/create-skill/references/agent-compatibility.md index 7d07511..848565c 100644 --- a/.agents/skills/create-skill/references/agent-compatibility.md +++ b/.agents/skills/create-skill/references/agent-compatibility.md @@ -1,14 +1,41 @@ # Agent Compatibility -Use this reference when adapting skill creation or evaluation to a specific runtime. +Use this reference when adapting skill creation, evaluation, or packaging to a specific agent runtime. + +The core rule is simple: preserve the skill's behavior, then swap only the runtime mechanics that do not exist in the current environment. + +## Start From Capability Gaps + +Before changing a workflow, identify what the current agent can and cannot do. + +Check for: + +- **Subagents:** Can it run skill and baseline attempts in parallel? +- **Trigger telemetry:** Can it tell whether a skill would activate? +- **File access:** Can it read and write the target skill directory? +- **Browser/display:** Can it show the review UI? +- **Command shape:** Does it take prompts through stdin, arguments, files, or an interactive session? + +Do not rewrite portable instructions just because a runtime lacks one convenience. Adapt the missing mechanism, not the skill's intent. + +--- ## Agents Without Subagents -Follow the same draft, test, review, and improve loop, but run test cases serially yourself. Skip baseline comparisons unless another local mechanism can produce them fairly. +Follow the same draft, test, review, and improve loop, but run test cases serially yourself. + +Baseline comparisons are weaker without isolated runs. Skip them unless another local mechanism can produce them fairly. Treat results as qualitative unless deterministic assertions can be checked locally. + +When review UI support is limited, use one of these fallbacks: + +- **Static review:** save a static HTML review file. +- **Inline summary:** summarize outputs directly in the conversation. +- **Focused questions:** ask concise inline review questions. +- **Deterministic checks:** use scripts for checks that do not need human judgment. -Present outputs directly in the conversation or save files for the user to inspect. If a browser is unavailable, skip the live review server and use a static HTML review file or concise inline review prompts. +### What changes -Quantitative benchmarking is less meaningful without isolated baseline runs. Prioritize qualitative feedback unless deterministic assertions can be checked locally. +The process gets slower and less statistically clean. The standard should not get lower. Keep transcripts, outputs, and grading results organized so another reviewer can reproduce the judgment. --- @@ -24,7 +51,7 @@ python -m scripts.run_loop \ --verbose ``` -Use the user's normal Claude Code configuration. +Use the user's normal Claude Code configuration. Do not silently switch models or tool settings, because trigger behavior should reflect the user's actual environment. --- @@ -40,7 +67,7 @@ python -m scripts.run_loop \ --verbose ``` -For CLIs that need arguments or files instead of stdin, use: +For CLIs that need arguments or files instead of stdin, use `--agent-command`: ```bash python -m scripts.run_loop \ @@ -51,20 +78,49 @@ python -m scripts.run_loop \ --verbose ``` +Use `{prompt}` when the CLI accepts inline prompt text. Use `{prompt_file}` when prompt files are safer for quoting, long inputs, or multiline content. + --- ## Cowork -Cowork has subagents, so parallel skill and baseline runs can work. If timeouts become a problem, run prompts in smaller batches. +Cowork has subagents, so parallel skill and baseline runs can work. If timeouts become a problem, run prompts in smaller batches instead of dropping coverage. + +Cowork may not have a display. Generate a static review file with: -Cowork may not have a display. Generate a static review file with `eval-viewer/generate_review.py --static ` and share that path. Use the generated review UI before revising from test outputs. +```bash +python /eval-viewer/generate_review.py \ + \ + --skill-name "" \ + --benchmark /benchmark.json \ + --static +``` -When feedback is downloaded as `feedback.json`, copy it into the current iteration directory before continuing. +Use the generated review UI before revising from test outputs. When feedback is downloaded as `feedback.json`, copy it into the current iteration directory before continuing. --- ## Updating Installed Skills -Preserve the original skill directory name and `name` frontmatter. If an installed skill path is read-only, copy it to a writable location, edit the copy, and package from there. +Preserve the original skill directory name and `name` frontmatter. Installed skills often rely on those identifiers for discovery. + +If the installed skill path is read-only: + +1. Copy the skill to a writable location. +2. Edit and validate the copy. +3. Package from the copy. +4. Tell the user which artifact or directory should replace the installed version. When packaging manually, stage temporary package contents in `/tmp/` first if direct writes fail. + +--- + +## Portability Checklist + +Before finishing a compatibility adaptation, verify: + +- **Core behavior:** the workflow still describes the same skill behavior. +- **Runtime isolation:** runtime-specific commands are isolated to compatibility notes. +- **Fallbacks:** unavailable features have explicit alternatives. +- **Result confidence:** eval results are described with the right confidence level. +- **Packaging:** package and install instructions match the user's actual runtime. diff --git a/.agents/skills/create-skill/references/authoring.md b/.agents/skills/create-skill/references/authoring.md index ecf7a1b..c463836 100644 --- a/.agents/skills/create-skill/references/authoring.md +++ b/.agents/skills/create-skill/references/authoring.md @@ -2,98 +2,169 @@ Use this reference when creating a new skill or revising an existing `SKILL.md`. +A good skill is not a long prompt. It is a compact operating procedure that tells another agent when to activate, what context to load, what steps to follow, and how to verify the result. + ## Capture Intent -Extract what the user already provided before asking questions. Identify what the skill enables the calling agent to do, when it should trigger, what output it should produce, and whether test cases should be created. Put the trigger scope in frontmatter `description`, not in a body `Scope` section. +Use what the user already gave you. Do not interview them for information that is already in the request, the repo, examples, or existing skill folder. + +Identify: + +- **Capability:** what the skill enables the calling agent to do +- **Activation:** which user phrases, artifacts, or contexts should trigger it +- **Output:** what the skill should produce or change +- **Inputs:** files, prompts, external tools, or structured data it consumes +- **Verification:** whether objective evals, validators, or review loops are useful -Ask at most one focused question when a missing answer would materially change the skill. +Ask at most one focused question when the missing answer would materially change the skill. Otherwise, make a conservative assumption and keep moving. --- -## Research and Interview +## Research Before Writing -Ask about edge cases, input formats, output formats, success criteria, dependencies, and example files. For existing skills, inspect the current folder before editing. Check bundled scripts, references, assets, and evals so you preserve local conventions. +For existing skills, inspect the current folder before editing. Check `SKILL.md`, `references/`, `scripts/`, `assets/`, and `evals/` so you preserve local conventions. -Ground the draft in real source material when available: prior task traces, existing docs, runbooks, review comments, issue history, example files, or failed eval outputs. Prefer concrete project facts and corrections over generic best practices. Fall back to general domain knowledge only when no better source exists. +Ground the draft in real source material when available: ---- +- **Task traces:** prior runs that show how the skill is actually used. +- **Docs and runbooks:** existing durable guidance. +- **Review and issues:** comments, bugs, and decisions from prior work. +- **Eval failures:** outputs that reveal recurring misses. +- **Examples and fixtures:** files with known structure or expected results. +- **Project context:** local rules or memory. -## Write `SKILL.md` +Use concrete project facts and corrections over generic best practices. Fall back to general domain knowledge only when no better source exists. -Required frontmatter fields are `name` and `description`. Optional top-level fields are `license`, `tags`, and `metadata`. Put `author`, `version`, `source`, `catalog`, `category`, and `references` inside `metadata`. Keep the complete frontmatter under 100 tokens. The description is the primary trigger signal, so write it as action plus activation cues: name the task the skill performs, the contexts that should trigger it, representative user phrases, and exclusions only when they prevent likely misfires. +--- -Use these metadata fields: +## Write The Frontmatter -| Field | Meaning | -| --- | --- | -| `name` | A unique identifier for the skill. | -| `description` | A concise explanation of the skill's purpose and when to use it. | -| `license` | The name of the license, such as `MIT` or `Apache-2.0`. | -| `tags` | A list of searchable keywords for discovery and filtering. | -| `metadata` | A nested mapping for arbitrary key-value pairs. | +- **Required fields:** Use `name` and `description`. Optional top-level fields are `license`, `tags`, and `metadata`. +- **Nested metadata:** Put `author`, `version`, `source`, `catalog`, `category`, and `references` inside `metadata`. +- **Trigger signal:** The `description` is the primary trigger signal. Write it as action plus activation cues: name the task the skill performs, the contexts that should trigger it, representative user phrases, and exclusions only when they prevent likely misfires. +- **Budget:** Keep the frontmatter `description` under 100 tokens. If it grows past that budget, make it shorter. -Common nested metadata fields: +### Metadata fields | Field | Meaning | | --- | --- | -| `metadata.author` | The creator's name or GitHub profile URL. | +| `name` | Unique skill identifier. | +| `description` | Concise purpose and activation signal. | +| `license` | License name, such as `MIT` or `Apache-2.0`. | +| `tags` | Searchable discovery and filtering keywords. | +| `metadata.author` | Creator name or GitHub profile URL. | | `metadata.version` | Semantic versioning string, such as `1.2.0`. | -| `metadata.source` | Repository or canonical source reference, such as `github.com/org/repo`. | +| `metadata.source` | Repository or canonical source reference. | | `metadata.catalog` | Optional catalog grouping string. | -| `metadata.category` | Optional domain category string in lowercase kebab-case, such as `development`, `documentation`, or `project-management`. | -| `metadata.references` | Optional list of local skill or rule names this skill explicitly uses or routes to. | +| `metadata.category` | Optional lowercase kebab-case category. | +| `metadata.references` | Local skills or rules this skill explicitly uses. | + +Use `metadata.references` only when the body tells the agent to use, apply, delegate to, run, or route follow-up work to that skill or rule. +Do not include route-away mentions, adjacent alternatives, near misses, exclusions, boundaries, or examples of work this skill should not handle. + +### Versioning -Use `metadata.references` when this skill actually uses another local skill or rule as part of its workflow. Include a referenced item when the body tells the agent to use, apply, delegate to, run, or route follow-up work to that skill/rule. Do not include skills that appear only as adjacent alternatives, near misses, exclusions, boundaries, or examples of work this skill should not handle. +Use Semantic Versioning for agent skills: -For `metadata.version`, use Semantic Versioning (Major.Minor.Patch) with these specific criteria for agent skills: +- **Major:** breaking changes to the skill contract, such as redefining the trigger, splitting the skill, removing required workflow steps, or changing expected output formats. +- **Minor:** backward-compatible additions, such as new references, expanded trigger coverage, or optional output sections. +- **Patch:** safe reliability tweaks, such as clearer wording, typo fixes, or boundaries that prevent common mistakes. -- **Major (X.0.0)**: Breaking changes to the skill's contract. This includes shrinking or redefining the trigger `description`, splitting the skill into multiple skills, removing a required workflow step, or altering expected output formats. -- **Minor (0.X.0)**: Backward-compatible additions. This includes adding a new `references/*.md` file, expanding the trigger to cover new intents without dropping old ones, or adding optional output sections. -- **Patch (0.0.X)**: Safe reliability tweaks. This includes wording adjustments to improve adherence, fixing typos, or adding boundaries to prevent hallucinations. +Always bump `metadata.version` when making a material change to a skill's files. --- -## Write `SKILL.md` Body +## Write The Body -Keep the Markdown body under 500 lines. The default body shape is a `#` title, one short purpose sentence, a standalone `---`, then `## Workflow`, `## Output`, `## Boundaries`, and `## Verification`. Add `## Error Paths` when failures need explicit handling. Use specialized sections such as `## Route the Work`, `## Source Handling`, or `## Bundled Resources` only when they replace or extend the default flow. Move deep detail into `references/` and point to it clearly. Do not use a body `Scope` section to describe when the skill should be called; that belongs in `description` per the Agent Skills spec. +Keep the Markdown body under 500 lines. +If the body grows past that budget, move detailed procedures, examples, platform notes, and variant-specific guidance into `references/`. +Router skills are the preferred shape for broad domains: keep the main file focused on routing and shared rules, then load only the relevant reference. -Start the body with the section that helps the activated agent act: usually `## Workflow`, `## Source Handling`, or `## Route the Work`. Put `## Boundaries` after the main workflow or output guidance so the file opens with execution, not limits. Place boundaries first only when safety or destructive behavior must be checked before any action. +The default shape is: -Apply the house Markdown style while writing, not as a later cleanup pass: +```text +# Skill Name -- **Section delimiters**: place a standalone `---` between `##` sections in `SKILL.md`. Keep the YAML frontmatter delimiters unchanged, and do not add an extra delimiter immediately after the frontmatter or before the `#` title. -- **Intro purpose**: after the `#` title, write one short sentence that states what the skill does, then place `---` before the first `##` section. -- **Scan anchors**: use bold labels inside steps or bullets when they make distinct actions, fields, or rules easier to scan. -- **Template exceptions**: do not force bold labels into schemas, command examples, literal output templates, or checklist items where they would make the example less accurate. +One short purpose sentence. -After editing, run `create-skill/scripts/quick_validate.py ` when this skill's scripts are available. Treat style failures as authoring bugs, not optional polish. +--- -Prefer deterministic helper scripts for repetitive validation, grading, packaging, report generation, or other mechanical checks that would otherwise be reimplemented by hand. +## Workflow +## Output +## Boundaries +## Verification +``` -For router skills with `references/*.md`, create `evals/evals.json` before validation is considered complete. Each eval must include a `reference` field that points to the routed reference, and every non-schema reference must have 8-10 evals. This keeps the router honest instead of giving it one polite smoke test and hoping for the best. +Add `## Error Paths` when failures need explicit handling. +Use specialized sections such as `## Route the Work`, `## Source Handling`, or `## Bundled Resources` only when they replace or extend the default flow. ---- +Start with the section that helps the activated agent act. +Usually that is `## Workflow`, `## Source Handling`, or `## Route the Work`. +Put `## Boundaries` after the main workflow unless safety or destructive behavior must be checked before any action. -## Write References +Do not use a body `## Scope` section to describe activation criteria. +Skill-call scope belongs in the frontmatter `description`. + +### House Markdown style + +Apply style while writing, not as a cleanup pass: -Use `references/*.md` for details that would bloat `SKILL.md`: variant workflows, platform notes, review checklists, schemas, long examples, eval guidance, or compatibility instructions. +- **Section delimiters:** place standalone `---` delimiters between `##` sections in `SKILL.md`; do not add an extra delimiter after the YAML frontmatter or before the `#` title. +- **Intro purpose:** after the `#` title, write one short purpose sentence, then place `---` before the first `##` section. +- **Scan anchors:** use bold labels inside steps or bullets when they make distinct actions, fields, or rules easier to scan. +- **Template exceptions:** do not force bold labels into schemas, command examples, literal output templates, or checklist items. -Each reference should start with a `#` title and one short purpose sentence. Use task-specific `##` sections instead of forcing the `SKILL.md` default body shape. Put standalone `---` delimiters between `##` sections in long references. Start with the most actionable section for that reference, not background or boundaries. +### Example -Use bold scan anchors inside steps or bullets when they make distinct actions, fields, or rules easier to scan. Schema references, command examples, literal templates, and field lists may use their natural formatting instead. +Situation: the skill needs a workflow for editing existing skill folders. Task: make the instructions reusable, not tied to one edit. + +Weak: + +```text +## Workflow + +- Analyze the files. +- Make improvements. +- Validate. +``` -Keep references loaded by clear conditions from `SKILL.md`. Do not create placeholder references, and do not use references as a dumping ground for detail that no workflow loads. +Strong: + +```text +## Workflow + +1. **Identify the target artifact:** Read the existing file before deciding whether to edit, replace, or add a reference. +2. **Preserve local conventions:** Match naming, metadata, validation scripts, and eval structure already present in the skill folder. +3. **Verify behavior:** Run the skill validator and any focused eval or schema check affected by the change. +``` + +The strong version tells the agent what decisions to make and what evidence to collect. The weak version is technically true and operationally soggy. + +--- + +## Write References + +- **Reference purpose:** Use `references/*.md` for details that would bloat `SKILL.md`: variant workflows, platform notes, review checklists, schemas, long examples, eval guidance, or compatibility instructions. +- **Reference shape:** Each reference should start with a `#` title and one short purpose sentence. Use task-specific `##` sections instead of forcing the `SKILL.md` default body shape. Put standalone `---` delimiters between `##` sections in long references. Start with the most actionable section for that reference, not background or boundaries. +- **Teaching format:** Write references in the format that best teaches the behavior. Use guide-style prose, concrete examples, checklists, weak/strong comparisons, and worked examples when they improve comprehension. Do not flatten references into terse operational bullets by default. Compress only repetition, stale context, or details that do not change agent behavior. +- **Reference scan anchors:** Use bold scan anchors inside steps or bullets when they make distinct actions, fields, or rules easier to scan. Schema references, command examples, literal templates, and field lists may use their natural formatting instead. +- **Load conditions:** Keep references loaded by clear conditions from `SKILL.md`. Do not create placeholder references, and do not use references as a dumping ground for detail that no workflow loads. --- -## Length Budgets +## Writing Style + +- **Imperative voice:** Use imperative instructions. Explain why constraints matter instead of stacking brittle all-caps rules. +- **Examples:** Write examples and eval prompts so reviewers can see the situation, task, expected action, and result criteria. Include a `### Example` subsection under the relevant `##` section only when it clarifies behavior, boundaries, or output shape. Keep examples short and move large examples into references. +- **Scan anchors:** Use bold scan anchors where they help another agent skim distinct actions, fields, or rules before reading details. +- **Delimiters:** Use standalone `---` delimiters between `##` sections so long skill files segment cleanly in model context. +- **Code responsibilities:** For code-generation skills and bundled helper scripts, keep responsibilities clear, interfaces small, and dependencies explicit without adding unnecessary layers. -Follow these budgets for every `SKILL.md`: +--- -- **Metadata/frontmatter**: no more than 100 tokens -- **Main instruction body**: no more than 500 lines +## Bundle Resources -If a skill exceeds either budget, shorten trigger metadata first, then move detailed procedures, examples, platform notes, and variant-specific guidance into `references/`. Router skills are the preferred shape for broad domains: keep the main file focused on routing and shared rules, then load only the relevant reference. +Add folders only when they contain useful files. Use this shape when helpful: @@ -106,34 +177,43 @@ skill-name/ └── evals/ ``` -Do not create placeholder directories. Add a folder only when it contains useful files. +Use deterministic helper scripts for repetitive validation, grading, packaging, report generation, or other mechanical checks that would otherwise be reimplemented by hand. + +Move long templates, large examples, fixture files, and generated review assets out of `SKILL.md` when they would make the main body harder to scan. --- -## Progressive Disclosure +## Add Evals -Use three levels: metadata loaded by the runtime, main body loaded when the skill triggers, and bundled resources loaded only when needed. +For router skills with `references/*.md`, create `evals/evals.json` before validation is considered complete. -Router skills should classify the request, choose the relevant reference, read only that reference, and act. +Each eval must include a `reference` field pointing to the routed reference. Every non-schema reference must have 8-10 evals. Near-miss prompts count toward the route they are intended to test. ---- +For objectively testable skills, include assertions, scripts, schemas, fixtures, or acceptance checks where practical. Use `references/evaluation.md` for deeper eval design guidance. -## Compatibility +--- -Write core instructions so they work in any agent runtime. Put runtime-specific notes under a short compatibility section or in `references/agent-compatibility.md`. +## Validate -Avoid relying on one agent's tool names, slash commands, event stream, or UI unless the skill is explicitly for that agent. +After editing, run: ---- +```bash +python /scripts/quick_validate.py +``` -## Writing Style +Treat style failures as authoring bugs, not optional polish. -Use imperative instructions. Explain why constraints matter instead of stacking brittle all-caps rules. Include a `### Example` subsection under the relevant `##` section only when it clarifies behavior, boundaries, or output shape. Write examples and eval prompts so reviewers can see the situation, task, expected action, and result criteria. Keep examples short and move large examples into references. +Also run focused checks for touched artifacts: -Use bold scan anchors where they help another agent skim distinct actions, fields, or rules before reading details. +- **JSON parsing:** parser check for `evals/*.json`. +- **Script tests:** relevant unit or integration tests for bundled scripts. +- **Packaging:** packaging check when producing a distributable skill. +- **Eval reruns:** trigger or behavior evals when the description, workflow, or references changed materially. -Use standalone `---` delimiters between `##` sections so long skill files segment cleanly in model context. +--- -For code-generation skills and bundled helper scripts, keep responsibilities clear, interfaces small, and dependencies explicit without adding unnecessary layers. +## Portability And Safety -Skills must not contain malware, hidden exfiltration behavior, credential capture, or instructions that would surprise the user relative to the skill description. +- **Portable core:** Write core instructions so they work in any agent runtime. Put runtime-specific notes under a short compatibility section or in `references/agent-compatibility.md`. +- **Runtime assumptions:** Avoid relying on one agent's tool names, slash commands, event stream, or UI unless the skill is explicitly for that agent. +- **User trust:** Skills must not contain malware, hidden exfiltration behavior, credential capture, or instructions that would surprise the user relative to the skill description. diff --git a/.agents/skills/create-skill/references/description-optimization.md b/.agents/skills/create-skill/references/description-optimization.md index 51a60ef..518445e 100644 --- a/.agents/skills/create-skill/references/description-optimization.md +++ b/.agents/skills/create-skill/references/description-optimization.md @@ -1,12 +1,26 @@ # Description Optimization -Use this reference when optimizing a skill's frontmatter description for trigger accuracy. +Use this reference when optimizing a skill's frontmatter `description` for trigger accuracy. -The `description` field is the main signal native skill runtimes use to decide whether to invoke a skill. Optimize it after the skill behavior is stable. +The description is the main signal native skill runtimes use to decide whether to invoke a skill. Optimize it after the skill behavior is stable; otherwise the trigger will faithfully route users into a moving target. -## Choose the Agent Adapter +## Start With Trigger Behavior -Infer the calling agent from the current session or CLI. Use native trigger detection when the agent exposes it. Otherwise, use routing-judgment evals. +Before changing words, define what the description must separate. + +Write down: + +- **Should trigger:** realistic prompts where this skill should help +- **Should not trigger:** near misses that mention similar words but need another skill or no skill +- **Ambiguous cases:** prompts where the right answer depends on missing context + +This prevents the common failure mode: adding keywords from one missed prompt and accidentally widening the trigger for everything else. + +--- + +## Choose The Agent Adapter + +Use the adapter that matches the calling agent's normal behavior. If the runtime exposes native trigger detection, use it. Otherwise, use routing-judgment evals. Examples: @@ -22,13 +36,23 @@ python -m scripts.run_loop \ --agent my-agent ``` -For CLIs with unusual invocation shapes, pass `--agent-command` with `{prompt}` or `{prompt_file}` placeholders. Do not override the model unless the user explicitly asks; the eval should match the agent's normal behavior. +For CLIs with unusual invocation shapes, pass `--agent-command` with `{prompt}` or `{prompt_file}` placeholders: + +```bash +python -m scripts.run_loop \ + --eval-set /evals/trigger-evals.json \ + --skill-path \ + --agent custom \ + --agent-command "agent run --input {prompt_file}" +``` + +Do not override the model unless the user explicitly asks. The eval should match the agent's real operating conditions. --- ## Create Trigger Evals -Create about 20 realistic queries, split between should-trigger and should-not-trigger cases. Use concrete prompts that resemble real user requests, including file paths, domain details, typos, abbreviations, and ambiguous phrasing. +Create about 20 realistic queries split between should-trigger and should-not-trigger cases. Use concrete prompts that resemble real user requests: file paths, domain details, typos, abbreviations, shorthand, and ambiguous phrasing. Positive cases should cover varied ways users ask for the skill's core capability. Negative cases should be near misses, not obviously irrelevant prompts. @@ -41,21 +65,51 @@ Save them as: ] ``` +### Weak vs strong eval prompts + +Weak positive: + +```text +Use the spreadsheet skill. +``` + +Strong positive: + +```text +Can you turn data/q4-orders.csv into an .xlsx workbook with formulas and a summary chart? +``` + +Weak negative: + +```text +Tell me a joke. +``` + +Strong negative: + +```text +Summarize the CSV schema in prose; do not create or edit a workbook. +``` + +The strong cases test boundaries. The weak cases mostly test whether the word "spreadsheet" exists. + --- -## Review the Eval Set +## Review The Eval Set -When possible, present the eval set to the user before running optimization. Use `assets/eval_review.html` by replacing: +When possible, show the eval set to the user before running optimization. People spot mislabeled near misses faster than benchmark charts do. -- **Eval data**: replace `__EVAL_DATA_PLACEHOLDER__` with the JSON array -- **Skill name**: replace `__SKILL_NAME_PLACEHOLDER__` with the skill name -- **Current description**: replace `__SKILL_DESCRIPTION_PLACEHOLDER__` with the current description +Use `assets/eval_review.html` by replacing: + +- **Eval data:** replace `__EVAL_DATA_PLACEHOLDER__` with the JSON array +- **Skill name:** replace `__SKILL_NAME_PLACEHOLDER__` with the skill name +- **Current description:** replace `__SKILL_DESCRIPTION_PLACEHOLDER__` with the current description The user can edit queries and export the final eval set. --- -## Run the Optimization Loop +## Run The Optimization Loop Run: @@ -71,10 +125,27 @@ python -m scripts.run_loop \ The loop splits train and held-out test data, evaluates the current description, proposes revisions, and selects `best_description` by held-out test score. -Apply the best description to `SKILL.md`, then report the before/after and scores. Keep the updated metadata under 100 tokens. +Apply the best description to `SKILL.md`, then report: + +- **Old description:** the previous frontmatter text. +- **New description:** the applied replacement text. +- **Scores:** train and held-out results. +- **Misses:** notable false positives and false negatives. +- **Generalization:** why the new wording should hold beyond the train set. + +Keep the updated metadata under 100 tokens. --- -## Triggering Notes +## Interpret Results + +Do not chase a perfect score blindly. A worse held-out score means the change probably overfit the train set. A better score with new false positives may still be unacceptable if those false positives trigger an expensive or risky workflow. + +Common fixes: + +- **False negatives:** add missing intent phrases or artifact types +- **False positives:** add narrower action verbs, required context, or exclusions +- **High variance:** add repeated runs or simplify ambiguous eval labels +- **Overlong description:** remove examples and move detailed routing into the body or references -Agents may skip a skill for simple tasks they can handle directly. Trigger eval prompts should be substantive enough that a specialized skill would help. Tiny prompts like "read this file" are poor trigger tests even if the skill technically could help. +Tiny prompts like "read this file" are poor trigger tests even if the skill technically could help. Trigger eval prompts should be substantive enough that a specialized skill would actually add value. diff --git a/.agents/skills/create-skill/references/evaluation.md b/.agents/skills/create-skill/references/evaluation.md index 6087bb6..e8102b5 100644 --- a/.agents/skills/create-skill/references/evaluation.md +++ b/.agents/skills/create-skill/references/evaluation.md @@ -14,6 +14,137 @@ Start with prompt-level expectations. Add objective assertions after the test se --- +## Assertion Design + +Read this when drafting assertions for a skill's `evals/evals.json`. It answers one question: **what makes an assertion actually useful?** + +### Start with the contract, not the test cases + +Before writing a single assertion, extract the skill's contract from `SKILL.md`: + +- **What does it produce?** File format, structure, and content type. +- **What inputs does it consume?** Files, prompt content, or structured data. +- **What does it explicitly promise?** For example, "always produces a two-sheet workbook" or "never modifies the input file". +- **What would a user reasonably assume even if unstated?** Values are accurate, nothing is hallucinated, and the output is complete. + +This is the ground truth. Every assertion should trace back to something in the contract. + +### Enumerate failure modes first + +Don't jump to assertions. Reason about what can go wrong first: + +| Failure mode | What it looks like | +| --- | --- | +| **Structural** | Wrong file type, corrupt output, unparseable JSON, missing required sheets | +| **Completeness** | Missing rows, fields, or sections; the output exists but is partial | +| **Accuracy** | Wrong values; hallucinated, miscalculated, or copied incorrectly from the input | +| **Fidelity** | Input data was transformed unintentionally; reworded, rounded, or reformatted | +| **Contamination** | Placeholders, leftover content from a previous run, apology text, or defaults | +| **Process** | The skill's documented steps were skipped in favor of improvisation | + +Each assertion should target at least one of these. If you can't name which failure mode an assertion catches, it is probably not worth including. + +### Cover the input space with equivalence classes + +Write at least one eval per useful class rather than many similar prompts: + +| Class | Purpose | +| --- | --- | +| `smoke` | Simplest possible input; catches total breakdowns | +| `happy_path` | Realistic, typical use case; the core regression test | +| `complex` | High-volume or multi-part input; catches partial completion and off-by-one errors | +| `edge` | Boundary condition the skill implies but does not explicitly handle | +| `invalid` | Malformed, missing, or contradictory input; catches error handling | + +Skip classes that do not apply. For skills with file inputs, use real files with known content. Do not write accuracy assertions unless you know the correct answer before running. + +### Write discriminating assertions + +An assertion that always passes is worse than no assertion; it creates false confidence. Every assertion should be hard to satisfy accidentally. + +**Structural** assertions check output shape, regardless of content: + +> `"The output file has the extension .docx"` +> `"The spreadsheet contains exactly two sheets named 'Summary' and 'Line Items'"` + +**Completeness** assertions check that required content is present: + +> `"Sheet 'Line Items' has exactly 7 data rows, one per item in the input"` +> `"No cell in column B is empty in rows 2 through 9"` + +**Accuracy** assertions check values against ground truth: + +> `"Cell B8 contains 3240.00, the correct subtotal of the 7 line items"` +> `"The vendor name in A1 is exactly 'Apex Industrial S.A.', copied from the PDF header"` + +**Fidelity** assertions check that input data was preserved: + +> `"Product descriptions in column A match the input verbatim, not paraphrased"` +> `"Numeric values are not rounded; '1247.50' appears, not '1248'"` + +**Negative** assertions check things that must not be present: + +> `"No cell contains '[PLACEHOLDER]' or is empty in the required range"` +> `"The total in B10 does not equal the subtotal in B8; tax was applied"` +> `"The word 'approximately' does not appear in any numeric field"` +> `"No text matching 'I was unable to' appears in the output file"` + +Every eval with objective checks should include at least one negative assertion. These catch cases where the model attempted the task but gave up, left defaults, or produced a plausible-looking wrong result. + +### Use `[MUST]` for blockers + +Some assertions represent total failure. If they do not pass, the output is worthless regardless of what else passes. Prefix these with `[MUST]`: + +> `"[MUST] The output file has the extension .xlsx"` +> `"[MUST] Cell B10 contains the correct grand total"` + +Do not overuse this. One or two per eval is usually enough. Treat `[MUST]` as a grading signal; do not imply deterministic tooling enforces it unless the grader or benchmark script actually does. + +### Use the discrimination checklist + +Before keeping each assertion, ask: + +1. Could this pass on hallucinated output? If yes, anchor it to a specific value from a specific input. +2. Could this pass on an empty file? If yes, add a content assertion alongside it. +3. Does another assertion in this eval target the same failure mode? If yes, merge or drop one. +4. What would have to go wrong for this to fail? If you can't name it, the assertion is too vague. + +### What good looks like + +**Skill:** extract invoice data from a PDF and write it to an Excel spreadsheet. + +Weak assertions: + +```text +"A .xlsx file was created" # passes if file is empty +"The file contains invoice data" # passes if one cell says "invoice" +"The total is calculated" # passes if any formula exists +``` + +Strong assertions for an eval with a known 7-item input totaling `$3,758.40`: + +```text +"[MUST] The output file has the extension .xlsx" +"[MUST] Cell B10 contains 3758.40, the correct grand total including 16% VAT" +"Sheet 'Line Items' has exactly 7 data rows matching the input" +"The vendor name in A1 is exactly 'Apex Industrial S.A.', from the PDF header" +"No cell in column B contains a string; all values are numeric" +"Cell B10 does not equal cell B8; VAT was applied, not just the subtotal" +"No cell in the required range contains '[PLACEHOLDER]' or is empty" +``` + +Each assertion catches a different failure. If B10 equals B8, the VAT step was skipped. If column B has strings, the model wrote `"3,240.00"` instead of a number. If there are 8 rows, a line item was hallucinated. None of these pass on a plausible-looking wrong output. + +### Use process assertions sparingly + +Process assertions check that the documented steps were followed, not just that the output looks right: + +> `"The skill's bundled script 'extract_fields.py' was called during execution"` + +Use these when the skill has a specific required tool or script. A model that produces the right output by improvising a different approach may fail on harder inputs, and process assertions catch that early. One process assertion is usually enough. + +--- + ## Run Iterations Put run results in `/evals/iterations/iteration-N/`. Each test case gets its own directory. For each case, save the prompt, generated outputs, timing when available, and grading results. @@ -26,7 +157,7 @@ When subagents are available, launch skill-enabled and baseline runs in the same ## Assertions and Grading -Good assertions are objective, specific, and named clearly enough to make benchmark output readable. Do not force quantitative assertions onto outputs that require human judgment. +Good assertions are objective, specific, contract-derived, and named clearly enough to make benchmark output readable. Do not force quantitative assertions onto outputs that require human judgment. Grade each run using `agents/grader.md` or a deterministic script. Save `grading.json` with expectation objects that use exactly `text`, `passed`, and `evidence`. diff --git a/.agents/skills/create-skill/references/review.md b/.agents/skills/create-skill/references/review.md index b1ff00a..6109944 100644 --- a/.agents/skills/create-skill/references/review.md +++ b/.agents/skills/create-skill/references/review.md @@ -1,69 +1,144 @@ # Skill Review Checklist -Use this reference when reviewing a created or revised skill folder. Treat the skill as executable guidance for another agent: review whether it will trigger at the right time, load the right amount of context, and produce reliable behavior. +Use this reference when reviewing a created or revised skill folder. Treat the skill as executable guidance for another agent: it should trigger at the right time, load the right amount of context, and produce reliable behavior. + +## Start With The Contract + +Before listing findings, identify what the skill is promising. + +A good review starts from four questions: + +1. **When should this skill activate?** Read the frontmatter `description` as the trigger contract. +2. **What work does it perform?** Read the workflow, output, and boundaries as the behavior contract. +3. **What extra context does it load?** Check references, scripts, assets, and evals. +4. **How would we know it worked?** Check validation, evals, and expected output shape. + +If those four answers do not line up, that is usually the core review finding. + +--- ## Trigger Description -Check whether the frontmatter `description` is a useful trigger signal. +The `description` is the main activation signal. It should describe the user intent, not the skill's internal mechanics. + +Good trigger descriptions usually say "Use when..." or "Use for..." and name the strongest contexts where the skill should help. They include representative user phrases only when those phrases improve routing. They avoid keyword stuffing because broad keywords create false positives on adjacent work. + +### What to check + +- **Intent phrasing:** Does the description name the task the agent should perform? +- **Trigger contexts:** Does it cover the real situations where the skill should activate, including cases where the user may not name the domain directly? +- **Near misses:** Does it avoid claiming adjacent tasks the skill cannot actually handle? +- **Length:** Is it under the 1024-character hard limit? +- **Skill value:** Would this skill help on tasks that require domain knowledge, project conventions, non-obvious APIs, or special workflows? -- **Intent phrasing**: use imperative, intent-focused language such as "Use when..." or "Use for..." rather than only describing internal mechanics. -- **Trigger contexts**: name the core user intents and strongest trigger contexts. Be proactive when the skill applies even if the user does not name the domain directly, such as when they omit an obvious domain keyword. -- **Near misses**: avoid broad keyword stuffing that would trigger on adjacent tasks the skill does not actually handle. -- **Hard limit**: stay under the **1024-character hard limit** enforced by the spec. Check the character count directly; detailed routing, exclusions, and examples belong in the body or references. -- **Skill value**: remember that agents tend to reach for skills only when a task requires knowledge or capabilities beyond what they can handle alone. Weight eval queries toward specialized knowledge, unfamiliar APIs, or domain-specific workflows. -- **Trigger evals**: check for realistic should-trigger and should-not-trigger prompts. Strong should-trigger cases are ones where the skill would help but the connection is not obvious; strong should-not-trigger cases are near misses. Vary phrasing, explicitness, detail level, and complexity. +### Weak vs strong + +Weak: + +```text +Use for documents. +``` + +Strong: + +```text +Use when creating, editing, rendering, or visually verifying .docx files, Word-style documents, or Google Docs-targeted document artifacts. +``` + +The strong version names actions, artifacts, and activation cues. The weak version is a fog machine with a YAML header. --- ## Coherence And Boundaries -Check whether the skill covers a coherent unit of work and adds genuine value. +A skill should cover one coherent unit of work. If it combines unrelated jobs, it becomes hard to trigger precisely and easy to misuse. + +Flag a skill when it: + +- **Mixed capabilities:** it combines unrelated work that needs different triggers or workflows. +- **Over-narrow scope:** it forces several skills to activate for one normal user task. +- **Generic value:** it adds advice the base agent already knows instead of project-specific or domain-specific value. +- **Misplaced scope:** it puts activation criteria in a body `## Scope` section instead of the frontmatter `description`. +- **Menu defaults:** it presents equal-choice menus where the skill should provide a default and name alternatives as escape hatches. + +### Useful test + +Ask: "Could I explain this skill's purpose in one sentence without using `and then also`?" -- **Added value**: ask whether the skill adds what the agent *lacks*, such as project-specific conventions, domain procedures, non-obvious edge cases, or particular tools and APIs. -- **Mixed jobs**: flag skills that combine unrelated work, because they become hard to trigger precisely and can load conflicting instructions. -- **Over-narrow boundaries**: flag skills that are so narrow they force several skills to activate for one normal user task. -- **Body scope sections**: flag `## Scope` sections that describe activation criteria. Skill-call scope belongs in the frontmatter `description`; body sections should cover workflow, boundaries, routing, and output rules. -- **Defaults first**: prefer defaults over menus when the skill names tools, formats, or procedures. Alternatives should be escape hatches, not equal-choice catalogs. +If not, the skill probably needs a router shape, a narrower trigger, or a split. --- ## Instruction Quality -Check whether the body teaches a reusable procedure rather than a one-off answer. +The body should teach a reusable procedure, not answer one past request. -- **Generalizable method**: describe how to approach a *class* of problems, not what to produce for a specific instance. -- **Purpose over procedure**: where multiple approaches are valid, explain *why* over prescribing *how*. Be prescriptive when operations are fragile, consistency matters, or a specific sequence must be followed. -- **Right-sized detail**: watch for overcomprehensiveness. When in doubt, cut and let the agent use its own judgment. -- **Ordered workflow**: make workflows stepwise, ordered, and validation-aware when the task has dependencies or failure modes. -- **Templates**: provide output templates when format consistency matters. Short templates can live inline; longer or conditional ones belong in `assets/`. -- **Section boundaries**: require standalone `---` delimiters between `##` sections in `SKILL.md`, without changing YAML frontmatter delimiters or adding an extra delimiter before the `#` title. -- **Execution-first order**: the first body section should usually be `## Workflow`, `## Source Handling`, or `## Route the Work`, not `## Boundaries`. Boundary-first order needs a concrete safety or destructive-action reason. -- **Scan anchors**: require bold labels inside steps or bullets when they make distinct actions, fields, or rules easier to scan. Do not require a bold principle sentence after each `##` heading. -- **Gotchas**: capture non-obvious mistakes an agent is likely to make, not generic advice like "handle errors appropriately." -- **Sensitive behavior**: make security-sensitive or destructive behavior explicit, expected by the user, and bounded by the skill description. +Good instructions describe how to approach a class of tasks. They explain why constraints matter when the agent needs judgment, and they prescribe exact steps when order, safety, or consistency matters. + +Look for: + +- **Generalizable method:** The workflow applies beyond one example. +- **Right-sized detail:** The skill gives enough guidance to prevent common mistakes without becoming a manual for every possible case. +- **Ordered workflow:** Dependent steps appear in execution order and include validation points. +- **Templates:** Output templates appear when format consistency matters. +- **Gotchas:** Non-obvious mistakes are named directly. +- **Sensitive behavior:** Destructive or security-sensitive actions are explicit, bounded, and expected from the description. + +### Formatting checks + +For `SKILL.md`, require standalone `---` delimiters between `##` sections. Do not add an extra delimiter between YAML frontmatter and the `#` title. + +The first body section should usually be `## Workflow`, `## Source Handling`, or `## Route the Work`. Put boundaries first only when safety or destructive behavior must be checked before any action. + +Use bold scan anchors when they help a reader skim distinct actions or fields. Do not require a bold principle sentence after every heading. --- ## Progressive Disclosure -Check whether context is spent deliberately. +A skill spends context in three layers: metadata, the main body, and bundled resources. + +`SKILL.md` should contain the routing logic and shared rules needed on every run. Deep details belong in `references/`, `assets/`, or `scripts/` only when a workflow actually uses them. -- **Top-level focus**: `SKILL.md` should contain the routing logic and shared rules needed on every run. The spec recommends keeping it under **500 lines and 5,000 tokens**. -- **Resource purpose**: deep details should live in `references/`, `assets/`, or `scripts/` only when they are actually used. -- **Load conditions**: references must be loaded by clear conditions, not vague "read everything" guidance. -- **Metadata references**: verify `metadata.references` lists only local skills or rules used inside the workflow. Remove route-away, adjacent-skill, near-miss, exclusion, and boundary mentions. -- **Unused folders**: placeholder folders or unused resources are review issues when they make the skill harder to understand or maintain. +Check that: + +- **Load conditions:** references are loaded by clear conditions, not vague "read everything" instructions. +- **Metadata references:** `metadata.references` lists only local skills or rules used in the workflow. +- **Metadata exclusions:** route-away, adjacent-skill, near-miss, exclusion, and boundary mentions are not listed as metadata references. +- **Resource hygiene:** placeholder folders and unused resources are removed. --- ## Validation And Evals -Check whether the skill can be tested and improved systematically. +Review whether the skill can be tested and improved systematically. + +- **Trigger coverage:** For trigger-sensitive skills, evals should include should-trigger and should-not-trigger cases. Aim for roughly 20 trigger queries, with 8-10 per side. Good positive cases are not just obvious keyword matches; they are realistic requests where the skill would help. Good negative cases are near misses. +- **Trigger rate:** Run each trigger query multiple times when the runtime is nondeterministic, then review the trigger rate instead of a single pass. A should-trigger query should pass at a rate >= 0.5; a should-not-trigger query should pass at a rate < 0.5. Keep train and validation splits fixed across iterations so changes are comparable. +- **Objective checks:** For behavior-sensitive skills, evals should include objective checks when practical: validators, schemas, deterministic scripts, known fixture outputs, or acceptance checks. + +### What good eval coverage catches + +- **False positives:** the skill triggers on adjacent work. +- **False negatives:** the skill does not activate when it should. +- **Skipped steps:** the workflow collapses under pressure. +- **Format drift:** outputs stop matching the promised shape. +- **Overfitting:** revisions copy one prompt instead of fixing the category. +- **Brittleness:** edge or invalid inputs break the workflow. + +If results stall, inspect the eval set before rewriting the skill. The queries may be too easy, too hard, mislabeled, or too repetitive to reveal useful signal. + +--- + +## Review Output + +Lead with findings. Put summaries after the issues, not before them. + +Each finding should include: + +- **Severity:** how much it affects triggering, correctness, safety, or maintainability +- **Evidence:** specific file and line reference when possible +- **Impact:** what can go wrong in agent behavior +- **Fix direction:** the smallest useful change -- **Trigger sets**: split trigger-sensitive eval queries into **should-trigger** and **should-not-trigger** sets. Aim for roughly 20 queries total, with 8-10 per side. -- **Repeated runs**: run each query multiple times and compute a **trigger rate**. A should-trigger query passes at a rate >= 0.5; a should-not-trigger query passes at a rate < 0.5. -- **Train and validation**: use about 60% as a **train set** and hold out about 40% as a **validation set**. Both sets should have a proportional mix of positive and negative cases. Keep the split fixed across iterations. -- **Avoid overfitting**: fix the *category* a failing query represents; do not copy keywords from the failing prompt into the description. -- **Objective checks**: include validators, scripts, schemas, or acceptance checks where practical. -- **Nondeterminism**: prefer repeated runs or trigger-rate style evaluation over a single pass. -- **Stalled results**: if performance is not improving, consider whether the queries are too easy, too hard, or poorly labeled. +If there are no findings, say that clearly and name any remaining test gaps or residual risk.