Autoresearch

An LLM-as-judge optimizer for Claude Code skill files. Validates judge reliability, runs greedy mutation experiments, produces stability-tested results with z-stat verdicts.

Real result: Optimized the cd:plan skill from 0.388 → 1.000 on holdout inputs (z=19.56, stdev=0.000 across 5 rounds).

Walkthrough: coherencedaddy.com/tutorials/autoresearch — 11-slide visual tutorial.

Install

As a Claude Code plugin

/plugin marketplace add Coherence-Daddy/autoresearch
/plugin install autoresearch@autoresearch
/reload-plugins

As a standalone CLI

git clone https://github.com/Coherence-Daddy/autoresearch
cd autoresearch
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
export ANTHROPIC_API_KEY=sk-ant-...

Three commands

autoresearch validate my-config.yaml             # check judge reliability
autoresearch optimize my-config.yaml             # run mutation loop
autoresearch compare my-config.yaml a.md b.md \
  --holdout --repeat 5                            # head-to-head with z-stat

How it works

Validate the judge — Fleiss' kappa across K reruns, Cohen's kappa vs hand grading. Both ≥ 0.7 means the judge is trustworthy.
Optimize — Opus mutates the skill, Haiku scores it on train inputs, the loop keeps changes that improve the score and discards the rest.
Compare — --repeat 5 runs the comparison five times to wash out generator-temperature noise. Reports mean, stdev, and a z-stat.

Eval design rules

Keep evals binary (yes/no, not scored 1-10).
At least one eval must be locally satisfiable — pass/fail determinable from a short output fragment. Pure global properties give the mutator no gradient.
If the baseline scores 1.000, there's nothing to optimize. Make the evals harder.
If the baseline scores 0.000, the evals or prompts are broken. Fix those first.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.claude-plugin		.claude-plugin
docs/specs		docs/specs
examples		examples
skills/autoresearch		skills/autoresearch
src/autoresearch		src/autoresearch
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
council-report-2026-04-27-140738.html		council-report-2026-04-27-140738.html
council-transcript-2026-04-27-140738.md		council-transcript-2026-04-27-140738.md
index.html		index.html
plan-config.yaml		plan-config.yaml
plan-skill.md		plan-skill.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autoresearch

Install

As a Claude Code plugin

As a standalone CLI

Three commands

How it works

Eval design rules

See also

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Autoresearch

Install

As a Claude Code plugin

As a standalone CLI

Three commands

How it works

Eval design rules

See also

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages