The docs said yes. The code said no. Only one of them gets to be true.
Cites the line or says nothing · rewrites nothing unasked · any language
A documentation-freshness reflex for AI coding agents.
Your README was true the day you wrote it. Then a flag got renamed, a file moved, a function started returning something else — and the docs stayed exactly where they were. That's how documentation lies: not by being wrong when written, by being left behind. The gap opens quietly and nobody sees it until someone pastes a command that no longer exists.
Evergreen is the reflex that closes the gap. The moment your agent touches code, it reads the affected docs back against the source and surfaces only what it can prove has gone false — pointing at the exact line. It rewrites nothing on its own. It just refuses to let the docs and the code disagree in silence.
You rename a flag and move on. Three files still document the old name. Nobody notices until someone copies a broken command.
With evergreen, in the same turn:
evergreen: you renamed --workers to --concurrency.
README.md:42 documents --workers — gone from cli.py → fix
docs/cli.md:8 same flag, same fix
left alone: docs/adr/0003.md mentions --workers — an ADR, frozen in time.
It cites the line or it says nothing. And it leaves the docs that are meant to describe the past — ADRs, specs, dated snapshots — alone. They lead the code; they don't lie about it.
More of what it catches, one per rung, in examples/.
When code changes, it stops at the first rung that catches:
1. A doc names a file that's gone? → grep, confirm, flag
2. A documented flag / env / route gone? → grep the code, flag
3. A shown snippet drifted from source? → read both, compare
4. Does the prose still tell the truth? → only then, reason
One rule above all: prove it or drop it. If it can't cite the code that makes the doc wrong, it isn't a finding. A checker that cries wolf gets muted — so this one never does.
That rule applies to evergreen itself. The eval seeds a fixture repo with catalogued lies, true claims that must not be flagged, and exempt docs, then lets a headless agent winnow it blind. The per-pair harness (eval/bench/) runs the judge over labeled code/doc pairs.
Against that corrected judge, the Python set (CoDocBench, three-LLM–validated labels) reduces to one confusion matrix over 332 pairs — 9 real drifts, 323 true claims:
| flagged | silent | |
|---|---|---|
| drift (9) | 8 | 1 |
| true (323) | 16 | 307 |
That's recall 0.89 (8/9) and specificity 0.95 (307/323) — it catches the drift and rarely cries wolf. Precision depends on how common drift is: 0.57 at a 10/90 mix, where the peer DocPrism reports 0.62. It rests on only 9 positives, so precision is the soft number — and the Java, TypeScript, Rust, and Go re-runs against this judge are still in flight. Full breakdown in eval/bench/.
/plugin marketplace add ChrisPachulski/evergreen
/plugin install evergreen@evergreen
It rides along every session: flags drift the moment a change leaves a doc lying, adds /evergreen:winnow, and leaves a quiet nudge if you changed code and forgot to look. Intensity is off | light | strict (default light). The truth reflex never blocks your commit — it flags, you decide. (The hygiene guard is the one exception, and it's the kind you want — see Commands.)
What it costs, since you count tokens: session start injects a ~40-line digest, not the full ruleset (that loads on demand), and the post-turn nudge fires once per new change — not on every turn while the tree sits dirty.
Want the check in CI too? Add the Action — it winnows the docs the PR's code touched and posts findings as a single comment. It never fails the build; it comments, you decide.
# .github/workflows/evergreen.yml
on: pull_request
permissions: { contents: read, pull-requests: write }
jobs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- uses: ChrisPachulski/evergreen@v1
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}The whole skill is skills/evergreen/SKILL.md. Drop it into any skill-capable agent, or paste it into your system prompt. For Codex, Copilot, Gemini, and anything that reads AGENTS.md, the flat-prose ruleset already lives at the repo root.
That's it. From the next change on, the docs answer to the code.
Three axes — truth · craft · hygiene — one creed: prove it or drop it, you keep the final call.
| Command | What it does |
|---|---|
/evergreen [off | light | strict] |
Set the intensity for this repo. No argument reports the current one. |
/evergreen:winnow [base-ref] [--prove-by-test] |
Truth, deep. Walk every claim that changed since a ref and certify it true or surface it — silence means certified, not just "no lie found." Always strict. With --prove-by-test, behavioral claims that reading can't settle are settled by execution (write the test the doc implies, run it): fails → drift proven, passes → certified by test. |
/evergreen:flourish <file> [--all] [--manual] |
Craft. Rewrite an accurate-but-ugly doc to a gold standard (mined from 28 top READMEs), then prove every claim against the code. Emits a diff — never a silent overwrite. The only sanctioned prose-rewrite. |
/evergreen:cultivate [path] |
Hygiene. Local-only files leaking into git, gitignore gaps, AI-slop that shouldn't be tracked or public. Proposes untrack/ignore/delete — never auto. A commit-time guard backstops it (the one thing that blocks). |
Will it rewrite my prose?
Not unless you ask. The reflex points; you write — a dead flag or moved path it hands you a diff for, the why behind a design it won't touch. The one exception is /evergreen:flourish, invoked deliberately: it crafts a doc to the gold standard, then verifies its own rewrite against the code so it can't introduce a lie. Fact-checker by default; ghostwriter only on request — and one that cites its sources.
Won't it cry wolf?
It flags only what it can prove against the code. Git's flags, CSS variables, other repos' paths, your ADRs — not its business. Tell it to drop something once and it offers the .evergreen-ignore line that keeps it dropped in every session after.
Does it scale? It reads paths, contracts, and prose — not your AST. Any language, any repo, nothing to compile.
Why "evergreen"? A doc that stays true as the code grows is evergreen. Yours aren't. Yet.
Distilled from a survey of 309 repos — an idea mine, not a blueprint. The taxonomies and instincts behind the skill are credited to their sources in docs/DESIGN.md.
MIT. Keep the docs honest; do what you like with the code.