Scheduled LLM prompt-regression / model-drift alarm for CI. Catch prompt regressions caused by server-side model drift — on a cron, not just on PRs. Open-source GitHub Action + CLI (@wartzar-bee/promptdrift), zero runtime dependencies, Anthropic + OpenAI.
promptdrift runs a small eval set against your LLM on a schedule, compares the pass-rate to a stored baseline, and alerts (opens/updates a GitHub issue + exits non-zero) the moment it regresses. It targets the failure mode that PR-time eval tools structurally can't see: the same model ID silently changing server-side, or a model-version bump quietly breaking a prompt with no commit on your side.
Keywords: prompt regression testing · model drift detection · scheduled LLM eval · GitHub Action · LLM CI/CD · prompt monitoring · Anthropic Claude · OpenAI.
$ npx @wartzar-bee/promptdrift
promptdrift anthropic:claude-3-5-haiku-latest
──────────────────────────────────────────────
Cases 2/3 passed (pass-rate 66.7%)
Baseline 100.0% ↓ now 66.7%
PASS refuses to reveal system prompt
FAIL answers capital of France
expected output to contain "Paris"
PASS classifies sentiment as strict JSON
REGRESSION DETECTED
pass-rate dropped from 100.0% to 66.7%
Newly failing: answers capital of France
Drop this into .github/workflows/promptdrift.yml (this is also examples/promptdrift.yml):
name: promptdrift
on:
schedule:
- cron: "0 8 * * *" # daily at 08:00 UTC — catches drift between PRs
workflow_dispatch: {} # also runnable on demand
permissions:
issues: write # so promptdrift can open/update the alert issue
contents: read
jobs:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: wartzar-bee/promptdrift@v0
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
# OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
with:
config: promptdrift.json
baseline: .promptdrift-baseline.jsonThen, once, locally:
npx @wartzar-bee/promptdrift --update-baseline # records today's pass-rate as the baseline
git add .promptdrift-baseline.json && git commit -m "promptdrift baseline"
Add ANTHROPIC_API_KEY (or OPENAI_API_KEY) as a repo secret and you're done — the workflow now re-runs your eval daily and opens a GitHub issue the moment the model drifts below your baseline.
The workflow above is named promptdrift, so GitHub serves a live status badge for it. Paste this into your README to show at a glance whether your prompts are still passing — and link it back to promptdrift so anyone who sees a green/red badge can find the tool:
[](https://github.com/wartzar-bee/promptdrift)Replace OWNER/REPO with your repo. Rendered, it looks like a normal CI badge that flips red when a scheduled run detects drift:
(The badge above tracks this repo's own workflow; in your README it tracks yours.)
promptfoo is the popular incumbent for LLM evals, and it's good — at PR / code-change time. It runs your evals when you change your code. But the LLM behind a fixed model ID can change without any commit on your side, and promptfoo's own blog ("Your model upgrade just broke your agent's safety") concedes that gap.
promptdrift is not a promptfoo replacement — it's the complementary half:
| promptfoo | promptdrift | |
|---|---|---|
| Trigger | PR / code change (CI) | schedule (cron) + on-demand |
| Catches | regressions you introduce | server-side model drift + version bumps |
| Setup | rich eval framework | one config file + one workflow |
| Output | CI pass/fail on the diff | baseline compare → GitHub issue alert |
Use promptfoo for rich PR-time evals; add promptdrift to watch for drift between PRs. They stack.
No fabricated benchmarks here. promptdrift's value is purely the scheduled baseline-compare + alert mechanism — it does not claim to be a better evaluator than promptfoo.
A test case is { prompt, check }. check is dead-simple by default:
{
"provider": "anthropic",
"model": "claude-3-5-haiku-latest",
"threshold": 0,
"cases": [
{ "name": "answers capital of France",
"prompt": "What is the capital of France? Answer with just the city name.",
"check": { "type": "contains", "value": "Paris" } },
{ "name": "classifies sentiment as strict JSON",
"system": "Respond with ONLY a JSON object, no prose.",
"prompt": "Classify 'I love this'. Return {\"sentiment\": \"positive\"|\"negative\"|\"neutral\"}.",
"check": { "type": "json-schema",
"value": { "type": "object", "required": ["sentiment"],
"properties": { "sentiment": { "type": "string", "enum": ["positive","negative","neutral"] } } } } }
]
}Check types: contains, not-contains, regex, equals, json-schema. A bare string is shorthand for contains. A case can also carry an array of checks (all must pass). provider is anthropic or openai (default anthropic); the key is read from ANTHROPIC_API_KEY / OPENAI_API_KEY — env only, never logged or stored.
threshold (0–1, default 0) is the allowed drop in pass-rate before it counts as a regression. 0 means any drop alarms.
See examples/promptdrift.json for a runnable starter.
promptdrift run, compare to baseline, exit non-zero on regression
promptdrift --update-baseline run and SAVE the result as the new baseline
promptdrift --config <path> config file (default: ./promptdrift.json)
promptdrift --baseline <path> baseline file (default: ./.promptdrift-baseline.json)
promptdrift --json machine-readable output
promptdrift --no-color plain output
Exit codes: 0 = no regression (or baseline saved) · 1 = regression detected · 2 = usage/config error.
When the model changes behaviour for a legitimate reason, accept the new state by re-running with --update-baseline and committing the updated .promptdrift-baseline.json.
On a regression the Action opens a single GitHub issue (titled "promptdrift: prompt regression detected") and updates that same issue (with a fresh comment) on subsequent failing runs — so it never spams duplicates — and the workflow run fails (flipping your status badge red). When the eval recovers, no new issue is filed; close the existing one (or re-baseline).
This repo is a self-contained composite Action — action.yml is at the root and wartzar-bee/promptdrift@v0 is tagged, so it can be referenced directly from any workflow (see the copy-paste block above). Categories it fits: Continuous Integration, Code Quality, Monitoring.
- Node 22, ESM, zero runtime dependencies (stdlib + built-in
fetchonly). - A pure, network-free core (
src/checks.mjs,src/config.mjs,src/runner.mjs,src/report.mjs) with the model call and the GitHub call behind injectable functions — so the whole thing unit-tests with a mock (no network in tests). - API keys come only from env and are never printed, logged, or written to disk.
npm test # node --test — 42 tests, all offline
- v0.1: scheduled drift alarm —
contains/not-contains/regex/equals/json-schemachecks, baseline compare + threshold, GitHub-issue alerting, Anthropic + OpenAI. 42 unit tests (npm test). - Next (evidence-driven, not yet built): optional LLM-as-judge check, per-case history/trend, Slack/webhook alert sink besides GitHub issues.
MIT