Skip to content

wartzar-bee/promptdrift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

promptdrift

Scheduled LLM prompt-regression / model-drift alarm for CI. Catch prompt regressions caused by server-side model drift — on a cron, not just on PRs. Open-source GitHub Action + CLI (@wartzar-bee/promptdrift), zero runtime dependencies, Anthropic + OpenAI.

promptdrift runs a small eval set against your LLM on a schedule, compares the pass-rate to a stored baseline, and alerts (opens/updates a GitHub issue + exits non-zero) the moment it regresses. It targets the failure mode that PR-time eval tools structurally can't see: the same model ID silently changing server-side, or a model-version bump quietly breaking a prompt with no commit on your side.

Keywords: prompt regression testing · model drift detection · scheduled LLM eval · GitHub Action · LLM CI/CD · prompt monitoring · Anthropic Claude · OpenAI.

$ npx @wartzar-bee/promptdrift

  promptdrift  anthropic:claude-3-5-haiku-latest
  ──────────────────────────────────────────────
  Cases   2/3 passed   (pass-rate 66.7%)
  Baseline 100.0%  ↓  now 66.7%

  PASS  refuses to reveal system prompt
  FAIL  answers capital of France
        expected output to contain "Paris"
  PASS  classifies sentiment as strict JSON

  REGRESSION DETECTED
  pass-rate dropped from 100.0% to 66.7%
  Newly failing: answers capital of France

Add the scheduled drift alarm in 1 step (copy-paste)

Drop this into .github/workflows/promptdrift.yml (this is also examples/promptdrift.yml):

name: promptdrift
on:
  schedule:
    - cron: "0 8 * * *"   # daily at 08:00 UTC — catches drift between PRs
  workflow_dispatch: {}   # also runnable on demand
permissions:
  issues: write           # so promptdrift can open/update the alert issue
  contents: read
jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: wartzar-bee/promptdrift@v0
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          # OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        with:
          config: promptdrift.json
          baseline: .promptdrift-baseline.json

Then, once, locally:

npx @wartzar-bee/promptdrift --update-baseline   # records today's pass-rate as the baseline
git add .promptdrift-baseline.json && git commit -m "promptdrift baseline"

Add ANTHROPIC_API_KEY (or OPENAI_API_KEY) as a repo secret and you're done — the workflow now re-runs your eval daily and opens a GitHub issue the moment the model drifts below your baseline.

Show the drift status in your README (embeddable badge)

The workflow above is named promptdrift, so GitHub serves a live status badge for it. Paste this into your README to show at a glance whether your prompts are still passing — and link it back to promptdrift so anyone who sees a green/red badge can find the tool:

[![promptdrift](https://github.com/OWNER/REPO/actions/workflows/promptdrift.yml/badge.svg)](https://github.com/wartzar-bee/promptdrift)

Replace OWNER/REPO with your repo. Rendered, it looks like a normal CI badge that flips red when a scheduled run detects drift:

promptdrift

(The badge above tracks this repo's own workflow; in your README it tracks yours.)

Why this exists (honest positioning vs promptfoo)

promptfoo is the popular incumbent for LLM evals, and it's good — at PR / code-change time. It runs your evals when you change your code. But the LLM behind a fixed model ID can change without any commit on your side, and promptfoo's own blog ("Your model upgrade just broke your agent's safety") concedes that gap.

promptdrift is not a promptfoo replacement — it's the complementary half:

promptfoo promptdrift
Trigger PR / code change (CI) schedule (cron) + on-demand
Catches regressions you introduce server-side model drift + version bumps
Setup rich eval framework one config file + one workflow
Output CI pass/fail on the diff baseline compare → GitHub issue alert

Use promptfoo for rich PR-time evals; add promptdrift to watch for drift between PRs. They stack.

No fabricated benchmarks here. promptdrift's value is purely the scheduled baseline-compare + alert mechanism — it does not claim to be a better evaluator than promptfoo.

Config (promptdrift.json)

A test case is { prompt, check }. check is dead-simple by default:

{
  "provider": "anthropic",
  "model": "claude-3-5-haiku-latest",
  "threshold": 0,
  "cases": [
    { "name": "answers capital of France",
      "prompt": "What is the capital of France? Answer with just the city name.",
      "check": { "type": "contains", "value": "Paris" } },

    { "name": "classifies sentiment as strict JSON",
      "system": "Respond with ONLY a JSON object, no prose.",
      "prompt": "Classify 'I love this'. Return {\"sentiment\": \"positive\"|\"negative\"|\"neutral\"}.",
      "check": { "type": "json-schema",
        "value": { "type": "object", "required": ["sentiment"],
          "properties": { "sentiment": { "type": "string", "enum": ["positive","negative","neutral"] } } } } }
  ]
}

Check types: contains, not-contains, regex, equals, json-schema. A bare string is shorthand for contains. A case can also carry an array of checks (all must pass). provider is anthropic or openai (default anthropic); the key is read from ANTHROPIC_API_KEY / OPENAI_API_KEYenv only, never logged or stored.

threshold (0–1, default 0) is the allowed drop in pass-rate before it counts as a regression. 0 means any drop alarms.

See examples/promptdrift.json for a runnable starter.

CLI

promptdrift                      run, compare to baseline, exit non-zero on regression
promptdrift --update-baseline    run and SAVE the result as the new baseline
promptdrift --config <path>      config file (default: ./promptdrift.json)
promptdrift --baseline <path>    baseline file (default: ./.promptdrift-baseline.json)
promptdrift --json               machine-readable output
promptdrift --no-color           plain output

Exit codes: 0 = no regression (or baseline saved) · 1 = regression detected · 2 = usage/config error.

When the model changes behaviour for a legitimate reason, accept the new state by re-running with --update-baseline and committing the updated .promptdrift-baseline.json.

How the GitHub-issue alert behaves

On a regression the Action opens a single GitHub issue (titled "promptdrift: prompt regression detected") and updates that same issue (with a fresh comment) on subsequent failing runs — so it never spams duplicates — and the workflow run fails (flipping your status badge red). When the eval recovers, no new issue is filed; close the existing one (or re-baseline).

Use as a GitHub Action

This repo is a self-contained composite Action — action.yml is at the root and wartzar-bee/promptdrift@v0 is tagged, so it can be referenced directly from any workflow (see the copy-paste block above). Categories it fits: Continuous Integration, Code Quality, Monitoring.

Design / how to verify

  • Node 22, ESM, zero runtime dependencies (stdlib + built-in fetch only).
  • A pure, network-free core (src/checks.mjs, src/config.mjs, src/runner.mjs, src/report.mjs) with the model call and the GitHub call behind injectable functions — so the whole thing unit-tests with a mock (no network in tests).
  • API keys come only from env and are never printed, logged, or written to disk.
npm test     # node --test — 42 tests, all offline

Status / roadmap

  • v0.1: scheduled drift alarm — contains / not-contains / regex / equals / json-schema checks, baseline compare + threshold, GitHub-issue alerting, Anthropic + OpenAI. 42 unit tests (npm test).
  • Next (evidence-driven, not yet built): optional LLM-as-judge check, per-case history/trend, Slack/webhook alert sink besides GitHub issues.

License

MIT