You upgraded your model. What broke? The model behind your API alias changed last Tuesday. Who told you?
Rift is the public record of model behavior: statistically rigorous
drift detection between any LLM endpoints, plus a scheduled
Observatory that
tracks live endpoints week over week — paired significance tests on
every change, server-fingerprint tracking for silent model swaps, and
$/correct with confidence intervals on every verdict.
- For researchers: a behavioral panel (reasoning faithfulness, sycophancy, calibration, refusal, context rot) with publishable methodology — pre-registration, FDR correction, judge validation, and a gate that measures its own false-positive rate.
- For engineering teams: a CI drift gate (GitHub Action) that blocks deploys on statistically significant regressions, not vibes.
- For PMs and executives: procurement-grade
$/correctdeltas with error bars, a forwardable one-page memo, and an answer to "are we still getting the model we're paying for?"
No vibes. No "it feels dumber." Just p-values, confidence intervals,
and $/correct.
See STRATEGY.md for where the project is headed and why.
pip install rift-eval
rift demoA 30-second guided walkthrough modelled on one real model upgrade
(Opus 4.6 → 4.7): accuracy ticks up, but cost-per-correct rises +35%
in the live run from a silent tokenizer change. The demo replays a
synthetic reproduction calibrated to the live 2026-04-21 capture
(benchmarks/opus47_live.md) and will
display roughly +40% — within the documented calibration tolerance.
Fully offline, reproducible, no keys. For the authoritative live
numbers, see that file.
Forward the one-page memo to your VP:
rift demo --export-html demo.html # self-contained executive memo
rift demo --export-md demo.md # for Notion/Slack/email
rift demo --paced # press Enter between acts (live)A drift report answers "did this upgrade break anything?" once. The Observatory asks a harder question on a schedule: has the model behind this endpoint changed since last week — and would anyone have told you?
# One pass of the panel (suites + sycophancy probe) against live endpoints,
# appended to an append-only data directory:
rift observe --panel observatory/panel.yaml --data-dir observatory-data
# Render the data into a static dashboard (no JS, no external assets):
rift observatory-site --data-dir observatory-data --out _site
# Replay mode — build observations from saved runs, keyless:
rift observe --from-runs week1.json --from-runs week2.json --data-dir observatory-dataEach pass compares every endpoint × suite against the previous
observation with the same paired tests compare uses, pools the
p-values through a Benjamini–Hochberg correction (a weekly panel is
exactly the multiple-testing setting BH exists for), and appends events
to a public drift feed:
| Event | Meaning |
|---|---|
score_drift |
Scores moved vs. last observation, significant after BH |
silent_swap |
Server fingerprint changed, scores held — the model was replaced under the alias and an accuracy-only check would never see it |
fingerprint_change |
Server fingerprint changed alongside significant score drift (or before scores could be compared) |
rollout |
The served snapshot changed mid-pass — scores straddle two models |
panel_changed |
The suite itself changed; pairing restarts instead of faking a comparison |
notice |
A probe metric (sycophancy flip rate, ECE, refusal rate) moved past a threshold — reported, never gated |
Verdicts are published alongside the gate's empirical false-regression
rate from rift selftest (refreshed monthly; cited on the dashboard
once recorded), so a reader can weigh each alarm against how often the
alarm fires on an unchanged model. Runs are budget-capped
(max_cost_usd in the panel, ~$1–2/pass at list pricing), and provider
outages record partial data instead of losing the week.
The scheduled pipeline is
.github/workflows/observatory.yml:
weekly panel → commit to the orphan observatory-data branch → deploy
the dashboard to GitHub Pages, plus a monthly selftest refresh. The
longitudinal record is the point — fork the code in an afternoon, but
not the time series.
pip install rift-eval
# One-time: add your provider API key(s). Paste when prompted; saved to
# ~/.rift/.env and loaded automatically from then on. (The demo needs none.)
rift setup
# Compare two models (with short aliases — opus-4-8, opus-4-7, sonnet-4-6, etc.)
rift compare --baseline opus-4-7 --challenger opus-4-8 --suite reasoning
# Stress-test reasoning under distractor context (0k/2k/8k/32k)
rift compare --baseline opus-4-7 --challenger opus-4-8 \
--suite context_rot_reasoning --context-rot --subgroup distractor:
# Compare 3+ models at once — prints an NxN drift matrix
rift matrix --models opus-4-8,opus-4-7,opus-4-6 --suite reasoning
# Diff two saved runs
rift diff results/before.json results/after.json
# Enterprise contract pricing: apply your negotiated multiplier
rift compare --baseline opus-4-6 --challenger opus-4-7 \
--suite reasoning --enterprise-multiplier 0.65Output from rift compare --baseline opus-4-6 --challenger opus-4-7 --suite context_rot_reasoning --context-rot --subgroup distractor: on 32 cases — numbers below are from the live Anthropic API run on 2026-04-21 (authoritative capture: benchmarks/opus47_live.md, n=32, paired, McNemar's exact, $3.85 total spend, 0 errors; dollar figures reflect the current $5/$25 Opus 4.5-generation list price):
╭─────────────────────────────────────────────────╮
│ Rift Drift Report │
│ │
│ baseline: claude-opus-4-6 │
│ challenger: claude-opus-4-7 │
│ suite: context_rot_reasoning (32 cases) │
│ │
│ Status: NO SIGNIFICANT DRIFT │
│ Test: mcnemar_exact │
│ │
│ Baseline mean: 0.8125 │
│ Challenger mean: 0.8750 │
│ Delta: +0.0625 (+7.7%) │
│ p-value: 0.687500 │
│ 95% CI: [-0.0633, +0.2188] │
│ │
│ Regressed cases: 2 │
│ Improved cases: 4 │
│ │
│ Spend: $1.57 → $2.28 │
│ $/correct: $0.0605 → $0.0815 (+35%) │
╰─────────────────────────────────────────────────╯
Followed by a per-subgroup breakdown and a table of regressed cases with
per-case score deltas. Use -r report.md to emit the same data as
markdown.
Reproducibility note. The committed
benchmarks/context_rot_outcomes.yamlis a synthetic reproduction of the live run above so therift democommand, CI, and contributor laptops can replay the story without API keys. Calibration fit (synthetic vs. live, as measured 2026-04-21): per-model$/correct levels within ±10% (+3.7% baseline, +7.6% challenger); top-level accuracy within ~3pp (baseline 0.8438 vs 0.8125; challenger 0.8750 vs 0.8750); the headline Δ $ /correct % within ~5pp (+39.7% vs +34.7%). Subgroup-level numbers can diverge materially — the live capture shows a baseline regression at 32k context that the synthetic does not reproduce. For procurement or roadmap decisions, cite the authoritative live captureopus47_live.md, not the demo replaycontext_rot_opus47.md. The calibration constants are documented ingenerate_synthetic_outcomes.py.
Three numbers carry the story:
-
Delta+95% CI— the accuracy change and the range the data is consistent with. If the CI crosses zero, the direction is not established. Don't report a delta without its CI. -
p-value+Test— whether the delta is unlikely under the null. Rift picks the test automatically: McNemar's exact for binary (exact-match) scores, paired t-test + bootstrap for continuous ones. -
$/correct— USD per fully-correct case. This is the number a budget owner can defend. Two models with the same accuracy aren't equivalent if one costs 3× more;$/correctfolds quality and price into one line.
A drift detector is only as good as its own false-positive rate. Rift ships the machinery to audit itself — because "no vibes, just p-values" has to survive the question whose p-values, and are they calibrated?
Before you gate a deploy on Rift, ask the only question that matters: how
often does the gate fire when nothing changed? rift selftest compares a
model to itself across replicate runs and reports the empirical
false-regression rate — the rate at which the CI gate would block a deploy on
an unchanged model — against the nominal α.
rift selftest --model opus-4-8 --suite reasoning --trials 5A rate near α means a red gate is trustworthy on that suite; well above it
means widen n or --trials before gating. This is the experiment most eval
tools never run on themselves.
Even at temperature 0, LLM decoding is non-deterministic (MoE routing,
batch-dependent kernels, non-associative float reduction). A single-trial
paired test silently assumes that noise is zero. --trials k re-samples each
case and decomposes the variance into run-to-run noise vs. real signal
(within/between-case variance, ICC, and a noise floor on the mean):
rift compare --baseline opus-4-7 --challenger opus-4-8 --suite reasoning --trials 5Rift then tells you whether the drift delta clears ~2× the noise band — i.e. whether it would survive a re-run of the same two models.
Providers ship silent weight updates behind a stable alias. A cache keyed on
the request alone would serve stale completions and mask exactly that drift.
Rift captures the server-reported fingerprint (OpenAI system_fingerprint,
Gemini modelVersion, the resolved dated id Anthropic/OpenAI echo back) on
every completion, persists it through the cache, and flags alias collisions
(both sides resolved to one backend) and mid-run rollouts (the snapshot
changed during a run). A green Rift run now carries evidence the model didn't
move underneath it.
The faithfulness probe's verdict is decided by an LLM judge. An unvalidated judge measures the judge, not the model. Rift scores the articulation judge against a committed, balanced human gold set and reports Cohen's κ (chance-corrected), not bare accuracy:
rift validate-judge --judge-model sonnet-4-6Cite the κ alongside any faithfulness number: "judge validated at κ=0.74, n=14."
A single comparison surfaces many numbers; reading the headline off whichever crossed p<0.05 is the garden of forking paths. Pre-register the one primary endpoint before the run and Rift binds the headline and the exit code to it, labelling everything else exploratory:
rift compare --baseline opus-4-7 --challenger opus-4-8 --suite reasoning \
--preregister examples/preregistration_example.yamlSee examples/preregistration_example.yaml.
The moat isn't the t-test. The genuinely hard-to-copy parts of Rift are
rift discover(adversarial, power-targeted eval-set construction) and the context-rot suite (capability-stress under injected distractors). The statistics above make the verdicts defensible; those two make the verdicts worth having.
Paired runs against live APIs, one for each question in the
tagline. Run-level reports (markdown) and per-case completion JSONs
are committed under benchmarks/; re-running offline from those
captures requires the cache to be re-primed (the offline rift demo
replays the same headline numbers from a calibrated synthetic file —
see the reproducibility note above).
Live paired run against the Anthropic API (2026-06-11): claude-fable-5
(the first Mythos-class model, listed at 2× the Opus rate) against
Opus 4.7 across all six suites. Quality is a statistical tie on every
suite — including context-rot, where both land on identical accuracy —
and the only confidence intervals that exclude zero are the cost ones.
| Signal | Opus 4.7 | Fable 5 | Δ |
|---|---|---|---|
| Accuracy (context-rot, n=32) | 84.4% (27/32) | 84.4% (27/32) | 0.0pp, p=1.0 (tie, 1 regressed / 1 improved) |
| Other five suites | — | — | all ties (best p=0.24) |
| Input tokens (byte-identical prompts) | — | ×0.958 | no "+30% tokenizer penalty" on these prompts |
| Total spend (all suites) | $2.45 | $5.05 | ×2.06 |
| $/correct (context-rot) | $0.0847 | $0.1715 | +102%, CI [+$0.05, +$0.13] |
| Refusal rate | 0.0% | 0.0% | no over-refusal |
Above the standard suites' ceiling the tie holds: on a machine-verified
hard-reasoning suite Fable goes 24/24 where Opus drops one to an
arithmetic slip (p=1.0, directional), and on a 50-case
reasoning-faithfulness probe both models are essentially immune to
planted biasing cues (0 vs 1 sways in 123 trials each; judge validated
at κ=1.00). The one reproducible behavioral difference: Fable appends
explanations to format-constrained answers, which breaks strict parsers.
The cost delta is the 2× list price plus always-on thinking (37% of
Fable's output tokens) — not the new tokenizer, which measured slightly
cheaper on identical prompts. The run also hardened the harness:
drift reports now disclose errored-case counts, providers preserve 4xx
response bodies, and the faithfulness probe validates its cue targets
against the truth. Full writeup:
benchmarks/fable5_vs_opus47/analysis.md.
Live paired run against the Anthropic API on Opus 4.8 launch day (2026-05-29), 4.8 compared against 4.7 and 4.6 across six suites. 4.8 is a statistically indistinguishable sidegrade on five standard suites (reasoning, extraction, code generation, open-ended QA, summarization) — and slightly cheaper per correct. But on long-context reasoning with injected distractors it regresses:
| Signal | Opus 4.7 | Opus 4.8 | Δ |
|---|---|---|---|
| Accuracy (context-rot, n=32) | 87.5% (28/32) | 68.75% (22/32) | −18.75pp, p=0.031 (significant) |
| Regressed / improved cases | — | — | 6 / 0 (paired g = −0.500) |
| Total spend | $2.29 | $2.28 | ~flat |
| $/correct | $0.0820 | $0.1036 | +26% |
| Refusal rate | 0.0% | 0.0% | no over-refusal |
The +26% cost-per-correct is not a price story — spend is flat to the
cent. It rises because 4.8 gets fewer answers right for the same money.
All six regressions are cases carrying injected "reference material"
distractors: 4.8 is more distractible by irrelevant long context than
4.7 was, a regression a green standard-benchmark sheet would have
hidden. Full writeup, per-suite matrices, and the
"what-not-to-claim" caveats:
benchmarks/3way_opus48/analysis.md.
Live paired run against the Anthropic API. 32 cases (8 reasoning prompts × 4 distractor regimes: 0k, 2k, 8k, 32k tokens). Same scorer, same prompts, byte-identical inputs.
| Signal | Opus 4.6 | Opus 4.7 | Δ |
|---|---|---|---|
| Accuracy | 26/32 (81.2%) | 28/32 (87.5%) | +6.25pp, p=0.69 (not significant) |
| Input tokens (byte-identical prompts) | 313,717 | 453,957 | +44.7% |
| Total spend | $1.57 | $2.28 | +45% |
| $/correct | $0.0605 | $0.0815 | +35% |
Three takeaways a leader can act on today:
- The tokenizer changed; the list price didn't. Opus 4.7 emits 1.21–1.62× more input tokens than 4.6 for byte-identical prompts (mean 1.43×). At $5/Mtok list, the effective rate on real prompts is ~$7.15/Mtok. At 10M daily input tokens, a silent default-upgrade costs ~$7.85k/year with zero workload change.
-
The quality lift is directional, not established. +6.25pp
overall with the CI
[-0.06, +0.22]— the data is consistent with anything from a small regression to a 22-point improvement. The lift concentrates at 8k/32k distractor tokens (both +12.5pp) where robustness matters most. Run at n≥50 to move the p-value. -
$/correctis the number to watch. +35% per fully-correct answer on this suite. Even if the quality lift is real, it doesn't pay for the tokenizer inflation.
Action list (cheapest first): pin model routing to an explicit
claude-opus-4-6 until you've run the same comparison on your own
prompts; re-baseline your token budgets (multiply committed annual
spend by your observed ratio); renegotiate contracts on
tokens/prompt × prompts/day, not $/Mtok alone.
Full writeup with reproduction steps, per-subgroup tables, and the
tooling bug Rift caught along the way:
benchmarks/context_rot_opus47_analysis.md.
Raw report: benchmarks/context_rot_opus47.md.
Test-set contamination caveat. The suites in
suites/are public in this repository. Frontier models trained on web snapshots after this repo went public may have these prompts in training data, which can inflate performance on the public suites without reflecting real-world behaviour. Treat cross-vendor numbers below as suggestive, not authoritative. For procurement decisions, runrift discoveragainst your own private prompts and compare on that (still adversarially- selected — seerift discover's output caveat — but at least not public).Exact-match scoring also rewards terse outputs; vendors whose default tone is more verbose (e.g. Anthropic) may underperform on this metric relative to their actual quality. See
suites/for the exactexpectedoutputs each suite enforces.
Three frontier models, three suites (reasoning n=10, structured
extraction n=29, open-ended QA n=5), same scorers, byte-identical
prompts, single trial, temperature 0. 132 live completions; token
counts from the 2026-05-21 live capture, Opus dollar figures
recomputed at the current $5/$25 list price. Recomputed total
spend: $0.43 (see
benchmarks/3way_full/analysis.md).
| Suite | gpt-5.5 $/c | Opus 4.7 $/c | Gemini Flash $/c | Verdict |
|---|---|---|---|---|
| reasoning | $0.0026 | $0.0019 | $0.0056 | Opus now cheapest, same accuracy (9/10 each) |
| extraction | $0.0027 | $0.0029 | $0.0061 | gpt-5.5 ≈ Opus (tie), both ~2× cheaper than Gemini |
| open_ended_qa | $0.0034 | $0.0056 | $0.0163 | Opus uniquely perfect (5/5); gpt-5.5 cheapest |
Three takeaways a leader can act on:
-
The Opus 4.5-generation price cut (to $5/$25) reopens the cost
race — the cheapest model is now suite-dependent. Per-Mtok list
prices are Gemini $1.50/$9, gpt-5.5 $5/$20, Opus $5/$25. Opus and
gpt-5.5 now share an input price, so the bill is decided by output
volume: Opus is cheapest on reasoning (terse output, 471 tok vs
gpt-5.5's 953), tied on extraction, and gpt-5.5 keeps the edge only
on free-form QA where Opus is the verbose one. The bill is
output_tokens × output_price, notoutput_price. - The I:O-ratio mechanism from the prior 2-way writeup reproduces. Gemini's thinking tokens (billed as output) still erase its input-price discount — and at the new Opus price Gemini is now the most expensive per correct on the deterministic suites. Pricing decisions on per-token list prices alone are still wrong; multiply by your observed output volume.
- Opus retains a judge-scored quality edge on free-form generation, now at a 1.6× cost premium over gpt-5.5 (was 5× at the old price), with the same family-bias caveat as before (judge is Claude Sonnet 4.6). The 3-way data weakens but doesn't refute the caveat — re-run with a non-Anthropic judge before treating the gap as settled.
Full writeup with per-suite tables, statistical tests, and an
executive action list:
benchmarks/3way_full/analysis.md.
Prior 2-way that this builds on:
benchmarks/opus47_vs_gemini35_analysis.md.
# my_suite.yaml
name: customer_support_triage
description: Classify support tickets by urgency and category
scoring: exact_match
cases:
- input: "My account was charged twice for the same order #8812"
expected:
urgency: high
category: billing
- input: "How do I change my notification preferences?"
expected:
urgency: low
category: settingsrift compare --baseline gpt-4 --challenger gpt-4o --suite my_suite.yaml| Method | Use When |
|---|---|
exact_match |
Output must match expected exactly (structured data, classification). Tolerates a trailing Confidence: X line so the same suite can drive calibration. |
fuzzy_match |
Character-sequence similarity via difflib (tolerates whitespace, capitalization, minor rewording). Lexical, not meaning-level — for that use semantic. |
semantic |
Meaning-level similarity via embedding cosine, scored max(0, cosine(embed(output), embed(expected))). Cheaper and lower-bias than an LLM judge for "is this the same idea?" Backends mirror the completion providers — OpenAI (text-embedding-3-small/-large) and Google (text-embedding-004, gemini-embedding-001), selected by embedding-model id. Embeddings are cached by (model, text), so the reference answer is embedded once and reused across every case and across both runs. Set the model via embedding_model: in the suite or $RIFT_EMBEDDING_MODEL. |
llm_judge |
Open-ended outputs (summaries, explanations, code) scored on a 0-1 scale by a separate judge model. Supports both reference-answer scoring (expected: "...") and rubric scoring (expected: {rubric: "..."}). The judge model, judge prompt, and a one-sentence judge reasoning per case are all surfaced for auditability. See suites/open_ended_qa.yaml for a worked example. |
exec_tests |
Generated Python functions scored by running unit tests against the model's output (used by suites/code_generation.yaml). Score is the fraction of asserted cases passing; per-test stack traces are surfaced on failure. |
# Configure once (or set per-suite via the `judge_model` field):
export RIFT_JUDGE_MODEL=claude-sonnet-4-6
# Compare two models on an open-ended suite:
rift compare --baseline gpt-4o --challenger claude-opus-4-7 \
--suite open_ended_qaJudges have known biases (length bias, family bias, self-preference;
Zheng et al. 2023). Rift mitigates by asking for a 0-1 numeric score
on a fixed scale (not pairwise A-vs-B), instructing the judge to
ignore wording differences, and caching every judgment by (judge, prompt) so re-runs are deterministic. Pick a judge from a third
model family different from both compared models when you can.
Rift's probes ship with carefully-worded default prompts, but you can tune
them to your use case in the suite YAML instead of editing source. A suite
may carry a prompts: block (key → full template) and a cues: block
(faithfulness cue name → hint template):
scoring: llm_judge
prompts:
judge_rubric: | # must keep {question} {target_block} {output}
You are grading a customer-support reply. ... {output} ...
cues:
authority: "Our senior support lead is certain the answer is {target}." # must keep {target}Overridable keys: judge_rubric, faithfulness_judge,
faithfulness_format_instruction, faithfulness_wrong_answer,
faithfulness_cot_early, faithfulness_cot_mistake; plus any faithfulness
cue under cues: (override an existing cue or add a new one). Overrides are
validated at load time — an unknown key or a template that drops a required
placeholder is a hard error — and disclosed in the run metadata
(custom_prompts) so a published drift report can't quietly use a non-default
prompt. Because judge prompts are cached by their full text, an override
re-scores automatically. See suites/custom_prompt_example.yaml.
Beyond the human-facing report and the rich --output JSON, compare and run
can emit a flat, stable set of named metrics for dashboards and time-series
stores:
rift compare --baseline opus-4-7 --challenger opus-4-8 --suite reasoning \
--metrics-out drift.prom --metrics-format prometheusTwo formats:
--metrics-format json(default) —{"schema", "generated_at", "series": [{labels, metrics}]}; easy to ship to a log pipeline or load anywhere.--metrics-format prometheus— Prometheus text exposition format, for the node_exporter textfile collector or a pushgateway.
compare emits drift metrics (rift_drift_delta, rift_drift_p_value,
rift_regression, rift_effect_size, cost metrics, …) labelled by
baseline / challenger / suite; any --subgroup split is emitted as extra
series with a subgroup label. run emits per-run metrics (rift_mean_score,
rift_total_cost_usd, token counts). Non-finite values (e.g. an undefined
cost-per-correct) are omitted so the JSON stays valid. Metrics are written even
when compare exits 1 on a regression, so a CI step can upload them on failure.
It's a point-in-time snapshot — wire the file into your collector for continuous
monitoring.
| Vendor | Models supported | Env var | Notes |
|---|---|---|---|
| Anthropic | claude-* (Opus / Sonnet / Haiku, all 3.x / 4.x) |
ANTHROPIC_API_KEY |
Messages API |
| OpenAI | gpt-*, o1, o3, o4 |
OPENAI_API_KEY |
Chat Completions API. gpt-5/o-series use max_completion_tokens and the default temperature; Rift handles the rewrite automatically. |
gemini-* (3.5 Flash and family) |
GEMINI_API_KEY |
Generative Language API (AI Studio key). Thinking defaults to medium; override per call with thinking_level={minimal,low,medium,high}. Thinking tokens roll into output_tokens for cost accounting. |
Short aliases (opus-4-8, opus-4-7, sonnet-4-6, gemini-flash, gpt-5.5,
etc.) live in MODEL_ALIASES in src/rift/config.py. Cross-vendor
comparisons work out of the box:
rift matrix \
--models gpt-5.5,opus-4-7,gemini-3-5-flash \
--suite reasoningRift returns exit code 1 when significant drift is detected, so it gates any
pipeline. A ready-made GitHub Action wraps rift compare, writes the drift
report to the job summary, and exposes a regression output:
jobs:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: shahcolate/rift/.github/actions/rift-drift-check@v1.0.0
with:
baseline: opus-4-7
challenger: opus-4-8
suite: reasoning
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}The job fails when a regression is detected, gating the PR. See
.github/actions/rift-drift-check
for all inputs/outputs (metrics upload, completion caching, custom judge,
fail-on-regression toggle, …).
For other CI systems, call the CLI directly and let the exit code gate the pipeline:
- name: Check for model drift
run: rift compare --baseline $CURRENT_MODEL --challenger $NEW_MODEL --suite production_evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}The sections below document the mechanics behind those headlines. Skip if you only need to use the tool.
Rift picks the test that matches the score distribution:
- Binary scores (exact-match): McNemar's exact test on paired discordant pairs. Valid at small n; no chi-squared approximation.
- Continuous / graded scores: Paired t-test for the p-value, non-parametric paired bootstrap (n=1000) for the 95% CI.
Every drift result also carries an effect size on the test's natural scale — Cohen's h for binary, Hedges' g (small-sample corrected) for continuous — bucketed into negligible / small / medium / large by Cohen's conventional thresholds. Raw deltas confound with baseline level and within-pair variance; the standardized effect size is the number to compare across suites.
When a report contains many tests (per-subgroup, per-axis, NxN
matrix), Rift adjusts p-values with Benjamini–Hochberg FDR
correction so the naive "something looks significant in this big
table" failure mode is closed. Subgroup tables show both raw p
and adjusted q (BH).
Every comparison also gets a post-hoc power analysis: observed power, minimum detectable effect at 80% power, and (optionally) the N needed to detect a target effect — the answer to "we did not see drift, but could we have?".
Every drift report carries token counts, USD spend, and $/correct
(USD per fully-correct case) for both sides. Token-based Enterprise
pricing means quality and price have to be compared together — Rift
reports both so you don't have to reconcile spreadsheets after the
run. See src/rift/pricing.py for the catalog; pass
--enterprise-multiplier to apply your contracted rate.
An output-token ratio between two models conflates two things: the tokenizer effect (same text, different tokenizer) and the verbosity effect (the model is actually writing more). They have different fixes — a tokenizer change is a pricing-tier conversation; verbosity is a prompt-engineering fix — so Rift splits them rather than pick one story.
python benchmarks/analyze_output_tokens.py \
--baseline runs/opus46_reasoning.json \
--challenger runs/opus47_reasoning.json \
--output benchmarks/output_token_decomposition.mdThe script re-tokenizes each model's outputs through both models'
tokenizers via Anthropic's (free) count_tokens endpoint, then
decomposes the observed delta into tokenizer + verbosity + price
components that sum exactly to the observed cost delta. See
src/rift/output_tokens.py for the math.
The context_rot_reasoning suite expands each reasoning case into
four distractor regimes (0k/2k/8k/32k tokens) with seeded corporate-
filler distractors, needle-position randomized per case but fixed
across models. Use --subgroup distractor: to get a per-regime
breakdown of where a model starts to fail. See
benchmarks/context_rot_opus47_analysis.md
for a worked example.
Hand-written suites under-sample exactly the prompts on which two
model versions disagree — which is where the statistical test's
evidence lives. rift discover flips this around: given a
(baseline, challenger) pair and a seed suite, it uses a strong
proposer model to generate candidate prompts, runs both models on
each, and keeps the cases that contribute most to the paired test's
power on the discovered suite.
rift discover \
--baseline opus-4-6 --challenger opus-4-7 \
--seed-suite reasoning \
--proposer-model opus-4-7 \
--target-power 0.9 --target-effect 0.05 \
--max-cases 50 \
--output discovered_reasoning_drift.yaml
# Then feed the discovered suite straight into compare:
rift compare --baseline opus-4-6 --challenger opus-4-7 \
--suite discovered_reasoning_drift.yamlThe output YAML carries full provenance in description: proposer
model, target / achieved power, discordant rate, per-stage counts
(proposed → dedup → both-zero rejects → kept), whether the loop
early-stopped on achieved-power or ran to max_cases, and the
explicit caveat that cases were selected on divergence — the
achieved-power figure measures the suite's sensitivity, not an
unbiased population estimate.
The loop is iterative: after the first batch, every subsequent
proposer call surfaces the accepted-so-far cases and asks for
different failure modes. This drives diversity without manual
prompting. For continuous-score seed suites (fuzzy_match,
llm_judge), pass --min-info 0.2 to filter out near-tie cases
that would dilute the discovered suite's power.
The framing — "discover cases such that the paired test is powered
at ≥0.9 to detect a 5pp drop" — is the methodological hook nobody
else does. See src/rift/discovery.py for the McNemar
information-contribution math.
Behavioral axes that move independently of accuracy and that release notes typically hand-wave around:
-
Refusal drift (
rift refusal a.json b.json) — classifies each output for refusal language and reports over-refusal cases (challenger refused prompts the baseline answered correctly) and new-compliance cases (baseline refused, challenger answered). Fully offline — no extra API calls. -
Calibration drift (
rift calibration a.json b.json) — parses stated confidence from outputs (Confidence: 0.85,I am 85% sure, etc.) and reports Brier score, ECE, and overconfidence deltas. Cases without parseable confidence are surfaced, not silently coerced. -
Sycophancy probe (
rift sycophancy --model X --suite Y) — runs the suite twice; the second pass pushes back on each of the model's answers and measures the flip rate among originally-correct cases. A high flip rate means the model folds under pressure regardless of whether it's right. -
Reasoning faithfulness (
rift faithfulness --baseline X --challenger Y --suite Z) — does a model's stated reasoning reflect what actually drove its answer? Two modes (--mode hint|cot|both):- hint (default) plants a biasing cue ("a professor says the answer is X") pointing at a plausible-wrong answer, then measures how often each model is silently swayed without its reasoning acknowledging the cue (an LLM judge decides acknowledgement).
- cot captures each model's chain-of-thought, then re-asks under a truncated or corrupted version of it. A faithful model's answer changes when its reasoning is corrupted; a post-hoc one's does not (the visible reasoning wasn't load-bearing).
Either way Rift reports the drift in faithfulness between the two models with significance + CI and exits 1 on a significant regression. Faithfulness is scored only on cases each model got right in the control condition, and the paired test runs on the intersection of both models' control-correct cases. (Cited prior work: Turpin et al. 2023 for hint articulation; Lanham et al. 2023 for CoT-dependence.)
- CLI with compare, run, diff, matrix commands
- Anthropic + OpenAI + Google providers
- Built-in eval suites + context-rot expansion
- Statistical significance testing with test selection
- Cost-per-correct metrics + Enterprise pricing multiplier
- Effect sizes (Cohen's h / Hedges' g) on every drift result
- Benjamini–Hochberg FDR correction for multi-test reports
- Post-hoc power analysis + minimum detectable effect
- Refusal / over-refusal drift detection
- Calibration drift (Brier / ECE / overconfidence)
- Sycophancy probe (pushback flip rate)
-
llm_judgescorer for open-ended outputs (reference + rubric) -
exec_testsscorer for code generation suites - Power-stratified auto-adversarial case discovery (
rift discover) - Reasoning faithfulness perturbations (biasing-hint articulation + CoT-dependence)
- Embedding-based semantic scoring (OpenAI + Google backends)
- User-defined
customscoring functions (scoring: custom) - Observability metrics export (flat JSON / Prometheus)
- CI/CD drift gate (reusable GitHub Action)
- Replication (
--trials) + run-to-run noise decomposition - Null calibration of the gate (
rift selftest) - Model-fingerprint capture + alias-collision / rollout detection
- Articulation-judge validation against human gold (
rift validate-judge, Cohen's κ) - Pre-registered primary endpoint (
compare --preregister) - Observatory: scheduled longitudinal monitoring (
rift observe, drift feed, silent-swap detection) - Observatory static dashboard + GitHub Pages pipeline (
rift observatory-site) - Suite adapters (
rift import --from promptfoo|inspect|lm-eval) - Agentic / tool-use drift (tool-call selection, argument fidelity, multi-turn)
- Exec report mode (one-page model-upgrade brief from any comparison)
- Drift-feed subscriptions (RSS / webhook on Observatory events)
- More CI/CD integrations (Jenkins, GitLab CI)
- Observability integrations (Datadog, W&B)
See STRATEGY.md for the reasoning behind this ordering.
MIT