Shuffle a multiple-choice question's options, relabel them A/B/C/D → 1/2/3/4, reformat the prompt, or reword it, and a language model's answer often changes. That part is settled — there's a whole line of work showing benchmark scores wobble under changes that shouldn't matter.
The question I care about is sharper: when an answer flips, did the model change its mind, or did I just read the answer out of it badly?
It matters because a lot rides on the readout. The probability mass on the first answer token, the likelihood of the full answer string, a regex over the generated text, and an LLM judge can disagree with each other on the same output more than half the time. So some of what gets reported as "the model is unstable" is really "my measurement is unstable." A 2025 paper showed that's largely true for paraphrasing. Nobody had checked the option-order and formatting axes, on current models, with the extraction method treated as a knob you turn rather than a default you inherit.
So that's what this does: hold each question fixed, vary both the perturbation and the extraction method, and split the measured instability into two pieces.
For a perturbation axis
Pick a faithful reference readout $e^{}$. In v1 that's the generated-text answer read by a regex — justified by the finding that, when readouts disagree, the written answer is the more robust signal ("Look at the Text"). The LLM judge is not part of $e^{}$; it's an independent bias gate on the run (see below). Then the instability decomposes:
Genuine instability survives a faithful readout. Artifact is the part that exists only because of how you pulled the answer out. The headline claim — "much of the reported order/format sensitivity is an artifact" — is just the statement that
The figures below are illustrative. They come from
scripts/demo_figures.py, which runs synthetic data with a planted structure through flipbench's realdecompose/ figure / table code. No model has been run yet — this shows the output shape and the hypothesis the tool is built to test, not a result. Real numbers come from a pilot and then a full run (see Status).
The decomposition. Genuine (blue) at the base, extraction artifact (orange) stacked on top, with clustered-bootstrap CIs. On the structural axes most of the bar is orange; on paraphrase it's mostly blue.
Why the readout matters. Flip rate under each extraction method. First-token logprob (the field default) reports far more instability than reading the generated text or asking a judge — except on paraphrase, where they converge, because that instability is real.
| axis | model | genuine |
total |
artifact |
artifact 95% CI |
|---|---|---|---|---|---|
| order | claude-opus | 0.08 | 0.58 | 0.50 | [0.31, 0.69] |
| label_set | claude-opus | 0.15 | 0.50 | 0.35 | [0.15, 0.54] |
| format | claude-opus | 0.08 | 0.46 | 0.38 | [0.19, 0.58] |
| paraphrase | claude-opus | 0.27 | 0.35 | 0.08 | [0.00, 0.19] |
The reading: on order, only ~0.08 of a 0.58 flip rate is the model genuinely changing its answer — the other 0.50 is an artifact of first-token scoring. On paraphrase it inverts, and the instability is mostly real.
Per-item flip surface, same model, two readouts. Each row is an item, each column an axis, bright = it flipped. The field readout (left) lights up; the faithful readout (right) is much darker.
| first-token logprob | regex / generated text |
|---|---|
![]() |
![]() |
- Items — one MMLU-Pro subject (philosophy, 10-option MC) plus a GSM8K arithmetic slice with unambiguous answers. The arithmetic slice is the control: if instability shows up where the answer is a number, it can't be blamed on ambiguous options.
- Perturbation axes (4), all meaning-preserving — option order (cyclic permutations), option-label set, prompt format, and verified paraphrase (human spot-checked).
- Extraction methods — first-token logprob (the field default), regex over the generated text, and cloze/answer-string likelihood, plus a bias-validated LLM judge. A default real run uses the first two: the field logprob readout I'm interrogating and the generated-text readout that defines $e^{}$. Cloze needs per-option answer-string logprobs that chat providers don't expose, so it's held back for providers that do; the judge is the independent bias gate, not part of $e^{}$. Crossing extraction with the axes on the same items is the actual contribution.
- Statistics — per-item flip rate and entropy, with paired/clustered bootstrap CIs (clustered by item), so the artifact-vs-genuine split is a difference of paired quantities rather than two independent point estimates.
- Two validation gates, enforced in the run — the judge is checked against human labels and for its own position bias, and the paraphrase axis is gated on a human meaning/difficulty spot-check, before either is allowed to affect the numbers. The gate machinery is wired in as a hard precondition on a run, not left as good intentions. One honest caveat: the human-label inputs those gates read are placeholder fixtures right now (all-passing review labels), so the gates exercise the plumbing but won't catch a real problem until a real labelled set replaces them. If I got the judge-bias gate wrong, distrust the paraphrase-axis numbers first.
-
The paraphrase store is a fixture too — the paraphrase axis reads pre-generated rewrites from
data/paraphrases/store.jsonl, which currently holds a single example record. Before the paraphrase axis runs on the real dataset, that store has to be populated with the paraphrase generator (generate_paraphrases, the one model-touching step of the axis).
Built on Inspect AI.
make install # uv sync (Python 3.12)
make test # full test suite
make smoke # offline end-to-end on a stub model — no API keys, no network
make pilot # cost estimate from 50 items (needs API keys / local Ollama)
make run # full factorial -> results parquet
make report # decomposition table + figuresmake smoke is the one to run first: it wires fixture items through perturb → a deterministic stub model → all four extractors → the real factorial driver → the decomposition → the figures, with no network and no keys. If it passes, the plumbing is sound. Regenerate the illustrative figures above with uv run python scripts/demo_figures.py.
The real run targets $20–60. Open and reasoning models run locally through Ollama for free; only the two API models cost anything. The pilot estimates spend on 50 items first, and a full run that would blow the budget refuses to start and prints the projected number instead of surprising you with a bill.
src/flipbench/
schema.py canonical Item / PerturbedItem / ModelOutput / ResultRow types
items/ load + normalize MMLU-Pro philosophy and GSM8K
perturbations/ order · label_set · format · paraphrase (pure, seeded)
extraction/ first_token_logprob · cloze · regex_freeform · llm_judge
models/ Inspect AI providers (Anthropic, OpenAI/Gemini, Ollama) + capability flags
runner/ cache · cost guard · factorial driver · CLI
analysis/ flip rate · clustered bootstrap · the decomposition · flip predictor
validation/ judge-bias gate · paraphrase gate
report/ figures · decomposition table
publish.py push the verified perturbation set as a Hugging Face dataset
docs/ design spec, implementation plan, writeup skeleton
scripts/ demo_figures.py (the illustrative figures above)
flipbench engages prior work directly rather than rediscovering it:
- "Flaw or Artifact? Rethinking Prompt Sensitivity" (arXiv:2509.01790) — showed instability is largely an extraction artifact, for paraphrasing. flipbench extends that test to the order and format axes and to current models.
- SCORE (arXiv:2503.00137) — multi-axis robustness with a per-item consistency rate, one extraction method, mid-2024 open models. flipbench crosses axes on the same items, treats extraction as a factor, and adds paired CIs.
- FormatSpread (arXiv:2310.11324), selection bias / PriDe (arXiv:2309.03882), "My Answer is C" (arXiv:2402.14499 — first-token vs. generated text disagree >60% of the time), "Look at the Text" (COLM 2024), and "Adding Error Bars to Evals" (arXiv:2411.00640) for the statistics.
The full pipeline is implemented and green — items, perturbations, extraction, models, runner, analysis, the two validation gates, and reporting — with a passing test suite (including a synthetic planted-decomposition test that checks the estimator recovers a known genuine/artifact split), and it type-checks clean under pyright.
It has not been run against real models yet. That's the next step: a 50-item pilot to pin down cost, then a full run on philosophy + GSM8K across a Claude and a GPT/Gemini model plus local Llama-3.1-8B and Qwen-2.5-7B through Ollama. The design spec, implementation plan, and writeup skeleton live under docs/.




