flipbench

Shuffle a multiple-choice question's options, relabel them A/B/C/D → 1/2/3/4, reformat the prompt, or reword it, and a language model's answer often changes. That part is settled — there's a whole line of work showing benchmark scores wobble under changes that shouldn't matter.

The question I care about is sharper: when an answer flips, did the model change its mind, or did I just read the answer out of it badly?

It matters because a lot rides on the readout. The probability mass on the first answer token, the likelihood of the full answer string, a regex over the generated text, and an LLM judge can disagree with each other on the same output more than half the time. So some of what gets reported as "the model is unstable" is really "my measurement is unstable." A 2025 paper showed that's largely true for paraphrasing. Nobody had checked the option-order and formatting axes, on current models, with the extraction method treated as a knob you turn rather than a default you inherit.

So that's what this does: hold each question fixed, vary both the perturbation and the extraction method, and split the measured instability into two pieces.

For a perturbation axis $a$, a model $m$, and an extraction method $e$, the total instability is the flip rate — how often two benign variants of the same item disagree once their answers are mapped back to a canonical option:

$$T(a,e,m) = \Pr_{v,,v' \sim \mathcal{V}_a(x)}\big[,\hat{y}_e(x_v) \neq \hat{y}_e(x_{v'}),\big]$$

Pick a faithful reference readout $e^{}$. In v1 that's the generated-text answer read by a regex — justified by the finding that, when readouts disagree, the written answer is the more robust signal ("Look at the Text"). The LLM judge is not part of $e^{}$; it's an independent bias gate on the run (see below). Then the instability decomposes:

$$\underbrace{T(a,e,m)}_{\text{measured}} ;=; \underbrace{G(a,m)}_{\text{genuine}} ;+; \underbrace{A(a,e,m)}_{\text{extraction artifact}}, \qquad G(a,m) = T(a,e^{*},m), \quad A(a,e,m) = T(a,e,m) - G(a,m).$$

Genuine instability survives a faithful readout. Artifact is the part that exists only because of how you pulled the answer out. The headline claim — "much of the reported order/format sensitivity is an artifact" — is just the statement that $A$ is large relative to $G$ on those axes, with confidence intervals that don't overlap. I'm not trying to show instability exists; it does. I'm trying to measure how much of it is the readout.

What it produces

The figures below are illustrative. They come from scripts/demo_figures.py, which runs synthetic data with a planted structure through flipbench's real decompose / figure / table code. No model has been run yet — this shows the output shape and the hypothesis the tool is built to test, not a result. Real numbers come from a pilot and then a full run (see Status).

The decomposition. Genuine (blue) at the base, extraction artifact (orange) stacked on top, with clustered-bootstrap CIs. On the structural axes most of the bar is orange; on paraphrase it's mostly blue.

Why the readout matters. Flip rate under each extraction method. First-token logprob (the field default) reports far more instability than reading the generated text or asking a judge — except on paraphrase, where they converge, because that instability is real.

axis	model	genuine $G$	total $T$ (logprob)	artifact $A$	artifact 95% CI
order	claude-opus	0.08	0.58	0.50	[0.31, 0.69]
label_set	claude-opus	0.15	0.50	0.35	[0.15, 0.54]
format	claude-opus	0.08	0.46	0.38	[0.19, 0.58]
paraphrase	claude-opus	0.27	0.35	0.08	[0.00, 0.19]

The reading: on order, only ~0.08 of a 0.58 flip rate is the model genuinely changing its answer — the other 0.50 is an artifact of first-token scoring. On paraphrase it inverts, and the instability is mostly real.

Per-item flip surface, same model, two readouts. Each row is an item, each column an axis, bright = it flipped. The field readout (left) lights up; the faithful readout (right) is much darker.

first-token logprob	regex / generated text

How it works

Items — one MMLU-Pro subject (philosophy, 10-option MC) plus a GSM8K arithmetic slice with unambiguous answers. The arithmetic slice is the control: if instability shows up where the answer is a number, it can't be blamed on ambiguous options.
Perturbation axes (4), all meaning-preserving — option order (cyclic permutations), option-label set, prompt format, and verified paraphrase (human spot-checked).
Extraction methods — first-token logprob (the field default), regex over the generated text, and cloze/answer-string likelihood, plus a bias-validated LLM judge. A default real run uses the first two: the field logprob readout I'm interrogating and the generated-text readout that defines $e^{}$. Cloze needs per-option answer-string logprobs that chat providers don't expose, so it's held back for providers that do; the judge is the independent bias gate, not part of $e^{}$. Crossing extraction with the axes on the same items is the actual contribution.
Statistics — per-item flip rate and entropy, with paired/clustered bootstrap CIs (clustered by item), so the artifact-vs-genuine split is a difference of paired quantities rather than two independent point estimates.
Two validation gates, enforced in the run — the judge is checked against human labels and for its own position bias, and the paraphrase axis is gated on a human meaning/difficulty spot-check, before either is allowed to affect the numbers. The gate machinery is wired in as a hard precondition on a run, not left as good intentions. One honest caveat: the human-label inputs those gates read are placeholder fixtures right now (all-passing review labels), so the gates exercise the plumbing but won't catch a real problem until a real labelled set replaces them. If I got the judge-bias gate wrong, distrust the paraphrase-axis numbers first.
The paraphrase store is a fixture too — the paraphrase axis reads pre-generated rewrites from data/paraphrases/store.jsonl, which currently holds a single example record. Before the paraphrase axis runs on the real dataset, that store has to be populated with the paraphrase generator (generate_paraphrases, the one model-touching step of the axis).

Built on Inspect AI.

Running it

make install    # uv sync (Python 3.12)
make test       # full test suite
make smoke      # offline end-to-end on a stub model — no API keys, no network
make pilot      # cost estimate from 50 items (needs API keys / local Ollama)
make run        # full factorial -> results parquet
make report     # decomposition table + figures

make smoke is the one to run first: it wires fixture items through perturb → a deterministic stub model → all four extractors → the real factorial driver → the decomposition → the figures, with no network and no keys. If it passes, the plumbing is sound. Regenerate the illustrative figures above with uv run python scripts/demo_figures.py.

Cost

The real run targets $20–60. Open and reasoning models run locally through Ollama for free; only the two API models cost anything. The pilot estimates spend on 50 items first, and a full run that would blow the budget refuses to start and prints the projected number instead of surprising you with a bill.

Repository layout

src/flipbench/
  schema.py          canonical Item / PerturbedItem / ModelOutput / ResultRow types
  items/             load + normalize MMLU-Pro philosophy and GSM8K
  perturbations/     order · label_set · format · paraphrase (pure, seeded)
  extraction/        first_token_logprob · cloze · regex_freeform · llm_judge
  models/            Inspect AI providers (Anthropic, OpenAI/Gemini, Ollama) + capability flags
  runner/            cache · cost guard · factorial driver · CLI
  analysis/          flip rate · clustered bootstrap · the decomposition · flip predictor
  validation/        judge-bias gate · paraphrase gate
  report/            figures · decomposition table
  publish.py         push the verified perturbation set as a Hugging Face dataset
docs/                design spec, implementation plan, writeup skeleton
scripts/             demo_figures.py (the illustrative figures above)

Where this sits in the literature

flipbench engages prior work directly rather than rediscovering it:

"Flaw or Artifact? Rethinking Prompt Sensitivity" (arXiv:2509.01790) — showed instability is largely an extraction artifact, for paraphrasing. flipbench extends that test to the order and format axes and to current models.
SCORE (arXiv:2503.00137) — multi-axis robustness with a per-item consistency rate, one extraction method, mid-2024 open models. flipbench crosses axes on the same items, treats extraction as a factor, and adds paired CIs.
FormatSpread (arXiv:2310.11324), selection bias / PriDe (arXiv:2309.03882), "My Answer is C" (arXiv:2402.14499 — first-token vs. generated text disagree >60% of the time), "Look at the Text" (COLM 2024), and "Adding Error Bars to Evals" (arXiv:2411.00640) for the statistics.

Status

The full pipeline is implemented and green — items, perturbations, extraction, models, runner, analysis, the two validation gates, and reporting — with a passing test suite (including a synthetic planted-decomposition test that checks the estimator recovers a known genuine/artifact split), and it type-checks clean under pyright.

It has not been run against real models yet. That's the next step: a 50-item pilot to pin down cost, then a full run on philosophy + GSM8K across a Claude and a GPT/Gemini model plus local Llama-3.1-8B and Qwen-2.5-7B through Ollama. The design spec, implementation plan, and writeup skeleton live under docs/.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
assets		assets
data		data
docs		docs
scripts		scripts
src/flipbench		src/flipbench
tests		tests
.gitignore		.gitignore
.python-version		.python-version
DATASET_CARD.md		DATASET_CARD.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flipbench

What it produces

How it works

Running it

Cost

Repository layout

Where this sits in the literature

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

flipbench

What it produces

How it works

Running it

Cost

Repository layout

Where this sits in the literature

Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages