Skip to content

0xFFD/flipbench

Repository files navigation

flipbench

Shuffle a multiple-choice question's options, relabel them A/B/C/D1/2/3/4, reformat the prompt, or reword it, and a language model's answer often changes. That part is settled — there's a whole line of work showing benchmark scores wobble under changes that shouldn't matter.

The question I care about is sharper: when an answer flips, did the model change its mind, or did I just read the answer out of it badly?

It matters because a lot rides on the readout. The probability mass on the first answer token, the likelihood of the full answer string, a regex over the generated text, and an LLM judge can disagree with each other on the same output more than half the time. So some of what gets reported as "the model is unstable" is really "my measurement is unstable." A 2025 paper showed that's largely true for paraphrasing. Nobody had checked the option-order and formatting axes, on current models, with the extraction method treated as a knob you turn rather than a default you inherit.

So that's what this does: hold each question fixed, vary both the perturbation and the extraction method, and split the measured instability into two pieces.

T(a,e,m) = G(a,m) + A(a,e,m)

For a perturbation axis $a$, a model $m$, and an extraction method $e$, the total instability is the flip rate — how often two benign variants of the same item disagree once their answers are mapped back to a canonical option:

$$T(a,e,m) = \Pr_{v,,v' \sim \mathcal{V}_a(x)}\big[,\hat{y}_e(x_v) \neq \hat{y}_e(x_{v'}),\big]$$

Pick a faithful reference readout $e^{}$. In v1 that's the generated-text answer read by a regex — justified by the finding that, when readouts disagree, the written answer is the more robust signal ("Look at the Text"). The LLM judge is not part of $e^{}$; it's an independent bias gate on the run (see below). Then the instability decomposes:

$$\underbrace{T(a,e,m)}_{\text{measured}} ;=; \underbrace{G(a,m)}_{\text{genuine}} ;+; \underbrace{A(a,e,m)}_{\text{extraction artifact}}, \qquad G(a,m) = T(a,e^{*},m), \quad A(a,e,m) = T(a,e,m) - G(a,m).$$

Genuine instability survives a faithful readout. Artifact is the part that exists only because of how you pulled the answer out. The headline claim — "much of the reported order/format sensitivity is an artifact" — is just the statement that $A$ is large relative to $G$ on those axes, with confidence intervals that don't overlap. I'm not trying to show instability exists; it does. I'm trying to measure how much of it is the readout.

What it produces

The figures below are illustrative. They come from scripts/demo_figures.py, which runs synthetic data with a planted structure through flipbench's real decompose / figure / table code. No model has been run yet — this shows the output shape and the hypothesis the tool is built to test, not a result. Real numbers come from a pilot and then a full run (see Status).

The decomposition. Genuine (blue) at the base, extraction artifact (orange) stacked on top, with clustered-bootstrap CIs. On the structural axes most of the bar is orange; on paraphrase it's mostly blue.

Instability decomposition: genuine + artifact per axis and model

Why the readout matters. Flip rate under each extraction method. First-token logprob (the field default) reports far more instability than reading the generated text or asking a judge — except on paraphrase, where they converge, because that instability is real.

Flip rate by extraction method

axis model genuine $G$ total $T$ (logprob) artifact $A$ artifact 95% CI
order claude-opus 0.08 0.58 0.50 [0.31, 0.69]
label_set claude-opus 0.15 0.50 0.35 [0.15, 0.54]
format claude-opus 0.08 0.46 0.38 [0.19, 0.58]
paraphrase claude-opus 0.27 0.35 0.08 [0.00, 0.19]

The reading: on order, only ~0.08 of a 0.58 flip rate is the model genuinely changing its answer — the other 0.50 is an artifact of first-token scoring. On paraphrase it inverts, and the instability is mostly real.

Per-item flip surface, same model, two readouts. Each row is an item, each column an axis, bright = it flipped. The field readout (left) lights up; the faithful readout (right) is much darker.

first-token logprob regex / generated text
field readout faithful readout

How it works

  • Items — one MMLU-Pro subject (philosophy, 10-option MC) plus a GSM8K arithmetic slice with unambiguous answers. The arithmetic slice is the control: if instability shows up where the answer is a number, it can't be blamed on ambiguous options.
  • Perturbation axes (4), all meaning-preserving — option order (cyclic permutations), option-label set, prompt format, and verified paraphrase (human spot-checked).
  • Extraction methods — first-token logprob (the field default), regex over the generated text, and cloze/answer-string likelihood, plus a bias-validated LLM judge. A default real run uses the first two: the field logprob readout I'm interrogating and the generated-text readout that defines $e^{}$. Cloze needs per-option answer-string logprobs that chat providers don't expose, so it's held back for providers that do; the judge is the independent bias gate, not part of $e^{}$. Crossing extraction with the axes on the same items is the actual contribution.
  • Statistics — per-item flip rate and entropy, with paired/clustered bootstrap CIs (clustered by item), so the artifact-vs-genuine split is a difference of paired quantities rather than two independent point estimates.
  • Two validation gates, enforced in the run — the judge is checked against human labels and for its own position bias, and the paraphrase axis is gated on a human meaning/difficulty spot-check, before either is allowed to affect the numbers. The gate machinery is wired in as a hard precondition on a run, not left as good intentions. One honest caveat: the human-label inputs those gates read are placeholder fixtures right now (all-passing review labels), so the gates exercise the plumbing but won't catch a real problem until a real labelled set replaces them. If I got the judge-bias gate wrong, distrust the paraphrase-axis numbers first.
  • The paraphrase store is a fixture too — the paraphrase axis reads pre-generated rewrites from data/paraphrases/store.jsonl, which currently holds a single example record. Before the paraphrase axis runs on the real dataset, that store has to be populated with the paraphrase generator (generate_paraphrases, the one model-touching step of the axis).

Built on Inspect AI.

Running it

make install    # uv sync (Python 3.12)
make test       # full test suite
make smoke      # offline end-to-end on a stub model — no API keys, no network
make pilot      # cost estimate from 50 items (needs API keys / local Ollama)
make run        # full factorial -> results parquet
make report     # decomposition table + figures

make smoke is the one to run first: it wires fixture items through perturb → a deterministic stub model → all four extractors → the real factorial driver → the decomposition → the figures, with no network and no keys. If it passes, the plumbing is sound. Regenerate the illustrative figures above with uv run python scripts/demo_figures.py.

Cost

The real run targets $20–60. Open and reasoning models run locally through Ollama for free; only the two API models cost anything. The pilot estimates spend on 50 items first, and a full run that would blow the budget refuses to start and prints the projected number instead of surprising you with a bill.

Repository layout

src/flipbench/
  schema.py          canonical Item / PerturbedItem / ModelOutput / ResultRow types
  items/             load + normalize MMLU-Pro philosophy and GSM8K
  perturbations/     order · label_set · format · paraphrase (pure, seeded)
  extraction/        first_token_logprob · cloze · regex_freeform · llm_judge
  models/            Inspect AI providers (Anthropic, OpenAI/Gemini, Ollama) + capability flags
  runner/            cache · cost guard · factorial driver · CLI
  analysis/          flip rate · clustered bootstrap · the decomposition · flip predictor
  validation/        judge-bias gate · paraphrase gate
  report/            figures · decomposition table
  publish.py         push the verified perturbation set as a Hugging Face dataset
docs/                design spec, implementation plan, writeup skeleton
scripts/             demo_figures.py (the illustrative figures above)

Where this sits in the literature

flipbench engages prior work directly rather than rediscovering it:

  • "Flaw or Artifact? Rethinking Prompt Sensitivity" (arXiv:2509.01790) — showed instability is largely an extraction artifact, for paraphrasing. flipbench extends that test to the order and format axes and to current models.
  • SCORE (arXiv:2503.00137) — multi-axis robustness with a per-item consistency rate, one extraction method, mid-2024 open models. flipbench crosses axes on the same items, treats extraction as a factor, and adds paired CIs.
  • FormatSpread (arXiv:2310.11324), selection bias / PriDe (arXiv:2309.03882), "My Answer is C" (arXiv:2402.14499 — first-token vs. generated text disagree >60% of the time), "Look at the Text" (COLM 2024), and "Adding Error Bars to Evals" (arXiv:2411.00640) for the statistics.

Status

The full pipeline is implemented and green — items, perturbations, extraction, models, runner, analysis, the two validation gates, and reporting — with a passing test suite (including a synthetic planted-decomposition test that checks the estimator recovers a known genuine/artifact split), and it type-checks clean under pyright.

It has not been run against real models yet. That's the next step: a 50-item pilot to pin down cost, then a full run on philosophy + GSM8K across a Claude and a GPT/Gemini model plus local Llama-3.1-8B and Qwen-2.5-7B through Ollama. The design spec, implementation plan, and writeup skeleton live under docs/.

About

An LLM eval that splits measured answer instability into genuine vs. extraction-artifact components, per perturbation axis, with paired confidence intervals.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors