Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
47 changes: 47 additions & 0 deletions app/src/data/posts/articles/introducing-policybench.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
People increasingly ask AI models to help them understand their taxes and benefits. We built PolicyBench to measure a concrete question underneath that: can a frontier model compute a household's taxes and benefits from the prompt alone — no tools, no lookups, one structured answer?

[PolicyBench](https://policybench.org) gives each model the same household description and asks for every scored tax and benefit output plus a short explanation, with no tool use, and scores the answers against PolicyEngine-US.

## The setup

PolicyBench evaluates 13 frontier models on 100 US households drawn from PolicyEngine's populace microdata, across 18 tax and benefit outputs per household, for tax year 2026.

The headline metric is exact match: the share of outputs where a model's value equals PolicyEngine's exactly — to the dollar for amounts, and the right decision for eligibility flags. It is the deployability bar, since a household filing taxes or claiming a benefit needs the right number, not a near one. The public leaderboard weights each output by its impact on household resources across the population, which keeps exact match informative even though the reference is zero-inflated: 84% of reference outputs are exact zeros, because most households are not eligible for most programs, so an unweighted exact rate would mostly measure how often a model answers zero. We report a within-1% hit rate alongside it as a near-miss-tolerant companion.

## GPT-5.5 leads

GPT-5.5 ranks first, matching PolicyEngine exactly on 80.3% of its outputs. The second through ninth models fall between 76.0% and 77.8%, and GPT-5.4 nano scores lowest, at 62.2%.

| Rank | Model | Exact | Within-1% |
| ---- | ----------------------------- | ----- | --------- |
| 1 | GPT-5.5 | 80.3% | 82.6% |
| 2 | Gemini 3.1 Pro Preview | 77.8% | 78.3% |
| 3 | Claude Opus 4.7 | 77.3% | 78.3% |
| 4 | Grok 4.3 | 77.2% | 79.1% |
| 5 | Claude Sonnet 4.6 | 77.0% | 78.2% |
| 6 | Gemini 3 Flash Preview | 76.8% | 77.5% |
| 7 | Gemini 3.5 Flash | 76.1% | 76.9% |
| 8 | Gemini 3.1 Flash Lite Preview | 76.0% | 77.9% |
| 9 | Grok Build 0.1 | 76.0% | 77.0% |
| 10 | Claude Opus 4.8 | 72.5% | 73.8% |
| 11 | Claude Haiku 4.5 | 71.7% | 73.2% |
| 12 | GPT-5.4 mini | 70.4% | 72.3% |
| 13 | GPT-5.4 nano | 62.2% | 63.2% |

## Amounts are harder than eligibility

The hardest outputs are the ones that require a multi-step dollar calculation. Federal and state income tax before credits are the lowest, at 47.1% and 53.3% exact: each requires selecting the right income concepts, applying exclusions and thresholds, and sequencing them correctly before arriving at a number. Payroll tax (68.2%) and SNAP (76.8%), a benefit paid as a computed amount, sit just above.

The highest-scoring outputs are eligibility flags and programs that are zero for most households. Medicare, CHIP, WIC, SSI, TANF, and school-meal outputs mostly exceed 95% exact. Part of that is that many are yes/no eligibility decisions rather than amounts; part is that most households do not qualify, and a correct "does not apply" is easier than a correct dollar figure.

So the divide is less taxes versus benefits than computed amounts versus eligibility. SNAP, a computed benefit, scores like the income taxes, while TANF and SSI — paid to a small share of households and zero for the rest — score above 96%. Models reliably decide whether a program applies; they are less reliable at computing how much.

## Auditing the errors

A benchmark is only as good as its reference. We reviewed by hand every cell where a model's answer was not an exact match — 3,300 in all. Every one was a model error; none was a PolicyEngine reference error.

## The data and code are open

PolicyBench includes a live leaderboard, a scenario explorer that exposes every prompt and every PolicyEngine reference output, and a [full paper](https://policybench.org/paper) documenting the method, scoring, and uncertainty. Because PolicyEngine is an open, deterministic calculator, it can serve as a reusable, open reference for evaluating AI on tax and benefit tasks.

Explore the benchmark at [policybench.org](https://policybench.org) and read the [paper](https://policybench.org/paper).
9 changes: 9 additions & 0 deletions app/src/data/posts/posts.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
[
{
"title": "Introducing PolicyBench: how accurately can AI compute taxes and benefits?",
"description": "We tested 13 frontier language models on 100 representative US households, with no tools, scored against PolicyEngine. GPT-5.5 leads, getting 80.3% exactly right.",
"date": "2026-06-16",
"tags": ["us", "ai", "featured"],
"authors": ["max-ghenis"],
"filename": "introducing-policybench.md",
"image": "introducing-policybench.webp"
},
{
"title": "The Keep Your Pay Act's AMT interaction",
"description": "Senator Booker's proposed standard deduction increase triggers AMT for some high earners, clawing back over half of tax savings compared to a simple bracket calculation.",
Expand Down
1 change: 1 addition & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
- bump: patch
changes:
added:
- Add the Introducing PolicyBench research post
- Add UK Universal Credit rebalancing analysis dashboard at /uk/uc-rebalancing
- Add UK fuel duty rise cancellation analysis dashboard at /uk/cancelling-fuel-duty-rise
changed:
Expand Down
Loading