diff --git a/app/public/assets/posts/introducing-policybench.webp b/app/public/assets/posts/introducing-policybench.webp new file mode 100644 index 000000000..c644d1591 Binary files /dev/null and b/app/public/assets/posts/introducing-policybench.webp differ diff --git a/app/src/data/posts/articles/introducing-policybench.md b/app/src/data/posts/articles/introducing-policybench.md new file mode 100644 index 000000000..e6db97a80 --- /dev/null +++ b/app/src/data/posts/articles/introducing-policybench.md @@ -0,0 +1,47 @@ +People increasingly ask AI models to help them understand their taxes and benefits. We built PolicyBench to measure a concrete question underneath that: can a frontier model compute a household's taxes and benefits from the prompt alone — no tools, no lookups, one structured answer? + +[PolicyBench](https://policybench.org) gives each model the same household description and asks for every scored tax and benefit output plus a short explanation, with no tool use, and scores the answers against PolicyEngine-US. + +## The setup + +PolicyBench evaluates 13 frontier models on 100 US households drawn from PolicyEngine's populace microdata, across 18 tax and benefit outputs per household, for tax year 2026. + +The headline metric is exact match: the share of outputs where a model's value equals PolicyEngine's exactly — to the dollar for amounts, and the right decision for eligibility flags. It is the deployability bar, since a household filing taxes or claiming a benefit needs the right number, not a near one. The public leaderboard weights each output by its impact on household resources across the population, which keeps exact match informative even though the reference is zero-inflated: 84% of reference outputs are exact zeros, because most households are not eligible for most programs, so an unweighted exact rate would mostly measure how often a model answers zero. We report a within-1% hit rate alongside it as a near-miss-tolerant companion. + +## GPT-5.5 leads + +GPT-5.5 ranks first, matching PolicyEngine exactly on 80.3% of its outputs. The second through ninth models fall between 76.0% and 77.8%, and GPT-5.4 nano scores lowest, at 62.2%. + +| Rank | Model | Exact | Within-1% | +| ---- | ----------------------------- | ----- | --------- | +| 1 | GPT-5.5 | 80.3% | 82.6% | +| 2 | Gemini 3.1 Pro Preview | 77.8% | 78.3% | +| 3 | Claude Opus 4.7 | 77.3% | 78.3% | +| 4 | Grok 4.3 | 77.2% | 79.1% | +| 5 | Claude Sonnet 4.6 | 77.0% | 78.2% | +| 6 | Gemini 3 Flash Preview | 76.8% | 77.5% | +| 7 | Gemini 3.5 Flash | 76.1% | 76.9% | +| 8 | Gemini 3.1 Flash Lite Preview | 76.0% | 77.9% | +| 9 | Grok Build 0.1 | 76.0% | 77.0% | +| 10 | Claude Opus 4.8 | 72.5% | 73.8% | +| 11 | Claude Haiku 4.5 | 71.7% | 73.2% | +| 12 | GPT-5.4 mini | 70.4% | 72.3% | +| 13 | GPT-5.4 nano | 62.2% | 63.2% | + +## Amounts are harder than eligibility + +The hardest outputs are the ones that require a multi-step dollar calculation. Federal and state income tax before credits are the lowest, at 47.1% and 53.3% exact: each requires selecting the right income concepts, applying exclusions and thresholds, and sequencing them correctly before arriving at a number. Payroll tax (68.2%) and SNAP (76.8%), a benefit paid as a computed amount, sit just above. + +The highest-scoring outputs are eligibility flags and programs that are zero for most households. Medicare, CHIP, WIC, SSI, TANF, and school-meal outputs mostly exceed 95% exact. Part of that is that many are yes/no eligibility decisions rather than amounts; part is that most households do not qualify, and a correct "does not apply" is easier than a correct dollar figure. + +So the divide is less taxes versus benefits than computed amounts versus eligibility. SNAP, a computed benefit, scores like the income taxes, while TANF and SSI — paid to a small share of households and zero for the rest — score above 96%. Models reliably decide whether a program applies; they are less reliable at computing how much. + +## Auditing the errors + +A benchmark is only as good as its reference. We reviewed by hand every cell where a model's answer was not an exact match — 3,300 in all. Every one was a model error; none was a PolicyEngine reference error. + +## The data and code are open + +PolicyBench includes a live leaderboard, a scenario explorer that exposes every prompt and every PolicyEngine reference output, and a [full paper](https://policybench.org/paper) documenting the method, scoring, and uncertainty. Because PolicyEngine is an open, deterministic calculator, it can serve as a reusable, open reference for evaluating AI on tax and benefit tasks. + +Explore the benchmark at [policybench.org](https://policybench.org) and read the [paper](https://policybench.org/paper). diff --git a/app/src/data/posts/posts.json b/app/src/data/posts/posts.json index 7ef2f15ae..f5ae248cf 100644 --- a/app/src/data/posts/posts.json +++ b/app/src/data/posts/posts.json @@ -1,4 +1,13 @@ [ + { + "title": "Introducing PolicyBench: how accurately can AI compute taxes and benefits?", + "description": "We tested 13 frontier language models on 100 representative US households, with no tools, scored against PolicyEngine. GPT-5.5 leads, getting 80.3% exactly right.", + "date": "2026-06-16", + "tags": ["us", "ai", "featured"], + "authors": ["max-ghenis"], + "filename": "introducing-policybench.md", + "image": "introducing-policybench.webp" + }, { "title": "The Keep Your Pay Act's AMT interaction", "description": "Senator Booker's proposed standard deduction increase triggers AMT for some high earners, clawing back over half of tax savings compared to a simple bracket calculation.", diff --git a/changelog_entry.yaml b/changelog_entry.yaml index 05a557f39..eb09a559c 100644 --- a/changelog_entry.yaml +++ b/changelog_entry.yaml @@ -1,6 +1,7 @@ - bump: patch changes: added: + - Add the Introducing PolicyBench research post - Add UK Universal Credit rebalancing analysis dashboard at /uk/uc-rebalancing - Add UK fuel duty rise cancellation analysis dashboard at /uk/cancelling-fuel-duty-rise changed: