🧩 AutismBench

Constraint satisfaction benchmark for large language models.

Write one sentence. Satisfy N rules. No LLM judge. Pure code verification.

Live Leaderboard · How It Works · Quick Start · Results · Citation

Created by the Chief of Autism. It tests what autists can, and most models can't.

AutismBench stress-tests simultaneous constraint satisfaction, precision, forward planning, and zero-tolerance rule adherence. The model must write a single sentence that satisfies N constraints at once. Start at 3, scale to 30+.

Every constraint is verified programmatically - no LLM judge, zero judge bias, fully reproducible.

Leaderboard

Levels 3-15 | 5 trials per level | temperature 0.0

#	Model	Score	Validity	Perfect Solve
1	qwen3-235b-a22b	400	59.8%	16.9%
2	llama-4-maverick	322	52.7%	6.2%
3	gpt-4.1	315	52.0%	4.6%
4	claude-sonnet-4.5	308	50.8%	4.6%
5	claude-haiku-4.5	303	50.9%	3.1%
6	gpt-4.1-nano	270	45.8%	1.5%
7	llama-4-scout	259	43.2%	3.1%
8	claude-opus-4.6	257	37.1%	13.8%
9	claude-sonnet-4.6	253	39.9%	7.7%
10	deepseek-v3.2	251	43.0%	0.0%
11	mistral-large-2512	246	39.7%	6.2%
12	deepseek-r1	245	33.5%	16.9%
13	gemini-3.1-pro-preview	183	28.8%	6.2%
14	o3	12	1.0%	3.1%
15	o4-mini	6	0.5%	1.5%

Note: o3, o4-mini, and gemini-2.5-pro score near zero due to response format issues - their reasoning output breaks sentence extraction. This is a known issue being investigated.

Additional models (levels 3-10)

#	Model	Score	Validity	Perfect Solve
1	claude-sonnet-4	175	61.2%	10.0%
2	gemini-2.0-flash-lite	133	48.9%	5.0%
3	gemini-2.5-flash	131	48.1%	2.5%

Results

More charts

Why "AutismBench"?

The benchmark is designed around cognitive strengths commonly associated with autism:

Pattern recognition under constraints - seeing how 20 rules interact and finding the one solution that satisfies all of them
Hyperfocus on detail - one wrong letter, one extra word, one broken rule = total failure
Systematizing - building a mental system of interlocking rules and working within it precisely
Rule adherence without drift - no "close enough", no rounding off, no approximation
Parallel constraint tracking - holding many rules in working memory at once without losing any

"Can your model think like an autist?" If it can't hold 20 explicit rules without dropping one, it fails.

How It Works

The model receives a prompt like this:

Write a single sentence that satisfies ALL of the following constraints:

1. Contains exactly 8 words
2. Every word starts with a unique letter
3. The 3rd word is a past-tense verb
4. Contains exactly one number (as a digit)
5. The last word is an animal name

Valid answer: Big cats devoured 5 exotic fish near eagle

Each constraint is binary pass/fail, checked by code. No ambiguity.

Difficulty Scaling

Level	Difficulty
3-5	Warm-up, most models pass
6-10	Non-reasoning models start failing
11-15	Only strong reasoning models survive
16-20	Frontier reasoning models only
21-30	Approaching NP-hard constraint satisfaction

Constraint Categories

18 constraints across 5 categories:

Category	Examples
Structural	Exact word count, character count, word length at position
Lexical	Must contain a color / animal / profession / digit
Positional	First word length, last word suffix, word starts with letter X
Relational	Unique first letters, ascending word length, acrostic
Meta	Vowel count, unique letters, word length sum, forbidden letter

Scoring

+1 point per satisfied constraint
×2 bonus if ALL constraints satisfied (perfect solve)
Validity ratio = constraints passed / constraints attempted
Perfect solve rate = tasks where every constraint passed

Quick Start

git clone https://github.com/chiefautism/autism-bench.git
cd autism-bench
uv sync

echo "OPENROUTER_API_KEY=your-key-here" > .env

# Run benchmark
uv run python main.py

# Specific models
uv run python main.py --models "anthropic/claude-sonnet-4" "openai/gpt-4.1"

# Custom levels
uv run python main.py --min-level 3 --max-level 15 --trials 5

# Generate interactive dashboard
uv run python visualization.py results/autism_bench_results_*.json --save

CLI Options

Flag	Default	Description
`--models`	`model_list.py` defaults	Space-separated model IDs (OpenRouter format)
`--min-level`	3	Minimum constraint count
`--max-level`	25	Maximum constraint count
`--trials`	10	Random tasks per level
`--temperature`	0.0	Sampling temperature
`--threads`	10	Parallel API threads
`--output`	auto-timestamped	Output JSON path
`--dry-run`	-	Preview tasks without API calls

Estimated Cost

Models	Levels	Trials	API Calls	Est. Cost
1	3-15	5	65	<$1
10	3-25	10	2,300	$10-25
50	3-30	10	14,000	$100-250

Repository Structure

├── main.py              # CLI entry point
├── autism_bench.py       # Core benchmark orchestration
├── constraint_pool.py    # 18 constraints with validators
├── validator.py          # Sentence extraction + validation
├── completions.py        # Thread-safe OpenRouter client
├── model_list.py         # Supported model identifiers
├── visualization.py      # Plotly interactive dashboard
├── utils.py              # Word lists, prompts, helpers
├── pyproject.toml        # Project config (uv)
└── TECH_TASK.md          # Full implementation spec

Why This Benchmark

Property	AutismBench
Objectively verifiable	Every constraint = binary pass/fail via code
No LLM judge	Zero judge bias, fully reproducible
No ceiling	Combinatorial explosion with more constraints
Cheap	~500 API calls per model, $2-5 total
Contamination-resistant	Random constraint combos = infinite unique tasks
What it tests	Planning, working memory, precision, constraint satisfaction

Citation

@misc{autismbench2026,
    title={AutismBench: Constraint Satisfaction Benchmark for Large Language Models},
    author={Chief of Autism},
    year={2026},
    url={https://github.com/chiefautism/autism-bench}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧩 AutismBench

Leaderboard

Results

Why "AutismBench"?

How It Works

Difficulty Scaling

Constraint Categories

Scoring

Quick Start

CLI Options

Estimated Cost

Repository Structure

Why This Benchmark

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
docs		docs
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
TECH_TASK.md		TECH_TASK.md
autism_bench.py		autism_bench.py
completions.py		completions.py
constraint_pool.py		constraint_pool.py
main.py		main.py
model_list.py		model_list.py
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock
validator.py		validator.py
visualization.py		visualization.py

Folders and files

Latest commit

History

Repository files navigation

🧩 AutismBench

Leaderboard

Results

Why "AutismBench"?

How It Works

Difficulty Scaling

Constraint Categories

Scoring

Quick Start

CLI Options

Estimated Cost

Repository Structure

Why This Benchmark

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages