Probity

A reliability + accuracy benchmark for LLMs on real fundraising documents.

Probity measures how trustworthy a language model is when it reads the legal and financial documents that decide who owns what in a startup financing — term sheets, charters, SAFEs, convertible notes, cap tables. It reports two numbers that are usually conflated and shouldn't be:

Wobble (the core metric) — does the model give the same answer when you ask it the same question 20 times at temperature 0.7? A model whose answer flips run to run cannot be trusted in a workflow that touches money, even when it is often right. This is label-free: it needs no ground truth, only repetition.
Accuracy — does the model get the answer right, graded against a validated answer that a human extracted from the source document (not authored by an AI)?

These are scored separately and never averaged into one headline — a model can be perfectly consistent and consistently wrong. Models are run across a size ladder (1B → 12B local, plus a hosted model) to test whether wobble falls as capability rises. Heavier models (a 27B local model and hosted frontier models) are reserved for a single comprehensive sweep once every test is built.

Benchmark results

39 tests so far. Each model run 20×/item at temp 0.7. Wobble = % of items answered inconsistently across runs. During build-out a leaf is run on the fast set (gemma3:1b + deepseek); the heavier rows (llama3.2 3B, gemma4:12b, and hosted frontier models) are filled in by one comprehensive sweep once every leaf exists, which is why newer leaves show fewer rows for now.

Test 1.3.2 — Preferred-stock liquidation participation — 18 clauses (5 part / 8 non-part / 5 capped), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	part	non-part	capped
`gemma3:1b`	1B	61%	90%	39%	0/5	7/8	0/5
`llama3.2:latest`	3B	72%	84%	44%	0/5	7/8	1/5
`gemma4:12b`	12B	0%	100%	72%	2/5	6/8	5/5
`deepseek-v4-flash`	hosted	6%	98%	67%	1/5	6/8	5/5

Test 2.1.4 — SAFE valuation cap: pre-money vs post-money — 16 clauses (10 post / 6 pre), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	post	pre
`gemma3:1b`	1B	6%	98%	62%	10/10	0/6
`llama3.2:latest`	3B	56%	88%	81%	10/10	3/6
`gemma4:12b`	12B	0%	100%	100%	10/10	6/6
`deepseek-v4-flash`	hosted	19%	99%	100%	10/10	6/6

Test 1.4.2 — Preferred dividends: cumulative vs non-cumulative — 16 clauses (8 cumulative / 8 non-cum), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	cumulative	non-cum
`gemma3:1b`	1B	44%	93%	88%	7/8	7/8
`deepseek-v4-flash`	hosted	0%	100%	100%	8/8	8/8

Test 6.3 — Equity vesting acceleration: single-trigger vs double-trigger — 13 clauses (6 single / 7 double), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	single	double
`gemma3:1b`	1B	46%	97%	85%	4/6	7/7
`deepseek-v4-flash`	hosted	0%	100%	100%	6/6	7/7

Test 1.3.4 — Multi-series preference seniority: pari-passu vs stacked — 11 clauses (6 pari-passu / 5 stacked), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	pari-passu	stacked
`gemma3:1b`	1B	45%	97%	45%	0/6	5/5
`deepseek-v4-flash`	hosted	0%	100%	82%	4/6	5/5

Test 8.1 — Risk flag: off-market liquidation preference (>1x) — 10 clauses (5 off-market(>1x) / 5 standard(1x)), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	off-market(>1x)	standard(1x)
`gemma3:1b`	1B	40%	95%	40%	3/5	1/5
`deepseek-v4-flash`	hosted	10%	99%	90%	5/5	4/5

Test 1.7 — Redemption rights: redeemable vs non-redeemable — 10 clauses (5 redeemable / 5 non-redeem), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	redeemable	non-redeem
`gemma3:1b`	1B	20%	97%	50%	0/5	5/5
`deepseek-v4-flash`	hosted	10%	96%	100%	5/5	5/5

Test 5.6 — Transfer agreements: drag-along (obligation) vs co-sale (right) — 12 clauses (6 drag(obligated) / 6 co-sale(right)), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	drag(obligated)	co-sale(right)
`gemma3:1b`	1B	17%	99%	42%	4/6	1/6
`deepseek-v4-flash`	hosted	8%	97%	100%	6/6	6/6

Test 5.5 — Right of First Refusal & Co-Sale: investor transfer right present vs absent — 12 clauses (6 rofr/cosale / 6 absent/other-right), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	rofr/cosale	absent/other-right
`gemma3:1b`	1B	17%	98%	67%	6/6	2/6
`deepseek-v4-flash`	hosted	17%	94%	92%	6/6	5/6

Test 5.4 — Pro-rata right on future financings: granted vs not — 12 clauses (6 pro-rata / 6 absent/waived), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	pro-rata	absent/waived
`gemma3:1b`	1B	17%	94%	100%	6/6	6/6
`deepseek-v4-flash`	hosted	8%	100%	100%	6/6	6/6

Test 6.2 — Vesting schedule: cliff present vs absent — 12 clauses (6 cliff / 6 no-cliff), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	cliff	no-cliff
`gemma3:1b`	1B	17%	96%	67%	6/6	2/6
`deepseek-v4-flash`	hosted	0%	100%	92%	6/6	5/6

Test 5.2 — Protective provisions: investor class-veto right present vs absent — 12 clauses (6 veto-right / 6 absent), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	veto-right	absent
`gemma3:1b`	1B	17%	97%	58%	6/6	1/6
`deepseek-v4-flash`	hosted	0%	100%	100%	6/6	6/6

Test 5.3 — Information rights: live financial-reporting obligation vs absent — 12 clauses (6 info-rights / 6 absent/waived), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	info-rights	absent/waived
`gemma3:1b`	1B	17%	95%	50%	6/6	0/6
`deepseek-v4-flash`	hosted	0%	100%	92%	5/6	6/6

Test 5.7 — Vesting acceleration: granted on trigger vs absent — 9 clauses (6 accelerates / 3 no-acceleration), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	accelerates	no-acceleration
`gemma3:1b`	1B	44%	93%	67%	4/6	2/3
`deepseek-v4-flash`	hosted	0%	100%	100%	6/6	3/3

Test 1.3.1 — Liquidation preference multiple: 1x vs 2x vs 3x vs other — 13 clauses (0 non-part / 4 1x / 5 2x / 4 3x / 0 other), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	non-part	1x	2x	3x	other
`gemma3:1b`	1B	46%	91%	0%	-	0/4	0/5	0/3	-
`deepseek-v4-flash`	hosted	46%	87%	62%	-	3/4	1/5	4/4	-

Test 5.1 — Board seats: number an investor has the right to designate — 9 clauses (values range 1-9), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	44%	92%	78%
`deepseek-v4-flash`	hosted	11%	97%	78%

Test 2.1.6 — SAFE pro-rata side letter: granted vs absent — 15 clauses (9 pro-rata / 6 absent), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	pro-rata	absent
`gemma3:1b`	1B	20%	96%	93%	9/9	5/6
`deepseek-v4-flash`	hosted	0%	100%	100%	9/9	6/6

Test 1.1.2 — Priced round basis: pre-money vs post-money — 21 clauses (15 pre / 6 post), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	pre	post
`gemma3:1b`	1B	86%	84%	71%	9/15	6/6
`deepseek-v4-flash`	hosted	0%	100%	95%	15/15	5/6

Test 8.2 — Risk flag: full-ratchet anti-dilution present vs absent — 7 clauses (4 full-ratchet / 3 absent), each model run 28×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	full-ratchet	absent
`gemma3:1b`	1B	0%	100%	57%	4/4	0/3
`deepseek-v4-flash`	hosted	0%	100%	86%	4/4	2/3

Test 1.1.1 — Post-money valuation extraction — 4 clauses (values range 5000000-275000000), each model run 30×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	25%	96%	25%
`deepseek-v4-flash`	hosted	0%	100%	50%

Test 1.5.1 — Anti-dilution mechanism: full-ratchet vs weighted-average vs none — 5 clauses (2 full-ratchet / 2 weighted-avg / 0 broad-based / 0 narrow-based / 1 none), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	full-ratchet	weighted-avg	broad-based	narrow-based	none
`gemma3:1b`	1B	0%	100%	40%	2/2	0/2	-	-	0/1
`deepseek-v4-flash`	hosted	0%	100%	100%	2/2	2/2	-	-	1/1

Test 8.3 — Risk flag: uncapped participating-preferred present vs absent — 13 clauses (4 uncapped / 9 capped/none), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	uncapped	capped/none
`gemma3:1b`	1B	8%	99%	31%	4/4	0/9
`deepseek-v4-flash`	hosted	8%	100%	85%	2/4	9/9

Test 2.1.5 — SAFE Most-Favored-Nation clause: present vs absent — 7 clauses (4 MFN / 3 absent), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	MFN	absent
`gemma3:1b`	1B	29%	95%	100%	4/4	3/3
`deepseek-v4-flash`	hosted	0%	100%	100%	4/4	3/3

Test 1.3.3 — Participation cap multiple extraction — 3 clauses (values range 3-3.5), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	0%	100%	100%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 6.4 — Stock option exercise (strike) price extraction — 7 clauses (values range 0.03-11.0), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	71%	87%	57%
`deepseek-v4-flash`	hosted	29%	90%	43%

Test 2.1.1 — SAFE valuation cap extraction — 8 clauses (values range 15000000-150000000), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	38%	96%	88%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 2.2.1 — Convertible note principal amount extraction — 7 clauses (values range 12500-17364375), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	0%	100%	100%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 2.1.2 — SAFE discount rate extraction — 9 clauses (values range 10-50), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	44%	89%	56%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 2.2.4 — Convertible note valuation cap extraction — 4 clauses (values range 25000000-125000000), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	50%	88%	50%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 1.6.1 — Preferred-stock conversion ratio extraction — 5 clauses (values range 1-8000), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	20%	92%	80%
`deepseek-v4-flash`	hosted	20%	98%	100%

Test 1.5.2 — Anti-dilution weighted-average base: broad-based vs narrow-based vs n/a — 10 clauses (3 broad / 4 narrow / 3 n/a), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	broad	narrow	n/a
`gemma3:1b`	1B	40%	96%	70%	3/3	4/4	0/3
`deepseek-v4-flash`	hosted	0%	100%	100%	3/3	4/4	3/3

Test 1.6.2 — Automatic conversion (QPO) proceeds threshold extraction — 5 clauses (values range 30000000-100000000), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	0%	100%	100%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 2.2.2 — Convertible note interest rate extraction — 6 clauses (values range 0.28-10.0), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	0%	100%	83%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 2.1.3 — SAFE conversion mechanic: cap-only vs discount-only vs both (MFN) — 11 clauses (0 cap / 1 discount / 10 both-mfn), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy	cap	discount	both-mfn
`gemma3:1b`	1B	18%	98%	73%	-	0/1	8/10
`deepseek-v4-flash`	hosted	9%	100%	100%	-	1/1	10/10

Test 1.1.3 — Priced-round price-per-share extraction — 9 clauses (values range 0.0031-1.5), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	22%	95%	56%
`deepseek-v4-flash`	hosted	22%	94%	67%

Test 2.2.3 — Convertible note maturity date extraction — 4 clauses (values range 2005-03-31-2026-12-31), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	50%	94%	50%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 2.2.5 — Convertible note conversion-discount rate extraction — 4 clauses (values range 5.0-50.0), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	0%	100%	25%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 2.2.6 — Convertible note Qualified Financing proceeds threshold extraction — 2 clauses (values range 10000000-40000000), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	0%	100%	100%
`deepseek-v4-flash`	hosted	0%	100%	100%

Test 6.1 — Equity vesting schedule extraction + normalization — 9 clauses (values range 1.5yr/no-cliff-4yr/no-cliff), each model run 20×/item:

Model	Size	Wobble ↓	Consistency	Accuracy
`gemma3:1b`	1B	0%	100%	25%
`deepseek-v4-flash`	hosted	22%	95%	89%

What the columns mean:

Wobble (headline, lower is better) — the share of items where the model gave more than one answer across its 20 identical runs. A model that wobbles can't be trusted in a money workflow even when it's often right.
Consistency — the average agreement within each item's runs (how often they matched that item's most common answer). Wobble counts whether an item flipped; Consistency measures how much.
Accuracy — the share of items whose majority answer matched the human-validated truth.
the right-hand class columns — accuracy within each true class (correct / total), so a model can't score well by always guessing the most common class.

Full per-item breakdown — including which clauses make each model wobble — in results/RESULTS.md.

Why the answers are trustworthy

Most LLM benchmarks in niche domains are built from synthetic data with synthetic answers. That has a hidden flaw: if an AI writes both the question and the answer key, the answer key can be wrong in exactly the ways the model under test is wrong. Probity avoids this with a strict oracle layer:

Source a real document that contains the ground truth in its own authoritative text — for example, a Certificate of Incorporation filed with the SEC that states, in legally precise language, whether its preferred stock is participating.
A human separates the question from the answer. The model sees only the clause (the question). The validated label, plus the exact quote that proves it, is stored in a separate oracle file the model never sees. Items whose answer cannot be determined with confidence are excluded, not guessed.
Run only the question through each model, N times, and score the majority answer against the validated label.

Synthetic instantiation is used only to multiply difficulty (varying numbers, off-market terms, ambiguous phrasing) on top of a real, human-validated seed — never as the sole source of truth.

The test map

Probity's full test backlog is a structured map of fundraising-reasoning capabilities (engine/registry.json) — 67 atomic checks across priced equity, convertibles, cap-table math, exit waterfalls, investor rights, founder equity, regulatory filings, and off-market risk flags. Each check is built one at a time, to depth, against real sourced documents.

Structure

engine/    the model-agnostic core: clients, run harness, normalizer, reliability+accuracy scorers
leaves/    one folder per test, each with its real-document corpus, its separated oracle, and its runner
results/   the living benchmark table

Running a test

cd leaves/<test_name>
python3 run.py          # runs the corpus through gemma + DeepSeek, scores accuracy + reliability

Models default to a local Ollama model (gemma4:12b, zero egress) and DeepSeek (deepseek-v4-flash). API keys are read from the environment, never committed.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Probity

Benchmark results

Why the answers are trustworthy

The test map

Structure

Running a test

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
engine		engine
leaves		leaves
results		results
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Probity

Benchmark results

Why the answers are trustworthy

The test map

Structure

Running a test

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages