Open Weight Models

A curated list of open-weight AI models with commercially exploitable licenses, verified benchmarks, and no geographic restrictions. Built to decide which models to support in herbert-rs, a local LLM inference engine in Rust and hand-written assembly.

Selection criteria:

Commercially exploitable license, no geographic restriction (EU ok)
Total size < 200B parameters
Released after April 2024

This excludes Llama 4 multimodal (EU exclusion), Qwen 3.6 Plus (closed-source), full DeepSeek V3/R1 (671B), and others. Note: Llama text-only models (3.3 70B, 3.2 1B/3B) are EU-exploitable. See Rejected models for details.

Maintained by Philippe Anel. Last updated: April 2026.

LLMs

Generalists

Model	Publisher	Active	Total	Arch	Ctx	License	Key scores
Gemma 4 31B	Google	31B	31B	Dense	256K	Apache 2.0	GPQA 84.3, MMLU-Pro 85.2
Qwen3.5-27B	Alibaba	27B	27B	Dense	128K	Apache 2.0	201 languages
Qwen3.5-9B	Alibaba	9B	9B	Dense	128K	Apache 2.0	GPQA 81.7 (9B!)
Qwen3.5-122B-A10B	Alibaba	10B	122B	MoE	256K	Apache 2.0	201 languages, multimodal
GPT-OSS-120B	OpenAI	5.1B	117B	MoE	128K	Apache 2.0	GPQA 80.9, Codeforces 2622, AIME 96.6%
GPT-OSS-20B	OpenAI	3.6B	21B	MoE	128K	Apache 2.0	AIME 96%, fits 16GB
Mistral Small 4	Mistral	6B	119B	MoE	256K	Apache 2.0	GPQA 71.2, unified instruct/reasoning/coding
GLM-4.5-Air	Zhipu AI	12B	106B	MoE	128K	MIT	MATH-500 98.1%, MMLU-Pro 81.4
QwQ-32B	Alibaba	32B	32B	Dense	128K	Apache 2.0	AIME ~80%, reasoning RL
DeepSeek R1-Distill-32B	DeepSeek	32B	32B	Dense	128K	MIT	Beats o1-mini
Step-3.5-Flash	StepFun	11B	196B	MoE	262K	Apache 2.0	SWE-bench 74.4%, 350 tok/s
Llama 3.3 70B	Meta	70B	70B	Dense	128K	Llama Community (EU OK)	MMLU 86.0, HumanEval 88.4, MATH 77.0
InternVL3-78B	Shanghai AI Lab	78B	78B	Dense	--	Apache 2.0	MMMU 72.2, SOTA open-source VLM

Code

Model	SWE-bench	Codeforces	Active	License
Claude Opus 4.6 (closed)	80.8%	--	--	--
Gemini 3.1 Pro (closed)	80.6%	--	--	--
GPT-5.4 (closed)	~80%	--	--	--
Step-3.5-Flash	74.4%	--	11B	Apache 2.0
Devstral 2	72.2%	--	~12B	MIT modified
Qwen3-Coder-Next 80B-A3B	70.6%	--	3B	Apache 2.0
Qwen2.5-Coder-32B	69.6%	--	32B	Apache 2.0
Devstral Small 2	68.0%	--	24B	Apache 2.0
GPT-OSS-120B	62.4%	2622	5.1B	Apache 2.0
Gemma 4 31B	--	2150	31B	Apache 2.0

SWE-bench = real bugs in real GitHub repos (Django, Flask, scikit-learn). 500 human-validated issues. Codeforces = algorithmic competition, ELO-scored like chess. Different skills: fixing a codebase vs solving a puzzle.

Reasoning

GPQA Diamond (198 questions)

Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark available.

Model	GPQA	Active
Gemini 3.1 Pro (closed)	94.3	--
GPT-5.4 (closed)	92.8	--
Claude Opus 4.6 (closed)	91.3	--
Gemma 4 31B	84.3	31B
Gemma 4 26B-A4B	82.3	3.8B
Qwen3.5-9B	81.7	9B
GPT-OSS-120B	80.9	5.1B
GLM-4.5-Air	75.0	12B
Nemotron 3 Nano	73.0	3.5B
Mistral Small 4	71.2	6B
Llama 3.3 70B	50.5	70B

Math (AIME, 15 problems/year)

Competition-level math requiring creativity and multi-step reasoning. Each year's edition is different and harder. Only compare within the same version.

Model	AIME	Conditions	Active
GPT-5.4 (closed)	~100%	2025	--
Claude Opus 4.6 (closed)	~98%	2025	--
Nemotron 3 Nano	99.2%	2025, with tools	3.5B
GPT-OSS-120B	96.6%	2024, with tools	5.1B
GPT-OSS-20B	96.0%	2024, with tools	3.6B
Gemma 4 31B	89.2%	2026	31B
Gemma 4 26B-A4B	88.3%	2026	3.8B
Ministral 14B	85.0%	2025	14B
Nemotron Nano 9B v2	97.8%	MATH-500, /think mode	9B

AIME versions (2024/2025/2026) are not comparable. Each year is harder.

Compact / Edge

Models that run on smartphones, laptops, or edge devices.

Model	Active	VRAM Q4	Strength	License
SmolLM3-3B	3B	~2 GB	Best 3B, AIME 36.7%, /think mode, 64K ctx	Apache 2.0
SmolLM2-1.7B	1.7B	~1 GB	11T tokens, data-centric	Apache 2.0
SmolLM2-360M	360M	< 1 GB	4T tokens	Apache 2.0
SmolLM2-135M	135M	< 1 GB	Ultra-compact, few MB quantized	Apache 2.0
Gemma 4 E2B	2.3B	~4 GB	Multimodal + audio	Apache 2.0
Gemma 4 E4B	4.5B	~6 GB	Multimodal + audio	Apache 2.0
Phi-4-mini	3.8B	~2 GB	MATH-500 92.5%	MIT
Phi-4-multimodal	5.6B	~3 GB	Text + image + audio	MIT
Ministral 3B	3B	~2 GB	Vision + reasoning, 256K ctx	Apache 2.0
Ministral 8B	8B	~5 GB	AIME 78.7%, vision	Apache 2.0
Ministral 14B	14B	~8 GB	AIME 85%, vision, 256K ctx	Apache 2.0
LFM2.5-1.2B	1.2B	~1 GB	IFBench 47.3 (2x Qwen3-1.7B), thinking, vision, audio	LFM Open v1.0
Llama 3.2 1B/3B	1-3B	< 2 GB	128K ctx, edge/mobile, EU OK (text-only)	Llama Community
InternLM3-8B	8B	~5 GB	Thinking mode, 4T tokens (75% less training)	Apache 2.0
InternVL3-1B→38B	1-38B	1-20 GB	Vision SOTA, full range edge→server	Apache 2.0
Chocolatine-2-4B-DPO	4B	~2.5 GB	French-optimized DPO fine-tune of Qwen3-4B, 262K ctx, no `<think>`	Apache 2.0

SmolLM3-3B beats all other 3B models and competes with 4B models (Qwen3-4B, Gemma3-4B). Data quality matters more than model size: SmolLM2-1.7B trained on 11T tokens beats larger models trained on less data.

Chocolatine-2-4B (Jonathan Pacifico) is a DPO fine-tune of Qwen3-4B-Instruct-2507 on French preference datasets (Compar:IA from the French Ministry of Culture + French-ORCA), merged with TIES. Gains on every French benchmark tested (GPQA-FR, French MMLU, French Bench, FR-MT-Bench) without degrading English performance. One of the rare French-focused open-weight models built by an individual contributor rather than a lab.

Long context

Model	Max ctx	RULER 1M	Architecture	Active	License
Nemotron 3 Nano	1M	86.3%	Mamba/MoE	3.5B	Nemotron OML
Nemotron 3 Super	1M	--	Mamba/MoE	12B	Nemotron OML
Jamba 1.6 Mini	256K	--	SSM+Transformer/MoE	12B	Jamba OML

RULER (GitHub) tests retrieval in long contexts with multiple needles, multi-hop tracing, and aggregation. Parametric by length (4K to 1M). Many models claim "1M context" without publishing RULER scores at that length. Without measurement, it's marketing.

Alternative architectures

Non-Transformer or hybrid models.

Model	Architecture	Active	Key metric	License
Granite 4.0	90% Mamba-2 / 10% Attention	3-9B	70% memory reduction, 2x speed	Apache 2.0
LFM2/2.5	Convolutions + grouped attention	2.3B	112 tok/s CPU, 2x Qwen3. LFM2.5: vision, audio, thinking	LFM Open v1.0
Jamba 1.6 Mini	Mamba + Transformer + MoE	12B	2.5x Transformer speed	Jamba OML

Decentralized training

Models pre-trained outside traditional data centers, using distributed peer-to-peer or blockchain-coordinated networks. The story is the training method, not the model quality.

Model	Method	Size	Tokens	Architecture	License
Covenant-72B	Permissionless P2P, SparseLoCo optimizer, Bittensor blockchain (Subnet 3)	72B dense	1.1T (+14.8B SFT)	LLaMA-3 style, GQA, 80 layers, d=8192, 64 heads, 8 KV heads, RoPE 500K, ctx 2048→8192	Apache 2.0 (checkpoints)

Pre-training benchmarks (0-shot) vs other dense baselines :

Benchmark	Covenant-72B	LLaMA-2-70B (centralized)	LLM360 K2 (65B, centralized)	INTELLECT-1 (10B, P2P)
ARC-Challenge	56.8	57.4	53.8	44.8
ARC-Easy	80.9	79.6	76.0	71.8
PIQA	81.6	82.6	82.5	77.4
OpenBookQA	44.0	49.4	48.0	43.8
HellaSwag	80.6	84.3	82.9	70.3
WinoGrande	75.9	80.4	76.4	63.3
MMLU	67.1	65.6	65.5	32.7

Covenant-72B-Chat (post-SFT) vs other chat models :

Benchmark	Covenant-72B-Chat	LLaMA-2-70B-Chat	K2-Chat (65B)
ARC-Challenge	64.2	65.4	62.0
MMLU	67.4	63.1	67.9
IFEval	64.7	40.7	45.5
MATH	26.3	10.7	19.1
MMLU-Pro	40.9	35.2	45.4
GSM8K	63.9	52.2	79.0

Why it matters: Covenant-72B is the first proof-of-concept that 72B-scale pre-training is possible without data centers, with peers joining and leaving freely. Coordination via the Bittensor blockchain (Subnet 3), communication via SparseLoCo (146× compression vs dense gradients), peers running 8×B200 GPUs over commodity internet (500 Mb/s down, 110 Mb/s up). The model achieves 94.5% compute utilization despite the network constraints, with an average of 16.9 contributing peers per round and 70+ unique peers over the run. On benchmarks, it beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU (despite 1.8× fewer training tokens), and the chat variant has the best IFEval and MATH scores in its comparison group. It's the first credible alternative to the data-center duopoly for pre-training at 70B scale. Authors: Covenant AI + Mila. See arXiv 2603.08163.

Specialized

Theorem provers (Lean 4)

miniF2F (GitHub): 488 formal Olympiad-level math problems. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible on mathematical correctness.

Model	miniF2F	PutnamBench	Active	License
BFS-Prover-V2-32B	95.0%	--	32B	Apache 2.0
Goedel-Prover-V2-32B	90.4%	#1	32B	Apache 2.0
DeepSeek-Prover-V2-7B	88.9%	--	7B	MIT
Leanstral	--	--	32B	Apache 2.0
Kimina-Prover-72B	84.0%	--	72B	MIT
Leanabell-Prover-V2-7B	78.2%	--	7B	Apache 2.0

Lean 4 proofs are verified by the compiler. Either correct or rejected. Zero hallucination on mathematical correctness.

The sweet spot is 32B: BFS-Prover (95%) and Goedel-V2 (90.4%) both beat the 72B Kimina (84%).

GUI agents

ScreenSpot (GitHub): 1,200+ instructions across desktop, mobile, web. Tests if the model can locate the right UI element from a natural language instruction.

Model	ScreenSpot	OSWorld	Active	License
UI-TARS-1.5-7B	94.2%	42.5	7B	Apache 2.0
Qwen2.5-VL-7B	84.7	--	7B	Apache 2.0
ShowUI-2B	--	--	2B	MIT

UI-TARS-7B beats Claude (87.6%) on ScreenSpot. 7B, Apache 2.0, runs on a laptop.

Search agents

Model	Specialty	Active	License
WebThinker-32B	RL web search, beats Gemini Deep Research	32B	Apache 2.0
DeepResearcher-7B	Emergent multi-step planning via RL	7B	Apache 2.0
Search-R1	Framework: teach any LLM to search (+26% on 7B)	any	Apache 2.0

Tool calling

BFCL (GitHub): Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory.

Model	BFCL	Active	License
Hammer2.1-7B	#1	7B	CC-BY-NC 4.0
xLAM-8B	#1 (alternate)	8B	CC-BY-NC 4.0
Hammer-0.5B	On-device	0.5B	CC-BY-NC 4.0

Specialized tool-calling models clearly beat generalists. xLAM-8B beats GPT-4o on BFCL.

Rust

Model	Strandset-Rust	RustEvo2	Active	License
Strand-Rust-Coder-14B	0.50	0.43	14B	Apache 2.0 (base)

Beats GPT-5-Codex and Claude Sonnet 4.5 on Rust benchmarks. Fine-tuned on 191K examples from 2,383 crates.

Vision / Multimodal

Model	MMMU	Active	Key feature	License
InternVL3-78B	72.2	78B	SOTA open-source VLM, custom InternViT	Apache 2.0
InternVL3-1B→38B	--	1-38B	Full range edge→server	Apache 2.0
Gemma 4 31B	Pro 76.9	31B	Text + image + video	Apache 2.0
Gemma 4 E2B/E4B	--	2.3-4.5B	Multimodal + audio, edge	Apache 2.0
Qwen2.5-VL-7B	--	7B	Computer/phone use, DocVQA 95.7	Apache 2.0

InternVL3-78B (72.2 MMMU) is on par with GPT-4o on multimodal. The InternViT encoder (300M–6B) is trained jointly with the LLM — not bolted on after the fact.

Observations

Patterns observed across 60+ models. Not definitive truths.

Architecture

Dense retreats above 35B, but doesn't die. For generalists above 35B, MoE clearly dominates (GPT-OSS-120B, Mistral Small 4, Qwen3.5-122B-A10B, GLM-4.5-Air, Step-3.5-Flash, Nemotron 3 Super, all MoE). But dense survives where it has a structural advantage: Llama 3.3 70B (generalist), InternVL3-78B (vision), Kimina-Prover-72B (theorem proving), Qwen 2.5-72B (production NLP), Covenant-72B (decentralized training), DeepSeek R1-Distill-70B (distilled reasoning). Dense is becoming a specialization choice.
Parameter count is no longer the determining factor. Qwen3.5-9B (9B) beats GPT-OSS-120B (5.1B active, 117B total) on GPQA Diamond.
The 40-79B segment is the dense survivors' refuge. New models often jump from ~35B straight to ~120B total via MoE. But the 40-79B range is well populated by quality dense models (Llama 3.3 70B, InternVL3-78B, Kimina-Prover-72B, Qwen 2.5-72B, Covenant-72B, R1-Distill-70B, Jamba 1.6 Mini 52B). This is where dense resists, and where you find both solid generalists and specialists.
InternVL3 is the best open-source VLM nobody was talking about. InternVL3-78B (Shanghai AI Lab) reaches 72.2 MMMU under Apache 2.0 — on par with GPT-4o. InternLM3-8B achieves SOTA with 75% fewer training tokens (4T vs 15-18T). Less press than Alibaba, comparable results.
Qwen is the de facto base model for fine-tuning. BFS-Prover, Goedel-Prover, Kimina-Prover, most community distillations: all built on Qwen. The ResNet of LLMs.
Decentralized pre-training is no longer a toy. Covenant-72B (Mar 2026) pre-trained a 72B dense LLaMA-3-style model over a permissionless blockchain network (Bittensor Subnet 3) on 1.1T tokens. It beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU despite 1.8× fewer training tokens, with 94.5% compute utilization over commodity internet (500/110 Mb/s) and dynamic peer participation. The data-center duopoly for pre-training at 70B scale now has a credible alternative. SparseLoCo + 2-bit quantization gives 146× compression on gradient communication.

Benchmarks

GPQA Diamond is the most discriminating benchmark for reasoning: 198 doctoral-level questions, impossible to solve by retrieval.
SWE-bench vs Codeforces measure different things. GPT-OSS-120B dominates competition (ELO 2622) but gets beaten on real bugs by Step-3.5-Flash (74.4% vs 62.4%).
Many models claim "1M context" without RULER scores at that length. Without measurement, it's marketing.
AIME versions (2024/2025/2026) are not comparable. Each year is harder. Only compare within the same version.

Specialization

Specialized models dominate on narrow tasks. UI-TARS-7B beats Claude on GUI (94.2% vs 87.6%). BFS-Prover-32B beats DeepSeek-671B on theorem proving (95% vs 88.9%).
The sweet spot for theorem proving is 32B. Method (tree search, self-correction) compensates for size.
Domain-specific models (medical, legal, finance) are less mature than code/math specialists. Generalists often outperform them on domain benchmarks. Specialization helps mainly for specific vocabulary, regulatory compliance, and private data fine-tuning.

Licenses

Gemma 4 under Apache 2.0 is a turning point. Google moved from a restrictive custom license to standard open-source for the first time.
Llama 4 excludes the EU for multimodal models. But text-only Llama (3.3 70B, 3.2 1B/3B) is EU-exploitable — the exclusion only applies to multimodal.
"Open-weight" is more nuanced than "open-source". Llama is technically open-weight but with geographic restrictions on multimodal. Always check the fine print.

Benchmarks reference

What each benchmark measures, how many questions it has, and where to find more.

Reasoning & Knowledge

GPQA Diamond (198 questions) — Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark.
MMLU-Pro (12K+ questions) — Hardened version of MMLU: 10 choices instead of 4, requires chain-of-thought reasoning. 14 domains. Drops accuracy 16-33% vs MMLU. Published at NeurIPS 2024.

Math

AIME (15 problems/year) — American Invitational Mathematics Examination. Competition-level math requiring creativity and multi-step reasoning. Each year's edition is harder. Only compare within the same version (2024/2025/2026).
MATH-500 (500 problems) — Diverse math problems (algebra, geometry, combinatorics, number theory). Good general math evaluation but easier to saturate than AIME.

Code

SWE-bench Verified (500 issues) — Real bugs from GitHub repos (Django, Flask, scikit-learn). The model must understand the codebase, find the bug, and produce a working patch. Human-validated by OpenAI. Paper
Codeforces (ELO system) — Algorithmic competition performance, scored like chess ELO. Measures pure algorithmic skill, not real-world coding. Different skill from SWE-bench.
LiveCodeBench (rotating, 700+) — Fresh competitive programming problems collected after model training cutoffs. Eliminates data contamination. Problems from LeetCode, AtCoder, Codeforces. GitHub

Long context

RULER (parametric) — Sophisticated "needle in a haystack" with multiple needles, multi-hop tracing, and aggregation. Tests at different lengths (4K to 1M). By NVIDIA. Many models claiming 1M context fail above 32K. GitHub

Agents & Tools

BFCL (2K+) — Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory. By UC Berkeley. GitHub

Theorem proving

miniF2F (488 problems) — Formal Olympiad-level math problems in Lean 4 (also Isabelle, HOL Light). Covers AMC, AIME, IMO, and university math. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible. GitHub

GUI

ScreenSpot (1.2K+ instructions) — GUI element grounding across desktop, mobile, and web. Tests if the model can locate the right UI element from a natural language instruction. GitHub

Licenses

License	Models	Commercial	EU	Patent grant	OSI
Apache 2.0	Gemma 4, Qwen 3/3.5, GPT-OSS, Ministral, Step-3.5-Flash	Yes	Yes	Yes	Yes
MIT	GLM-4.5-Air, DeepSeek R1-Distill, Phi-4	Yes	Yes	No (implicit)	Yes
Nemotron OML	Nemotron 3 Nano/Super	Yes	Yes	Yes	No
Jamba OML	Jamba 1.6	Yes	Yes	--	No
Llama Community	Llama 3.3 70B, Llama 3.2 1B/3B (text-only)	Yes	Yes (text-only)	--	No
LFM Open v1.0	LFM2, LFM2.5	Yes (< $10M)	Yes	--	No

How to choose

Constraint	Recommendation
Smartphone / edge (< 4 GB)	SmolLM3-3B, SmolLM2-135M/360M/1.7B, Gemma 4 E2B, Phi-4-mini, Ministral 3B, LFM2.5-1.2B, Llama 3.2 1B/3B
Laptop 16 GB	GPT-OSS-20B, Ministral 14B, Gemma 4 26B-A4B
Desktop 24 GB	Gemma 4 31B, DeepSeek R1-Distill-32B, Devstral Small 2
Desktop 48+ GB (dense 70B)	Llama 3.3 70B (MMLU 86.0, EU OK), InternVL3-78B (vision)
Server single-GPU (80 GB)	GPT-OSS-120B
Server multi-GPU	Step-3.5-Flash, Nemotron 3 Super, Qwen3.5-122B
Long context (> 256K)	Nemotron 3 Nano (1M, RULER 86.3%)
Math	Nemotron Nano 9B v2 (/think mode), GPT-OSS-120B
Code (real bugs)	Step-3.5-Flash, Devstral Small 2
Code (competition)	GPT-OSS-120B (Codeforces 2622)
Multilingual (100+ langs)	Qwen 3.5 (201), Qwen 3 (119)
Theorem proving	BFS-Prover-V2-32B (95% miniF2F)
GUI automation	UI-TARS-1.5-7B (94.2% ScreenSpot)
Throughput	Step-3.5-Flash (350 tok/s)

Rejected models

Model	License	Reason
Llama 4 (Maverick, Scout)	Llama Community License	EU exclusion (multimodal)
Llama 3.2 Vision 11B/90B	Llama Community License	EU exclusion (multimodal)
Llama-Nemotron-Super-49B	Llama 3.3 License	Inherits EU exclusion (multimodal base)
Qwen 3.6 Plus	Proprietary	Closed-source, API-only
Codestral	Non-commercial	Research only
Falcon 3	Ambiguous	Potential 10% royalty
Kimi K2.5	Modified MIT (100M MAU)	User threshold
MiniMax M2.5	Modified MIT	Custom restrictions
DeepSeek V3/R1 (full)	MIT	> 200B total (671B)
Qwen 3 235B / Qwen 3.5 397B	Apache 2.0	> 200B total

Contributing

Found an error? Missing a model? Open an issue or submit a PR.

Sources: HuggingFace, Papers With Code, official model repos and papers.

License

This list is licensed under CC-BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Open Weight Models

Table of Contents

LLMs

Generalists

Code

Reasoning

GPQA Diamond (198 questions)

Math (AIME, 15 problems/year)

Compact / Edge

Long context

Alternative architectures

Decentralized training

Specialized

Theorem provers (Lean 4)

GUI agents

Search agents

Tool calling

Rust

Vision / Multimodal

Observations

Architecture

Benchmarks

Specialization

Licenses

Benchmarks reference

Reasoning & Knowledge

Math

Code

Long context

Agents & Tools

Theorem proving

GUI

Licenses

How to choose

Rejected models

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages