Skip to content

xigh/open-weight-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

Open Weight Models

A curated list of open-weight AI models with commercially exploitable licenses, verified benchmarks, and no geographic restrictions. Built to decide which models to support in herbert-rs, a local LLM inference engine in Rust and hand-written assembly.

Selection criteria:

  1. Commercially exploitable license, no geographic restriction (EU ok)
  2. Total size < 200B parameters
  3. Released after April 2024

This excludes Llama 4 multimodal (EU exclusion), Qwen 3.6 Plus (closed-source), full DeepSeek V3/R1 (671B), and others. Note: Llama text-only models (3.3 70B, 3.2 1B/3B) are EU-exploitable. See Rejected models for details.

Maintained by Philippe Anel. Last updated: April 2026.


Table of Contents


LLMs

Generalists

Model Publisher Active Total Arch Ctx License Key scores
Gemma 4 31B Google 31B 31B Dense 256K Apache 2.0 GPQA 84.3, MMLU-Pro 85.2
Qwen3.5-27B Alibaba 27B 27B Dense 128K Apache 2.0 201 languages
Qwen3.5-9B Alibaba 9B 9B Dense 128K Apache 2.0 GPQA 81.7 (9B!)
Qwen3.5-122B-A10B Alibaba 10B 122B MoE 256K Apache 2.0 201 languages, multimodal
GPT-OSS-120B OpenAI 5.1B 117B MoE 128K Apache 2.0 GPQA 80.9, Codeforces 2622, AIME 96.6%
GPT-OSS-20B OpenAI 3.6B 21B MoE 128K Apache 2.0 AIME 96%, fits 16GB
Mistral Small 4 Mistral 6B 119B MoE 256K Apache 2.0 GPQA 71.2, unified instruct/reasoning/coding
GLM-4.5-Air Zhipu AI 12B 106B MoE 128K MIT MATH-500 98.1%, MMLU-Pro 81.4
QwQ-32B Alibaba 32B 32B Dense 128K Apache 2.0 AIME ~80%, reasoning RL
DeepSeek R1-Distill-32B DeepSeek 32B 32B Dense 128K MIT Beats o1-mini
Step-3.5-Flash StepFun 11B 196B MoE 262K Apache 2.0 SWE-bench 74.4%, 350 tok/s
Llama 3.3 70B Meta 70B 70B Dense 128K Llama Community (EU OK) MMLU 86.0, HumanEval 88.4, MATH 77.0
InternVL3-78B Shanghai AI Lab 78B 78B Dense -- Apache 2.0 MMMU 72.2, SOTA open-source VLM

Code

Model SWE-bench Codeforces Active License
Claude Opus 4.6 (closed) 80.8% -- -- --
Gemini 3.1 Pro (closed) 80.6% -- -- --
GPT-5.4 (closed) ~80% -- -- --
Step-3.5-Flash 74.4% -- 11B Apache 2.0
Devstral 2 72.2% -- ~12B MIT modified
Qwen3-Coder-Next 80B-A3B 70.6% -- 3B Apache 2.0
Qwen2.5-Coder-32B 69.6% -- 32B Apache 2.0
Devstral Small 2 68.0% -- 24B Apache 2.0
GPT-OSS-120B 62.4% 2622 5.1B Apache 2.0
Gemma 4 31B -- 2150 31B Apache 2.0

SWE-bench = real bugs in real GitHub repos (Django, Flask, scikit-learn). 500 human-validated issues. Codeforces = algorithmic competition, ELO-scored like chess. Different skills: fixing a codebase vs solving a puzzle.

Reasoning

GPQA Diamond (198 questions)

Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark available.

Model GPQA Active
Gemini 3.1 Pro (closed) 94.3 --
GPT-5.4 (closed) 92.8 --
Claude Opus 4.6 (closed) 91.3 --
Gemma 4 31B 84.3 31B
Gemma 4 26B-A4B 82.3 3.8B
Qwen3.5-9B 81.7 9B
GPT-OSS-120B 80.9 5.1B
GLM-4.5-Air 75.0 12B
Nemotron 3 Nano 73.0 3.5B
Mistral Small 4 71.2 6B
Llama 3.3 70B 50.5 70B

Math (AIME, 15 problems/year)

Competition-level math requiring creativity and multi-step reasoning. Each year's edition is different and harder. Only compare within the same version.

Model AIME Conditions Active
GPT-5.4 (closed) ~100% 2025 --
Claude Opus 4.6 (closed) ~98% 2025 --
Nemotron 3 Nano 99.2% 2025, with tools 3.5B
GPT-OSS-120B 96.6% 2024, with tools 5.1B
GPT-OSS-20B 96.0% 2024, with tools 3.6B
Gemma 4 31B 89.2% 2026 31B
Gemma 4 26B-A4B 88.3% 2026 3.8B
Ministral 14B 85.0% 2025 14B
Nemotron Nano 9B v2 97.8% MATH-500, /think mode 9B

AIME versions (2024/2025/2026) are not comparable. Each year is harder.

Compact / Edge

Models that run on smartphones, laptops, or edge devices.

Model Active VRAM Q4 Strength License
SmolLM3-3B 3B ~2 GB Best 3B, AIME 36.7%, /think mode, 64K ctx Apache 2.0
SmolLM2-1.7B 1.7B ~1 GB 11T tokens, data-centric Apache 2.0
SmolLM2-360M 360M < 1 GB 4T tokens Apache 2.0
SmolLM2-135M 135M < 1 GB Ultra-compact, few MB quantized Apache 2.0
Gemma 4 E2B 2.3B ~4 GB Multimodal + audio Apache 2.0
Gemma 4 E4B 4.5B ~6 GB Multimodal + audio Apache 2.0
Phi-4-mini 3.8B ~2 GB MATH-500 92.5% MIT
Phi-4-multimodal 5.6B ~3 GB Text + image + audio MIT
Ministral 3B 3B ~2 GB Vision + reasoning, 256K ctx Apache 2.0
Ministral 8B 8B ~5 GB AIME 78.7%, vision Apache 2.0
Ministral 14B 14B ~8 GB AIME 85%, vision, 256K ctx Apache 2.0
LFM2.5-1.2B 1.2B ~1 GB IFBench 47.3 (2x Qwen3-1.7B), thinking, vision, audio LFM Open v1.0
Llama 3.2 1B/3B 1-3B < 2 GB 128K ctx, edge/mobile, EU OK (text-only) Llama Community
InternLM3-8B 8B ~5 GB Thinking mode, 4T tokens (75% less training) Apache 2.0
InternVL3-1B→38B 1-38B 1-20 GB Vision SOTA, full range edge→server Apache 2.0
Chocolatine-2-4B-DPO 4B ~2.5 GB French-optimized DPO fine-tune of Qwen3-4B, 262K ctx, no <think> Apache 2.0

SmolLM3-3B beats all other 3B models and competes with 4B models (Qwen3-4B, Gemma3-4B). Data quality matters more than model size: SmolLM2-1.7B trained on 11T tokens beats larger models trained on less data.

Chocolatine-2-4B (Jonathan Pacifico) is a DPO fine-tune of Qwen3-4B-Instruct-2507 on French preference datasets (Compar:IA from the French Ministry of Culture + French-ORCA), merged with TIES. Gains on every French benchmark tested (GPQA-FR, French MMLU, French Bench, FR-MT-Bench) without degrading English performance. One of the rare French-focused open-weight models built by an individual contributor rather than a lab.

Long context

Model Max ctx RULER 1M Architecture Active License
Nemotron 3 Nano 1M 86.3% Mamba/MoE 3.5B Nemotron OML
Nemotron 3 Super 1M -- Mamba/MoE 12B Nemotron OML
Jamba 1.6 Mini 256K -- SSM+Transformer/MoE 12B Jamba OML

RULER (GitHub) tests retrieval in long contexts with multiple needles, multi-hop tracing, and aggregation. Parametric by length (4K to 1M). Many models claim "1M context" without publishing RULER scores at that length. Without measurement, it's marketing.

Alternative architectures

Non-Transformer or hybrid models.

Model Architecture Active Key metric License
Granite 4.0 90% Mamba-2 / 10% Attention 3-9B 70% memory reduction, 2x speed Apache 2.0
LFM2/2.5 Convolutions + grouped attention 2.3B 112 tok/s CPU, 2x Qwen3. LFM2.5: vision, audio, thinking LFM Open v1.0
Jamba 1.6 Mini Mamba + Transformer + MoE 12B 2.5x Transformer speed Jamba OML

Decentralized training

Models pre-trained outside traditional data centers, using distributed peer-to-peer or blockchain-coordinated networks. The story is the training method, not the model quality.

Model Method Size Tokens Architecture License
Covenant-72B Permissionless P2P, SparseLoCo optimizer, Bittensor blockchain (Subnet 3) 72B dense 1.1T (+14.8B SFT) LLaMA-3 style, GQA, 80 layers, d=8192, 64 heads, 8 KV heads, RoPE 500K, ctx 2048→8192 Apache 2.0 (checkpoints)

Pre-training benchmarks (0-shot) vs other dense baselines :

Benchmark Covenant-72B LLaMA-2-70B (centralized) LLM360 K2 (65B, centralized) INTELLECT-1 (10B, P2P)
ARC-Challenge 56.8 57.4 53.8 44.8
ARC-Easy 80.9 79.6 76.0 71.8
PIQA 81.6 82.6 82.5 77.4
OpenBookQA 44.0 49.4 48.0 43.8
HellaSwag 80.6 84.3 82.9 70.3
WinoGrande 75.9 80.4 76.4 63.3
MMLU 67.1 65.6 65.5 32.7

Covenant-72B-Chat (post-SFT) vs other chat models :

Benchmark Covenant-72B-Chat LLaMA-2-70B-Chat K2-Chat (65B)
ARC-Challenge 64.2 65.4 62.0
MMLU 67.4 63.1 67.9
IFEval 64.7 40.7 45.5
MATH 26.3 10.7 19.1
MMLU-Pro 40.9 35.2 45.4
GSM8K 63.9 52.2 79.0

Why it matters: Covenant-72B is the first proof-of-concept that 72B-scale pre-training is possible without data centers, with peers joining and leaving freely. Coordination via the Bittensor blockchain (Subnet 3), communication via SparseLoCo (146× compression vs dense gradients), peers running 8×B200 GPUs over commodity internet (500 Mb/s down, 110 Mb/s up). The model achieves 94.5% compute utilization despite the network constraints, with an average of 16.9 contributing peers per round and 70+ unique peers over the run. On benchmarks, it beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU (despite 1.8× fewer training tokens), and the chat variant has the best IFEval and MATH scores in its comparison group. It's the first credible alternative to the data-center duopoly for pre-training at 70B scale. Authors: Covenant AI + Mila. See arXiv 2603.08163.


Specialized

Theorem provers (Lean 4)

miniF2F (GitHub): 488 formal Olympiad-level math problems. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible on mathematical correctness.

Model miniF2F PutnamBench Active License
BFS-Prover-V2-32B 95.0% -- 32B Apache 2.0
Goedel-Prover-V2-32B 90.4% #1 32B Apache 2.0
DeepSeek-Prover-V2-7B 88.9% -- 7B MIT
Leanstral -- -- 32B Apache 2.0
Kimina-Prover-72B 84.0% -- 72B MIT
Leanabell-Prover-V2-7B 78.2% -- 7B Apache 2.0

Lean 4 proofs are verified by the compiler. Either correct or rejected. Zero hallucination on mathematical correctness.

The sweet spot is 32B: BFS-Prover (95%) and Goedel-V2 (90.4%) both beat the 72B Kimina (84%).

GUI agents

ScreenSpot (GitHub): 1,200+ instructions across desktop, mobile, web. Tests if the model can locate the right UI element from a natural language instruction.

Model ScreenSpot OSWorld Active License
UI-TARS-1.5-7B 94.2% 42.5 7B Apache 2.0
Qwen2.5-VL-7B 84.7 -- 7B Apache 2.0
ShowUI-2B -- -- 2B MIT

UI-TARS-7B beats Claude (87.6%) on ScreenSpot. 7B, Apache 2.0, runs on a laptop.

Search agents

Model Specialty Active License
WebThinker-32B RL web search, beats Gemini Deep Research 32B Apache 2.0
DeepResearcher-7B Emergent multi-step planning via RL 7B Apache 2.0
Search-R1 Framework: teach any LLM to search (+26% on 7B) any Apache 2.0

Tool calling

BFCL (GitHub): Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory.

Model BFCL Active License
Hammer2.1-7B #1 7B CC-BY-NC 4.0
xLAM-8B #1 (alternate) 8B CC-BY-NC 4.0
Hammer-0.5B On-device 0.5B CC-BY-NC 4.0

Specialized tool-calling models clearly beat generalists. xLAM-8B beats GPT-4o on BFCL.

Rust

Model Strandset-Rust RustEvo2 Active License
Strand-Rust-Coder-14B 0.50 0.43 14B Apache 2.0 (base)

Beats GPT-5-Codex and Claude Sonnet 4.5 on Rust benchmarks. Fine-tuned on 191K examples from 2,383 crates.

Vision / Multimodal

Model MMMU Active Key feature License
InternVL3-78B 72.2 78B SOTA open-source VLM, custom InternViT Apache 2.0
InternVL3-1B→38B -- 1-38B Full range edge→server Apache 2.0
Gemma 4 31B Pro 76.9 31B Text + image + video Apache 2.0
Gemma 4 E2B/E4B -- 2.3-4.5B Multimodal + audio, edge Apache 2.0
Qwen2.5-VL-7B -- 7B Computer/phone use, DocVQA 95.7 Apache 2.0

InternVL3-78B (72.2 MMMU) is on par with GPT-4o on multimodal. The InternViT encoder (300M–6B) is trained jointly with the LLM — not bolted on after the fact.


Observations

Patterns observed across 60+ models. Not definitive truths.

Architecture

  • Dense retreats above 35B, but doesn't die. For generalists above 35B, MoE clearly dominates (GPT-OSS-120B, Mistral Small 4, Qwen3.5-122B-A10B, GLM-4.5-Air, Step-3.5-Flash, Nemotron 3 Super, all MoE). But dense survives where it has a structural advantage: Llama 3.3 70B (generalist), InternVL3-78B (vision), Kimina-Prover-72B (theorem proving), Qwen 2.5-72B (production NLP), Covenant-72B (decentralized training), DeepSeek R1-Distill-70B (distilled reasoning). Dense is becoming a specialization choice.

  • Parameter count is no longer the determining factor. Qwen3.5-9B (9B) beats GPT-OSS-120B (5.1B active, 117B total) on GPQA Diamond.

  • The 40-79B segment is the dense survivors' refuge. New models often jump from ~35B straight to ~120B total via MoE. But the 40-79B range is well populated by quality dense models (Llama 3.3 70B, InternVL3-78B, Kimina-Prover-72B, Qwen 2.5-72B, Covenant-72B, R1-Distill-70B, Jamba 1.6 Mini 52B). This is where dense resists, and where you find both solid generalists and specialists.

  • InternVL3 is the best open-source VLM nobody was talking about. InternVL3-78B (Shanghai AI Lab) reaches 72.2 MMMU under Apache 2.0 — on par with GPT-4o. InternLM3-8B achieves SOTA with 75% fewer training tokens (4T vs 15-18T). Less press than Alibaba, comparable results.

  • Qwen is the de facto base model for fine-tuning. BFS-Prover, Goedel-Prover, Kimina-Prover, most community distillations: all built on Qwen. The ResNet of LLMs.

  • Decentralized pre-training is no longer a toy. Covenant-72B (Mar 2026) pre-trained a 72B dense LLaMA-3-style model over a permissionless blockchain network (Bittensor Subnet 3) on 1.1T tokens. It beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU despite 1.8× fewer training tokens, with 94.5% compute utilization over commodity internet (500/110 Mb/s) and dynamic peer participation. The data-center duopoly for pre-training at 70B scale now has a credible alternative. SparseLoCo + 2-bit quantization gives 146× compression on gradient communication.

Benchmarks

  • GPQA Diamond is the most discriminating benchmark for reasoning: 198 doctoral-level questions, impossible to solve by retrieval.

  • SWE-bench vs Codeforces measure different things. GPT-OSS-120B dominates competition (ELO 2622) but gets beaten on real bugs by Step-3.5-Flash (74.4% vs 62.4%).

  • Many models claim "1M context" without RULER scores at that length. Without measurement, it's marketing.

  • AIME versions (2024/2025/2026) are not comparable. Each year is harder. Only compare within the same version.

Specialization

  • Specialized models dominate on narrow tasks. UI-TARS-7B beats Claude on GUI (94.2% vs 87.6%). BFS-Prover-32B beats DeepSeek-671B on theorem proving (95% vs 88.9%).

  • The sweet spot for theorem proving is 32B. Method (tree search, self-correction) compensates for size.

  • Domain-specific models (medical, legal, finance) are less mature than code/math specialists. Generalists often outperform them on domain benchmarks. Specialization helps mainly for specific vocabulary, regulatory compliance, and private data fine-tuning.

Licenses

  • Gemma 4 under Apache 2.0 is a turning point. Google moved from a restrictive custom license to standard open-source for the first time.

  • Llama 4 excludes the EU for multimodal models. But text-only Llama (3.3 70B, 3.2 1B/3B) is EU-exploitable — the exclusion only applies to multimodal.

  • "Open-weight" is more nuanced than "open-source". Llama is technically open-weight but with geographic restrictions on multimodal. Always check the fine print.


Benchmarks reference

What each benchmark measures, how many questions it has, and where to find more.

Reasoning & Knowledge

  • GPQA Diamond (198 questions) — Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark.

  • MMLU-Pro (12K+ questions) — Hardened version of MMLU: 10 choices instead of 4, requires chain-of-thought reasoning. 14 domains. Drops accuracy 16-33% vs MMLU. Published at NeurIPS 2024.

Math

  • AIME (15 problems/year) — American Invitational Mathematics Examination. Competition-level math requiring creativity and multi-step reasoning. Each year's edition is harder. Only compare within the same version (2024/2025/2026).

  • MATH-500 (500 problems) — Diverse math problems (algebra, geometry, combinatorics, number theory). Good general math evaluation but easier to saturate than AIME.

Code

  • SWE-bench Verified (500 issues) — Real bugs from GitHub repos (Django, Flask, scikit-learn). The model must understand the codebase, find the bug, and produce a working patch. Human-validated by OpenAI. Paper

  • Codeforces (ELO system) — Algorithmic competition performance, scored like chess ELO. Measures pure algorithmic skill, not real-world coding. Different skill from SWE-bench.

  • LiveCodeBench (rotating, 700+) — Fresh competitive programming problems collected after model training cutoffs. Eliminates data contamination. Problems from LeetCode, AtCoder, Codeforces. GitHub

Long context

  • RULER (parametric) — Sophisticated "needle in a haystack" with multiple needles, multi-hop tracing, and aggregation. Tests at different lengths (4K to 1M). By NVIDIA. Many models claiming 1M context fail above 32K. GitHub

Agents & Tools

  • BFCL (2K+) — Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory. By UC Berkeley. GitHub

Theorem proving

  • miniF2F (488 problems) — Formal Olympiad-level math problems in Lean 4 (also Isabelle, HOL Light). Covers AMC, AIME, IMO, and university math. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible. GitHub

GUI

  • ScreenSpot (1.2K+ instructions) — GUI element grounding across desktop, mobile, and web. Tests if the model can locate the right UI element from a natural language instruction. GitHub

Licenses

License Models Commercial EU Patent grant OSI
Apache 2.0 Gemma 4, Qwen 3/3.5, GPT-OSS, Ministral, Step-3.5-Flash Yes Yes Yes Yes
MIT GLM-4.5-Air, DeepSeek R1-Distill, Phi-4 Yes Yes No (implicit) Yes
Nemotron OML Nemotron 3 Nano/Super Yes Yes Yes No
Jamba OML Jamba 1.6 Yes Yes -- No
Llama Community Llama 3.3 70B, Llama 3.2 1B/3B (text-only) Yes Yes (text-only) -- No
LFM Open v1.0 LFM2, LFM2.5 Yes (< $10M) Yes -- No

How to choose

Constraint Recommendation
Smartphone / edge (< 4 GB) SmolLM3-3B, SmolLM2-135M/360M/1.7B, Gemma 4 E2B, Phi-4-mini, Ministral 3B, LFM2.5-1.2B, Llama 3.2 1B/3B
Laptop 16 GB GPT-OSS-20B, Ministral 14B, Gemma 4 26B-A4B
Desktop 24 GB Gemma 4 31B, DeepSeek R1-Distill-32B, Devstral Small 2
Desktop 48+ GB (dense 70B) Llama 3.3 70B (MMLU 86.0, EU OK), InternVL3-78B (vision)
Server single-GPU (80 GB) GPT-OSS-120B
Server multi-GPU Step-3.5-Flash, Nemotron 3 Super, Qwen3.5-122B
Long context (> 256K) Nemotron 3 Nano (1M, RULER 86.3%)
Math Nemotron Nano 9B v2 (/think mode), GPT-OSS-120B
Code (real bugs) Step-3.5-Flash, Devstral Small 2
Code (competition) GPT-OSS-120B (Codeforces 2622)
Multilingual (100+ langs) Qwen 3.5 (201), Qwen 3 (119)
Theorem proving BFS-Prover-V2-32B (95% miniF2F)
GUI automation UI-TARS-1.5-7B (94.2% ScreenSpot)
Throughput Step-3.5-Flash (350 tok/s)

Rejected models

Model License Reason
Llama 4 (Maverick, Scout) Llama Community License EU exclusion (multimodal)
Llama 3.2 Vision 11B/90B Llama Community License EU exclusion (multimodal)
Llama-Nemotron-Super-49B Llama 3.3 License Inherits EU exclusion (multimodal base)
Qwen 3.6 Plus Proprietary Closed-source, API-only
Codestral Non-commercial Research only
Falcon 3 Ambiguous Potential 10% royalty
Kimi K2.5 Modified MIT (100M MAU) User threshold
MiniMax M2.5 Modified MIT Custom restrictions
DeepSeek V3/R1 (full) MIT > 200B total (671B)
Qwen 3 235B / Qwen 3.5 397B Apache 2.0 > 200B total

Contributing

Found an error? Missing a model? Open an issue or submit a PR.

Sources: HuggingFace, Papers With Code, official model repos and papers.


License

This list is licensed under CC-BY 4.0.

About

Curated list of open-weight AI models with commercially exploitable licenses, verified benchmarks, and no EU restrictions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors