A curated list of open-weight AI models with commercially exploitable licenses, verified benchmarks, and no geographic restrictions. Built to decide which models to support in herbert-rs, a local LLM inference engine in Rust and hand-written assembly.
Selection criteria:
- Commercially exploitable license, no geographic restriction (EU ok)
- Total size < 200B parameters
- Released after April 2024
This excludes Llama 4 multimodal (EU exclusion), Qwen 3.6 Plus (closed-source), full DeepSeek V3/R1 (671B), and others. Note: Llama text-only models (3.3 70B, 3.2 1B/3B) are EU-exploitable. See Rejected models for details.
Maintained by Philippe Anel. Last updated: April 2026.
- LLMs
- Specialized
- Observations
- Benchmarks reference
- Licenses
- How to choose
- Rejected models
- Contributing
| Model | Publisher | Active | Total | Arch | Ctx | License | Key scores |
|---|---|---|---|---|---|---|---|
| Gemma 4 31B | 31B | 31B | Dense | 256K | Apache 2.0 | GPQA 84.3, MMLU-Pro 85.2 | |
| Qwen3.5-27B | Alibaba | 27B | 27B | Dense | 128K | Apache 2.0 | 201 languages |
| Qwen3.5-9B | Alibaba | 9B | 9B | Dense | 128K | Apache 2.0 | GPQA 81.7 (9B!) |
| Qwen3.5-122B-A10B | Alibaba | 10B | 122B | MoE | 256K | Apache 2.0 | 201 languages, multimodal |
| GPT-OSS-120B | OpenAI | 5.1B | 117B | MoE | 128K | Apache 2.0 | GPQA 80.9, Codeforces 2622, AIME 96.6% |
| GPT-OSS-20B | OpenAI | 3.6B | 21B | MoE | 128K | Apache 2.0 | AIME 96%, fits 16GB |
| Mistral Small 4 | Mistral | 6B | 119B | MoE | 256K | Apache 2.0 | GPQA 71.2, unified instruct/reasoning/coding |
| GLM-4.5-Air | Zhipu AI | 12B | 106B | MoE | 128K | MIT | MATH-500 98.1%, MMLU-Pro 81.4 |
| QwQ-32B | Alibaba | 32B | 32B | Dense | 128K | Apache 2.0 | AIME ~80%, reasoning RL |
| DeepSeek R1-Distill-32B | DeepSeek | 32B | 32B | Dense | 128K | MIT | Beats o1-mini |
| Step-3.5-Flash | StepFun | 11B | 196B | MoE | 262K | Apache 2.0 | SWE-bench 74.4%, 350 tok/s |
| Llama 3.3 70B | Meta | 70B | 70B | Dense | 128K | Llama Community (EU OK) | MMLU 86.0, HumanEval 88.4, MATH 77.0 |
| InternVL3-78B | Shanghai AI Lab | 78B | 78B | Dense | -- | Apache 2.0 | MMMU 72.2, SOTA open-source VLM |
| Model | SWE-bench | Codeforces | Active | License |
|---|---|---|---|---|
| Claude Opus 4.6 (closed) | 80.8% | -- | -- | -- |
| Gemini 3.1 Pro (closed) | 80.6% | -- | -- | -- |
| GPT-5.4 (closed) | ~80% | -- | -- | -- |
| Step-3.5-Flash | 74.4% | -- | 11B | Apache 2.0 |
| Devstral 2 | 72.2% | -- | ~12B | MIT modified |
| Qwen3-Coder-Next 80B-A3B | 70.6% | -- | 3B | Apache 2.0 |
| Qwen2.5-Coder-32B | 69.6% | -- | 32B | Apache 2.0 |
| Devstral Small 2 | 68.0% | -- | 24B | Apache 2.0 |
| GPT-OSS-120B | 62.4% | 2622 | 5.1B | Apache 2.0 |
| Gemma 4 31B | -- | 2150 | 31B | Apache 2.0 |
SWE-bench = real bugs in real GitHub repos (Django, Flask, scikit-learn). 500 human-validated issues. Codeforces = algorithmic competition, ELO-scored like chess. Different skills: fixing a codebase vs solving a puzzle.
GPQA Diamond (198 questions)
Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark available.
| Model | GPQA | Active |
|---|---|---|
| Gemini 3.1 Pro (closed) | 94.3 | -- |
| GPT-5.4 (closed) | 92.8 | -- |
| Claude Opus 4.6 (closed) | 91.3 | -- |
| Gemma 4 31B | 84.3 | 31B |
| Gemma 4 26B-A4B | 82.3 | 3.8B |
| Qwen3.5-9B | 81.7 | 9B |
| GPT-OSS-120B | 80.9 | 5.1B |
| GLM-4.5-Air | 75.0 | 12B |
| Nemotron 3 Nano | 73.0 | 3.5B |
| Mistral Small 4 | 71.2 | 6B |
| Llama 3.3 70B | 50.5 | 70B |
Math (AIME, 15 problems/year)
Competition-level math requiring creativity and multi-step reasoning. Each year's edition is different and harder. Only compare within the same version.
| Model | AIME | Conditions | Active |
|---|---|---|---|
| GPT-5.4 (closed) | ~100% | 2025 | -- |
| Claude Opus 4.6 (closed) | ~98% | 2025 | -- |
| Nemotron 3 Nano | 99.2% | 2025, with tools | 3.5B |
| GPT-OSS-120B | 96.6% | 2024, with tools | 5.1B |
| GPT-OSS-20B | 96.0% | 2024, with tools | 3.6B |
| Gemma 4 31B | 89.2% | 2026 | 31B |
| Gemma 4 26B-A4B | 88.3% | 2026 | 3.8B |
| Ministral 14B | 85.0% | 2025 | 14B |
| Nemotron Nano 9B v2 | 97.8% | MATH-500, /think mode | 9B |
AIME versions (2024/2025/2026) are not comparable. Each year is harder.
Models that run on smartphones, laptops, or edge devices.
| Model | Active | VRAM Q4 | Strength | License |
|---|---|---|---|---|
| SmolLM3-3B | 3B | ~2 GB | Best 3B, AIME 36.7%, /think mode, 64K ctx | Apache 2.0 |
| SmolLM2-1.7B | 1.7B | ~1 GB | 11T tokens, data-centric | Apache 2.0 |
| SmolLM2-360M | 360M | < 1 GB | 4T tokens | Apache 2.0 |
| SmolLM2-135M | 135M | < 1 GB | Ultra-compact, few MB quantized | Apache 2.0 |
| Gemma 4 E2B | 2.3B | ~4 GB | Multimodal + audio | Apache 2.0 |
| Gemma 4 E4B | 4.5B | ~6 GB | Multimodal + audio | Apache 2.0 |
| Phi-4-mini | 3.8B | ~2 GB | MATH-500 92.5% | MIT |
| Phi-4-multimodal | 5.6B | ~3 GB | Text + image + audio | MIT |
| Ministral 3B | 3B | ~2 GB | Vision + reasoning, 256K ctx | Apache 2.0 |
| Ministral 8B | 8B | ~5 GB | AIME 78.7%, vision | Apache 2.0 |
| Ministral 14B | 14B | ~8 GB | AIME 85%, vision, 256K ctx | Apache 2.0 |
| LFM2.5-1.2B | 1.2B | ~1 GB | IFBench 47.3 (2x Qwen3-1.7B), thinking, vision, audio | LFM Open v1.0 |
| Llama 3.2 1B/3B | 1-3B | < 2 GB | 128K ctx, edge/mobile, EU OK (text-only) | Llama Community |
| InternLM3-8B | 8B | ~5 GB | Thinking mode, 4T tokens (75% less training) | Apache 2.0 |
| InternVL3-1B→38B | 1-38B | 1-20 GB | Vision SOTA, full range edge→server | Apache 2.0 |
| Chocolatine-2-4B-DPO | 4B | ~2.5 GB | French-optimized DPO fine-tune of Qwen3-4B, 262K ctx, no <think> |
Apache 2.0 |
SmolLM3-3B beats all other 3B models and competes with 4B models (Qwen3-4B, Gemma3-4B). Data quality matters more than model size: SmolLM2-1.7B trained on 11T tokens beats larger models trained on less data.
Chocolatine-2-4B (Jonathan Pacifico) is a DPO fine-tune of Qwen3-4B-Instruct-2507 on French preference datasets (Compar:IA from the French Ministry of Culture + French-ORCA), merged with TIES. Gains on every French benchmark tested (GPQA-FR, French MMLU, French Bench, FR-MT-Bench) without degrading English performance. One of the rare French-focused open-weight models built by an individual contributor rather than a lab.
| Model | Max ctx | RULER 1M | Architecture | Active | License |
|---|---|---|---|---|---|
| Nemotron 3 Nano | 1M | 86.3% | Mamba/MoE | 3.5B | Nemotron OML |
| Nemotron 3 Super | 1M | -- | Mamba/MoE | 12B | Nemotron OML |
| Jamba 1.6 Mini | 256K | -- | SSM+Transformer/MoE | 12B | Jamba OML |
RULER (GitHub) tests retrieval in long contexts with multiple needles, multi-hop tracing, and aggregation. Parametric by length (4K to 1M). Many models claim "1M context" without publishing RULER scores at that length. Without measurement, it's marketing.
Non-Transformer or hybrid models.
| Model | Architecture | Active | Key metric | License |
|---|---|---|---|---|
| Granite 4.0 | 90% Mamba-2 / 10% Attention | 3-9B | 70% memory reduction, 2x speed | Apache 2.0 |
| LFM2/2.5 | Convolutions + grouped attention | 2.3B | 112 tok/s CPU, 2x Qwen3. LFM2.5: vision, audio, thinking | LFM Open v1.0 |
| Jamba 1.6 Mini | Mamba + Transformer + MoE | 12B | 2.5x Transformer speed | Jamba OML |
Models pre-trained outside traditional data centers, using distributed peer-to-peer or blockchain-coordinated networks. The story is the training method, not the model quality.
| Model | Method | Size | Tokens | Architecture | License |
|---|---|---|---|---|---|
| Covenant-72B | Permissionless P2P, SparseLoCo optimizer, Bittensor blockchain (Subnet 3) | 72B dense | 1.1T (+14.8B SFT) | LLaMA-3 style, GQA, 80 layers, d=8192, 64 heads, 8 KV heads, RoPE 500K, ctx 2048→8192 | Apache 2.0 (checkpoints) |
Pre-training benchmarks (0-shot) vs other dense baselines :
| Benchmark | Covenant-72B | LLaMA-2-70B (centralized) | LLM360 K2 (65B, centralized) | INTELLECT-1 (10B, P2P) |
|---|---|---|---|---|
| ARC-Challenge | 56.8 | 57.4 | 53.8 | 44.8 |
| ARC-Easy | 80.9 | 79.6 | 76.0 | 71.8 |
| PIQA | 81.6 | 82.6 | 82.5 | 77.4 |
| OpenBookQA | 44.0 | 49.4 | 48.0 | 43.8 |
| HellaSwag | 80.6 | 84.3 | 82.9 | 70.3 |
| WinoGrande | 75.9 | 80.4 | 76.4 | 63.3 |
| MMLU | 67.1 | 65.6 | 65.5 | 32.7 |
Covenant-72B-Chat (post-SFT) vs other chat models :
| Benchmark | Covenant-72B-Chat | LLaMA-2-70B-Chat | K2-Chat (65B) |
|---|---|---|---|
| ARC-Challenge | 64.2 | 65.4 | 62.0 |
| MMLU | 67.4 | 63.1 | 67.9 |
| IFEval | 64.7 | 40.7 | 45.5 |
| MATH | 26.3 | 10.7 | 19.1 |
| MMLU-Pro | 40.9 | 35.2 | 45.4 |
| GSM8K | 63.9 | 52.2 | 79.0 |
Why it matters: Covenant-72B is the first proof-of-concept that 72B-scale pre-training is possible without data centers, with peers joining and leaving freely. Coordination via the Bittensor blockchain (Subnet 3), communication via SparseLoCo (146× compression vs dense gradients), peers running 8×B200 GPUs over commodity internet (500 Mb/s down, 110 Mb/s up). The model achieves 94.5% compute utilization despite the network constraints, with an average of 16.9 contributing peers per round and 70+ unique peers over the run. On benchmarks, it beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU (despite 1.8× fewer training tokens), and the chat variant has the best IFEval and MATH scores in its comparison group. It's the first credible alternative to the data-center duopoly for pre-training at 70B scale. Authors: Covenant AI + Mila. See arXiv 2603.08163.
miniF2F (GitHub): 488 formal Olympiad-level math problems. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible on mathematical correctness.
| Model | miniF2F | PutnamBench | Active | License |
|---|---|---|---|---|
| BFS-Prover-V2-32B | 95.0% | -- | 32B | Apache 2.0 |
| Goedel-Prover-V2-32B | 90.4% | #1 | 32B | Apache 2.0 |
| DeepSeek-Prover-V2-7B | 88.9% | -- | 7B | MIT |
| Leanstral | -- | -- | 32B | Apache 2.0 |
| Kimina-Prover-72B | 84.0% | -- | 72B | MIT |
| Leanabell-Prover-V2-7B | 78.2% | -- | 7B | Apache 2.0 |
Lean 4 proofs are verified by the compiler. Either correct or rejected. Zero hallucination on mathematical correctness.
The sweet spot is 32B: BFS-Prover (95%) and Goedel-V2 (90.4%) both beat the 72B Kimina (84%).
ScreenSpot (GitHub): 1,200+ instructions across desktop, mobile, web. Tests if the model can locate the right UI element from a natural language instruction.
| Model | ScreenSpot | OSWorld | Active | License |
|---|---|---|---|---|
| UI-TARS-1.5-7B | 94.2% | 42.5 | 7B | Apache 2.0 |
| Qwen2.5-VL-7B | 84.7 | -- | 7B | Apache 2.0 |
| ShowUI-2B | -- | -- | 2B | MIT |
UI-TARS-7B beats Claude (87.6%) on ScreenSpot. 7B, Apache 2.0, runs on a laptop.
| Model | Specialty | Active | License |
|---|---|---|---|
| WebThinker-32B | RL web search, beats Gemini Deep Research | 32B | Apache 2.0 |
| DeepResearcher-7B | Emergent multi-step planning via RL | 7B | Apache 2.0 |
| Search-R1 | Framework: teach any LLM to search (+26% on 7B) | any | Apache 2.0 |
BFCL (GitHub): Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory.
| Model | BFCL | Active | License |
|---|---|---|---|
| Hammer2.1-7B | #1 | 7B | CC-BY-NC 4.0 |
| xLAM-8B | #1 (alternate) | 8B | CC-BY-NC 4.0 |
| Hammer-0.5B | On-device | 0.5B | CC-BY-NC 4.0 |
Specialized tool-calling models clearly beat generalists. xLAM-8B beats GPT-4o on BFCL.
| Model | Strandset-Rust | RustEvo2 | Active | License |
|---|---|---|---|---|
| Strand-Rust-Coder-14B | 0.50 | 0.43 | 14B | Apache 2.0 (base) |
Beats GPT-5-Codex and Claude Sonnet 4.5 on Rust benchmarks. Fine-tuned on 191K examples from 2,383 crates.
| Model | MMMU | Active | Key feature | License |
|---|---|---|---|---|
| InternVL3-78B | 72.2 | 78B | SOTA open-source VLM, custom InternViT | Apache 2.0 |
| InternVL3-1B→38B | -- | 1-38B | Full range edge→server | Apache 2.0 |
| Gemma 4 31B | Pro 76.9 | 31B | Text + image + video | Apache 2.0 |
| Gemma 4 E2B/E4B | -- | 2.3-4.5B | Multimodal + audio, edge | Apache 2.0 |
| Qwen2.5-VL-7B | -- | 7B | Computer/phone use, DocVQA 95.7 | Apache 2.0 |
InternVL3-78B (72.2 MMMU) is on par with GPT-4o on multimodal. The InternViT encoder (300M–6B) is trained jointly with the LLM — not bolted on after the fact.
Patterns observed across 60+ models. Not definitive truths.
-
Dense retreats above 35B, but doesn't die. For generalists above 35B, MoE clearly dominates (GPT-OSS-120B, Mistral Small 4, Qwen3.5-122B-A10B, GLM-4.5-Air, Step-3.5-Flash, Nemotron 3 Super, all MoE). But dense survives where it has a structural advantage: Llama 3.3 70B (generalist), InternVL3-78B (vision), Kimina-Prover-72B (theorem proving), Qwen 2.5-72B (production NLP), Covenant-72B (decentralized training), DeepSeek R1-Distill-70B (distilled reasoning). Dense is becoming a specialization choice.
-
Parameter count is no longer the determining factor. Qwen3.5-9B (9B) beats GPT-OSS-120B (5.1B active, 117B total) on GPQA Diamond.
-
The 40-79B segment is the dense survivors' refuge. New models often jump from ~35B straight to ~120B total via MoE. But the 40-79B range is well populated by quality dense models (Llama 3.3 70B, InternVL3-78B, Kimina-Prover-72B, Qwen 2.5-72B, Covenant-72B, R1-Distill-70B, Jamba 1.6 Mini 52B). This is where dense resists, and where you find both solid generalists and specialists.
-
InternVL3 is the best open-source VLM nobody was talking about. InternVL3-78B (Shanghai AI Lab) reaches 72.2 MMMU under Apache 2.0 — on par with GPT-4o. InternLM3-8B achieves SOTA with 75% fewer training tokens (4T vs 15-18T). Less press than Alibaba, comparable results.
-
Qwen is the de facto base model for fine-tuning. BFS-Prover, Goedel-Prover, Kimina-Prover, most community distillations: all built on Qwen. The ResNet of LLMs.
-
Decentralized pre-training is no longer a toy. Covenant-72B (Mar 2026) pre-trained a 72B dense LLaMA-3-style model over a permissionless blockchain network (Bittensor Subnet 3) on 1.1T tokens. It beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU despite 1.8× fewer training tokens, with 94.5% compute utilization over commodity internet (500/110 Mb/s) and dynamic peer participation. The data-center duopoly for pre-training at 70B scale now has a credible alternative. SparseLoCo + 2-bit quantization gives 146× compression on gradient communication.
-
GPQA Diamond is the most discriminating benchmark for reasoning: 198 doctoral-level questions, impossible to solve by retrieval.
-
SWE-bench vs Codeforces measure different things. GPT-OSS-120B dominates competition (ELO 2622) but gets beaten on real bugs by Step-3.5-Flash (74.4% vs 62.4%).
-
Many models claim "1M context" without RULER scores at that length. Without measurement, it's marketing.
-
AIME versions (2024/2025/2026) are not comparable. Each year is harder. Only compare within the same version.
-
Specialized models dominate on narrow tasks. UI-TARS-7B beats Claude on GUI (94.2% vs 87.6%). BFS-Prover-32B beats DeepSeek-671B on theorem proving (95% vs 88.9%).
-
The sweet spot for theorem proving is 32B. Method (tree search, self-correction) compensates for size.
-
Domain-specific models (medical, legal, finance) are less mature than code/math specialists. Generalists often outperform them on domain benchmarks. Specialization helps mainly for specific vocabulary, regulatory compliance, and private data fine-tuning.
-
Gemma 4 under Apache 2.0 is a turning point. Google moved from a restrictive custom license to standard open-source for the first time.
-
Llama 4 excludes the EU for multimodal models. But text-only Llama (3.3 70B, 3.2 1B/3B) is EU-exploitable — the exclusion only applies to multimodal.
-
"Open-weight" is more nuanced than "open-source". Llama is technically open-weight but with geographic restrictions on multimodal. Always check the fine print.
What each benchmark measures, how many questions it has, and where to find more.
-
GPQA Diamond (198 questions) — Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark.
-
MMLU-Pro (12K+ questions) — Hardened version of MMLU: 10 choices instead of 4, requires chain-of-thought reasoning. 14 domains. Drops accuracy 16-33% vs MMLU. Published at NeurIPS 2024.
-
AIME (15 problems/year) — American Invitational Mathematics Examination. Competition-level math requiring creativity and multi-step reasoning. Each year's edition is harder. Only compare within the same version (2024/2025/2026).
-
MATH-500 (500 problems) — Diverse math problems (algebra, geometry, combinatorics, number theory). Good general math evaluation but easier to saturate than AIME.
-
SWE-bench Verified (500 issues) — Real bugs from GitHub repos (Django, Flask, scikit-learn). The model must understand the codebase, find the bug, and produce a working patch. Human-validated by OpenAI. Paper
-
Codeforces (ELO system) — Algorithmic competition performance, scored like chess ELO. Measures pure algorithmic skill, not real-world coding. Different skill from SWE-bench.
-
LiveCodeBench (rotating, 700+) — Fresh competitive programming problems collected after model training cutoffs. Eliminates data contamination. Problems from LeetCode, AtCoder, Codeforces. GitHub
- RULER (parametric) — Sophisticated "needle in a haystack" with multiple needles, multi-hop tracing, and aggregation. Tests at different lengths (4K to 1M). By NVIDIA. Many models claiming 1M context fail above 32K. GitHub
- BFCL (2K+) — Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory. By UC Berkeley. GitHub
- miniF2F (488 problems) — Formal Olympiad-level math problems in Lean 4 (also Isabelle, HOL Light). Covers AMC, AIME, IMO, and university math. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible. GitHub
- ScreenSpot (1.2K+ instructions) — GUI element grounding across desktop, mobile, and web. Tests if the model can locate the right UI element from a natural language instruction. GitHub
| License | Models | Commercial | EU | Patent grant | OSI |
|---|---|---|---|---|---|
| Apache 2.0 | Gemma 4, Qwen 3/3.5, GPT-OSS, Ministral, Step-3.5-Flash | Yes | Yes | Yes | Yes |
| MIT | GLM-4.5-Air, DeepSeek R1-Distill, Phi-4 | Yes | Yes | No (implicit) | Yes |
| Nemotron OML | Nemotron 3 Nano/Super | Yes | Yes | Yes | No |
| Jamba OML | Jamba 1.6 | Yes | Yes | -- | No |
| Llama Community | Llama 3.3 70B, Llama 3.2 1B/3B (text-only) | Yes | Yes (text-only) | -- | No |
| LFM Open v1.0 | LFM2, LFM2.5 | Yes (< $10M) | Yes | -- | No |
| Constraint | Recommendation |
|---|---|
| Smartphone / edge (< 4 GB) | SmolLM3-3B, SmolLM2-135M/360M/1.7B, Gemma 4 E2B, Phi-4-mini, Ministral 3B, LFM2.5-1.2B, Llama 3.2 1B/3B |
| Laptop 16 GB | GPT-OSS-20B, Ministral 14B, Gemma 4 26B-A4B |
| Desktop 24 GB | Gemma 4 31B, DeepSeek R1-Distill-32B, Devstral Small 2 |
| Desktop 48+ GB (dense 70B) | Llama 3.3 70B (MMLU 86.0, EU OK), InternVL3-78B (vision) |
| Server single-GPU (80 GB) | GPT-OSS-120B |
| Server multi-GPU | Step-3.5-Flash, Nemotron 3 Super, Qwen3.5-122B |
| Long context (> 256K) | Nemotron 3 Nano (1M, RULER 86.3%) |
| Math | Nemotron Nano 9B v2 (/think mode), GPT-OSS-120B |
| Code (real bugs) | Step-3.5-Flash, Devstral Small 2 |
| Code (competition) | GPT-OSS-120B (Codeforces 2622) |
| Multilingual (100+ langs) | Qwen 3.5 (201), Qwen 3 (119) |
| Theorem proving | BFS-Prover-V2-32B (95% miniF2F) |
| GUI automation | UI-TARS-1.5-7B (94.2% ScreenSpot) |
| Throughput | Step-3.5-Flash (350 tok/s) |
| Model | License | Reason |
|---|---|---|
| Llama 4 (Maverick, Scout) | Llama Community License | EU exclusion (multimodal) |
| Llama 3.2 Vision 11B/90B | Llama Community License | EU exclusion (multimodal) |
| Llama-Nemotron-Super-49B | Llama 3.3 License | Inherits EU exclusion (multimodal base) |
| Qwen 3.6 Plus | Proprietary | Closed-source, API-only |
| Codestral | Non-commercial | Research only |
| Falcon 3 | Ambiguous | Potential 10% royalty |
| Kimi K2.5 | Modified MIT (100M MAU) | User threshold |
| MiniMax M2.5 | Modified MIT | Custom restrictions |
| DeepSeek V3/R1 (full) | MIT | > 200B total (671B) |
| Qwen 3 235B / Qwen 3.5 397B | Apache 2.0 | > 200B total |
Found an error? Missing a model? Open an issue or submit a PR.
Sources: HuggingFace, Papers With Code, official model repos and papers.
This list is licensed under CC-BY 4.0.