Single-script benchmarking suite for the entire Qwen 3.5 model family (0.8B → 35B) running locally on Apple Silicon via MLX.
No cloud. No API keys. No GPU rental. Just your Mac and its unified memory.
You're building something with a local LLM. You've got six Qwen 3.5 sizes to choose from. The internet has opinions — but none of those people have your Mac, your RAM, or your workload.
The real question isn't "which model is best?" — it's "which is the smallest model that's good enough for what I need?"
That's what this tool answers. Here's why it matters:
-
Your hardware, your numbers. Benchmarks run on the exact Apple Silicon Mac you'll use in production. Not someone else's cloud GPU, not a spec sheet — your actual unified memory, your actual Metal cores. The tok/s you see is the tok/s you'll get.
-
Your prompts, your answers. The built-in suite covers factual, reasoning, creative, and code tasks — but you can plug in your own prompts. Whether it's a simple question or a gnarly multi-step instruction, you'll see exactly how each model handles your use case.
-
Right-size your model. Maybe the 0.8B runs at 140 tok/s and scores 6/10 on your task. Maybe the 4B runs at 68 tok/s but scores 8.5/10. Is that extra quality worth 4x the memory and half the speed? Now you have the data to decide — not guess.
-
One command, full picture. Throughput, memory footprint, response quality, cost-efficiency score — all in one run, all comparable, all on the same hardware. No spreadsheet stitching, no tab-switching between blog posts.
- 6 models, one script — benchmarks Qwen 3.5 at 0.8B, 2B, 4B, 9B, 27B, and 35B (MoE) parameters
- Interactive TUI — checkbox model/prompt selection, live config summary, zero CLI flags needed
- Real hardware metrics — generation tok/s, prompt tok/s, peak unified memory (GB), elapsed time
- Auto-judge — the largest model scores all smaller models' responses (accuracy, clarity, completeness)
- Cost-efficiency scoring — composite 0–100 score blending speed, memory, and quality
- Thinking extraction — separates
<think>...</think>reasoning blocks from final answers (Qwen 3.5 4B+) - Crash-resilient — incremental JSONL saves after every inference; resume picks up where you left off
- Side-by-side diffs — unified diff of smallest vs. largest model responses
- Export — JSONL (raw data), Markdown (GitHub-ready report), or HTML (Rich terminal capture)
┌─────────────────────────────────────────────────────────────────────────┐
│ qwen_text.py │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ CLI │ │ TUI │ Either path produces │
│ │ argparse │ │questiony │──► a BenchmarkConfig │
│ └────┬─────┘ └────┬─────┘ │
│ └───────┬───────┘ │
│ ▼ │
│ ┌───────────────┐ │
│ │BenchmarkConfig│ │
│ └───────┬───────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ run_benchmark() │ │
│ │ │ │
│ │ ┌─── for each model ──────────────────────────────────┐ │ │
│ │ │ │ │ │
│ │ │ load(model_id) ◄── mlx_lm (download + load) │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ ┌── for each prompt ───────────────────────────┐ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ stream_generate() ──► BenchmarkResult │ │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ │ │ append to JSONL │ │ │ │
│ │ │ │ │ (crash-resilient) │ │ │ │
│ │ │ │ ▼ │ │ │ │
│ │ │ │ extract_thinking() ──► thinking | answer │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ └──────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ unload model + gc.collect() │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ ┌──────────────────┐ │ │
│ │ │ Auto-Judge │──►│ Cost-Efficiency │ │ │
│ │ │ (largest LM) │ │ Scoring │ │ │
│ │ └─────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Table │ │ Response │ │ Diff │ │ Export │ │
│ │ Summary │ │ Panels │ │ View │ │ MD/HTML │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
| Minimum | Recommended | |
|---|---|---|
| Platform | Apple Silicon Mac (M1/M2/M3/M4) | M-series Pro/Max/Ultra |
| RAM | 8 GB unified memory | 32 GB+ unified memory |
| macOS | 13.0+ (Ventura) | Latest |
This project runs exclusively on Apple Silicon Macs. It uses MLX, Apple's machine learning framework optimized for the Metal GPU and unified memory architecture. It will not work on Intel Macs, Linux, or Windows.
| Model | Approx. Memory Needed |
|---|---|
| Qwen3.5-0.8B | ~2 GB |
| Qwen3.5-2B | ~5 GB |
| Qwen3.5-4B | ~9 GB |
| Qwen3.5-9B | ~19 GB |
| Qwen3.5-27B | ~55 GB |
| Qwen3.5-35B-A3B (MoE) | ~70 GB |
Models that exceed your available unified memory will fail gracefully with an error logged — the benchmark continues with the remaining models.
The models are hosted on Hugging Face under the mlx-community organization. You need a one-time Hugging Face authentication to download them.
Option A — CLI login (recommended):
pip install huggingface_hub
huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokensOption B — Environment variable:
export HF_TOKEN="hf_your_token_here"Get a free token at huggingface.co/settings/tokens. A read-only token is sufficient. Models are cached locally after the first download (
~/.cache/huggingface/hub), so you only need network access once per model.
pip install mlx mlx-lm rich questionaryThat's it. No requirements.txt bloat, no virtual environment ceremony — four packages.
python qwen_text.pyLaunches a full TUI where you pick models, prompts, and options with checkboxes — see the TUI screenshot above.
# Benchmark specific models
python qwen_text.py --models 0.8B,4B,9B
# Single custom prompt
python qwen_text.py --prompt "Explain quantum entanglement simply."
# Subset of built-in prompts + markdown export
python qwen_text.py --prompts reasoning,code --format markdown
# Full run, no judge, with diff view
python qwen_text.py --no-judge --diff
# Fresh run (ignore cached results)
python qwen_text.py --no-resume| Flag | Default | Description |
|---|---|---|
--tui |
(default with no args) | Launch interactive TUI |
--models |
all 6 | Comma-separated size filter (e.g. 0.8B,4B) |
--prompt |
— | Single custom prompt (overrides suite) |
--prompts |
all 4 | Subset: factual, reasoning, creative, code |
--max-tokens |
8192 |
Max generation tokens |
--temp |
0.0 |
Sampling temperature |
--output |
results.jsonl |
Output file path |
--format |
jsonl |
Export: jsonl, markdown, or html |
--report |
— | Explicit report file path |
--no-judge |
— | Skip auto-judge step |
--no-resume |
— | Force re-run, ignore cache |
--diff |
— | Show diff of smallest vs. largest model |
| Label | Prompt | Tests |
|---|---|---|
factual |
"Explain what a transformer model is in 2 sentences." | Conciseness, accuracy |
reasoning |
"A farmer has 17 sheep. All but 9 die. How many are left?" | Logic, step-by-step |
creative |
"Write a short poem about a robot discovering the ocean." | Creativity, style |
code |
"Write a Python function that checks if a string is a valid IPv4 address." | Code quality, edge cases |
One JSON object per run — every metric, the full response, thinking blocks, judge scores. Machine-readable, diff-friendly, appendable.
A self-contained .md report with summary table, per-prompt responses, collapsible thinking blocks, and cost-efficiency rankings. Drops right into a GitHub issue or wiki.
Rich terminal output captured as styled HTML. Dark-themed, monospace, looks exactly like your terminal — but shareable.
The largest model in your run acts as judge. For each prompt, it reads every other model's answer and scores them 1–10 on accuracy, completeness, clarity, and conciseness. Scores and rationales are stored in the results.
A composite score from three normalized signals:
With judge: 0.3 × speed + 0.3 × memory_efficiency + 0.4 × judge_score
Without judge: 0.5 × speed + 0.5 × memory_efficiency
Higher is better. A small model that's fast, lean, and still scores well will rank above a huge model that's slow and memory-hungry.
Every result is appended to the JSONL file immediately after inference. If the process crashes, gets killed, or you run out of memory on a large model:
# Just re-run — completed pairs are skipped automatically
python qwen_text.pyUse --no-resume to force a fresh run.
qwenbench/
├── qwen_text.py # The entire benchmark suite (single file)
├── assets/ # SVG screenshots for README
├── results.jsonl # Generated: raw benchmark data
├── results.md # Generated: markdown report (if --format markdown)
├── results.html # Generated: HTML report (if --format html)
└── README.md
| Problem | Fix |
|---|---|
ModuleNotFoundError: mlx |
You're not on Apple Silicon, or pip install mlx mlx-lm was missed |
MemoryError on large model |
Not enough unified memory — skip that model size with --models |
401 Unauthorized from HF |
Run huggingface-cli login or set HF_TOKEN env var |
| TUI doesn't render properly | Ensure your terminal supports ANSI colors (iTerm2, Terminal.app, Warp, etc.) |
| Stuck on "Loading model..." | First download can take minutes depending on model size + connection speed |
MIT
Built for the silicon. Tested on M1 Pro, M2 Max, M3 Ultra, and M4 Max.