GitHub - sw30labs/qwenbench-mlx: Benchmark the full Qwen 3.5 model family (0.8B–35B) on Apple Silicon via MLX. Interactive TUI, auto-judge, cost-efficiency scoring, crash-resilient resume.

Single-script benchmarking suite for the entire Qwen 3.5 model family (0.8B → 35B) running locally on Apple Silicon via MLX.

No cloud. No API keys. No GPU rental. Just your Mac and its unified memory.

Why This? Why Now?

You're building something with a local LLM. You've got six Qwen 3.5 sizes to choose from. The internet has opinions — but none of those people have your Mac, your RAM, or your workload.

The real question isn't "which model is best?" — it's "which is the smallest model that's good enough for what I need?"

That's what this tool answers. Here's why it matters:

Your hardware, your numbers. Benchmarks run on the exact Apple Silicon Mac you'll use in production. Not someone else's cloud GPU, not a spec sheet — your actual unified memory, your actual Metal cores. The tok/s you see is the tok/s you'll get.
Your prompts, your answers. The built-in suite covers factual, reasoning, creative, and code tasks — but you can plug in your own prompts. Whether it's a simple question or a gnarly multi-step instruction, you'll see exactly how each model handles your use case.
Right-size your model. Maybe the 0.8B runs at 140 tok/s and scores 6/10 on your task. Maybe the 4B runs at 68 tok/s but scores 8.5/10. Is that extra quality worth 4x the memory and half the speed? Now you have the data to decide — not guess.
One command, full picture. Throughput, memory footprint, response quality, cost-efficiency score — all in one run, all comparable, all on the same hardware. No spreadsheet stitching, no tab-switching between blog posts.

Features

6 models, one script — benchmarks Qwen 3.5 at 0.8B, 2B, 4B, 9B, 27B, and 35B (MoE) parameters
Interactive TUI — checkbox model/prompt selection, live config summary, zero CLI flags needed
Real hardware metrics — generation tok/s, prompt tok/s, peak unified memory (GB), elapsed time
Auto-judge — the largest model scores all smaller models' responses (accuracy, clarity, completeness)
Cost-efficiency scoring — composite 0–100 score blending speed, memory, and quality
Thinking extraction — separates <think>...</think> reasoning blocks from final answers (Qwen 3.5 4B+)
Crash-resilient — incremental JSONL saves after every inference; resume picks up where you left off
Side-by-side diffs — unified diff of smallest vs. largest model responses
Export — JSONL (raw data), Markdown (GitHub-ready report), or HTML (Rich terminal capture)

Screenshots

Interactive TUI

Benchmark Results

Side-by-Side Response Comparison

Cost-Efficiency Ranking

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                          qwen_text.py                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────┐    ┌──────────┐                                           │
│  │   CLI    │    │   TUI    │    Either path produces                   │
│  │ argparse │    │questiony │──► a BenchmarkConfig                      │
│  └────┬─────┘    └────┬─────┘                                           │
│       └───────┬───────┘                                                 │
│               ▼                                                         │
│       ┌───────────────┐                                                 │
│       │BenchmarkConfig│                                                 │
│       └───────┬───────┘                                                 │
│               ▼                                                         │
│  ┌─────────────────────────────────────────────────────────────┐        │
│  │                   run_benchmark()                           │        │
│  │                                                             │        │
│  │   ┌─── for each model ──────────────────────────────────┐   │        │
│  │   │                                                     │   │        │
│  │   │  load(model_id)  ◄── mlx_lm (download + load)      │   │        │
│  │   │       │                                             │   │        │
│  │   │       ▼                                             │   │        │
│  │   │  ┌── for each prompt ───────────────────────────┐   │   │        │
│  │   │  │                                              │   │   │        │
│  │   │  │  stream_generate() ──► BenchmarkResult       │   │   │        │
│  │   │  │       │                    │                  │   │   │        │
│  │   │  │       │              append to JSONL          │   │   │        │
│  │   │  │       │              (crash-resilient)        │   │   │        │
│  │   │  │       ▼                                      │   │   │        │
│  │   │  │  extract_thinking()  ──► thinking | answer    │   │   │        │
│  │   │  │                                              │   │   │        │
│  │   │  └──────────────────────────────────────────────┘   │   │        │
│  │   │                                                     │   │        │
│  │   │  unload model + gc.collect()                        │   │        │
│  │   └─────────────────────────────────────────────────────┘   │        │
│  │                                                             │        │
│  │   ┌─────────────┐   ┌──────────────────┐                   │        │
│  │   │  Auto-Judge  │──►│ Cost-Efficiency  │                   │        │
│  │   │ (largest LM) │   │    Scoring       │                   │        │
│  │   └─────────────┘   └──────────────────┘                   │        │
│  └─────────────────────────────────────────────────────────────┘        │
│               │                                                         │
│               ▼                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │  Table   │  │ Response │  │  Diff    │  │  Export  │               │
│  │ Summary  │  │ Panels   │  │ View     │  │ MD/HTML  │               │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Requirements

Hardware

	Minimum	Recommended
Platform	Apple Silicon Mac (M1/M2/M3/M4)	M-series Pro/Max/Ultra
RAM	8 GB unified memory	32 GB+ unified memory
macOS	13.0+ (Ventura)	Latest

This project runs exclusively on Apple Silicon Macs. It uses MLX, Apple's machine learning framework optimized for the Metal GPU and unified memory architecture. It will not work on Intel Macs, Linux, or Windows.

RAM Guide by Model

Model	Approx. Memory Needed
Qwen3.5-0.8B	~2 GB
Qwen3.5-2B	~5 GB
Qwen3.5-4B	~9 GB
Qwen3.5-9B	~19 GB
Qwen3.5-27B	~55 GB
Qwen3.5-35B-A3B (MoE)	~70 GB

Models that exceed your available unified memory will fail gracefully with an error logged — the benchmark continues with the remaining models.

Setup

1. Hugging Face Authentication

The models are hosted on Hugging Face under the mlx-community organization. You need a one-time Hugging Face authentication to download them.

Option A — CLI login (recommended):

pip install huggingface_hub
huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokens

Option B — Environment variable:

export HF_TOKEN="hf_your_token_here"

Get a free token at huggingface.co/settings/tokens. A read-only token is sufficient. Models are cached locally after the first download (~/.cache/huggingface/hub), so you only need network access once per model.

2. Install Dependencies

pip install mlx mlx-lm rich questionary

That's it. No requirements.txt bloat, no virtual environment ceremony — four packages.

Quick Start

Interactive Mode (default)

python qwen_text.py

Launches a full TUI where you pick models, prompts, and options with checkboxes — see the TUI screenshot above.

CLI Mode

# Benchmark specific models
python qwen_text.py --models 0.8B,4B,9B

# Single custom prompt
python qwen_text.py --prompt "Explain quantum entanglement simply."

# Subset of built-in prompts + markdown export
python qwen_text.py --prompts reasoning,code --format markdown

# Full run, no judge, with diff view
python qwen_text.py --no-judge --diff

# Fresh run (ignore cached results)
python qwen_text.py --no-resume

CLI Reference

Flag	Default	Description
`--tui`	(default with no args)	Launch interactive TUI
`--models`	all 6	Comma-separated size filter (e.g. `0.8B,4B`)
`--prompt`	—	Single custom prompt (overrides suite)
`--prompts`	all 4	Subset: `factual`, `reasoning`, `creative`, `code`
`--max-tokens`	`8192`	Max generation tokens
`--temp`	`0.0`	Sampling temperature
`--output`	`results.jsonl`	Output file path
`--format`	`jsonl`	Export: `jsonl`, `markdown`, or `html`
`--report`	—	Explicit report file path
`--no-judge`	—	Skip auto-judge step
`--no-resume`	—	Force re-run, ignore cache
`--diff`	—	Show diff of smallest vs. largest model

Built-in Prompt Suite

Label	Prompt	Tests
`factual`	"Explain what a transformer model is in 2 sentences."	Conciseness, accuracy
`reasoning`	"A farmer has 17 sheep. All but 9 die. How many are left?"	Logic, step-by-step
`creative`	"Write a short poem about a robot discovering the ocean."	Creativity, style
`code`	"Write a Python function that checks if a string is a valid IPv4 address."	Code quality, edge cases

Output Formats

JSONL (default)

One JSON object per run — every metric, the full response, thinking blocks, judge scores. Machine-readable, diff-friendly, appendable.

Markdown

A self-contained .md report with summary table, per-prompt responses, collapsible thinking blocks, and cost-efficiency rankings. Drops right into a GitHub issue or wiki.

HTML

Rich terminal output captured as styled HTML. Dark-themed, monospace, looks exactly like your terminal — but shareable.

How Scoring Works

Auto-Judge

The largest model in your run acts as judge. For each prompt, it reads every other model's answer and scores them 1–10 on accuracy, completeness, clarity, and conciseness. Scores and rationales are stored in the results.

Cost-Efficiency (0–100)

A composite score from three normalized signals:

With judge:    0.3 × speed + 0.3 × memory_efficiency + 0.4 × judge_score
Without judge: 0.5 × speed + 0.5 × memory_efficiency

Higher is better. A small model that's fast, lean, and still scores well will rank above a huge model that's slow and memory-hungry.

Resumability

Every result is appended to the JSONL file immediately after inference. If the process crashes, gets killed, or you run out of memory on a large model:

# Just re-run — completed pairs are skipped automatically
python qwen_text.py

Use --no-resume to force a fresh run.

Project Structure

qwenbench/
├── qwen_text.py       # The entire benchmark suite (single file)
├── assets/            # SVG screenshots for README
├── results.jsonl      # Generated: raw benchmark data
├── results.md         # Generated: markdown report (if --format markdown)
├── results.html       # Generated: HTML report (if --format html)
└── README.md

Troubleshooting

Problem	Fix
`ModuleNotFoundError: mlx`	You're not on Apple Silicon, or `pip install mlx mlx-lm` was missed
`MemoryError` on large model	Not enough unified memory — skip that model size with `--models`
`401 Unauthorized` from HF	Run `huggingface-cli login` or set `HF_TOKEN` env var
TUI doesn't render properly	Ensure your terminal supports ANSI colors (iTerm2, Terminal.app, Warp, etc.)
Stuck on "Loading model..."	First download can take minutes depending on model size + connection speed

License

MIT

_{Built for the silicon. Tested on M1 Pro, M2 Max, M3 Ultra, and M4 Max.}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
qwen_text.py		qwen_text.py

Folders and files

Latest commit

History

Repository files navigation

Why This? Why Now?

Features

Screenshots

Interactive TUI

Benchmark Results

Side-by-Side Response Comparison

Cost-Efficiency Ranking

Architecture

Requirements

Hardware

RAM Guide by Model

Setup

1. Hugging Face Authentication

2. Install Dependencies

Quick Start

Interactive Mode (default)

CLI Mode

CLI Reference

Built-in Prompt Suite

Output Formats

JSONL (default)

Markdown

HTML

How Scoring Works

Auto-Judge

Cost-Efficiency (0–100)

Resumability

Project Structure

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages