Skip to content

sw30labs/qwenbench-mlx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QwenBench

Single-script benchmarking suite for the entire Qwen 3.5 model family (0.8B → 35B) running locally on Apple Silicon via MLX.

No cloud. No API keys. No GPU rental. Just your Mac and its unified memory.

Apple Silicon MLX Python 3.10+ Qwen 3.5 Version 1.0.0 MIT License March 2026

Author Built with Claude Book


Why This? Why Now?

You're building something with a local LLM. You've got six Qwen 3.5 sizes to choose from. The internet has opinions — but none of those people have your Mac, your RAM, or your workload.

The real question isn't "which model is best?" — it's "which is the smallest model that's good enough for what I need?"

That's what this tool answers. Here's why it matters:

  • Your hardware, your numbers. Benchmarks run on the exact Apple Silicon Mac you'll use in production. Not someone else's cloud GPU, not a spec sheet — your actual unified memory, your actual Metal cores. The tok/s you see is the tok/s you'll get.

  • Your prompts, your answers. The built-in suite covers factual, reasoning, creative, and code tasks — but you can plug in your own prompts. Whether it's a simple question or a gnarly multi-step instruction, you'll see exactly how each model handles your use case.

  • Right-size your model. Maybe the 0.8B runs at 140 tok/s and scores 6/10 on your task. Maybe the 4B runs at 68 tok/s but scores 8.5/10. Is that extra quality worth 4x the memory and half the speed? Now you have the data to decide — not guess.

  • One command, full picture. Throughput, memory footprint, response quality, cost-efficiency score — all in one run, all comparable, all on the same hardware. No spreadsheet stitching, no tab-switching between blog posts.


Features

  • 6 models, one script — benchmarks Qwen 3.5 at 0.8B, 2B, 4B, 9B, 27B, and 35B (MoE) parameters
  • Interactive TUI — checkbox model/prompt selection, live config summary, zero CLI flags needed
  • Real hardware metrics — generation tok/s, prompt tok/s, peak unified memory (GB), elapsed time
  • Auto-judge — the largest model scores all smaller models' responses (accuracy, clarity, completeness)
  • Cost-efficiency scoring — composite 0–100 score blending speed, memory, and quality
  • Thinking extraction — separates <think>...</think> reasoning blocks from final answers (Qwen 3.5 4B+)
  • Crash-resilient — incremental JSONL saves after every inference; resume picks up where you left off
  • Side-by-side diffs — unified diff of smallest vs. largest model responses
  • Export — JSONL (raw data), Markdown (GitHub-ready report), or HTML (Rich terminal capture)

Screenshots

Interactive TUI

Interactive TUI

Benchmark Results

Benchmark Results Table

Side-by-Side Response Comparison

Response Comparison

Cost-Efficiency Ranking

Cost-Efficiency Ranking


Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                          qwen_text.py                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────┐    ┌──────────┐                                           │
│  │   CLI    │    │   TUI    │    Either path produces                   │
│  │ argparse │    │questiony │──► a BenchmarkConfig                      │
│  └────┬─────┘    └────┬─────┘                                           │
│       └───────┬───────┘                                                 │
│               ▼                                                         │
│       ┌───────────────┐                                                 │
│       │BenchmarkConfig│                                                 │
│       └───────┬───────┘                                                 │
│               ▼                                                         │
│  ┌─────────────────────────────────────────────────────────────┐        │
│  │                   run_benchmark()                           │        │
│  │                                                             │        │
│  │   ┌─── for each model ──────────────────────────────────┐   │        │
│  │   │                                                     │   │        │
│  │   │  load(model_id)  ◄── mlx_lm (download + load)      │   │        │
│  │   │       │                                             │   │        │
│  │   │       ▼                                             │   │        │
│  │   │  ┌── for each prompt ───────────────────────────┐   │   │        │
│  │   │  │                                              │   │   │        │
│  │   │  │  stream_generate() ──► BenchmarkResult       │   │   │        │
│  │   │  │       │                    │                  │   │   │        │
│  │   │  │       │              append to JSONL          │   │   │        │
│  │   │  │       │              (crash-resilient)        │   │   │        │
│  │   │  │       ▼                                      │   │   │        │
│  │   │  │  extract_thinking()  ──► thinking | answer    │   │   │        │
│  │   │  │                                              │   │   │        │
│  │   │  └──────────────────────────────────────────────┘   │   │        │
│  │   │                                                     │   │        │
│  │   │  unload model + gc.collect()                        │   │        │
│  │   └─────────────────────────────────────────────────────┘   │        │
│  │                                                             │        │
│  │   ┌─────────────┐   ┌──────────────────┐                   │        │
│  │   │  Auto-Judge  │──►│ Cost-Efficiency  │                   │        │
│  │   │ (largest LM) │   │    Scoring       │                   │        │
│  │   └─────────────┘   └──────────────────┘                   │        │
│  └─────────────────────────────────────────────────────────────┘        │
│               │                                                         │
│               ▼                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │  Table   │  │ Response │  │  Diff    │  │  Export  │               │
│  │ Summary  │  │ Panels   │  │ View     │  │ MD/HTML  │               │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Requirements

Hardware

Minimum Recommended
Platform Apple Silicon Mac (M1/M2/M3/M4) M-series Pro/Max/Ultra
RAM 8 GB unified memory 32 GB+ unified memory
macOS 13.0+ (Ventura) Latest

This project runs exclusively on Apple Silicon Macs. It uses MLX, Apple's machine learning framework optimized for the Metal GPU and unified memory architecture. It will not work on Intel Macs, Linux, or Windows.

RAM Guide by Model

Model Approx. Memory Needed
Qwen3.5-0.8B ~2 GB
Qwen3.5-2B ~5 GB
Qwen3.5-4B ~9 GB
Qwen3.5-9B ~19 GB
Qwen3.5-27B ~55 GB
Qwen3.5-35B-A3B (MoE) ~70 GB

Models that exceed your available unified memory will fail gracefully with an error logged — the benchmark continues with the remaining models.


Setup

1. Hugging Face Authentication

The models are hosted on Hugging Face under the mlx-community organization. You need a one-time Hugging Face authentication to download them.

Option A — CLI login (recommended):

pip install huggingface_hub
huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokens

Option B — Environment variable:

export HF_TOKEN="hf_your_token_here"

Get a free token at huggingface.co/settings/tokens. A read-only token is sufficient. Models are cached locally after the first download (~/.cache/huggingface/hub), so you only need network access once per model.

2. Install Dependencies

pip install mlx mlx-lm rich questionary

That's it. No requirements.txt bloat, no virtual environment ceremony — four packages.


Quick Start

Interactive Mode (default)

python qwen_text.py

Launches a full TUI where you pick models, prompts, and options with checkboxes — see the TUI screenshot above.

CLI Mode

# Benchmark specific models
python qwen_text.py --models 0.8B,4B,9B

# Single custom prompt
python qwen_text.py --prompt "Explain quantum entanglement simply."

# Subset of built-in prompts + markdown export
python qwen_text.py --prompts reasoning,code --format markdown

# Full run, no judge, with diff view
python qwen_text.py --no-judge --diff

# Fresh run (ignore cached results)
python qwen_text.py --no-resume

CLI Reference

Flag Default Description
--tui (default with no args) Launch interactive TUI
--models all 6 Comma-separated size filter (e.g. 0.8B,4B)
--prompt Single custom prompt (overrides suite)
--prompts all 4 Subset: factual, reasoning, creative, code
--max-tokens 8192 Max generation tokens
--temp 0.0 Sampling temperature
--output results.jsonl Output file path
--format jsonl Export: jsonl, markdown, or html
--report Explicit report file path
--no-judge Skip auto-judge step
--no-resume Force re-run, ignore cache
--diff Show diff of smallest vs. largest model

Built-in Prompt Suite

Label Prompt Tests
factual "Explain what a transformer model is in 2 sentences." Conciseness, accuracy
reasoning "A farmer has 17 sheep. All but 9 die. How many are left?" Logic, step-by-step
creative "Write a short poem about a robot discovering the ocean." Creativity, style
code "Write a Python function that checks if a string is a valid IPv4 address." Code quality, edge cases

Output Formats

JSONL (default)

One JSON object per run — every metric, the full response, thinking blocks, judge scores. Machine-readable, diff-friendly, appendable.

Markdown

A self-contained .md report with summary table, per-prompt responses, collapsible thinking blocks, and cost-efficiency rankings. Drops right into a GitHub issue or wiki.

HTML

Rich terminal output captured as styled HTML. Dark-themed, monospace, looks exactly like your terminal — but shareable.


How Scoring Works

Auto-Judge

The largest model in your run acts as judge. For each prompt, it reads every other model's answer and scores them 1–10 on accuracy, completeness, clarity, and conciseness. Scores and rationales are stored in the results.

Cost-Efficiency (0–100)

A composite score from three normalized signals:

With judge:    0.3 × speed + 0.3 × memory_efficiency + 0.4 × judge_score
Without judge: 0.5 × speed + 0.5 × memory_efficiency

Higher is better. A small model that's fast, lean, and still scores well will rank above a huge model that's slow and memory-hungry.


Resumability

Every result is appended to the JSONL file immediately after inference. If the process crashes, gets killed, or you run out of memory on a large model:

# Just re-run — completed pairs are skipped automatically
python qwen_text.py

Use --no-resume to force a fresh run.


Project Structure

qwenbench/
├── qwen_text.py       # The entire benchmark suite (single file)
├── assets/            # SVG screenshots for README
├── results.jsonl      # Generated: raw benchmark data
├── results.md         # Generated: markdown report (if --format markdown)
├── results.html       # Generated: HTML report (if --format html)
└── README.md

Troubleshooting

Problem Fix
ModuleNotFoundError: mlx You're not on Apple Silicon, or pip install mlx mlx-lm was missed
MemoryError on large model Not enough unified memory — skip that model size with --models
401 Unauthorized from HF Run huggingface-cli login or set HF_TOKEN env var
TUI doesn't render properly Ensure your terminal supports ANSI colors (iTerm2, Terminal.app, Warp, etc.)
Stuck on "Loading model..." First download can take minutes depending on model size + connection speed

License

MIT


Built for the silicon. Tested on M1 Pro, M2 Max, M3 Ultra, and M4 Max.

About

Benchmark the full Qwen 3.5 model family (0.8B–35B) on Apple Silicon via MLX. Interactive TUI, auto-judge, cost-efficiency scoring, crash-resilient resume.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages