Benchmark on-device MLX inference. One command, one number you can trust.
A tiny Swift CLI that loads any MLX-community model from HuggingFace, runs a prompt, and reports decode tok/s, TTFT (time-to-first-token), and load time. No frameworks, no notebooks — xcodebuild and you're benching.
$ ./mlx-bench mlx-community/Qwen3.5-2B-MLX-4bit
🧠 MLX Bench
Model: mlx-community/Qwen3.5-2B-MLX-4bit
Prompt: Explain quantum computing in 3 sentences.
Max tokens: 200
Loading model... 100%
✓ Loaded in 1.5s
Generating...
═══════════════════════════════════════
Response:
Quantum computing utilizes qubits, which can exist in multiple states simultaneously...
Tokens: 200
TTFT: 0.21s
Decode: 2.82s (70.6 tok/s)
Total: 3.03s
Load time: 1.5s
═══════════════════════════════════════
- TTFT — wall time from
generate()call to the first token. Includes prompt prefill. - Decode — wall time and tok/s for tokens 2..N. This is what you feel after the first token lands.
- Load — model weights → MLX container, after the HuggingFace download is cached.
Decode tok/s is the right number to compare across models. Total tok/s including prefill biases short generations low.
Requires Xcode 15+ and macOS 14+ (Apple Silicon).
git clone https://github.com/daslabhq/mlx-bench
cd mlx-bench
xcodebuild -scheme mlx-bench -destination 'platform=OS X' -configuration Release -derivedDataPath .build build
ln -sf .build/Build/Products/Release/mlx-bench mlx-benchWhy xcodebuild and not
swift build? The Metal shaders that MLX uses can only be compiled by xcodebuild —swift buildproduces a binary that crashes with "Failed to load the default metallib." See mlx-swift docs.
# defaults: Qwen3.5-0.8B, "Explain quantum computing in 3 sentences.", 200 tokens
./mlx-bench
# pick a model
./mlx-bench mlx-community/Qwen3.5-4B-MLX-4bit
# pick a model, prompt, max tokens
./mlx-bench mlx-community/Qwen3.5-2B-MLX-4bit "Write a haiku about MLX." 100First run downloads the model from HuggingFace (subsequent runs hit the local cache at ~/Documents/huggingface/).
M-series Mac, Qwen3.5-*-MLX-4bit, 200-token generation, prompt forces full output:
| Model | Size | Decode | TTFT | Load (cached) |
|---|---|---|---|---|
| Qwen3.5-0.8B | 625 MB | 137.7 tok/s | 0.15s | 1.2s |
| Qwen3.5-2B | 2.2 GB | 70.6 tok/s | 0.21s | 1.5s |
| Qwen3.5-4B | 3.0 GB | 31.1 tok/s | 0.45s | 1.5s |
Use these as a sanity check, not gospel — your numbers depend on chip, RAM pressure, and thermal state.
Picking the right on-device model is mostly a tok/s question, and "fire up Xcode, paste into a playground, wait" is too much friction to ask a dozen times. mlx-bench is one command per model.
Built for Daslab, where on-device inference is a first-class option.
MIT — see LICENSE.