Skip to content

daslabhq/mlx-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mlx-bench

Benchmark on-device MLX inference. One command, one number you can trust.

A tiny Swift CLI that loads any MLX-community model from HuggingFace, runs a prompt, and reports decode tok/s, TTFT (time-to-first-token), and load time. No frameworks, no notebooks — xcodebuild and you're benching.

$ ./mlx-bench mlx-community/Qwen3.5-2B-MLX-4bit

🧠 MLX Bench
  Model:      mlx-community/Qwen3.5-2B-MLX-4bit
  Prompt:     Explain quantum computing in 3 sentences.
  Max tokens: 200

Loading model... 100%
✓ Loaded in 1.5s

Generating...

═══════════════════════════════════════
  Response:
  Quantum computing utilizes qubits, which can exist in multiple states simultaneously...

  Tokens:     200
  TTFT:       0.21s
  Decode:     2.82s  (70.6 tok/s)
  Total:      3.03s
  Load time:  1.5s
═══════════════════════════════════════

What it measures

  • TTFT — wall time from generate() call to the first token. Includes prompt prefill.
  • Decode — wall time and tok/s for tokens 2..N. This is what you feel after the first token lands.
  • Load — model weights → MLX container, after the HuggingFace download is cached.

Decode tok/s is the right number to compare across models. Total tok/s including prefill biases short generations low.

Install

Requires Xcode 15+ and macOS 14+ (Apple Silicon).

git clone https://github.com/daslabhq/mlx-bench
cd mlx-bench
xcodebuild -scheme mlx-bench -destination 'platform=OS X' -configuration Release -derivedDataPath .build build
ln -sf .build/Build/Products/Release/mlx-bench mlx-bench

Why xcodebuild and not swift build? The Metal shaders that MLX uses can only be compiled by xcodebuild — swift build produces a binary that crashes with "Failed to load the default metallib." See mlx-swift docs.

Use

# defaults: Qwen3.5-0.8B, "Explain quantum computing in 3 sentences.", 200 tokens
./mlx-bench

# pick a model
./mlx-bench mlx-community/Qwen3.5-4B-MLX-4bit

# pick a model, prompt, max tokens
./mlx-bench mlx-community/Qwen3.5-2B-MLX-4bit "Write a haiku about MLX." 100

First run downloads the model from HuggingFace (subsequent runs hit the local cache at ~/Documents/huggingface/).

Reference numbers

M-series Mac, Qwen3.5-*-MLX-4bit, 200-token generation, prompt forces full output:

Model Size Decode TTFT Load (cached)
Qwen3.5-0.8B 625 MB 137.7 tok/s 0.15s 1.2s
Qwen3.5-2B 2.2 GB 70.6 tok/s 0.21s 1.5s
Qwen3.5-4B 3.0 GB 31.1 tok/s 0.45s 1.5s

Use these as a sanity check, not gospel — your numbers depend on chip, RAM pressure, and thermal state.

Why

Picking the right on-device model is mostly a tok/s question, and "fire up Xcode, paste into a playground, wait" is too much friction to ask a dozen times. mlx-bench is one command per model.

Built for Daslab, where on-device inference is a first-class option.

License

MIT — see LICENSE.

About

Benchmark on-device MLX inference. One command, one number you can trust.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages