mlx-bench

Benchmark on-device MLX inference. One command, one number you can trust.

A tiny Swift CLI that loads any MLX-community model from HuggingFace, runs a prompt, and reports decode tok/s, TTFT (time-to-first-token), and load time. No frameworks, no notebooks — xcodebuild and you're benching.

$ ./mlx-bench mlx-community/Qwen3.5-2B-MLX-4bit

🧠 MLX Bench
  Model:      mlx-community/Qwen3.5-2B-MLX-4bit
  Prompt:     Explain quantum computing in 3 sentences.
  Max tokens: 200

Loading model... 100%
✓ Loaded in 1.5s

Generating...

═══════════════════════════════════════
  Response:
  Quantum computing utilizes qubits, which can exist in multiple states simultaneously...

  Tokens:     200
  TTFT:       0.21s
  Decode:     2.82s  (70.6 tok/s)
  Total:      3.03s
  Load time:  1.5s
═══════════════════════════════════════

What it measures

TTFT — wall time from generate() call to the first token. Includes prompt prefill.
Decode — wall time and tok/s for tokens 2..N. This is what you feel after the first token lands.
Load — model weights → MLX container, after the HuggingFace download is cached.

Decode tok/s is the right number to compare across models. Total tok/s including prefill biases short generations low.

Install

Requires Xcode 15+ and macOS 14+ (Apple Silicon).

git clone https://github.com/daslabhq/mlx-bench
cd mlx-bench
xcodebuild -scheme mlx-bench -destination 'platform=OS X' -configuration Release -derivedDataPath .build build
ln -sf .build/Build/Products/Release/mlx-bench mlx-bench

Why xcodebuild and not swift build? The Metal shaders that MLX uses can only be compiled by xcodebuild — swift build produces a binary that crashes with "Failed to load the default metallib." See mlx-swift docs.

Use

# defaults: Qwen3.5-0.8B, "Explain quantum computing in 3 sentences.", 200 tokens
./mlx-bench

# pick a model
./mlx-bench mlx-community/Qwen3.5-4B-MLX-4bit

# pick a model, prompt, max tokens
./mlx-bench mlx-community/Qwen3.5-2B-MLX-4bit "Write a haiku about MLX." 100

First run downloads the model from HuggingFace (subsequent runs hit the local cache at ~/Documents/huggingface/).

Reference numbers

M-series Mac, Qwen3.5-*-MLX-4bit, 200-token generation, prompt forces full output:

Model	Size	Decode	TTFT	Load (cached)
Qwen3.5-0.8B	625 MB	137.7 tok/s	0.15s	1.2s
Qwen3.5-2B	2.2 GB	70.6 tok/s	0.21s	1.5s
Qwen3.5-4B	3.0 GB	31.1 tok/s	0.45s	1.5s

Use these as a sanity check, not gospel — your numbers depend on chip, RAM pressure, and thermal state.

Why

Picking the right on-device model is mostly a tok/s question, and "fire up Xcode, paste into a playground, wait" is too much friction to ask a dozen times. mlx-bench is one command per model.

Built for Daslab, where on-device inference is a first-class option.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Sources		Sources
.gitignore		.gitignore
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlx-bench

What it measures

Install

Use

Reference numbers

Why

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mlx-bench

What it measures

Install

Use

Reference numbers

Why

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages