Using ZMLX with exo (GLM-4.7-Flash / Qwen3-30B-A3B)

ZMLX integrates with exo to speed up MoE decode for GLM-4.7-Flash and Qwen3-30B-A3B. The speedup comes from gather_qmm_swiglu, a fused Metal primitive that replaces multiple kernel launches per MoE expert per layer with a single dispatch.

What to expect: speedups vary by prompt/length/hardware; see the repro capsules referenced below. Output should remain token-identical under greedy decoding (verify with python -m zmlx.validate).

Quick start

In a Python 3.13+ environment (exo requires >= 3.13; custom MLX builds must match your Python minor version):

# From a ZMLX checkout (recommended):
bash setup_zmlx.sh
bash exo/run_zmlx.sh

# If `exo` is already installed in your environment:
#   pip install zmlx
#   zmlx-exo

Then open http://localhost:52416 in your browser and select GLM-4.7-Flash or Qwen3-30B-A3B.

Which models benefit?

Model	With ZMLX in exo	Notes	Capsule
GLM-4.7-Flash-4bit	Yes (custom primitive)	MoE expert SwiGLU fused via `gather_qmm_swiglu`	`benchmarks/repro_capsules/glm_stress_m4_20260204.json`
Qwen3-30B-A3B-4bit	Yes (custom primitive)	MoE expert SwiGLU fused via `gather_qmm_swiglu`	`benchmarks/repro_capsules/qwen3_a3b_moe_mlp_m4max_20260205.json`
LFM2-8B-A1B-4bit	Yes (stock MLX)	No custom MLX build needed	`benchmarks/repro_capsules/lfm2_m4max_20260131.json`
Other models	No change	`patch()` auto-skips; safe no-op	—

GLM and Qwen3 require the custom MLX primitive (gather_qmm_swiglu). Without it, ZMLX auto-skips these models — no regressions, but no speedup either.

Prerequisites

macOS 14+ on Apple Silicon (M1 or later)
Python 3.13+ (exo requires >= 3.13). If you build custom MLX, its Python ABI must match the interpreter you use for exo.
uv (recommended) — brew install uv or curl -LsSf https://astral.sh/uv/install.sh | sh

Custom MLX primitive (optional, for GLM/Qwen3)

GLM and Qwen3 gains require mx.gather_qmm_swiglu, which is not in released MLX. See docs/EXPERIMENTAL_MLX.md for full details.

Recommended:

bash integrations/mlx_local_integration/setup_mlx_local.sh

Verify the custom primitive is active:

python -c "import mlx.core as mx; print(hasattr(mx, 'gather_qmm_swiglu'))"  # should print True

To run exo with the custom build without replacing your default MLX install, prepend mlx_local/python on PYTHONPATH:

export PYTHONPATH=<REPO_ROOT>/mlx_local/python:$PYTHONPATH
zmlx-exo

Remove mlx_local/python from PYTHONPATH to revert to stock MLX.

How it works

The launcher (zmlx-exo / python -m zmlx.exo) installs a runtime hook on exo's MLX model loading path (exo.worker.engines.mlx.utils_mlx.load_mlx_items). After exo loads an MLX model, the hook calls zmlx.patch() which:

Detects the model family (GLM, Qwen3, LFM2, etc.)
Checks which patterns are safe for this architecture
Replaces matching MoE layers with fused kernel equivalents
Returns the patched model — all subsequent generation uses fused paths

The hook is controlled by the EXO_ZMLX environment variable:

EXO_ZMLX=1 — enable patching
Unset or 0 — exo runs normally, no ZMLX code is imported

Additional env vars:

EXO_ZMLX_VERBOSE=1 — log every patched module
EXO_ZMLX_PATTERNS=moe_mlp,swiglu_mlp — override auto-detected patterns
EXO_ZMLX_EXCLUDE=moe_mlp — exclude specific patterns

Measured results

Model	Hardware	Baseline	Patched	Change	Capsule
GLM-4.7-Flash-4bit	M4 Max 36 GB	86.6 tok/s	92.4 tok/s	+6.7%	`benchmarks/repro_capsules/glm47_flash_control_m4max_20260205.json`
Qwen3-30B-A3B-4bit	M4 Max 36 GB	106.6 tok/s	115.0 tok/s	+7.9%	`benchmarks/repro_capsules/qwen3_a3b_moe_mlp_m4max_20260205.json`

All results token-identical under greedy decoding.

Multi-device notes

Mode	MoE fusions	SwiGLU fusions	Notes
Single device	Yes	Yes	Best case — all patterns active
Pipeline parallel	Yes	Yes	Each rank patches its own layers
Tensor parallel	No (auto-excluded)	Yes	exo's ShardedMoE handles distribution; ZMLX would bypass the all-reduce

Troubleshooting

Symptom	Fix
No `[zmlx.exo]` log lines after model load	Use `zmlx-exo` / `python -m zmlx.exo` (it installs the hook). If launching exo directly, ZMLX will not be loaded.
`Patched 0 modules` on GLM/Qwen	Custom MLX not active. Run `python -c "import mlx.core as mx; print(hasattr(mx, 'gather_qmm_swiglu'))"` — should print `True`.
`ModuleNotFoundError: No module named 'zmlx'`	ZMLX not installed in the exo environment. Run `pip install zmlx` (or use an editable install).
`Error: couldn't import exo.main`	`exo` may not be installed, or your current directory contains an `exo/` folder shadowing the Python package. Run from a different directory or install exo into the current environment.
Port 52415 in use	Another process (IDE, previous exo) is using the port. The launcher uses 52416 by default, or pass `--api-port <N>`.
Line continuation breaks on paste	Prefer `zmlx-exo` over long multi-line commands.

Verify token fidelity

python -m zmlx.validate mlx-community/GLM-4.7-Flash-4bit --max-tokens 128 --runs 5

This compares patched vs unpatched output token-by-token and reports throughput.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ZMLX with exo (GLM-4.7-Flash / Qwen3-30B-A3B)

Quick start

Which models benefit?

Prerequisites

Custom MLX primitive (optional, for GLM/Qwen3)

How it works

Measured results

Multi-device notes

Troubleshooting

Verify token fidelity

FilesExpand file tree

EXO.md

Latest commit

History

EXO.md

File metadata and controls

Using ZMLX with exo (GLM-4.7-Flash / Qwen3-30B-A3B)

Quick start

Which models benefit?

Prerequisites

Custom MLX primitive (optional, for GLM/Qwen3)

How it works

Measured results

Multi-device notes

Troubleshooting

Verify token fidelity