Add optional fused SwitchGLU gate-up projection by xxxkkw · Pull Request #1319 · ml-explore/mlx-lm

xxxkkw · 2026-05-28T15:39:34Z

Summary

Add an optional fuse_gate_up path to non-quantized SwitchGLU.
When explicitly enabled and safe, compute the gate and up expert projections with a single gathered matmul, then split the result.
Keep the default path unchanged and disabled by default.
Quantized SwitchGLU experts now intentionally fall back to the original path, because real 4-bit MoE testing showed the fused quantized path materialized persistent fused weights and regressed long-prompt memory/performance.

Root-cause note from real-weight testing

The first implementation also fused QuantizedSwitchLinear gate/up projections by concatenating packed weights, scales, and biases into _fused_gate_up_cache. On mlx-community/OLMoE-1B-7B-0125-4bit, simply building those fused params for 16 SwitchGLU layers added 2.416 GB of persistent active memory. The same quantized fused path also regressed large-token prefill in layer microbenchmarks.

This update disables fusion for quantized experts. Non-quantized SwitchGLU fusion remains available; quantized models continue to run the original two-projection path.

Environment

macOS Darwin 25.4.0 arm64
Apple M1 Max, 32 GB unified memory
MLX / mlx-lm virtual environment on Apple Silicon

Benchmarks

Synthetic non-quantized SwitchGLU microbenchmark

This isolates the layer-level non-quantized path without attention/KV overhead.

Shape	Experts	Hidden	Iterations	Unfused	Fused	Delta
`(1, 4096)`	128	1536	20	0.6656 ms	0.6590 ms	1.0% faster
`(512, 1024)`	16	2048	10	1.7154 ms	1.6392 ms	4.4% faster

Real-weight quantized MoE benchmark after the fix

Model: mlx-community/OLMoE-1B-7B-0125-4bit, a normal mlx-community MoE checkpoint using quantized SwitchGLU. The 100K input / 10K output run was attempted earlier, but the unfused leg was still running after 20 minutes and was stopped as impractical for this local PR benchmark. The reported comparison uses the fallback size: 50K input / 5K output.

Protocol: set all SwitchGLU.fuse_gate_up flags off/on after load, greedy generation, prefill_step_size=2048. Because this model uses quantized SwitchGLU experts, the fixed fused-flag run now correctly falls back to the original path and does not build fused quantized caches.

Mode	Input tokens	Output tokens	SwitchGLU layers	TTFT	Total time	Prompt+generation TPS	Decode TPS after TTFT	Peak memory
Unfused	50,000	5,000	16	63.39s	199.50s	275.69 tok/s	36.73 tok/s	12.01 GB
`fuse_gate_up=True` with quantized fallback	50,000	5,000	16	47.11s	182.83s	300.82 tok/s	36.84 tok/s	12.01 GB

The important correction is memory behavior: the previous quantized fused run peaked at 14.43 GB, while the fixed fallback path stays at 12.01 GB. Timing differences between the two fixed quantized runs are run-to-run variation; there is no quantized fusion speed claim.

Test plan

python -m unittest discover -s tests -p test_switch_layers.py
Regression: quantized SwitchGLU with fuse_gate_up=True matches unfused output and does not build _fused_gate_up_cache
Synthetic non-quantized SwitchGLU microbenchmarks above
Real mlx-community/OLMoE-1B-7B-0125-4bit 50K/5K fused-flag-vs-unfused benchmark above

Let SwitchGLU combine gate and up expert projections when explicitly enabled, reducing duplicate gather matmul work while preserving the default unfused path.

Avoid building persistent fused gate/up weights for quantized SwitchGLU layers, which increases memory and can regress long-prompt MoE inference.

xxxkkw added 2 commits May 28, 2026 23:37

Add optional fused SwitchGLU gate-up projection

d15917b

Let SwitchGLU combine gate and up expert projections when explicitly enabled, reducing duplicate gather matmul work while preserving the default unfused path.

Disable SwitchGLU fusion for quantized experts

7fe8992

Avoid building persistent fused gate/up weights for quantized SwitchGLU layers, which increases memory and can regress long-prompt MoE inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional fused SwitchGLU gate-up projection#1319

Add optional fused SwitchGLU gate-up projection#1319
xxxkkw wants to merge 2 commits into
ml-explore:mainfrom
xxxkkw:switchglu-fused-gate-up

xxxkkw commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xxxkkw commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root-cause note from real-weight testing

Environment

Benchmarks

Synthetic non-quantized SwitchGLU microbenchmark

Real-weight quantized MoE benchmark after the fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xxxkkw commented May 28, 2026 •

edited

Loading