Skip to content

Add optional fused SwitchGLU gate-up projection#1319

Open
xxxkkw wants to merge 2 commits into
ml-explore:mainfrom
xxxkkw:switchglu-fused-gate-up
Open

Add optional fused SwitchGLU gate-up projection#1319
xxxkkw wants to merge 2 commits into
ml-explore:mainfrom
xxxkkw:switchglu-fused-gate-up

Conversation

@xxxkkw
Copy link
Copy Markdown

@xxxkkw xxxkkw commented May 28, 2026

Summary

  • Add an optional fuse_gate_up path to non-quantized SwitchGLU.
  • When explicitly enabled and safe, compute the gate and up expert projections with a single gathered matmul, then split the result.
  • Keep the default path unchanged and disabled by default.
  • Quantized SwitchGLU experts now intentionally fall back to the original path, because real 4-bit MoE testing showed the fused quantized path materialized persistent fused weights and regressed long-prompt memory/performance.

Root-cause note from real-weight testing

The first implementation also fused QuantizedSwitchLinear gate/up projections by concatenating packed weights, scales, and biases into _fused_gate_up_cache. On mlx-community/OLMoE-1B-7B-0125-4bit, simply building those fused params for 16 SwitchGLU layers added 2.416 GB of persistent active memory. The same quantized fused path also regressed large-token prefill in layer microbenchmarks.

This update disables fusion for quantized experts. Non-quantized SwitchGLU fusion remains available; quantized models continue to run the original two-projection path.

Environment

  • macOS Darwin 25.4.0 arm64
  • Apple M1 Max, 32 GB unified memory
  • MLX / mlx-lm virtual environment on Apple Silicon

Benchmarks

Synthetic non-quantized SwitchGLU microbenchmark

This isolates the layer-level non-quantized path without attention/KV overhead.

Shape Experts Hidden Iterations Unfused Fused Delta
(1, 4096) 128 1536 20 0.6656 ms 0.6590 ms 1.0% faster
(512, 1024) 16 2048 10 1.7154 ms 1.6392 ms 4.4% faster

Real-weight quantized MoE benchmark after the fix

Model: mlx-community/OLMoE-1B-7B-0125-4bit, a normal mlx-community MoE checkpoint using quantized SwitchGLU. The 100K input / 10K output run was attempted earlier, but the unfused leg was still running after 20 minutes and was stopped as impractical for this local PR benchmark. The reported comparison uses the fallback size: 50K input / 5K output.

Protocol: set all SwitchGLU.fuse_gate_up flags off/on after load, greedy generation, prefill_step_size=2048. Because this model uses quantized SwitchGLU experts, the fixed fused-flag run now correctly falls back to the original path and does not build fused quantized caches.

Mode Input tokens Output tokens SwitchGLU layers TTFT Total time Prompt+generation TPS Decode TPS after TTFT Peak memory
Unfused 50,000 5,000 16 63.39s 199.50s 275.69 tok/s 36.73 tok/s 12.01 GB
fuse_gate_up=True with quantized fallback 50,000 5,000 16 47.11s 182.83s 300.82 tok/s 36.84 tok/s 12.01 GB

The important correction is memory behavior: the previous quantized fused run peaked at 14.43 GB, while the fixed fallback path stays at 12.01 GB. Timing differences between the two fixed quantized runs are run-to-run variation; there is no quantized fusion speed claim.

Test plan

  • python -m unittest discover -s tests -p test_switch_layers.py
  • Regression: quantized SwitchGLU with fuse_gate_up=True matches unfused output and does not build _fused_gate_up_cache
  • Synthetic non-quantized SwitchGLU microbenchmarks above
  • Real mlx-community/OLMoE-1B-7B-0125-4bit 50K/5K fused-flag-vs-unfused benchmark above

xxxkkw added 2 commits May 28, 2026 23:37
Let SwitchGLU combine gate and up expert projections when explicitly enabled, reducing duplicate gather matmul work while preserving the default unfused path.
Avoid building persistent fused gate/up weights for quantized SwitchGLU layers, which increases memory and can regress long-prompt MoE inference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant