Add optional fused SwitchGLU gate-up projection#1319
Open
xxxkkw wants to merge 2 commits into
Open
Conversation
Let SwitchGLU combine gate and up expert projections when explicitly enabled, reducing duplicate gather matmul work while preserving the default unfused path.
Avoid building persistent fused gate/up weights for quantized SwitchGLU layers, which increases memory and can regress long-prompt MoE inference.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fuse_gate_uppath to non-quantizedSwitchGLU.SwitchGLUexperts now intentionally fall back to the original path, because real 4-bit MoE testing showed the fused quantized path materialized persistent fused weights and regressed long-prompt memory/performance.Root-cause note from real-weight testing
The first implementation also fused
QuantizedSwitchLineargate/up projections by concatenating packed weights, scales, and biases into_fused_gate_up_cache. Onmlx-community/OLMoE-1B-7B-0125-4bit, simply building those fused params for 16 SwitchGLU layers added 2.416 GB of persistent active memory. The same quantized fused path also regressed large-token prefill in layer microbenchmarks.This update disables fusion for quantized experts. Non-quantized SwitchGLU fusion remains available; quantized models continue to run the original two-projection path.
Environment
Benchmarks
Synthetic non-quantized SwitchGLU microbenchmark
This isolates the layer-level non-quantized path without attention/KV overhead.
(1, 4096)(512, 1024)Real-weight quantized MoE benchmark after the fix
Model:
mlx-community/OLMoE-1B-7B-0125-4bit, a normal mlx-community MoE checkpoint using quantizedSwitchGLU. The 100K input / 10K output run was attempted earlier, but the unfused leg was still running after 20 minutes and was stopped as impractical for this local PR benchmark. The reported comparison uses the fallback size: 50K input / 5K output.Protocol: set all
SwitchGLU.fuse_gate_upflags off/on after load, greedy generation,prefill_step_size=2048. Because this model uses quantized SwitchGLU experts, the fixed fused-flag run now correctly falls back to the original path and does not build fused quantized caches.fuse_gate_up=Truewith quantized fallbackThe important correction is memory behavior: the previous quantized fused run peaked at 14.43 GB, while the fixed fallback path stays at 12.01 GB. Timing differences between the two fixed quantized runs are run-to-run variation; there is no quantized fusion speed claim.
Test plan
python -m unittest discover -s tests -p test_switch_layers.pyfuse_gate_up=Truematches unfused output and does not build_fused_gate_up_cachemlx-community/OLMoE-1B-7B-0125-4bit50K/5K fused-flag-vs-unfused benchmark above