dusterbloom: perf(models): fused MoE gate+up — 3→2 expert matmuls per layer by dusterbloom · Pull Request #12 · dusterbloom/higgs

dusterbloom · 2026-04-30T16:53:28Z

Summary

SwitchMlpWeights::forward_gather_fused() lazy-concatenates gate+up weights on first call, then dispatches a single gather_qmm instead of two separate calls. FfnBlock::forward() now routes through the fused path instead of forward_gather_global_sort().

Numbers (35B-A3B-3bit on M4 base)

Metric	Before	After	Δ
S=1 decode	27 ms	17 ms	−37%
S=16 verify	253 ms	112 ms	−56%
MoE/layer at K=1	~0.68 ms	0.47 ms	−31%

Test plan

cargo check -p higgs-models — clean
cargo test -p higgs-models --lib qwen3_next:: — 71 passed, 0 failed (includes new test_moe_gate_up_fusion_parity)
cargo run -p higgs-bench --release --bin bench_decode -- --model <qwen3_5_moe-key> — A/B against main on a Qwen3.5-MoE checkpoint
cargo clippy -p higgs-models — clean

Notes

Single-file change: crates/higgs-models/src/qwen3_next.rs (+175 / −9)
The fused weight tensor is built lazily on first forward call; concat happens once per model load, not per token
Co-authored with Claude Sonnet 4.6

fix(deps): update rust crate toml to v1

…l-action-digest chore(deps): update taiki-e/install-action digest to cca35ed

…file chore(deps): update rust crate tokio to v1.52.2

…-lockfile chore(deps): update rust crate tower-http to v0.6.9

…anbanda#143) Adds AnyCache::trim_by to roll back KV layers for speculative decode while leaving hybrid Arrays state untouched.\n\nCI: https://github.com/panbanda/higgs/actions/runs/25312580791

…#148) * feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback Adds a fallback path for loading Qwen3.5 models with mixed-bit GDN projection weights (some layers q4, some q8 — common in unsloth's dynamic-quant variants). The default fused-projection loader fuses `in_proj_a` + `in_proj_b` into a single matmul; mixed-bit weights have incompatible shapes and the fusion fails. Behaviour: 1. Detect via `is_mixed_bit_gdn_ba_fusion_error` — matches a `ModelError::ShapeMismatch` whose message contains both `in_proj_ba` and `requires separate GDN projections`. 2. On detection, retry the load with `args.use_separate_gdn_projections = true`, taking the `load_qwen3_5_moe_weights_direct` path. Forward dispatches go from 2 to 4 GDN ops per layer — slightly slower but correct. 3. Forced separate (via `args.use_separate_gdn_projections` config or `HIGGS_SEPARATE_GDN_PROJ` env var) skips the fused attempt entirely. Also adds: * `qwen3_5_quantization_config` — parses `{group_size, bits}` from the per-layer `quantization` map in `config.json`. * `qwen3_5_mixed_ba_quantization_layers` — scans for the layers where `in_proj_a` and `in_proj_b` differ in bits or group_size. * `can_concatenate_axis0` — guard used inside `load_qwen3_5_moe_weights_fused` to emit the diagnostic `ShapeMismatch` error rather than panicking on the concat. * `load_qwen3_5_model_with_gdn_fallback` — private helper called by both `load_qwen3_5_model` (dense) and `load_qwen3_5_moe_model` (MoE), unifying the fallback path. Adaptations from feat/magic-canvas → origin/main: * The dense `load_qwen3_5_model` previously only honoured the env var; now it honours `args.use_separate_gdn_projections` too, matching the MoE path. Strict improvement: the config flag is set only by the env var or by mixed-bit detection. * No `unwrap()`, no `as` casts (use `i32::try_from`); match arms enumerate variants. No file-level allows added. Verification on origin/main (rustc 1.95.0): * `cargo check -p higgs-models` — clean * `cargo clippy --all-targets --all-features -- -D warnings` — clean * `cargo fmt --check` — clean * `cargo test -p higgs-models --lib` — 333/333 pass (3 new) Source: feat/magic-canvas commit `061e500c`. Direct cherry-pick had 5 conflict regions because origin/main has evolved the load functions independently; this is a manual surgical port that preserves origin/main's structure while adding the fallback behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3_next): preserve explicit GDN projection config --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Jonathan Reyes <me@jonathanreyes.com>

SwitchMlpWeights::forward_gather_fused() lazy-concatenates gate+up weights on first call, then dispatches a single gather_qmm instead of two separate calls. FfnBlock::forward() now routes through the fused path instead of forward_gather_global_sort(). Measured on 35B-A3B-3bit M4 base: - S=1 decode: 27ms → 17ms (−37%) - S=16 verify: 253ms → 112ms (−56%) - MoE/layer at K=1: 0.47ms (down from ~0.68ms) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Reflow `let fw/fs/fb = ops::concatenate_axis(..)` from broken-line indentation back onto single lines so `cargo fmt --all -- --check` passes in CI.

Two errors flagged by `-D clippy::doc-markdown` and `-D clippy::cast-sign-loss`/`-D clippy::as-conversions`: - Backtick `MoE` and `gather_qmm` in the `fused_gate_up` doc comment. - Replace `top_k as u32` with the same `u32::try_from(top_k).map_err(...)` pattern already used by `forward_gather_global_sort`.

fix(deps): update rust crate toml to v1

f671235

dusterbloom force-pushed the dusterbloom/perf-foundational-wins branch from 2e9e083 to 18bf33f Compare May 3, 2026 21:26

dusterbloom mentioned this pull request May 4, 2026

perf(models): opt-in fused MoE gate+up — 3→2 expert matmuls per layer panbanda/higgs#141

Merged

5 tasks

renovate Bot and others added 13 commits May 4, 2026 14:50

chore(deps): update rust crate tokio to v1.52.2

50daaef

chore(deps): update taiki-e/install-action digest to cca35ed

8659d38

chore(deps): update rust crate tower-http to v0.6.9

442063e

Merge pull request panbanda#135 from panbanda/renovate/toml-1.x

a08c1bf

fix(deps): update rust crate toml to v1

Merge pull request panbanda#140 from panbanda/renovate/taiki-e-instal…

ea2fb41

…l-action-digest chore(deps): update taiki-e/install-action digest to cca35ed

Merge pull request panbanda#146 from panbanda/renovate/tokio-1.x-lock…

0ae70ec

…file chore(deps): update rust crate tokio to v1.52.2

Merge pull request panbanda#149 from panbanda/renovate/tower-http-0.x…

4b4c0be

…-lockfile chore(deps): update rust crate tower-http to v0.6.9

feat(cache): AnyCache::trim_by dispatcher for spec-decode rollback (p…

229c111

…anbanda#143) Adds AnyCache::trim_by to roll back KV layers for speculative decode while leaving hybrid Arrays state untouched.\n\nCI: https://github.com/panbanda/higgs/actions/runs/25312580791

style: cargo fmt qwen3_next.rs

b677d3b

Reflow `let fw/fs/fb = ops::concatenate_axis(..)` from broken-line indentation back onto single lines so `cargo fmt --all -- --check` passes in CI.

fix(qwen3_next): gate MoE gate-up fusion behind opt-in

371c2e3

panbanda force-pushed the dusterbloom/perf-foundational-wins branch from 009280b to 371c2e3 Compare May 6, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dusterbloom: perf(models): fused MoE gate+up — 3→2 expert matmuls per layer#12

dusterbloom: perf(models): fused MoE gate+up — 3→2 expert matmuls per layer#12
dusterbloom wants to merge 14 commits into
mainfrom
dusterbloom/perf-foundational-wins

dusterbloom commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dusterbloom commented Apr 30, 2026

Summary

Numbers (35B-A3B-3bit on M4 base)

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants