feat(qwen35): Qwen3.6 MoE MTP speculative decode + checkpoint memory-safety fix by dusterbloom · Pull Request #183 · panbanda/higgs

dusterbloom · 2026-06-10T11:22:52Z

What

Adds Qwen3.6-A3B (MoE) multi-token-prediction speculative decode, and fixes a memory-safety bug in the speculative-decode checkpoint path.

Three commits:

feat(qwen35): MoE-structured MTP head — loads the mtp.safetensors sidecar as a MoeMtpHead and wires it into the speculative-decode loop for Qwen3.6-A3B-style checkpoints.
fix(qwen35): survive mixed-bit GDN quants — handles OptiQ-style per-projection bit widths so mixed-bit GDN checkpoints load instead of erroring.
fix(mtp): eval-before-deep_clone + trim-based rollback — the memory-safety fix (below).

The memory-safety bug

Speculative-decode checkpoints shared MLX buffers with the live cache. The backbone rollback did *cache = base_cache (a shallow clone()), and the MTP head cache was shallow-cloned too. The live cache's in-place slice_update during verify then let MLX donate/free a buffer the checkpoint still referenced, double-freeing on drop — the malloc: pointer being freed was not allocated abort that crashed MTP decode at ~44 tokens.

Fix:

Backbone: roll KV layers back by offset (trim_by), never clone — no buffer aliasing. Only hybrid SSM/recurrent state (can't be offset-trimmed) still clone-restores, via AnyCache::deep_clone.
MTP head + hybrid clones: deep-copy via deep_clone_mtp_cache / SteppingKeyValueCache::deep_clone, which allocate fresh buffers.
deep_clone itself was unsafe: Array::deep_clone copies straight from the buffer pointer (valid only once evaluated), but the cache stores lazy slice_update results — cloning read an unmaterialized pointer and segfaulted. eval_deep_clone forces eval first.

Testing

cargo test -p higgs-models --lib — 356 passed, 0 failed.
Two new unit tests for the contract: deep_clone_preserves_contents_and_offset (faithful copy) and deep_clone_checkpoint_survives_live_in_place_update (independence under a live in-place update). Both fail (segfault) without the fix.
Soak: Qwen3.6-35B-A3B-4bit server, 5×200-token completions = 432 mtp_cycle iterations / ~1300 cache deep-clones, accept-rate 62–69%, zero aborts. Pre-fix this aborted at ~44 tokens.

Summary by CodeRabbit

New Features
- Added support for Mixture-of-Experts MTP head variant.
Bug Fixes
- Improved cache checkpoint/rollback safety in speculative decoding.
- Fixed potential buffer aliasing issues during cache updates.
Refactor
- Optimized cache cloning strategy in speculative decoding to reduce memory overhead.
Tests
- Added validation tests for MoE MTP functionality and cache deep-cloning correctness.

Qwen3.6-A3B ships its MTP layer as a full MoE decoder layer (router gate + 256 stacked experts + shared expert/gate) with a quantized fc, both bundled and as standalone drafter sidecars (e.g. mlx-community Qwen3.6-35B-A3B-MTP- 4bit). The existing MtpHead/DenseMtpHead are dense-MLP only, so these checkpoints could not speculate. - MoeMtpHead / MoeMtpTransformerLayer: full attention + SparseMoeBlock, quantized fc (these sidecars ship fc.{weight,scales,biases} triples). Constructed at the checkpoint's uniform quantization; the main model's gate_quantization override is deliberately NOT applied (sidecar router gates are quantized at the default width). - Layout detection: MoE-structured MTP keys classify as MoeQuantized and enable the head (use_moe_mtp). Truly unprefixed sidecar keys (fc.weight, layers.0....) are mtp.-prefixed at detection and load time (normalize_sidecar_mtp_key), so mlx-community drafters work as mtp.safetensors drop-ins. - Loading: mtp.* -> moe_mtp.* param remap through both the fused and direct loaders; no dense rmsnorm adjustment for MoE targets. - Forward: MoE branches in mtp_step_hidden and mtp_advance_many; has_mtp / make_mtp_cache cover the new head. Measured on Qwen3.6-35B-A3B-4bit + MTP drafter (M-series, kv-bits 4): short/structured output 60-62 tok/s at 100% accept (vs 40 tok/s baseline, +50%); long prose 37-41 tok/s at 60-71% accept (breakeven). Outputs verified exact (drafts go through the verify path). Tests: MoE layout classification, mtp->moe_mtp key remap, sidecar key normalization (aux-only, idempotent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… bits) Mixed-precision checkpoints (e.g. mlx-community OptiQ quants) assign different bit-widths per GDN projection on sensitive layers. Two loader gaps turned those into hard load failures: - The mixed-bit detector only compared the in_proj_a/in_proj_b pair; a mismatched in_proj_qkv/in_proj_z pair slipped through to the fused loader, which then failed concatenating packed shapes like (8192,256) vs (4096,512). Check both fusion pairs. - In separate-GDN mode the fused in_proj_qkvz/in_proj_ba QLinears are still constructed (as unused placeholders — the forward dispatches on use_separate_projections), and the direct loader's completeness check flagged them as missing weights, rejecting every checkpoint that *requires* separate projections. Exempt the unused fused placeholders. Note: fully running OptiQ-style quants also needs per-projection quantization plumbed into every QLinear (their overrides span attention, shared experts, etc.) — this fix makes the GDN layer detection/loading correct and turns the failure mode from a crash into a clean report. Test: mixed in_proj_qkv/in_proj_z bits force separate GDN projections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tive-decode double-free Speculative-decode checkpoints shared MLX buffers with the live cache. The backbone rollback did `*cache = base_cache` (shallow clone) and the MTP head cache was shallow-cloned too; the live cache's in-place `slice_update` during verify then let MLX donate/free a buffer the checkpoint still held, double- freeing on drop — the `malloc: pointer being freed was not allocated` abort that crashed MTP decode at ~44 tokens. Backbone: roll KV layers back by offset (`trim_by`), never clone — no buffer aliasing. Only hybrid SSM/recurrent state (can't be offset-trimmed) still clone-restores via `AnyCache::deep_clone`. MTP head + hybrid clones: deep-copy via `deep_clone_mtp_cache` / `SteppingKeyValueCache::deep_clone`, which allocate fresh buffers. deep_clone itself was unsafe: `Array::deep_clone` copies straight from the buffer pointer (valid only once evaluated), but the cache stores lazy `slice_update` results — cloning read an unmaterialized pointer and segfaulted. `eval_deep_clone` forces eval first. Tests: 2 deep_clone unit tests (lazy-pointer + live-update independence); higgs-models lib suite 371 passed. Soak: Qwen3.6-35B-A3B MTP server, 5x200-tok requests = 432 mtp_cycle iterations / ~1300 cache deep-clones, accept 62-69%, zero aborts (RED crashed ~44). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-10T11:23:05Z

Warning

Review limit reached

@dusterbloom, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 1 hour. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f7a11c65-a83b-47ee-9139-9a9f4c027866

📥 Commits

Reviewing files that changed from the base of the PR and between 40bafa8 and 4cd7f47.

📒 Files selected for processing (2)

crates/higgs-models/src/lib.rs
crates/higgs-models/src/qwen3_next.rs

📝 Walkthrough

Walkthrough

This PR implements safe speculative decoding with deep-cloned KV cache checkpoints and adds MoE-structured MTP head support. KV cache deep-cloning infrastructure materializes lazy arrays into independent buffers, enabling safe snapshot/restore semantics during speculative verification. MTP engine cycles now use this infrastructure instead of shallow cloning. Additionally, MoE MTP heads are now detected in checkpoints, loaded alongside existing variants, and executed with corresponding cache management and weight remapping.

Changes

Safe speculative decoding with deep-cloned cache checkpoints and MoE MTP support

Layer / File(s)	Summary
Cache deep-cloning infrastructure for speculative checkpoints `crates/higgs-models/src/cache.rs`, `crates/higgs-models/src/lib.rs`	`SteppingKeyValueCache::deep_clone()` materializes lazy MLX arrays into independent buffers and recursively deep-clones TurboQuant packed state, while `eval_deep_clone()` safely evaluates lazy `slice_update` results. `AnyCache::deep_clone()` and `deep_clone_mtp_cache()` provide layer-wise and MTP-specific snapshot helpers. Tests validate checkpoint correctness and rollback invariants. `AUXILIARY_SAFETENSORS_FILES` visibility adjusted to `pub(crate)`.
MTP engine checkpoint/rollback and deep-clone updates `crates/higgs-engine/src/mtp.rs`	Backbone cache checkpoint/rollback helpers replace shallow cloning: KV caches checkpoint as `None` and rollback via trimming; hybrid caches deep-clone and restore. `mtp_prompt_lookup_cycle`, `prompt_lookup_cycle`, and `mtp_cycle` updated to use `capture_backbone_checkpoint`, `rollback_backbone`, and `deep_clone_mtp_cache` to eliminate buffer aliasing during speculative advances and verification rejections.
MoE MTP head struct definitions and model integration `crates/higgs-models/src/qwen3_next.rs` (lines 218–226, 1963–2042, 3427–3428, 3542–3564)	`Qwen3NextModelArgs` gains `use_moe_mtp` bool flag. `MoeMtpTransformerLayer` and `MoeMtpHead` structs introduced with attention, sparse MoE blocks, and quantized projection, with `gate_quantization` cleared to match checkpoint sidecar. `Qwen3NextCausalLM` adds `moe_mtp: Option<MoeMtpHead>` field and initialization logic alongside existing dense/quantized MTP paths.
MoE MTP runtime execution `crates/higgs-models/src/qwen3_next.rs` (lines 3914–3925, 4011–4029, 4122–4168)	`has_mtp()` recognizes MoE variant. `make_mtp_cache()` includes MoE layer count. `mtp_step_hidden` and `mtp_advance_many` implement MoE execution paths calling `SparseMoeBlock` MLPs instead of dense projections.
MoE MTP checkpoint detection and weight loading `crates/higgs-models/src/qwen3_next.rs` (lines 4358–4388, 4404–4494, 4682–4692, 4949–4983, 5036–5036, 5070–5082, 14128–14178)	`MtpWeightLayout::MoeQuantized` variant detected by scanning for MoE MLP patterns. `normalize_sidecar_mtp_key` prefixes missing `mtp.` on auxiliary sidecar files. `moe_mtp_param_key` remaps `mtp.` keys to `moe_mtp.`. Qwen3.5 loader updated for mixed-bit quantization constraints and fused-param handling. Tests validate MoE detection, remapping, and key normalization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

panbanda/higgs#166: Both PRs modify MTP speculative decoding cycles in crates/higgs-engine/src/mtp.rs around mtp_cycle/mtp_prompt_lookup_cycle and their control flow; this PR adds checkpoint/rollback and deep-clone cache state, while the retrieved PR introduces accepted-draft telemetry, adaptive draft depth, and hybrid prompt-lookup paths.

Poem

🐰 Deep clones now guard our cache state so fair,
Checkpoint and rollback—no aliasing nightmare!
MoE heads bloom forth in speculative decoding's light,
Snapshot integrity preserved throughout the night. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the two main contributions: adding Qwen3.6 MoE MTP speculative decode support and fixing a checkpoint memory-safety bug in speculative decoding.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/higgs-models/src/qwen3_next.rs (1)
4527-4529: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

load_qwen3_next_model still skips the mtp.* → moe_mtp.* remap.

After MoeQuantized detection, this path constructs moe_mtp, then immediately calls the generic safetensor loader under the assumption that checkpoint keys already match model params exactly. They do not: the new MoE sidecars still live under the mtp.* namespace, and the only remap helper (moe_mtp_param_key) is wired into the qwen3.5 direct/fused loaders further down in this file. That means plain Qwen3Next/Qwen3.6-A3B loads will miss the MoE MTP weights or fail the completeness check.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/higgs-models/src/qwen3_next.rs` around lines 4527 - 4529, The
load_qwen3_next_model path detects MoeQuantized and constructs the moe_mtp
sidecar but then calls crate::load_safetensors_weights assuming keys match;
however safetensor checkpoints still use the mtp.* namespace so the moe MTP
params are never remapped or loaded. Update load_qwen3_next_model to apply the
same remapping used elsewhere: use the existing moe_mtp_param_key helper (or
equivalent remap function) to translate incoming safetensor keys from mtp.* →
moe_mtp.* before calling crate::load_safetensors_weights (or pass a key-remap
callback into that loader) so the MoE MTP weights are picked up and the
completeness check succeeds.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/higgs-models/src/qwen3_next.rs`:
- Around line 4961-4970: The function normalize_sidecar_mtp_key is
over-prefixing namespaced keys (e.g. "language_model.mtp.layers..."); change the
condition so you only add "mtp." for auxiliary files when the key is truly
un-namespaced—i.e., it does not already start with "mtp." and does not contain
any '.' namespace separator. Update normalize_sidecar_mtp_key to check is_aux &&
!key.starts_with("mtp.") && !key.contains('.') before returning
format!("mtp.{key}"), so qwen35_checkpoint_param_key can still recognize and
strip/remap legitimately namespaced keys.

---

Outside diff comments:
In `@crates/higgs-models/src/qwen3_next.rs`:
- Around line 4527-4529: The load_qwen3_next_model path detects MoeQuantized and
constructs the moe_mtp sidecar but then calls crate::load_safetensors_weights
assuming keys match; however safetensor checkpoints still use the mtp.*
namespace so the moe MTP params are never remapped or loaded. Update
load_qwen3_next_model to apply the same remapping used elsewhere: use the
existing moe_mtp_param_key helper (or equivalent remap function) to translate
incoming safetensor keys from mtp.* → moe_mtp.* before calling
crate::load_safetensors_weights (or pass a key-remap callback into that loader)
so the MoE MTP weights are picked up and the completeness check succeeds.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c6a38926-c3b4-41e5-8efd-f9f189956210

📥 Commits

Reviewing files that changed from the base of the PR and between f6e3c2f and 40bafa8.

📒 Files selected for processing (4)

crates/higgs-engine/src/mtp.rs
crates/higgs-models/src/cache.rs
crates/higgs-models/src/lib.rs
crates/higgs-models/src/qwen3_next.rs

CodeRabbit review: - normalize_sidecar_mtp_key: only prefix truly un-namespaced sidecar keys. Gate on !is_mtp_key() instead of !starts_with("mtp.") so already-namespaced keys (e.g. language_model.mtp.*) aren't mangled into unmatchable mtp.language_model.mtp.*. Extends the existing unit test to cover it. - load_qwen3_next_model: route through a new MTP-aware load_qwen3_next_weights instead of the plain loader. maybe_disable_mtp_without_checkpoint_weights can select the dense/MoE head (params dense_mtp.* / moe_mtp.*) while the checkpoint ships the head under mtp.*; the plain loader did no remap and silently left the draft head uninitialized. Backbone keys still match directly, so behaviour is unchanged for the common Quantized layout. Lint CI was failing at the fmt step, masking clippy -Dwarnings errors that the feature commits introduced. Fixed so the full Lint job passes: - cargo fmt (AUXILIARY_SAFETENSORS_FILES wrap, in_proj filter closure) - clippy::shadow_reuse on the three sidecar-key loaders (file convention) - backtick bare `MoE` in MoE-MTP doc comments (doc_markdown) - LayerCache wildcard -> explicit Arrays(_) arm (match_wildcard_for_single_variants) - allow large_enum_variant on AnyModel (singleton dispatch handle; boxing every variant would add forward-path indirection for no real benefit) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dusterbloom and others added 3 commits June 10, 2026 13:15

coderabbitai Bot requested changes Jun 10, 2026

View reviewed changes

Comment thread crates/higgs-models/src/qwen3_next.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qwen35): Qwen3.6 MoE MTP speculative decode + checkpoint memory-safety fix#183

feat(qwen35): Qwen3.6 MoE MTP speculative decode + checkpoint memory-safety fix#183
dusterbloom wants to merge 4 commits into
panbanda:mainfrom
dusterbloom:dusterbloom/qwen36-mtp-decode

dusterbloom commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dusterbloom commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The memory-safety bug

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading