feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback by dusterbloom · Pull Request #148 · panbanda/higgs

dusterbloom · 2026-05-05T14:44:25Z

Summary

Adds a load-time fallback for Qwen3.5 models with mixed-bit GDN BA projections (some layers q4, some q8 — common in unsloth's dynamic-quant variants). The default fused loader concatenates in_proj_a + in_proj_b into a single matmul; mixed-bit weights have incompatible shapes and fusion fails. This PR detects that specific failure and retries with separate (4-dispatch) GDN projections.

PR-1c of the feat/magic-canvas split — a deferred follow-up from #141's audit (source commit 061e500c).

Behaviour

Default fused path: try load_qwen3_5_moe_weights_fused (2 GDN dispatches per layer).
On ModelError::ShapeMismatch containing both "in_proj_ba" and "requires separate GDN projections", retry with args.use_separate_gdn_projections = true and load_qwen3_5_moe_weights_direct (4 dispatches per layer).
Forced separate (HIGGS_SEPARATE_GDN_PROJ env or args.use_separate_gdn_projections) skips the fused attempt entirely.
Any other error propagates.

What's added

is_mixed_bit_gdn_ba_fusion_error — matches the diagnostic shape-mismatch error
qwen3_5_quantization_config — parses {group_size, bits} from the per-layer quantization map
qwen3_5_mixed_ba_quantization_layers — finds layers where in_proj_a and in_proj_b differ
can_concatenate_axis0 — guard inside load_qwen3_5_moe_weights_fused to emit the diagnostic error rather than panic
load_qwen3_5_model_with_gdn_fallback — private helper called by both dense and MoE load paths

Adaptations to origin/main

The dense load_qwen3_5_model on origin/main only honoured the env var; now also honours args.use_separate_gdn_projections (matching MoE behaviour). Strict improvement — flag is set only by env or by mixed-bit detection.
Direct cherry-pick of 061e500c produced 5 conflict regions because origin/main has evolved the load functions independently. This is a manual surgical port preserving origin/main's structure.

Hygiene

No unwrap() on Result/Option
No as casts in shape-arithmetic paths (i32::try_from)
Match arms enumerate variants explicitly
No file-level blanket allows added

Test plan

cargo check -p higgs-models — clean
cargo clippy --all-targets --all-features -- -D warnings — clean (rustc 1.95.0, matches CI)
cargo fmt --check — clean
cargo test -p higgs-models --lib — 333/333 pass (3 new)
End-to-end load test against an unsloth dynamic-quant Qwen3.5 model (e.g. Brooooooklyn/Qwen3.5-27B-unsloth-mlx) — confirm the warn-then-retry path triggers and the model loads.

Context

Part of the feat/magic-canvas PR split. Prior PRs in the chain: #141 (fused MoE), #142 (Bonsai-Q1 fp16 +2.83×), #143 (AnyCache::trim_by), #144 (DraftModel trait), #145 (PLD drafter), #147 (DFlash drafter foundation).

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved Qwen3.5 model loading to detect per-layer mixed-bit quantization and automatically fall back to a compatible loading path when fusion is invalid, reducing load failures and shape-mismatch errors.
- Added safeguards to prevent incorrect tensor fusion during model import and clearer logging of fused vs. separate projection outcomes.
Tests
- Added tests covering mixed-bit quantization, forced separate-projection scenarios, fusion-preservation, and shape-mismatch detection.

coderabbitai · 2026-05-05T14:44:42Z

Warning

Rate limit exceeded

@panbanda has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 5 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b51490d9-a8e4-43bb-a408-dacf55d33c97

📥 Commits

Reviewing files that changed from the base of the PR and between 358ac62 and dd9cf7a.

📒 Files selected for processing (1)

crates/higgs-models/src/qwen3_next.rs

📝 Walkthrough

Walkthrough

Qwen3.5 MoE/VLM loading now detects per-layer mixed-bit BA quantization for GDN projections during config parsing and uses a runtime path that tries fused projection loading first, then falls back to separate projections when BA fusion is invalid due to mismatched shapes or quantization.

Changes

GDN Mixed-Bit Quantization Handling

Layer / File(s)	Summary
Config Detection & Helpers `crates/higgs-models/src/qwen3_next.rs` (lines 3757–3797)	Adds `qwen3_5_quantization_config` and `qwen3_5_mixed_ba_quantization_layers` to parse per-layer quantization overrides and identify layers where `in_proj_a` and `in_proj_b` differ in `bits`/`group_size`. `load_qwen3_5_moe_text_config_args` now sets `use_separate_gdn_projections` from env or detected mixed-bit layers.
Load Entry Points `crates/higgs-models/src/qwen3_next.rs` (lines 3825–3828)	`load_qwen3_5_model` / MOE entry now route through a new fallback loader instead of directly branching on `use_separate_gdn_projections`.
Runtime Fallback Orchestration `crates/higgs-models/src/qwen3_next.rs` (lines 3866–3919)	Introduces `load_qwen3_5_model_with_gdn_fallback` which attempts fused GDN loading, detects mixed-bit BA fusion errors via `is_mixed_bit_gdn_ba_fusion_error`, and retries with separate GDN projections when appropriate.
Fused Loader Integration `crates/higgs-models/src/qwen3_next.rs` (lines 4166–4181)	Fused GDN loader now calls `can_concatenate_axis0` before concatenation; if incompatible, returns a `ShapeMismatch` error indicating a need for separate GDN projections to trigger fallback.
Shape Validation Guard `crates/higgs-models/src/qwen3_next.rs` (lines 4019–4040)	Adds `can_concatenate_axis0_shapes` and `can_concatenate_axis0` utilities to validate axis-0 concatenation compatibility (including quantized-inner-shape cases).
Error Detection Helper `crates/higgs-models/src/qwen3_next.rs` (lines 3921–3930)	Adds `is_mixed_bit_gdn_ba_fusion_error` to recognize fusion failures that should trigger fallback.
Tests & Validation `crates/higgs-models/src/qwen3_next.rs` (lines 12955–13049)	Adds tests verifying: mixed-bit BA forces separate GDN projections; matching BA preserves fused GDN; explicit separate-GDN config is preserved; and `can_concatenate_axis0` detects mismatched quantized inner shapes.

Sequence Diagram

sequenceDiagram
    participant Config as Config Parser
    participant Runtime as Runtime Loader
    participant FusedLoader as Fused GDN Loader
    participant DirectLoader as Direct Loader
    participant Fallback as Fallback Handler

    Config->>Config: Scan per-layer quantization overrides
    Config->>Runtime: Set use_separate_gdn_projections flag

    Runtime->>Runtime: Load model config
    alt use_separate_gdn_projections forced
        Runtime->>DirectLoader: Load with separate projections
        DirectLoader-->>Runtime: Weights loaded
    else Attempt fused path
        Runtime->>FusedLoader: Attempt fused GDN loading
        FusedLoader->>FusedLoader: Check can_concatenate_axis0(in_proj_a, in_proj_b)
        alt Shapes compatible
            FusedLoader->>FusedLoader: Fuse projections
            FusedLoader-->>Runtime: Weights loaded (fused)
        else Shapes incompatible or mixed-bit
            FusedLoader-->>Fallback: Return ShapeMismatch / fusion error
            Fallback->>Runtime: Rebuild config with separate_gdn=true
            Runtime->>DirectLoader: Reload via direct loader
            DirectLoader-->>Runtime: Weights loaded (separate)
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hopping through bits where projections meet,

When BA mismatches make fused paths retreat,
We try the fast fuse, then gently divide—
Separate projections stand ready, bona fide.
A rabbit cheers: safe loading, stride by stride!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback' directly and clearly summarizes the main change: adding a fallback mechanism for loading Qwen3.5 models with mixed-bit quantization in GDN BA projections.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/higgs-models/src/qwen3_next.rs`:
- Around line 3731-3736: The code currently unconditionally computes
use_separate and inserts "use_separate_gdn_projections" into map, overwriting
any config-provided value; change this to preserve an existing config entry:
before calling map.insert for the key "use_separate_gdn_projections" (around the
variable use_separate and map.insert), check whether map already contains that
key or whether map.get("use_separate_gdn_projections") is
Some(serde_json::Value::Bool(_)); only insert the computed
serde_json::Value::from(use_separate) when the key is absent (or not a boolean),
so the config-provided true/false is not clobbered by the env/mixed-bit
detection logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d5bd8595-b442-4013-9fb3-29e3c1e8d6e1

📥 Commits

Reviewing files that changed from the base of the PR and between 3d5a136 and 7497649.

📒 Files selected for processing (1)

crates/higgs-models/src/qwen3_next.rs

Adds a fallback path for loading Qwen3.5 models with mixed-bit GDN projection weights (some layers q4, some q8 — common in unsloth's dynamic-quant variants). The default fused-projection loader fuses `in_proj_a` + `in_proj_b` into a single matmul; mixed-bit weights have incompatible shapes and the fusion fails. Behaviour: 1. Detect via `is_mixed_bit_gdn_ba_fusion_error` — matches a `ModelError::ShapeMismatch` whose message contains both `in_proj_ba` and `requires separate GDN projections`. 2. On detection, retry the load with `args.use_separate_gdn_projections = true`, taking the `load_qwen3_5_moe_weights_direct` path. Forward dispatches go from 2 to 4 GDN ops per layer — slightly slower but correct. 3. Forced separate (via `args.use_separate_gdn_projections` config or `HIGGS_SEPARATE_GDN_PROJ` env var) skips the fused attempt entirely. Also adds: * `qwen3_5_quantization_config` — parses `{group_size, bits}` from the per-layer `quantization` map in `config.json`. * `qwen3_5_mixed_ba_quantization_layers` — scans for the layers where `in_proj_a` and `in_proj_b` differ in bits or group_size. * `can_concatenate_axis0` — guard used inside `load_qwen3_5_moe_weights_fused` to emit the diagnostic `ShapeMismatch` error rather than panicking on the concat. * `load_qwen3_5_model_with_gdn_fallback` — private helper called by both `load_qwen3_5_model` (dense) and `load_qwen3_5_moe_model` (MoE), unifying the fallback path. Adaptations from feat/magic-canvas → origin/main: * The dense `load_qwen3_5_model` previously only honoured the env var; now it honours `args.use_separate_gdn_projections` too, matching the MoE path. Strict improvement: the config flag is set only by the env var or by mixed-bit detection. * No `unwrap()`, no `as` casts (use `i32::try_from`); match arms enumerate variants. No file-level allows added. Verification on origin/main (rustc 1.95.0): * `cargo check -p higgs-models` — clean * `cargo clippy --all-targets --all-features -- -D warnings` — clean * `cargo fmt --check` — clean * `cargo test -p higgs-models --lib` — 333/333 pass (3 new) Source: feat/magic-canvas commit `061e500c`. Direct cherry-pick had 5 conflict regions because origin/main has evolved the load functions independently; this is a manual surgical port that preserves origin/main's structure while adding the fallback behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/higgs-models/src/qwen3_next.rs`:
- Around line 3787-3797: The current layer-scan in the closure only looks up
quantization keys with the "language_model.model.layers.{layer_idx}.linear_attn"
prefix and therefore misses checkpoints that use the unprefixed
"model.layers.{layer_idx}.linear_attn" layout; update the closure (the filter
over (0..num_hidden_layers) that computes a_quant and b_quant) to try both key
forms when calling quant.get(...).and_then(qwen3_5_quantization_config) (i.e.,
check quant.get(prefixed_key) first and fall back to quant.get(unprefixed_key)
before defaulting to default_quant.clone()), mirroring the behavior already used
by gate_quantization_override so mixed-BA detection runs for both prefixed and
unprefixed checkpoints.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5fe2c153-be5c-448d-bb7c-f4c105efe65d

📥 Commits

Reviewing files that changed from the base of the PR and between 7497649 and 358ac62.

📒 Files selected for processing (1)

crates/higgs-models/src/qwen3_next.rs

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread crates/higgs-models/src/qwen3_next.rs Outdated

panbanda force-pushed the dusterbloom/qwen3-mixed-bit-gdn branch from 7497649 to 358ac62 Compare May 6, 2026 12:25

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread crates/higgs-models/src/qwen3_next.rs Outdated

panbanda force-pushed the dusterbloom/qwen3-mixed-bit-gdn branch from 358ac62 to 45ece14 Compare May 6, 2026 12:30

fix(qwen3_next): preserve explicit GDN projection config

dd9cf7a

panbanda force-pushed the dusterbloom/qwen3-mixed-bit-gdn branch from 45ece14 to dd9cf7a Compare May 6, 2026 12:32

panbanda merged commit cc18616 into panbanda:main May 6, 2026
6 checks passed

panbanda mentioned this pull request May 6, 2026

feat(bonsai-q1): packed engine scaffold with upstream MLX guard #142

Merged

5 tasks

github-actions Bot mentioned this pull request May 6, 2026

chore: release main #153

Merged

coderabbitai Bot mentioned this pull request Jun 9, 2026

Bonsai-Q1 (bits=1) engine from #142 fails at runtime on the pinned mlx-rs #181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback#148

feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback#148
panbanda merged 2 commits into
panbanda:mainfrom
dusterbloom:dusterbloom/qwen3-mixed-bit-gdn

dusterbloom commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dusterbloom commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behaviour

What's added

Adaptations to origin/main

Hygiene

Test plan

Context

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dusterbloom commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading