Skip to content

feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback#148

Merged
panbanda merged 2 commits into
panbanda:mainfrom
dusterbloom:dusterbloom/qwen3-mixed-bit-gdn
May 6, 2026
Merged

feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback#148
panbanda merged 2 commits into
panbanda:mainfrom
dusterbloom:dusterbloom/qwen3-mixed-bit-gdn

Conversation

@dusterbloom

@dusterbloom dusterbloom commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a load-time fallback for Qwen3.5 models with mixed-bit GDN BA projections (some layers q4, some q8 — common in unsloth's dynamic-quant variants). The default fused loader concatenates in_proj_a + in_proj_b into a single matmul; mixed-bit weights have incompatible shapes and fusion fails. This PR detects that specific failure and retries with separate (4-dispatch) GDN projections.

PR-1c of the feat/magic-canvas split — a deferred follow-up from #141's audit (source commit 061e500c).

Behaviour

  1. Default fused path: try load_qwen3_5_moe_weights_fused (2 GDN dispatches per layer).
  2. On ModelError::ShapeMismatch containing both "in_proj_ba" and "requires separate GDN projections", retry with args.use_separate_gdn_projections = true and load_qwen3_5_moe_weights_direct (4 dispatches per layer).
  3. Forced separate (HIGGS_SEPARATE_GDN_PROJ env or args.use_separate_gdn_projections) skips the fused attempt entirely.
  4. Any other error propagates.

What's added

  • is_mixed_bit_gdn_ba_fusion_error — matches the diagnostic shape-mismatch error
  • qwen3_5_quantization_config — parses {group_size, bits} from the per-layer quantization map
  • qwen3_5_mixed_ba_quantization_layers — finds layers where in_proj_a and in_proj_b differ
  • can_concatenate_axis0 — guard inside load_qwen3_5_moe_weights_fused to emit the diagnostic error rather than panic
  • load_qwen3_5_model_with_gdn_fallback — private helper called by both dense and MoE load paths

Adaptations to origin/main

  • The dense load_qwen3_5_model on origin/main only honoured the env var; now also honours args.use_separate_gdn_projections (matching MoE behaviour). Strict improvement — flag is set only by env or by mixed-bit detection.
  • Direct cherry-pick of 061e500c produced 5 conflict regions because origin/main has evolved the load functions independently. This is a manual surgical port preserving origin/main's structure.

Hygiene

  • No unwrap() on Result/Option
  • No as casts in shape-arithmetic paths (i32::try_from)
  • Match arms enumerate variants explicitly
  • No file-level blanket allows added

Test plan

  • cargo check -p higgs-models — clean
  • cargo clippy --all-targets --all-features -- -D warnings — clean (rustc 1.95.0, matches CI)
  • cargo fmt --check — clean
  • cargo test -p higgs-models --lib — 333/333 pass (3 new)
  • End-to-end load test against an unsloth dynamic-quant Qwen3.5 model (e.g. Brooooooklyn/Qwen3.5-27B-unsloth-mlx) — confirm the warn-then-retry path triggers and the model loads.

Context

Part of the feat/magic-canvas PR split. Prior PRs in the chain: #141 (fused MoE), #142 (Bonsai-Q1 fp16 +2.83×), #143 (AnyCache::trim_by), #144 (DraftModel trait), #145 (PLD drafter), #147 (DFlash drafter foundation).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved Qwen3.5 model loading to detect per-layer mixed-bit quantization and automatically fall back to a compatible loading path when fusion is invalid, reducing load failures and shape-mismatch errors.
    • Added safeguards to prevent incorrect tensor fusion during model import and clearer logging of fused vs. separate projection outcomes.
  • Tests

    • Added tests covering mixed-bit quantization, forced separate-projection scenarios, fusion-preservation, and shape-mismatch detection.

@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Warning

Rate limit exceeded

@panbanda has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 5 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b51490d9-a8e4-43bb-a408-dacf55d33c97

📥 Commits

Reviewing files that changed from the base of the PR and between 358ac62 and dd9cf7a.

📒 Files selected for processing (1)
  • crates/higgs-models/src/qwen3_next.rs
📝 Walkthrough

Walkthrough

Qwen3.5 MoE/VLM loading now detects per-layer mixed-bit BA quantization for GDN projections during config parsing and uses a runtime path that tries fused projection loading first, then falls back to separate projections when BA fusion is invalid due to mismatched shapes or quantization.

Changes

GDN Mixed-Bit Quantization Handling

Layer / File(s) Summary
Config Detection & Helpers
crates/higgs-models/src/qwen3_next.rs (lines 3757–3797)
Adds qwen3_5_quantization_config and qwen3_5_mixed_ba_quantization_layers to parse per-layer quantization overrides and identify layers where in_proj_a and in_proj_b differ in bits/group_size. load_qwen3_5_moe_text_config_args now sets use_separate_gdn_projections from env or detected mixed-bit layers.
Load Entry Points
crates/higgs-models/src/qwen3_next.rs (lines 3825–3828)
load_qwen3_5_model / MOE entry now route through a new fallback loader instead of directly branching on use_separate_gdn_projections.
Runtime Fallback Orchestration
crates/higgs-models/src/qwen3_next.rs (lines 3866–3919)
Introduces load_qwen3_5_model_with_gdn_fallback which attempts fused GDN loading, detects mixed-bit BA fusion errors via is_mixed_bit_gdn_ba_fusion_error, and retries with separate GDN projections when appropriate.
Fused Loader Integration
crates/higgs-models/src/qwen3_next.rs (lines 4166–4181)
Fused GDN loader now calls can_concatenate_axis0 before concatenation; if incompatible, returns a ShapeMismatch error indicating a need for separate GDN projections to trigger fallback.
Shape Validation Guard
crates/higgs-models/src/qwen3_next.rs (lines 4019–4040)
Adds can_concatenate_axis0_shapes and can_concatenate_axis0 utilities to validate axis-0 concatenation compatibility (including quantized-inner-shape cases).
Error Detection Helper
crates/higgs-models/src/qwen3_next.rs (lines 3921–3930)
Adds is_mixed_bit_gdn_ba_fusion_error to recognize fusion failures that should trigger fallback.
Tests & Validation
crates/higgs-models/src/qwen3_next.rs (lines 12955–13049)
Adds tests verifying: mixed-bit BA forces separate GDN projections; matching BA preserves fused GDN; explicit separate-GDN config is preserved; and can_concatenate_axis0 detects mismatched quantized inner shapes.

Sequence Diagram

sequenceDiagram
    participant Config as Config Parser
    participant Runtime as Runtime Loader
    participant FusedLoader as Fused GDN Loader
    participant DirectLoader as Direct Loader
    participant Fallback as Fallback Handler

    Config->>Config: Scan per-layer quantization overrides
    Config->>Runtime: Set use_separate_gdn_projections flag

    Runtime->>Runtime: Load model config
    alt use_separate_gdn_projections forced
        Runtime->>DirectLoader: Load with separate projections
        DirectLoader-->>Runtime: Weights loaded
    else Attempt fused path
        Runtime->>FusedLoader: Attempt fused GDN loading
        FusedLoader->>FusedLoader: Check can_concatenate_axis0(in_proj_a, in_proj_b)
        alt Shapes compatible
            FusedLoader->>FusedLoader: Fuse projections
            FusedLoader-->>Runtime: Weights loaded (fused)
        else Shapes incompatible or mixed-bit
            FusedLoader-->>Fallback: Return ShapeMismatch / fusion error
            Fallback->>Runtime: Rebuild config with separate_gdn=true
            Runtime->>DirectLoader: Reload via direct loader
            DirectLoader-->>Runtime: Weights loaded (separate)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hopping through bits where projections meet,

When BA mismatches make fused paths retreat,
We try the fast fuse, then gently divide—
Separate projections stand ready, bona fide.
A rabbit cheers: safe loading, stride by stride!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback' directly and clearly summarizes the main change: adding a fallback mechanism for loading Qwen3.5 models with mixed-bit quantization in GDN BA projections.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/higgs-models/src/qwen3_next.rs`:
- Around line 3731-3736: The code currently unconditionally computes
use_separate and inserts "use_separate_gdn_projections" into map, overwriting
any config-provided value; change this to preserve an existing config entry:
before calling map.insert for the key "use_separate_gdn_projections" (around the
variable use_separate and map.insert), check whether map already contains that
key or whether map.get("use_separate_gdn_projections") is
Some(serde_json::Value::Bool(_)); only insert the computed
serde_json::Value::from(use_separate) when the key is absent (or not a boolean),
so the config-provided true/false is not clobbered by the env/mixed-bit
detection logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d5bd8595-b442-4013-9fb3-29e3c1e8d6e1

📥 Commits

Reviewing files that changed from the base of the PR and between 3d5a136 and 7497649.

📒 Files selected for processing (1)
  • crates/higgs-models/src/qwen3_next.rs

Comment thread crates/higgs-models/src/qwen3_next.rs Outdated
Adds a fallback path for loading Qwen3.5 models with mixed-bit GDN
projection weights (some layers q4, some q8 — common in unsloth's
dynamic-quant variants). The default fused-projection loader fuses
`in_proj_a` + `in_proj_b` into a single matmul; mixed-bit weights
have incompatible shapes and the fusion fails.

Behaviour:

  1. Detect via `is_mixed_bit_gdn_ba_fusion_error` — matches a
     `ModelError::ShapeMismatch` whose message contains both
     `in_proj_ba` and `requires separate GDN projections`.

  2. On detection, retry the load with
     `args.use_separate_gdn_projections = true`, taking the
     `load_qwen3_5_moe_weights_direct` path. Forward dispatches go from
     2 to 4 GDN ops per layer — slightly slower but correct.

  3. Forced separate (via `args.use_separate_gdn_projections` config or
     `HIGGS_SEPARATE_GDN_PROJ` env var) skips the fused attempt
     entirely.

Also adds:

  * `qwen3_5_quantization_config` — parses `{group_size, bits}` from
    the per-layer `quantization` map in `config.json`.
  * `qwen3_5_mixed_ba_quantization_layers` — scans for the layers
    where `in_proj_a` and `in_proj_b` differ in bits or group_size.
  * `can_concatenate_axis0` — guard used inside
    `load_qwen3_5_moe_weights_fused` to emit the diagnostic
    `ShapeMismatch` error rather than panicking on the concat.
  * `load_qwen3_5_model_with_gdn_fallback` — private helper called by
    both `load_qwen3_5_model` (dense) and `load_qwen3_5_moe_model`
    (MoE), unifying the fallback path.

Adaptations from feat/magic-canvas → origin/main:

  * The dense `load_qwen3_5_model` previously only honoured the env
    var; now it honours `args.use_separate_gdn_projections` too,
    matching the MoE path. Strict improvement: the config flag is set
    only by the env var or by mixed-bit detection.

  * No `unwrap()`, no `as` casts (use `i32::try_from`); match arms
    enumerate variants. No file-level allows added.

Verification on origin/main (rustc 1.95.0):

  * `cargo check -p higgs-models` — clean
  * `cargo clippy --all-targets --all-features -- -D warnings` — clean
  * `cargo fmt --check` — clean
  * `cargo test -p higgs-models --lib` — 333/333 pass (3 new)

Source: feat/magic-canvas commit `061e500c`. Direct cherry-pick had 5
conflict regions because origin/main has evolved the load functions
independently; this is a manual surgical port that preserves
origin/main's structure while adding the fallback behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@panbanda panbanda force-pushed the dusterbloom/qwen3-mixed-bit-gdn branch from 7497649 to 358ac62 Compare May 6, 2026 12:25

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/higgs-models/src/qwen3_next.rs`:
- Around line 3787-3797: The current layer-scan in the closure only looks up
quantization keys with the "language_model.model.layers.{layer_idx}.linear_attn"
prefix and therefore misses checkpoints that use the unprefixed
"model.layers.{layer_idx}.linear_attn" layout; update the closure (the filter
over (0..num_hidden_layers) that computes a_quant and b_quant) to try both key
forms when calling quant.get(...).and_then(qwen3_5_quantization_config) (i.e.,
check quant.get(prefixed_key) first and fall back to quant.get(unprefixed_key)
before defaulting to default_quant.clone()), mirroring the behavior already used
by gate_quantization_override so mixed-BA detection runs for both prefixed and
unprefixed checkpoints.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5fe2c153-be5c-448d-bb7c-f4c105efe65d

📥 Commits

Reviewing files that changed from the base of the PR and between 7497649 and 358ac62.

📒 Files selected for processing (1)
  • crates/higgs-models/src/qwen3_next.rs

Comment thread crates/higgs-models/src/qwen3_next.rs Outdated
@panbanda panbanda force-pushed the dusterbloom/qwen3-mixed-bit-gdn branch from 358ac62 to 45ece14 Compare May 6, 2026 12:30
@panbanda panbanda force-pushed the dusterbloom/qwen3-mixed-bit-gdn branch from 45ece14 to dd9cf7a Compare May 6, 2026 12:32
@panbanda panbanda merged commit cc18616 into panbanda:main May 6, 2026
6 checks passed
@github-actions github-actions Bot mentioned this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants