Add Qwen3.5-MoE (35B-A3B) model support by tanzeel-amd · Pull Request #2146 · microsoft/onnxruntime-genai

tanzeel-amd · 2026-05-08T10:25:35Z

Model builder and runtime support for Qwen3.5-MoE

Added Qwen35MoeTextModel builder class for Qwen3_5MoeForConditionalGeneration architecture with 256 experts, shared expert, and SwiGLU activation
Registered qwen3_5_moe as VLM type in C++ runtime (model_type.h, model.cpp)
Added architecture dispatch in builder.py for Qwen3_5MoeForConditionalGeneration
Key implementation details:
- Repacks HF concatenated gate_up_proj to ORT interleaved format (swiglu_fusion=1)
- Shared expert implemented as separate SiLU MLP path with sigmoid gating
- Router uses bias-free MatMul matching Qwen3_5MoeTopKRouter
- QMoE symmetric blockwise quantization without explicit zero_points

Copilot

Pull request overview

Adds builder + runtime plumbing for Qwen3.5-MoE export/inference, integrating a new MoE-capable Qwen3.5 builder into the Python model builder and registering the corresponding model type in the C++ runtime so the correct multi-modal processor/model-family behaviors can be selected at runtime.

Changes:

Introduces Qwen35MoeTextModel in the Python model builder with fused MoE/QMoE graph construction (router + routed experts + shared expert).
Adds builder dispatch/import wiring so Qwen3_5MoeForConditionalGeneration can be exported via builder.py.
Registers qwen3_5_moe in the C++ runtime as a VLM/Qwen-VL-family type and wires it to the existing Qwen image processor.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/python/py/models/builders/qwen.py	Adds `Qwen35MoeTextModel` and fused MoE/QMoE subgraph generation for Qwen3.5-MoE.
src/python/py/models/builders/init.py	Exports `Qwen35MoeTextModel` from the builders package.
src/python/py/models/builder.py	Adds architecture dispatch for `Qwen3_5MoeForConditionalGeneration`.
src/models/model.cpp	Registers `qwen3_5_moe` with `MultiModalProcessor`’s processor factory.
src/models/model_type.h	Adds `qwen3_5_moe` to VLM and Qwen-VL-family classification helpers.

tanzeel-amd · 2026-05-08T10:44:03Z

@microsoft-github-policy-service agree company="AMD"

VishalX · 2026-05-18T09:40:32Z

@kunal-vaishnavi pls review this. Olive recipes are added here: microsoft/olive-recipes#405

baijumeswani · 2026-05-18T16:46:38Z

@tanzeel-amd could you please address copilot review comments if relevant?

tanzeel-amd · 2026-05-19T08:26:42Z

@baijumeswani resolved the copilot comments. Please review.

VishalX · 2026-05-19T08:48:27Z

@baijumeswani resolved the copilot comments. Please review.

@kunal-vaishnavi

baijumeswani · 2026-05-21T17:34:17Z

@tanzeel-amd could you please resolve the merge conflict?

tanzeel-amd · 2026-05-22T11:58:37Z

Resolved conflicts @baijumeswani

- Added Qwen35MoeTextModel builder class for Qwen3_5MoeForConditionalGeneration architecture with 256 experts, shared expert, and SwiGLU activation - Registered qwen3_5_moe as VLM type in C++ runtime (model_type.h, model.cpp) - Added architecture dispatch in builder.py for Qwen3_5MoeForConditionalGeneration - Key implementation details: - Repacks HF concatenated gate_up_proj to ORT interleaved format (swiglu_fusion=1) - Shared expert implemented as separate SiLU MLP path with sigmoid gating - Router uses bias-free MatMul matching Qwen3_5MoeTopKRouter - QMoE symmetric blockwise quantization without explicit zero_points - Also includes existing gemma.py rope_local_base_freq fix for TranslateGemma

…ly, set model_type in __init__ - model_type.h: Merge duplicate copyright lines into 2025-2026 range - model_type.h: Rewrite IsQwenVLFamily to use std::array + std::find consistent with other methods - qwen.py: Set model_type in __init__ for both Qwen35TextModel and Qwen35MoeTextModel instead of hardcoding in make_genai_config. Removes the make_genai_config override entirely. Co-authored-by: Cursor <cursoragent@cursor.com>

…eric int4 config - base.py: Add make_fused_moe() supporting router with/without bias, 2-weight SwiGLU layout with interleaving, and optional shared expert. Add make_shared_expert() using wrapper methods (make_sigmoid, make_mul, etc.). Move MoE /mlp/ int4 config cleanup into make_int4_algo_config(). - qwen.py: Remove _make_moe_fused (~150 lines) and make_moe dispatcher. Replace with single make_fused_moe() call from base class. Remove int4 algo cleanup from __init__ (now in base). Co-authored-by: Cursor <cursoragent@cursor.com>

Per reviewer feedback, MoE builders in this codebase follow a model-specific pattern rather than a shared base class method. Moved make_moe, make_shared_expert, and int4 config cleanup back to Qwen35MoeTextModel. Retained use of wrapper methods (make_sigmoid, make_mul, make_add) instead of raw make_node/make_value. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings May 8, 2026 10:25

tanzeel-amd requested a review from a team as a code owner May 8, 2026 10:25

Copilot started reviewing on behalf of tanzeel-amd May 8, 2026 10:26 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/python/py/models/builders/qwen.py

xieofxie mentioned this pull request May 12, 2026

Add Qwen3.5-35B-A3B MoE VLM ONNX export recipe microsoft/olive-recipes#405

Open

baijumeswani added the 0.14.0 label May 18, 2026

tanzeel-amd force-pushed the turrahma/qwen3.5-moe-support branch from 9d80321 to 0be688b Compare May 19, 2026 08:21