Skip to content

Add Qwen3.5-35B-A3B MoE VLM ONNX export recipe#405

Open
tanzeel-amd wants to merge 7 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/qwen3.5-moe-35B-A3B
Open

Add Qwen3.5-35B-A3B MoE VLM ONNX export recipe#405
tanzeel-amd wants to merge 7 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/qwen3.5-moe-35B-A3B

Conversation

@tanzeel-amd
Copy link
Copy Markdown

@tanzeel-amd tanzeel-amd commented May 8, 2026

  • Olive recipe for exporting Qwen/Qwen3.5-35B-A3B (256 experts, 8 routed + 1 shared)
  • Three sub-model pipeline: text decoder (INT4 QMoE), embedding (FP32), vision (FP32)
  • Custom ONNX-export-friendly MoE model class (codes/modeling_qwen3_5_moe.py)
  • Inference script with text, image, interactive, and benchmark modes
  • Requires ORT GenAI built with qwen3_5_moe support

Ur Rahman and others added 7 commits April 14, 2026 03:25
- Olive recipe for exporting Qwen/Qwen3.5-35B-A3B (256 experts, 8 routed + 1 shared)
- Three sub-model pipeline: text decoder (INT4 QMoE), embedding (FP32), vision (FP32)
- Custom ONNX-export-friendly MoE model class (codes/modeling_qwen3_5_moe.py)
- Inference script with text, image, interactive, and benchmark modes
- Requires ORT GenAI built with qwen3_5_moe support (see DEBUG_STATUS.md)
Copilot AI review requested due to automatic review settings May 8, 2026 10:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Olive recipe to export and run Qwen/Qwen3.5-35B-A3B as a three-submodel ONNX Runtime GenAI pipeline (vision encoder + embedding fusion + INT4 text decoder), including a custom ONNX-export-friendly MoE model shell and an inference/benchmark script.

Changes:

  • Introduces a custom Qwen3_5MoeModel implementation used for ONNX export of the vision and embedding submodels.
  • Adds Olive JSON pipelines for exporting/optimizing vision.onnx, embedding.onnx, and building text.onnx via ModelBuilder (INT4).
  • Adds end-to-end optimize.py config generation and inference.py runner with interactive + benchmark + optional PyTorch comparison.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Qwen-Qwen3.5-35B-A3B/LICENSE Adds upstream Apache-2.0 license text for the recipe content.
Qwen-Qwen3.5-35B-A3B/builtin/user_script.py Provides Olive model loaders + dummy inputs for exporting embedding/vision via a custom model shell.
Qwen-Qwen3.5-35B-A3B/builtin/optimize.py Orchestrates Olive runs and patches genai_config.json + writes processor_config.json + tokenizer fixups.
Qwen-Qwen3.5-35B-A3B/builtin/inference.py Adds ORT GenAI inference script with interactive mode and benchmarking (optionally vs PyTorch).
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/text.json Olive pipeline to build INT4 text decoder via ModelBuilder.
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/embedding.json Olive pipeline to export embedding fusion model and apply graph surgeries/optimizations.
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/vision.json Olive pipeline to export vision encoder, apply PackedAttention surgery, and optimization passes.
Qwen-Qwen3.5-35B-A3B/builtin/codes/modeling_qwen3_5_moe.py Custom ONNX-export-friendly model implementation (vision + embedding shell + MoE text components).
Qwen-Qwen3.5-35B-A3B/builtin/codes/__init__.py Initializes the codes module for imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +40
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
import glob

cfg_path = hf_hub_download(model_path, "config.json")
model_dir = os.path.dirname(cfg_path)
st_files = sorted(glob.glob(os.path.join(model_dir, "*.safetensors")))

state_dict = {}
for sf in st_files:
tensors = load_file(sf)
for k, v in tensors.items():
if k.startswith("model."):
stripped = k[6:]
state_dict[stripped] = v
if stripped.startswith("language_model.embed_tokens."):
state_dict[stripped[len("language_model."):]] = v
Comment on lines +7 to +15
"""End-to-end optimization pipeline for Qwen3.5-35B-A3B MoE VLM.

Exports three sub-models (vision encoder, text embedding, text decoder),
applies graph optimizations and INT4 quantization via Olive passes.

Usage:
python optimize.py --config-dir cpu_and_mobile --device cpu
python optimize.py --config-dir cpu_and_mobile --device cpu --skip-export
"""
# Copyright (C) 2026 Advanced Micro Devices, Inc. All rights reserved.
# Portions of this file consist of AI generated content.
# --------------------------------------------------------------------------
# SPDX-License-Identifier: MIT
@tanzeel-amd
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="AMD"

@VishalX
Copy link
Copy Markdown

VishalX commented May 11, 2026

@xieofxie / @devang-ml pls review

@xieofxie
Copy link
Copy Markdown
Contributor

please wait for microsoft/onnxruntime-genai#2146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants