Add Qwen3.5-35B-A3B MoE VLM ONNX export recipe#405
Open
tanzeel-amd wants to merge 7 commits into
Open
Conversation
- Olive recipe for exporting Qwen/Qwen3.5-35B-A3B (256 experts, 8 routed + 1 shared) - Three sub-model pipeline: text decoder (INT4 QMoE), embedding (FP32), vision (FP32) - Custom ONNX-export-friendly MoE model class (codes/modeling_qwen3_5_moe.py) - Inference script with text, image, interactive, and benchmark modes - Requires ORT GenAI built with qwen3_5_moe support (see DEBUG_STATUS.md)
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new Olive recipe to export and run Qwen/Qwen3.5-35B-A3B as a three-submodel ONNX Runtime GenAI pipeline (vision encoder + embedding fusion + INT4 text decoder), including a custom ONNX-export-friendly MoE model shell and an inference/benchmark script.
Changes:
- Introduces a custom
Qwen3_5MoeModelimplementation used for ONNX export of the vision and embedding submodels. - Adds Olive JSON pipelines for exporting/optimizing
vision.onnx,embedding.onnx, and buildingtext.onnxvia ModelBuilder (INT4). - Adds end-to-end
optimize.pyconfig generation andinference.pyrunner with interactive + benchmark + optional PyTorch comparison.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
Qwen-Qwen3.5-35B-A3B/LICENSE |
Adds upstream Apache-2.0 license text for the recipe content. |
Qwen-Qwen3.5-35B-A3B/builtin/user_script.py |
Provides Olive model loaders + dummy inputs for exporting embedding/vision via a custom model shell. |
Qwen-Qwen3.5-35B-A3B/builtin/optimize.py |
Orchestrates Olive runs and patches genai_config.json + writes processor_config.json + tokenizer fixups. |
Qwen-Qwen3.5-35B-A3B/builtin/inference.py |
Adds ORT GenAI inference script with interactive mode and benchmarking (optionally vs PyTorch). |
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/text.json |
Olive pipeline to build INT4 text decoder via ModelBuilder. |
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/embedding.json |
Olive pipeline to export embedding fusion model and apply graph surgeries/optimizations. |
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/vision.json |
Olive pipeline to export vision encoder, apply PackedAttention surgery, and optimization passes. |
Qwen-Qwen3.5-35B-A3B/builtin/codes/modeling_qwen3_5_moe.py |
Custom ONNX-export-friendly model implementation (vision + embedding shell + MoE text components). |
Qwen-Qwen3.5-35B-A3B/builtin/codes/__init__.py |
Initializes the codes module for imports. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+24
to
+40
| from safetensors.torch import load_file | ||
| from huggingface_hub import hf_hub_download | ||
| import glob | ||
|
|
||
| cfg_path = hf_hub_download(model_path, "config.json") | ||
| model_dir = os.path.dirname(cfg_path) | ||
| st_files = sorted(glob.glob(os.path.join(model_dir, "*.safetensors"))) | ||
|
|
||
| state_dict = {} | ||
| for sf in st_files: | ||
| tensors = load_file(sf) | ||
| for k, v in tensors.items(): | ||
| if k.startswith("model."): | ||
| stripped = k[6:] | ||
| state_dict[stripped] = v | ||
| if stripped.startswith("language_model.embed_tokens."): | ||
| state_dict[stripped[len("language_model."):]] = v |
Comment on lines
+7
to
+15
| """End-to-end optimization pipeline for Qwen3.5-35B-A3B MoE VLM. | ||
|
|
||
| Exports three sub-models (vision encoder, text embedding, text decoder), | ||
| applies graph optimizations and INT4 quantization via Olive passes. | ||
|
|
||
| Usage: | ||
| python optimize.py --config-dir cpu_and_mobile --device cpu | ||
| python optimize.py --config-dir cpu_and_mobile --device cpu --skip-export | ||
| """ |
| # Copyright (C) 2026 Advanced Micro Devices, Inc. All rights reserved. | ||
| # Portions of this file consist of AI generated content. | ||
| # -------------------------------------------------------------------------- | ||
| # SPDX-License-Identifier: MIT |
Author
|
@microsoft-github-policy-service agree company="AMD" |
|
@xieofxie / @devang-ml pls review |
Contributor
|
please wait for microsoft/onnxruntime-genai#2146 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.