feat: performance improvement and Qwen3 support by drunkcoding · Pull Request #60 · EfficientMoE/MoE-Infinity

drunkcoding · 2025-05-11T20:03:10Z

Description

Major changes for performance improvement

Motivation

Support latest QWen3 MoE model
Overlap hidden states gather with expert copy
Reduce torch kernel launch overhead

Type of Change

Bug fix
New feature
Breaking change
Documentation update

Checklist

I have read the CONTRIBUTION guide.
I have updated the tests (if applicable).
I have updated the documentation (if applicable).

… into feature/openai_api

lausannel · 2025-06-02T03:33:31Z

core/common/pytorch.h

+  return tensor_dtype;
+}
+
+inline size_t torch_dtype_size(int dtype) {


we might not use a tensor item every time, so constructing a tensor just to query its itemsize() might be unnecessarily expensive.

inline size_t torch_dtype_size(int dtype) { switch (dtype) { case DTYPE_FLOAT32: return 4; case DTYPE_FLOAT16: return 2; case DTYPE_BFLOAT16: return 2; case DTYPE_FP8_E4M3FN: return 1; default: throw std::invalid_argument("Unknown dtype in torch_dtype_size()"); } }

lausannel · 2025-06-02T03:38:15Z

core/model/fused_mlp.cu

+  // std::endl; TORCH_CHECK(output.is_contiguous(), "Output tensor must be
+  // contiguous"); TORCH_CHECK(w1.is_contiguous() && w2.is_contiguous() &&
+  // w3.is_contiguous(), "Weight tensors must be contiguous");
+  // TORCH_CHECK(hidden.is_contiguous(), "Hidden tensor must be contiguous");


Just wondering—was there a specific reason for removing this?

core/parallel/expert_dispatcher.cpp

examples/interface_example.py

moe_infinity/runtime/compile.py

Copilot

Pull Request Overview

This PR adds support for the QWen3 MoE model and implements several performance improvements by overlapping expert copying, introducing fused kernels, CUDA graph support, and refined memory allocators.

Added Qwen3MoeForCausalLM to model mappings and constants
Refactored expert modules with a DECLARE_MODULE macro and introduced MoEMLP using CUDA graphs
Overhauled caching allocators and fused MLP kernels for reduced overhead
Updated examples, documentation, and CI workflows for Ubuntu 22.04

Reviewed Changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
moe_infinity/common/constants.py	Added QWen3 model to imports and mappings
examples/interface_example.py	Switched to chat template and cleaned dataset loading
core/parallel/expert_module.h	Refactored expert modules with macros and new fields
core/memory/caching_allocator.h	Introduced templated caching allocator
core/model/fused_mlp.{h,cu}	Added fused MLP CUDA kernel and launcher
.github/workflows/*	Upgraded Ubuntu runner from 20.04 to 22.04

Comments suppressed due to low confidence (1)

core/parallel/expert_dispatcher.h:49

[nitpick] The default num_threads was reduced from 8 to 1, which may degrade parallel throughput. If this is intentional, please document the rationale or expose it as a configurable parameter.

explicit ExpertDispatcher(int num_experts, int num_layers, int dtype, int expert_type, int num_threads = 1);

core/parallel/expert_module.h

examples/interface_example.py

core/memory/caching_allocator.h

examples/interface_example.py

xly and others added 30 commits February 27, 2024 14:22

update table format

000f22a

improve table clarity

c871b41

init code commit

9cd8e99

add openai api support

46cf81c

add test scripts, update readme, update api

87c3e28

Merge branch 'main' into feature/openai_api

ba9d66f

format and change to deepseek in example

9045494

fix format

72c641e

remove unused files

c218025

fix api server token id device

7b97703

fix gen broken

9906513

update readme links

5c87fe9

cancel concurrent job

9257e81

set dense node to device

18d08aa

sparse node set cpu

cc25124

Merge branch 'main' into feature/openai_api

9d0b4d8

remove OS def

de0ebf5

Merge branch 'feature/openai_api' of github.com:TorchMoE/MoE-Infinity…

ba35284

… into feature/openai_api

use update to date clang-format

128c30f

fix setuptools version

e5f625f

fix setuptools version for python 3.8

48324d8

keep single cuda version in publish

f73e5b0

add max length in gen openai

fe81a87

fix cache race condition

845e89d

all param init at host

ef028d8

add qwen3

eb0bb11

Merge branch 'feature/openai_api' into feature/qwen

50c9b65

ubuntu lts and build

5c7e368

pre-commit ubuntu version

cde7d3b

router weights update overlap

ea2f3b3

xly added 3 commits May 11, 2025 20:56

rename deepseek_v2 and reduce torch kernel launch

5017bcc

fix import

042b2ee

fix build and fix bug

8d190e9

drunkcoding requested a review from lausannel May 12, 2025 21:55

fix citation linebreak

d902eca

lausannel requested a review from Copilot June 2, 2025 03:31

This comment was marked as outdated.

Sign in to view

lausannel reviewed Jun 2, 2025

View reviewed changes

xly added 4 commits June 14, 2025 13:39

fix typo

1a5e10f

fix dtype size

7916de6

remove comments

93bf9ad

fix example

33932d0

drunkcoding requested a review from Copilot June 14, 2025 12:48

Copilot AI reviewed Jun 14, 2025

View reviewed changes

core/parallel/expert_module.h Outdated Show resolved Hide resolved

examples/interface_example.py Outdated Show resolved Hide resolved

core/memory/caching_allocator.h Show resolved Hide resolved

pr update init

823d393

lausannel reviewed Jun 15, 2025

View reviewed changes

examples/interface_example.py Outdated Show resolved Hide resolved

lausannel changed the title ~~Performance improvement and QWen3 Support~~ feat: performance improvement and Qwen3 support Jun 16, 2025

remove comment and unify deepseek preroute

afd0bd1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: performance improvement and Qwen3 support#60

feat: performance improvement and Qwen3 support#60
drunkcoding wants to merge 40 commits intomainfrom
feature/qwen

drunkcoding commented May 11, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

lausannel Jun 2, 2025

Uh oh!

lausannel Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

drunkcoding commented May 11, 2025

Description

Motivation

Type of Change

Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

lausannel Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

lausannel Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants