feat: performance improvement and Qwen3 support#60
Open
drunkcoding wants to merge 40 commits intomainfrom
Open
feat: performance improvement and Qwen3 support#60drunkcoding wants to merge 40 commits intomainfrom
drunkcoding wants to merge 40 commits intomainfrom
Conversation
… into feature/openai_api
lausannel
reviewed
Jun 2, 2025
| return tensor_dtype; | ||
| } | ||
|
|
||
| inline size_t torch_dtype_size(int dtype) { |
Collaborator
There was a problem hiding this comment.
we might not use a tensor item every time, so constructing a tensor just to query its itemsize() might be unnecessarily expensive.
inline size_t torch_dtype_size(int dtype) {
switch (dtype) {
case DTYPE_FLOAT32:
return 4;
case DTYPE_FLOAT16:
return 2;
case DTYPE_BFLOAT16:
return 2;
case DTYPE_FP8_E4M3FN:
return 1;
default:
throw std::invalid_argument("Unknown dtype in torch_dtype_size()");
}
}| // std::endl; TORCH_CHECK(output.is_contiguous(), "Output tensor must be | ||
| // contiguous"); TORCH_CHECK(w1.is_contiguous() && w2.is_contiguous() && | ||
| // w3.is_contiguous(), "Weight tensors must be contiguous"); | ||
| // TORCH_CHECK(hidden.is_contiguous(), "Hidden tensor must be contiguous"); |
Collaborator
There was a problem hiding this comment.
Just wondering—was there a specific reason for removing this?
added 4 commits
June 14, 2025 13:39
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for the QWen3 MoE model and implements several performance improvements by overlapping expert copying, introducing fused kernels, CUDA graph support, and refined memory allocators.
- Added
Qwen3MoeForCausalLMto model mappings and constants - Refactored expert modules with a
DECLARE_MODULEmacro and introducedMoEMLPusing CUDA graphs - Overhauled caching allocators and fused MLP kernels for reduced overhead
- Updated examples, documentation, and CI workflows for Ubuntu 22.04
Reviewed Changes
Copilot reviewed 45 out of 45 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| moe_infinity/common/constants.py | Added QWen3 model to imports and mappings |
| examples/interface_example.py | Switched to chat template and cleaned dataset loading |
| core/parallel/expert_module.h | Refactored expert modules with macros and new fields |
| core/memory/caching_allocator.h | Introduced templated caching allocator |
| core/model/fused_mlp.{h,cu} | Added fused MLP CUDA kernel and launcher |
| .github/workflows/* | Upgraded Ubuntu runner from 20.04 to 22.04 |
Comments suppressed due to low confidence (1)
core/parallel/expert_dispatcher.h:49
- [nitpick] The default
num_threadswas reduced from 8 to 1, which may degrade parallel throughput. If this is intentional, please document the rationale or expose it as a configurable parameter.
explicit ExpertDispatcher(int num_experts, int num_layers, int dtype, int expert_type, int num_threads = 1);
lausannel
reviewed
Jun 15, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Major changes for performance improvement
Motivation
Type of Change
Checklist