feat: support Qwen3-next on npu device.#989
feat: support Qwen3-next on npu device.#989JC-ut0 wants to merge 13 commits intojd-opensource:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the 'Qwen next' model, involving extensive changes across the build system, environment setup, and core C++ components, including new layers, kernels, and model arguments. A critical security vulnerability has been identified where user-supplied data in RPC requests is validated using CHECK macros, creating a Denial of Service (DoS) attack vector by allowing malformed requests to crash worker processes. It is strongly recommended to replace these CHECK macros with proper error validation and return error statuses. Furthermore, a critical issue exists in the KV cache capacity estimation logic where variable names for key and value head dimensions are swapped, potentially leading to incorrect memory allocation and runtime failures.
| CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape))) | ||
| << "KVCacheShape does not support index_shape with conv/ssm shapes " | ||
| << "simultaneously."; |
There was a problem hiding this comment.
The AllocateKVCache RPC method uses CHECK macros to validate the consistency of the request data. If a request is sent with conflicting shape information (e.g., both index_shape and conv_shape are present), the CHECK macro will fail and cause the worker process to abort. This allows an attacker who can reach the worker's RPC interface to crash the worker process, leading to a denial of service. It is recommended to replace CHECK macros with proper error handling that returns an error status to the caller instead of crashing the process.
| CHECK(has_conv_shape && has_ssm_shape) | ||
| << "conv_shape and ssm_shape must be provided together."; |
There was a problem hiding this comment.
The AllocateKVCache RPC method uses a CHECK macro to ensure that conv_shape and ssm_shape are provided together. If a request is sent with only one of these shapes, the CHECK macro will fail and cause the worker process to abort. This is a denial of service vector. It is recommended to validate the request data and return an error status to the caller instead of using CHECK.
| CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape))) | ||
| << "KVCacheShape does not support index_shape with conv/ssm shapes " | ||
| << "simultaneously."; |
There was a problem hiding this comment.
The AllocateKVCacheWithTransfer RPC method uses CHECK macros to validate the consistency of the request data. If a request is sent with conflicting shape information (e.g., both index_shape and conv_shape are present), the CHECK macro will fail and cause the worker process to abort. This allows an attacker to crash the worker process. It is recommended to replace CHECK macros with proper error handling.
| CHECK(has_conv_shape && has_ssm_shape) | ||
| << "conv_shape and ssm_shape must be provided together."; |
There was a problem hiding this comment.
The AllocateKVCacheWithTransfer RPC method uses a CHECK macro to ensure that conv_shape and ssm_shape are provided together. If a request is sent with only one of these shapes, the CHECK macro will fail and cause the worker process to abort. This is a denial of service vector. It is recommended to validate the request data and return an error status to the caller instead of using CHECK.
xllm/core/layers/npu/fused_moe.cpp
Outdated
There was a problem hiding this comment.
| torch::Tensor act_out; |
Init an empty tensor is enough here. Those codes are redundant.
CMakeLists.txt
Outdated
| endif() | ||
|
|
||
| option(INSTALL_XLLM_KERNELS "Install xllm_kernels RPM" ON) | ||
| option(INSTALL_XLLM_KERNELS "Install xllm_kernels RPM" OFF) |
There was a problem hiding this comment.
This is a temporary workaround, remember to back off before merging to main.
| $ENV{NPU_HOME_PATH}/include | ||
| $ENV{ATB_HOME_PATH}/include | ||
| $ENV{NPU_HOME_PATH}/opp/vendors/xllm/op_api/include/ | ||
| ${CMAKE_CURRENT_SOURCE_DIR}/third_party/torch_npu_ops/ |
There was a problem hiding this comment.
Need to check whether there is a better way to do this
a3e3901 to
0bc39a0
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for the "Qwen3-next" model on NPU devices. A high-severity Denial of Service (DoS) vulnerability has been identified in the RPC handlers of the "WorkerService", where "CHECK" macros used for input validation can cause the worker process to abort on invalid input, allowing remote attackers to crash the worker. Additionally, two critical bugs were found in the cache allocation logic: a typo in the "SSM" cache shape definition and a copy-paste error when handling cache shapes in the worker service. These issues need to be addressed to ensure both correctness and security, specifically by replacing "CHECK" macros with graceful error handling.
| CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape))) | ||
| << "KVCacheShape does not support index_shape with conv/ssm shapes " | ||
| << "simultaneously."; | ||
| kv_cache_shape.reserve(has_conv_shape || has_ssm_shape ? 4 : 3); | ||
| kv_cache_shape.emplace_back( | ||
| std::vector<int64_t>(req->kv_cache_shape().key_shape().begin(), | ||
| req->kv_cache_shape().key_shape().end())); | ||
| std::vector<int64_t>(shape_req.key_shape().begin(), | ||
| shape_req.key_shape().end())); | ||
| kv_cache_shape.emplace_back( | ||
| std::vector<int64_t>(req->kv_cache_shape().value_shape().begin(), | ||
| req->kv_cache_shape().value_shape().end())); | ||
| std::vector<int64_t>(shape_req.value_shape().begin(), | ||
| shape_req.value_shape().end())); | ||
| // add index shape if exists | ||
| if (req->kv_cache_shape().index_shape_size() > 0) { | ||
| if (has_index_shape) { | ||
| kv_cache_shape.emplace_back( | ||
| std::vector<int64_t>(shape_req.index_shape().begin(), | ||
| shape_req.index_shape().end())); | ||
| } else if (has_conv_shape || has_ssm_shape) { | ||
| CHECK(has_conv_shape && has_ssm_shape) | ||
| << "conv_shape and ssm_shape must be provided together."; |
There was a problem hiding this comment.
The AllocateKVCacheWithTransfer RPC handler uses CHECK macros for shape validation, which is a Denial of Service (DoS) vector allowing a remote caller to crash the worker by providing inconsistent shape flags. Proper error handling should be implemented to return a failure status without aborting the process. Additionally, there is a copy-paste error when constructing the kv_cache_shape for linear attention, where index_shape is incorrectly used instead of ssm_shape. This can lead to incorrect shape information, causing allocation failures or memory corruption.
| CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape))) | ||
| << "KVCacheShape does not support index_shape with conv/ssm shapes " | ||
| << "simultaneously."; | ||
| // Reserve for key, value, and optional extra shapes | ||
| kv_cache_shape.reserve(has_conv_shape || has_ssm_shape ? 4 : 3); | ||
| kv_cache_shape.emplace_back(std::vector<int64_t>( | ||
| request->key_shape().begin(), request->key_shape().end())); | ||
| kv_cache_shape.emplace_back(std::vector<int64_t>( | ||
| request->value_shape().begin(), request->value_shape().end())); | ||
| // add index shape if exists | ||
| if (request->index_shape_size() > 0) { | ||
| if (has_index_shape) { | ||
| kv_cache_shape.emplace_back(std::vector<int64_t>( | ||
| request->index_shape().begin(), request->index_shape().end())); | ||
| } else if (has_conv_shape || has_ssm_shape) { | ||
| CHECK(has_conv_shape && has_ssm_shape) | ||
| << "conv_shape and ssm_shape must be provided together."; |
There was a problem hiding this comment.
The AllocateKVCache RPC handler uses the CHECK macro to validate input parameters (has_index_shape, has_conv_shape, has_ssm_shape). The CHECK macro causes the entire process to abort if the condition is not met. Since this logic operates on untrusted data received over the network, it creates a Denial of Service (DoS) vulnerability. An attacker can send a malformed request to crash the worker process. Input validation should be performed using conditional logic that returns an error response (e.g., via controller->SetFailed()) instead of crashing the process.
…a / Qwen3NextDecoderLayer.
feat: adjust cache allocation based on attention settings(support ssm_cache). feat: update torch_npu_ops commit. bugfix:layer CMake fix. bugfix:add model arguments for enhanced configuration. bugfix: QKV linear load fix. bugfix: fallback pg_comm. bugfix: set q_seq_len in prefill. bugfix: handle optional finished tensor in moe gating and FusedMoE implementations. bugfix: ensure activation output is correctly assigned in FusedMoE forward pass.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Qwen3-next model on NPU devices, which includes adding a linear attention cache. The changes are extensive, involving new model layers, kernels, and updates to the build system and data structures. My review identified a critical compilation error related to incorrect pointer access and a couple of high-severity issues where function signatures could lead to unexpected side effects by modifying input tensors. I have provided code suggestions to address these problems.
| torch::Tensor Qwen3NextRMSNormImpl::forward(torch::Tensor& input) { | ||
| auto input_dtype = input.dtype(); | ||
| input = input.to(torch::kFloat32); | ||
|
|
||
| // Calculate RMS | ||
| auto variance = torch::mean(torch::pow(input, 2), -1, true); | ||
| auto normalized = input * torch::rsqrt(variance + eps_); | ||
|
|
||
| // Apply weight and convert back to original dtype | ||
| return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype); | ||
| } |
There was a problem hiding this comment.
The forward method takes input as a non-const reference (torch::Tensor&) and modifies it by reassigning it with input = input.to(torch::kFloat32);. This modifies the caller's tensor, which can lead to unexpected side effects and bugs. The method should take a const torch::Tensor& and use a local variable for type conversion to avoid modifying the original tensor. Please also update the corresponding header file xllm/core/layers/common/qwen3_next_rms_norm.h.
| torch::Tensor Qwen3NextRMSNormImpl::forward(torch::Tensor& input) { | |
| auto input_dtype = input.dtype(); | |
| input = input.to(torch::kFloat32); | |
| // Calculate RMS | |
| auto variance = torch::mean(torch::pow(input, 2), -1, true); | |
| auto normalized = input * torch::rsqrt(variance + eps_); | |
| // Apply weight and convert back to original dtype | |
| return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype); | |
| } | |
| torch::Tensor Qwen3NextRMSNormImpl::forward(const torch::Tensor& input) { | |
| auto input_dtype = input.dtype(); | |
| auto float_input = input.to(torch::kFloat32); | |
| // Calculate RMS | |
| auto variance = torch::mean(torch::pow(float_input, 2), -1, true); | |
| auto normalized = float_input * torch::rsqrt(variance + eps_); | |
| // Apply weight and convert back to original dtype | |
| return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype); | |
| } |
| torch::Tensor RmsNormGatedImpl::forward(torch::Tensor& input, std::optional<torch::Tensor> gate) { | ||
| xllm::kernel::GatedLayerNormParams params; | ||
| auto input_type = input.dtype(); | ||
| input = input.to(torch::kFloat32); | ||
| params.x = input; | ||
| params.weight = weight_.to(torch::kFloat32); | ||
| torch::Tensor bias; | ||
| params.bias = bias; | ||
| params.eps = eps_; | ||
| if (gate.has_value()) { | ||
| gate = gate.value().to(torch::kFloat32); | ||
| params.z = gate; | ||
| } | ||
| params.group_size = input.size(-1); | ||
| params.is_rms_norm = true; | ||
| auto ret = xllm::kernel::gated_layer_norm(params); | ||
| return ret.to(input_type); | ||
| } |
There was a problem hiding this comment.
The forward method takes input as a non-const reference (torch::Tensor&) and modifies it by reassigning it with input = input.to(torch::kFloat32);. This is an unexpected side effect for the caller. To avoid bugs, the method should take a const torch::Tensor& and use a local variable for the type conversion. Please also update the corresponding header file xllm/core/layers/common/rms_norm_gated.h.
| torch::Tensor RmsNormGatedImpl::forward(torch::Tensor& input, std::optional<torch::Tensor> gate) { | |
| xllm::kernel::GatedLayerNormParams params; | |
| auto input_type = input.dtype(); | |
| input = input.to(torch::kFloat32); | |
| params.x = input; | |
| params.weight = weight_.to(torch::kFloat32); | |
| torch::Tensor bias; | |
| params.bias = bias; | |
| params.eps = eps_; | |
| if (gate.has_value()) { | |
| gate = gate.value().to(torch::kFloat32); | |
| params.z = gate; | |
| } | |
| params.group_size = input.size(-1); | |
| params.is_rms_norm = true; | |
| auto ret = xllm::kernel::gated_layer_norm(params); | |
| return ret.to(input_type); | |
| } | |
| torch::Tensor RmsNormGatedImpl::forward(const torch::Tensor& input, std::optional<torch::Tensor> gate) { | |
| xllm::kernel::GatedLayerNormParams params; | |
| auto input_type = input.dtype(); | |
| auto float_input = input.to(torch::kFloat32); | |
| params.x = float_input; | |
| params.weight = weight_.to(torch::kFloat32); | |
| torch::Tensor bias; | |
| params.bias = bias; | |
| params.eps = eps_; | |
| if (gate.has_value()) { | |
| gate = gate.value().to(torch::kFloat32); | |
| params.z = gate; | |
| } | |
| params.group_size = float_input.size(-1); | |
| params.is_rms_norm = true; | |
| auto ret = xllm::kernel::gated_layer_norm(params); | |
| return ret.to(input_type); | |
| } |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Uh oh!
There was an error while loading. Please reload this page.