Skip to content

feat: support Qwen3-next on npu device.#989

Open
JC-ut0 wants to merge 13 commits intojd-opensource:mainfrom
JC-ut0:qwen-next
Open

feat: support Qwen3-next on npu device.#989
JC-ut0 wants to merge 13 commits intojd-opensource:mainfrom
JC-ut0:qwen-next

Conversation

@JC-ut0
Copy link
Contributor

@JC-ut0 JC-ut0 commented Mar 4, 2026

  1. Support Qwen3-next on NPU device, add linear attention cache.
  2. Add triton kernel api, which depends on the merging of feat: adapt for CANN 8.5 and PyTorch 2.7.1 for npu device. #891 .
  3. Modified from feat: support qwen3-next on npu device. #945, to resolve merging conflicts and bugs.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the 'Qwen next' model, involving extensive changes across the build system, environment setup, and core C++ components, including new layers, kernels, and model arguments. A critical security vulnerability has been identified where user-supplied data in RPC requests is validated using CHECK macros, creating a Denial of Service (DoS) attack vector by allowing malformed requests to crash worker processes. It is strongly recommended to replace these CHECK macros with proper error validation and return error statuses. Furthermore, a critical issue exists in the KV cache capacity estimation logic where variable names for key and value head dimensions are swapped, potentially leading to incorrect memory allocation and runtime failures.

Comment on lines +289 to +291
CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape)))
<< "KVCacheShape does not support index_shape with conv/ssm shapes "
<< "simultaneously.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The AllocateKVCache RPC method uses CHECK macros to validate the consistency of the request data. If a request is sent with conflicting shape information (e.g., both index_shape and conv_shape are present), the CHECK macro will fail and cause the worker process to abort. This allows an attacker who can reach the worker's RPC interface to crash the worker process, leading to a denial of service. It is recommended to replace CHECK macros with proper error handling that returns an error status to the caller instead of crashing the process.

Comment on lines +303 to +304
CHECK(has_conv_shape && has_ssm_shape)
<< "conv_shape and ssm_shape must be provided together.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The AllocateKVCache RPC method uses a CHECK macro to ensure that conv_shape and ssm_shape are provided together. If a request is sent with only one of these shapes, the CHECK macro will fail and cause the worker process to abort. This is a denial of service vector. It is recommended to validate the request data and return an error status to the caller instead of using CHECK.

Comment on lines +361 to +363
CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape)))
<< "KVCacheShape does not support index_shape with conv/ssm shapes "
<< "simultaneously.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The AllocateKVCacheWithTransfer RPC method uses CHECK macros to validate the consistency of the request data. If a request is sent with conflicting shape information (e.g., both index_shape and conv_shape are present), the CHECK macro will fail and cause the worker process to abort. This allows an attacker to crash the worker process. It is recommended to replace CHECK macros with proper error handling.

Comment on lines +377 to +378
CHECK(has_conv_shape && has_ssm_shape)
<< "conv_shape and ssm_shape must be provided together.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The AllocateKVCacheWithTransfer RPC method uses a CHECK macro to ensure that conv_shape and ssm_shape are provided together. If a request is sent with only one of these shapes, the CHECK macro will fail and cause the worker process to abort. This is a denial of service vector. It is recommended to validate the request data and return an error status to the caller instead of using CHECK.

@yingxudeng yingxudeng marked this pull request as draft March 4, 2026 02:43
@JC-ut0 JC-ut0 changed the title [Draft] Support Qwen next [Draft] Support Qwen3-next on NPU device Mar 4, 2026
Comment on lines 266 to 268
Copy link
Contributor Author

@JC-ut0 JC-ut0 Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
torch::Tensor act_out;

Init an empty tensor is enough here. Those codes are redundant.

CMakeLists.txt Outdated
endif()

option(INSTALL_XLLM_KERNELS "Install xllm_kernels RPM" ON)
option(INSTALL_XLLM_KERNELS "Install xllm_kernels RPM" OFF)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary workaround, remember to back off before merging to main.

$ENV{NPU_HOME_PATH}/include
$ENV{ATB_HOME_PATH}/include
$ENV{NPU_HOME_PATH}/opp/vendors/xllm/op_api/include/
${CMAKE_CURRENT_SOURCE_DIR}/third_party/torch_npu_ops/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check whether there is a better way to do this

@JC-ut0 JC-ut0 force-pushed the qwen-next branch 4 times, most recently from a3e3901 to 0bc39a0 Compare March 5, 2026 09:11
@JC-ut0
Copy link
Contributor Author

JC-ut0 commented Mar 5, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the "Qwen3-next" model on NPU devices. A high-severity Denial of Service (DoS) vulnerability has been identified in the RPC handlers of the "WorkerService", where "CHECK" macros used for input validation can cause the worker process to abort on invalid input, allowing remote attackers to crash the worker. Additionally, two critical bugs were found in the cache allocation logic: a typo in the "SSM" cache shape definition and a copy-paste error when handling cache shapes in the worker service. These issues need to be addressed to ensure both correctness and security, specifically by replacing "CHECK" macros with graceful error handling.

Comment on lines +361 to +378
CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape)))
<< "KVCacheShape does not support index_shape with conv/ssm shapes "
<< "simultaneously.";
kv_cache_shape.reserve(has_conv_shape || has_ssm_shape ? 4 : 3);
kv_cache_shape.emplace_back(
std::vector<int64_t>(req->kv_cache_shape().key_shape().begin(),
req->kv_cache_shape().key_shape().end()));
std::vector<int64_t>(shape_req.key_shape().begin(),
shape_req.key_shape().end()));
kv_cache_shape.emplace_back(
std::vector<int64_t>(req->kv_cache_shape().value_shape().begin(),
req->kv_cache_shape().value_shape().end()));
std::vector<int64_t>(shape_req.value_shape().begin(),
shape_req.value_shape().end()));
// add index shape if exists
if (req->kv_cache_shape().index_shape_size() > 0) {
if (has_index_shape) {
kv_cache_shape.emplace_back(
std::vector<int64_t>(shape_req.index_shape().begin(),
shape_req.index_shape().end()));
} else if (has_conv_shape || has_ssm_shape) {
CHECK(has_conv_shape && has_ssm_shape)
<< "conv_shape and ssm_shape must be provided together.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The AllocateKVCacheWithTransfer RPC handler uses CHECK macros for shape validation, which is a Denial of Service (DoS) vector allowing a remote caller to crash the worker by providing inconsistent shape flags. Proper error handling should be implemented to return a failure status without aborting the process. Additionally, there is a copy-paste error when constructing the kv_cache_shape for linear attention, where index_shape is incorrectly used instead of ssm_shape. This can lead to incorrect shape information, causing allocation failures or memory corruption.

Comment on lines +289 to +304
CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape)))
<< "KVCacheShape does not support index_shape with conv/ssm shapes "
<< "simultaneously.";
// Reserve for key, value, and optional extra shapes
kv_cache_shape.reserve(has_conv_shape || has_ssm_shape ? 4 : 3);
kv_cache_shape.emplace_back(std::vector<int64_t>(
request->key_shape().begin(), request->key_shape().end()));
kv_cache_shape.emplace_back(std::vector<int64_t>(
request->value_shape().begin(), request->value_shape().end()));
// add index shape if exists
if (request->index_shape_size() > 0) {
if (has_index_shape) {
kv_cache_shape.emplace_back(std::vector<int64_t>(
request->index_shape().begin(), request->index_shape().end()));
} else if (has_conv_shape || has_ssm_shape) {
CHECK(has_conv_shape && has_ssm_shape)
<< "conv_shape and ssm_shape must be provided together.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The AllocateKVCache RPC handler uses the CHECK macro to validate input parameters (has_index_shape, has_conv_shape, has_ssm_shape). The CHECK macro causes the entire process to abort if the condition is not met. Since this logic operates on untrusted data received over the network, it creates a Denial of Service (DoS) vulnerability. An attacker can send a malformed request to crash the worker process. Input validation should be performed using conditional logic that returns an error response (e.g., via controller->SetFailed()) instead of crashing the process.

@JC-ut0 JC-ut0 changed the title [Draft] Support Qwen3-next on NPU device feat: Support Qwen3-next on NPU device Mar 10, 2026
@JC-ut0 JC-ut0 marked this pull request as ready for review March 10, 2026 01:45
@XuZhang99 XuZhang99 changed the title feat: Support Qwen3-next on NPU device feat: support Qwen3-next on npu device. Mar 12, 2026
ext.wangxiaochi1 and others added 8 commits March 14, 2026 10:58
feat: adjust cache allocation based on attention settings(support ssm_cache).

feat: update torch_npu_ops commit.

bugfix:layer CMake fix.

bugfix:add model arguments for enhanced configuration.

bugfix: QKV linear load fix.

bugfix: fallback pg_comm.

bugfix: set q_seq_len in prefill.

bugfix: handle optional finished tensor in moe gating and FusedMoE implementations.

bugfix: ensure activation output is correctly assigned in FusedMoE forward pass.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@JC-ut0
Copy link
Contributor Author

JC-ut0 commented Mar 14, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Qwen3-next model on NPU devices, which includes adding a linear attention cache. The changes are extensive, involving new model layers, kernels, and updates to the build system and data structures. My review identified a critical compilation error related to incorrect pointer access and a couple of high-severity issues where function signatures could lead to unexpected side effects by modifying input tensors. I have provided code suggestions to address these problems.

Comment on lines +30 to +40
torch::Tensor Qwen3NextRMSNormImpl::forward(torch::Tensor& input) {
auto input_dtype = input.dtype();
input = input.to(torch::kFloat32);

// Calculate RMS
auto variance = torch::mean(torch::pow(input, 2), -1, true);
auto normalized = input * torch::rsqrt(variance + eps_);

// Apply weight and convert back to original dtype
return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The forward method takes input as a non-const reference (torch::Tensor&) and modifies it by reassigning it with input = input.to(torch::kFloat32);. This modifies the caller's tensor, which can lead to unexpected side effects and bugs. The method should take a const torch::Tensor& and use a local variable for type conversion to avoid modifying the original tensor. Please also update the corresponding header file xllm/core/layers/common/qwen3_next_rms_norm.h.

Suggested change
torch::Tensor Qwen3NextRMSNormImpl::forward(torch::Tensor& input) {
auto input_dtype = input.dtype();
input = input.to(torch::kFloat32);
// Calculate RMS
auto variance = torch::mean(torch::pow(input, 2), -1, true);
auto normalized = input * torch::rsqrt(variance + eps_);
// Apply weight and convert back to original dtype
return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype);
}
torch::Tensor Qwen3NextRMSNormImpl::forward(const torch::Tensor& input) {
auto input_dtype = input.dtype();
auto float_input = input.to(torch::kFloat32);
// Calculate RMS
auto variance = torch::mean(torch::pow(float_input, 2), -1, true);
auto normalized = float_input * torch::rsqrt(variance + eps_);
// Apply weight and convert back to original dtype
return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype);
}

Comment on lines +33 to +50
torch::Tensor RmsNormGatedImpl::forward(torch::Tensor& input, std::optional<torch::Tensor> gate) {
xllm::kernel::GatedLayerNormParams params;
auto input_type = input.dtype();
input = input.to(torch::kFloat32);
params.x = input;
params.weight = weight_.to(torch::kFloat32);
torch::Tensor bias;
params.bias = bias;
params.eps = eps_;
if (gate.has_value()) {
gate = gate.value().to(torch::kFloat32);
params.z = gate;
}
params.group_size = input.size(-1);
params.is_rms_norm = true;
auto ret = xllm::kernel::gated_layer_norm(params);
return ret.to(input_type);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The forward method takes input as a non-const reference (torch::Tensor&) and modifies it by reassigning it with input = input.to(torch::kFloat32);. This is an unexpected side effect for the caller. To avoid bugs, the method should take a const torch::Tensor& and use a local variable for the type conversion. Please also update the corresponding header file xllm/core/layers/common/rms_norm_gated.h.

Suggested change
torch::Tensor RmsNormGatedImpl::forward(torch::Tensor& input, std::optional<torch::Tensor> gate) {
xllm::kernel::GatedLayerNormParams params;
auto input_type = input.dtype();
input = input.to(torch::kFloat32);
params.x = input;
params.weight = weight_.to(torch::kFloat32);
torch::Tensor bias;
params.bias = bias;
params.eps = eps_;
if (gate.has_value()) {
gate = gate.value().to(torch::kFloat32);
params.z = gate;
}
params.group_size = input.size(-1);
params.is_rms_norm = true;
auto ret = xllm::kernel::gated_layer_norm(params);
return ret.to(input_type);
}
torch::Tensor RmsNormGatedImpl::forward(const torch::Tensor& input, std::optional<torch::Tensor> gate) {
xllm::kernel::GatedLayerNormParams params;
auto input_type = input.dtype();
auto float_input = input.to(torch::kFloat32);
params.x = float_input;
params.weight = weight_.to(torch::kFloat32);
torch::Tensor bias;
params.bias = bias;
params.eps = eps_;
if (gate.has_value()) {
gate = gate.value().to(torch::kFloat32);
params.z = gate;
}
params.group_size = float_input.size(-1);
params.is_rms_norm = true;
auto ret = xllm::kernel::gated_layer_norm(params);
return ret.to(input_type);
}

JC-ut0 and others added 2 commits March 14, 2026 15:47
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants