feat: support Qwen3-next on npu device. by JC-ut0 · Pull Request #989 · jd-opensource/xllm

JC-ut0 · 2026-03-04T02:16:00Z

Support Qwen3-next on NPU device, add linear attention cache.
Add triton kernel api, which depends on the merging of feat: adapt for CANN 8.5 and PyTorch 2.7.1 for npu device. #891 .
Modified from feat: support qwen3-next on npu device. #945, to resolve merging conflicts and bugs.

gemini-code-assist

Code Review

This pull request introduces support for the 'Qwen next' model, involving extensive changes across the build system, environment setup, and core C++ components, including new layers, kernels, and model arguments. A critical security vulnerability has been identified where user-supplied data in RPC requests is validated using CHECK macros, creating a Denial of Service (DoS) attack vector by allowing malformed requests to crash worker processes. It is strongly recommended to replace these CHECK macros with proper error validation and return error statuses. Furthermore, a critical issue exists in the KV cache capacity estimation logic where variable names for key and value head dimensions are swapped, potentially leading to incorrect memory allocation and runtime failures.

xllm/core/distributed_runtime/llm_engine.cpp

gemini-code-assist · 2026-03-04T02:20:05Z

xllm/core/distributed_runtime/worker_service.cpp

+    CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape)))
+        << "KVCacheShape does not support index_shape with conv/ssm shapes "
+        << "simultaneously.";


The AllocateKVCache RPC method uses CHECK macros to validate the consistency of the request data. If a request is sent with conflicting shape information (e.g., both index_shape and conv_shape are present), the CHECK macro will fail and cause the worker process to abort. This allows an attacker who can reach the worker's RPC interface to crash the worker process, leading to a denial of service. It is recommended to replace CHECK macros with proper error handling that returns an error status to the caller instead of crashing the process.

gemini-code-assist · 2026-03-04T02:20:05Z

xllm/core/distributed_runtime/worker_service.cpp

+      CHECK(has_conv_shape && has_ssm_shape)
+          << "conv_shape and ssm_shape must be provided together.";


The AllocateKVCache RPC method uses a CHECK macro to ensure that conv_shape and ssm_shape are provided together. If a request is sent with only one of these shapes, the CHECK macro will fail and cause the worker process to abort. This is a denial of service vector. It is recommended to validate the request data and return an error status to the caller instead of using CHECK.

gemini-code-assist · 2026-03-04T02:20:05Z

xllm/core/distributed_runtime/worker_service.cpp

+    CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape)))
+        << "KVCacheShape does not support index_shape with conv/ssm shapes "
+        << "simultaneously.";


The AllocateKVCacheWithTransfer RPC method uses CHECK macros to validate the consistency of the request data. If a request is sent with conflicting shape information (e.g., both index_shape and conv_shape are present), the CHECK macro will fail and cause the worker process to abort. This allows an attacker to crash the worker process. It is recommended to replace CHECK macros with proper error handling.

gemini-code-assist · 2026-03-04T02:20:05Z

xllm/core/distributed_runtime/worker_service.cpp

+      CHECK(has_conv_shape && has_ssm_shape)
+          << "conv_shape and ssm_shape must be provided together.";


The AllocateKVCacheWithTransfer RPC method uses a CHECK macro to ensure that conv_shape and ssm_shape are provided together. If a request is sent with only one of these shapes, the CHECK macro will fail and cause the worker process to abort. This is a denial of service vector. It is recommended to validate the request data and return an error status to the caller instead of using CHECK.

JC-ut0 · 2026-03-04T09:41:00Z

xllm/core/layers/npu/fused_moe.cpp

Suggested change

torch::Tensor act_out;

Init an empty tensor is enough here. Those codes are redundant.

JC-ut0 · 2026-03-04T09:55:51Z

CMakeLists.txt

  endif()

-  option(INSTALL_XLLM_KERNELS "Install xllm_kernels RPM" ON)  
+  option(INSTALL_XLLM_KERNELS "Install xllm_kernels RPM" OFF)  


This is a temporary workaround, remember to back off before merging to main.

JC-ut0 · 2026-03-04T09:56:51Z

CMakeLists.txt

      $ENV{NPU_HOME_PATH}/include
      $ENV{ATB_HOME_PATH}/include
      $ENV{NPU_HOME_PATH}/opp/vendors/xllm/op_api/include/
+      ${CMAKE_CURRENT_SOURCE_DIR}/third_party/torch_npu_ops/


Need to check whether there is a better way to do this

JC-ut0 · 2026-03-05T12:37:39Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for the "Qwen3-next" model on NPU devices. A high-severity Denial of Service (DoS) vulnerability has been identified in the RPC handlers of the "WorkerService", where "CHECK" macros used for input validation can cause the worker process to abort on invalid input, allowing remote attackers to crash the worker. Additionally, two critical bugs were found in the cache allocation logic: a typo in the "SSM" cache shape definition and a copy-paste error when handling cache shapes in the worker service. These issues need to be addressed to ensure both correctness and security, specifically by replacing "CHECK" macros with graceful error handling.

xllm/core/distributed_runtime/llm_engine.cpp

gemini-code-assist · 2026-03-05T12:48:30Z

xllm/core/distributed_runtime/worker_service.cpp

+    CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape)))
+        << "KVCacheShape does not support index_shape with conv/ssm shapes "
+        << "simultaneously.";
+    kv_cache_shape.reserve(has_conv_shape || has_ssm_shape ? 4 : 3);
    kv_cache_shape.emplace_back(
-        std::vector<int64_t>(req->kv_cache_shape().key_shape().begin(),
-                             req->kv_cache_shape().key_shape().end()));
+        std::vector<int64_t>(shape_req.key_shape().begin(),
+                             shape_req.key_shape().end()));
    kv_cache_shape.emplace_back(
-        std::vector<int64_t>(req->kv_cache_shape().value_shape().begin(),
-                             req->kv_cache_shape().value_shape().end()));
+        std::vector<int64_t>(shape_req.value_shape().begin(),
+                             shape_req.value_shape().end()));
    // add index shape if exists
-    if (req->kv_cache_shape().index_shape_size() > 0) {
+    if (has_index_shape) {
+      kv_cache_shape.emplace_back(
+          std::vector<int64_t>(shape_req.index_shape().begin(),
+                               shape_req.index_shape().end()));
+    } else if (has_conv_shape || has_ssm_shape) {
+      CHECK(has_conv_shape && has_ssm_shape)
+          << "conv_shape and ssm_shape must be provided together.";


The AllocateKVCacheWithTransfer RPC handler uses CHECK macros for shape validation, which is a Denial of Service (DoS) vector allowing a remote caller to crash the worker by providing inconsistent shape flags. Proper error handling should be implemented to return a failure status without aborting the process. Additionally, there is a copy-paste error when constructing the kv_cache_shape for linear attention, where index_shape is incorrectly used instead of ssm_shape. This can lead to incorrect shape information, causing allocation failures or memory corruption.

gemini-code-assist · 2026-03-05T12:48:30Z

xllm/core/distributed_runtime/worker_service.cpp

+    CHECK(!(has_index_shape && (has_conv_shape || has_ssm_shape)))
+        << "KVCacheShape does not support index_shape with conv/ssm shapes "
+        << "simultaneously.";
+    // Reserve for key, value, and optional extra shapes
+    kv_cache_shape.reserve(has_conv_shape || has_ssm_shape ? 4 : 3);
    kv_cache_shape.emplace_back(std::vector<int64_t>(
        request->key_shape().begin(), request->key_shape().end()));
    kv_cache_shape.emplace_back(std::vector<int64_t>(
        request->value_shape().begin(), request->value_shape().end()));
    // add index shape if exists
-    if (request->index_shape_size() > 0) {
+    if (has_index_shape) {
      kv_cache_shape.emplace_back(std::vector<int64_t>(
          request->index_shape().begin(), request->index_shape().end()));
+    } else if (has_conv_shape || has_ssm_shape) {
+      CHECK(has_conv_shape && has_ssm_shape)
+          << "conv_shape and ssm_shape must be provided together.";


The AllocateKVCache RPC handler uses the CHECK macro to validate input parameters (has_index_shape, has_conv_shape, has_ssm_shape). The CHECK macro causes the entire process to abort if the condition is not met. Since this logic operates on untrusted data received over the network, it creates a Denial of Service (DoS) vulnerability. An attacker can send a malformed request to crash the worker process. Input validation should be performed using conditional logic that returns an error response (e.g., via controller->SetFailed()) instead of crashing the process.

…a / Qwen3NextDecoderLayer.

feat: adjust cache allocation based on attention settings(support ssm_cache). feat: update torch_npu_ops commit. bugfix:layer CMake fix. bugfix:add model arguments for enhanced configuration. bugfix: QKV linear load fix. bugfix: fallback pg_comm. bugfix: set q_seq_len in prefill. bugfix: handle optional finished tensor in moe gating and FusedMoE implementations. bugfix: ensure activation output is correctly assigned in FusedMoE forward pass.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

JC-ut0 · 2026-03-14T07:41:19Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for the Qwen3-next model on NPU devices, which includes adding a linear attention cache. The changes are extensive, involving new model layers, kernels, and updates to the build system and data structures. My review identified a critical compilation error related to incorrect pointer access and a couple of high-severity issues where function signatures could lead to unexpected side effects by modifying input tensors. I have provided code suggestions to address these problems.

xllm/core/distributed_runtime/comm_channel.cpp

gemini-code-assist · 2026-03-14T07:44:17Z

xllm/core/layers/common/qwen3_next_rms_norm.cpp

+torch::Tensor Qwen3NextRMSNormImpl::forward(torch::Tensor& input) {
+  auto input_dtype = input.dtype();
+  input = input.to(torch::kFloat32);
+
+  // Calculate RMS
+  auto variance = torch::mean(torch::pow(input, 2), -1, true);
+  auto normalized = input * torch::rsqrt(variance + eps_);
+
+  // Apply weight and convert back to original dtype
+  return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype);
+}


The forward method takes input as a non-const reference (torch::Tensor&) and modifies it by reassigning it with input = input.to(torch::kFloat32);. This modifies the caller's tensor, which can lead to unexpected side effects and bugs. The method should take a const torch::Tensor& and use a local variable for type conversion to avoid modifying the original tensor. Please also update the corresponding header file xllm/core/layers/common/qwen3_next_rms_norm.h.

Suggested change

torch::Tensor Qwen3NextRMSNormImpl::forward(torch::Tensor& input) {

auto input_dtype = input.dtype();

input = input.to(torch::kFloat32);

// Calculate RMS

auto variance = torch::mean(torch::pow(input, 2), -1, true);

auto normalized = input * torch::rsqrt(variance + eps_);

// Apply weight and convert back to original dtype

return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype);

}

torch::Tensor Qwen3NextRMSNormImpl::forward(const torch::Tensor& input) {

auto input_dtype = input.dtype();

auto float_input = input.to(torch::kFloat32);

// Calculate RMS

auto variance = torch::mean(torch::pow(float_input, 2), -1, true);

auto normalized = float_input * torch::rsqrt(variance + eps_);

// Apply weight and convert back to original dtype

return (normalized * (1.0f + weight_.to(torch::kFloat32))).to(input_dtype);

}

gemini-code-assist · 2026-03-14T07:44:17Z

xllm/core/layers/common/rms_norm_gated.cpp

+torch::Tensor RmsNormGatedImpl::forward(torch::Tensor& input, std::optional<torch::Tensor> gate) {
+  xllm::kernel::GatedLayerNormParams params;
+  auto input_type = input.dtype();
+  input = input.to(torch::kFloat32);
+  params.x = input;
+  params.weight = weight_.to(torch::kFloat32);
+  torch::Tensor bias;
+  params.bias = bias;
+  params.eps = eps_;
+  if (gate.has_value()) {
+    gate = gate.value().to(torch::kFloat32);
+    params.z = gate;
+  }
+  params.group_size = input.size(-1); 
+  params.is_rms_norm = true;  
+  auto ret = xllm::kernel::gated_layer_norm(params);
+  return ret.to(input_type);
+}


The forward method takes input as a non-const reference (torch::Tensor&) and modifies it by reassigning it with input = input.to(torch::kFloat32);. This is an unexpected side effect for the caller. To avoid bugs, the method should take a const torch::Tensor& and use a local variable for the type conversion. Please also update the corresponding header file xllm/core/layers/common/rms_norm_gated.h.

Suggested change

torch::Tensor RmsNormGatedImpl::forward(torch::Tensor& input, std::optional<torch::Tensor> gate) {

xllm::kernel::GatedLayerNormParams params;

auto input_type = input.dtype();

input = input.to(torch::kFloat32);

params.x = input;

params.weight = weight_.to(torch::kFloat32);

torch::Tensor bias;

params.bias = bias;

params.eps = eps_;

if (gate.has_value()) {

gate = gate.value().to(torch::kFloat32);

params.z = gate;

}

params.group_size = input.size(-1);

params.is_rms_norm = true;

auto ret = xllm::kernel::gated_layer_norm(params);

return ret.to(input_type);

}

torch::Tensor RmsNormGatedImpl::forward(const torch::Tensor& input, std::optional<torch::Tensor> gate) {

xllm::kernel::GatedLayerNormParams params;

auto input_type = input.dtype();

auto float_input = input.to(torch::kFloat32);

params.x = float_input;

params.weight = weight_.to(torch::kFloat32);

torch::Tensor bias;

params.bias = bias;

params.eps = eps_;

if (gate.has_value()) {

gate = gate.value().to(torch::kFloat32);

params.z = gate;

}

params.group_size = float_input.size(-1);

params.is_rms_norm = true;

auto ret = xllm::kernel::gated_layer_norm(params);

return ret.to(input_type);

}

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

JC-ut0 requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners March 4, 2026 02:16

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

yingxudeng marked this pull request as draft March 4, 2026 02:43

JC-ut0 changed the title ~~[Draft] Support Qwen next~~ [Draft] Support Qwen3-next on NPU device Mar 4, 2026

JC-ut0 commented Mar 4, 2026

View reviewed changes

JC-ut0 force-pushed the qwen-next branch 4 times, most recently from a3e3901 to 0bc39a0 Compare March 5, 2026 09:11

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

JC-ut0 changed the title ~~[Draft] Support Qwen3-next on NPU device~~ feat: Support Qwen3-next on NPU device Mar 10, 2026

JC-ut0 marked this pull request as ready for review March 10, 2026 01:45

XuZhang99 changed the title ~~feat: Support Qwen3-next on NPU device~~ feat: support Qwen3-next on npu device. Mar 12, 2026

ext.wangxiaochi1 and others added 8 commits March 14, 2026 10:58

add qwen3-next decoder layer

e13f43e

support linear attention cache

7d7787a

add triton kernel api

a028513

add rope atb ops

1be2d5c

bugfix: fix some compile problem: LOAD_MERGED_WEIGHT_V2 / testing tre…

75a3417

…a / Qwen3NextDecoderLayer.

bugfix: Fix build errors with qwen3-next model interfaces.

dc8dd64

Apply suggestions from code review

8f16375

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

JC-ut0 added 3 commits March 14, 2026 11:13

BugFix: reduce device memory usage

b06794f

Bugfix: qwen3-next only need to update self-attention's params

86ac26a

rebase to main

8515dcf

JC-ut0 force-pushed the qwen-next branch from 7dc7f5c to 8515dcf Compare March 14, 2026 07:39

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

JC-ut0 and others added 2 commits March 14, 2026 15:47

Apply suggestions from code review

5c89042

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix compile error

e1f3b0c

JC-ut0 force-pushed the qwen-next branch from 37a5766 to e1f3b0c Compare March 14, 2026 10:23

		CHECK(has_conv_shape && has_ssm_shape)
		<< "conv_shape and ssm_shape must be provided together.";

Conversation

JC-ut0 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

JC-ut0 Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JC-ut0 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

JC-ut0 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

JC-ut0 commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

JC-ut0 commented Mar 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JC-ut0 commented Mar 4, 2026 •

edited

Loading

JC-ut0 Mar 4, 2026 •

edited

Loading