[feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch by zejunchen-zejun · Pull Request #750 · ROCm/ATOM

zejunchen-zejun · 2026-05-11T11:15:35Z

This PR refactor the attention architecture for ATOM-vLLM

Here is the RFC: #758

Here is the validation result:
ATOM-vLLM CI
DeepSeek-R1-FP8 TP8 / atom-vllm CI: 0.9484457922668689 >= 0.93
gpt-oss-120b TP1 / atom-vllm CI: FAILED - aiter JIT header missing
Kimi-K2-Thinking-MXFP4 TP4 / atom-vllm CI: FAILED - server not ready
Qwen3.5-35B-A3B-FP8 TP2 / atom-vllm CI: 0.7862016679302501 >= 0.76

ATOM-vLLM nightly validation
Qwen3-235B-A22B-Instruct-2507-FP8 TP8+EP8 / atom-vllm nightly: FAILED - max_qlen=None
Qwen3-Next-80B-A3B-Instruct-FP8 TP1 / atom-vllm nightly: FAILED - tuple index error
Qwen3-Next-80B-A3B-Instruct-FP8 TP4 / atom-vllm nightly: FAILED - tuple index error
Qwen3.5-397B-A17B-FP8 TP8 / atom-vllm nightly: 0.8688400303260045 >= 0.83
Qwen3.5-397B-A17B TP8 / atom-vllm nightly: 0.8688400303260045 >= 0.83
Qwen3.5-397B-A17B-MXFP4 TP4 / atom-vllm nightly: 0.8468536770280516 >= 0.83
Meta-Llama-3.1-405B-Instruct-FP8 TP8 / atom-vllm nightly: FAILED - fp8 gemm dtype mismatch
Llama-3.1-8B-Instruct TP1 / atom-vllm nightly: FAILED - max_qlen=None
Kimi-K2-Thinking-MXFP4 TP8 / atom-vllm nightly: 0.931008339651251 >= 0.90
Kimi-K2.5-MXFP4 TP8 / atom-vllm nightly: FAILED - vision config missing hidden_size
DeepSeek-R1-FP8 TP8 / atom-vllm nightly: 0.9507202426080363 >= 0.93
DeepSeek-R1-0528-MXFP4 TP8 / atom-vllm nightly: 0.9370735405610311 >= 0.93
DeepSeek-V3.2-FP8 TP8 / atom-vllm nightly: 0.9492039423805914 >= 0.93
gpt-oss-120b TP1 / atom-vllm nightly: FAILED - aiter JIT header missing
gpt-oss-120b TP2 / atom-vllm nightly: FAILED - aiter JIT header missing
GLM-5.1-FP8 TP8 / atom-vllm nightly: 0.9423805913570887 >= 0.88

ATOM CI and nightly:
Meta-Llama-3-8B-Instruct / native atom: FAILED - local model missing
DeepSeek-R1-0528 / native atom: 0.9522365428354814 >= 0.94
DeepSeek-V4-Pro / native atom: 0.9552691432903715 >= 0.92
DeepSeek-R1-0528 MTP / native atom: 0.9461713419257013 >= 0.94
gpt-oss-120b / native atom: FAILED - no result JSON
Llama-3.3-70B-Instruct-MXFP4-Preview / native atom: FAILED - local model missing
DeepSeek-R1-0528-FP4 / native atom: 0.9492039423805914 >= 0.93
DeepSeek-R1-0528-FP4 MTP / native atom: 0.9401061410159212 >= 0.93
Qwen3-235B-A22B-Instruct-2507-FP8 / native atom: 0.8953752843062927 >= 0.87
Qwen3-Next-80B-A3B-Thinking / native atom: 0.6732373009855952 >= 0.65
gpt-oss-120b 2GPU / native atom: FAILED - no result JSON
Qwen3-235B-A22B-Instruct-2507-MXFP4 / native atom: 0.8764215314632298 >= 0.87
Kimi-K2.5-MXFP4 / native atom: 0.9423805913570887 >= 0.92
Kimi-K2.5-MXFP4 Eagle3 / native atom: 0.935557240333586 >= 0.91
GLM-5-FP8 / native atom: 0.9416224412433661 >= 0.93
GLM-5.1-FP8 / native atom: 0.8893100833965125 >= 0.875
GLM-5.1-MXFP4 MTP / native atom: 0.8809704321455648 >= 0.87
GLM-5.1-MXFP4 / native atom: 0.88855193328279 >= 0.87
Qwen3.5-397B-A17B-FP8 / native atom: 0.8786959818043972 >= 0.85
Qwen3.5-397B-A17B-FP8 MTP / native atom: 0.8673237300985596 >= 0.85
Qwen3.5-397B-A17B-MXFP4 / native atom: 0.8620166793025019 >= 0.835
Qwen3.5-397B-A17B-MXFP4 MTP / native atom: FAILED - 0.8339651250947687 < 0.835
MiniMax-M2.5 / native atom: 0.9317664897649734 >= 0.92
MiniMax-M2.5-MXFP4 / native atom: 0.9196360879454132 >= 0.91

github-actions · 2026-05-11T11:15:55Z

-from atom.plugin.prepare import is_plugin_mode
 from atom.utils import CpuGpuBuffer
 from atom.utils.block_convert import (
    block_table_convert_triton,


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
atom.utils.block_convert.block_table_convert_triton imported but unused

reviewdog suggestion error
GitHub comment range and suggestion line range must be same. L20-L20 v.s. L19-L22

github-actions · 2026-05-11T11:15:56Z

+from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4
+


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.ops.triton.batched_gemm_a16wfp4.batched_gemm_a16wfp4 imported but unused

Suggested change

from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4

github-actions · 2026-05-11T11:15:56Z

+from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import (  # noqa: E501 # isort: skip
+    batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,
+)
+


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant imported but unused

Suggested change

from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import ( # noqa: E501 # isort: skip

batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,

)

github-actions · 2026-05-11T11:15:56Z

+from functools import partial as functools_partial
+from atom.model_ops.linear import use_triton_gemm


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
functools.partial imported but unused

Suggested change

from functools import partial as functools_partial

from atom.model_ops.linear import use_triton_gemm

from atom.model_ops.linear import use_triton_gemm

github-actions · 2026-05-11T13:03:38Z

+if use_triton_gemm():
+    try:
+        from aiter.ops.triton.fused_gemm_a8w8_blockscale_split_cat import (
+            fused_gemm_a8w8_blockscale_preshuffle_split_cat,


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.ops.triton.fused_gemm_a8w8_blockscale_split_cat.fused_gemm_a8w8_blockscale_preshuffle_split_cat imported but unused; consider using importlib.util.find_spec to test for availability

github-actions · 2026-05-11T13:03:38Z

+            fused_gemm_a8w8_blockscale_preshuffle_split_cat,
+        )
+        from aiter.ops.triton.fused_gemm_afp4wfp4_split_cat import (
+            fused_gemm_afp4wfp4_preshuffle_split_cat,


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.ops.triton.fused_gemm_afp4wfp4_split_cat.fused_gemm_afp4wfp4_preshuffle_split_cat imported but unused; consider using importlib.util.find_spec to test for availability

github-actions · 2026-05-12T11:50:16Z

+from aiter import get_mla_metadata_v1
+from atom.utils.block_convert import kv_indices_generate_triton


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.get_mla_metadata_v1 imported but unused

Suggested change

from aiter import get_mla_metadata_v1

from atom.utils.block_convert import kv_indices_generate_triton

from atom.utils.block_convert import kv_indices_generate_triton

github-actions · 2026-05-12T11:50:16Z

+from atom.utils.block_convert import kv_indices_generate_triton
+from atom.utils.forward_context import Context


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
atom.utils.block_convert.kv_indices_generate_triton imported but unused

Suggested change

from atom.utils.block_convert import kv_indices_generate_triton

from atom.utils.forward_context import Context

from atom.utils.forward_context import Context

github-actions · 2026-05-12T11:50:16Z

+from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS
+


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
atom.model_ops.attention_mla.MLAAttention imported but unused

Suggested change

from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS

github-actions · 2026-05-12T11:50:16Z

+from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS
+


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
atom.model_ops.attention_mla._MLA_MIN_HEADS imported but unused

Suggested change

from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS

github-actions · 2026-05-12T11:50:16Z

+from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4
+


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.ops.triton.batched_gemm_a16wfp4.batched_gemm_a16wfp4 imported but unused

Suggested change

from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4

github-actions · 2026-05-12T11:50:16Z

+from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import (  # noqa: E501 # isort: skip
+    batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,
+)
+from aiter.mla import mla_decode_fwd


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant imported but unused

Suggested change

from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import ( # noqa: E501 # isort: skip

batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,

)

from aiter.mla import mla_decode_fwd

from aiter.mla import mla_decode_fwd

github-actions · 2026-05-12T11:50:16Z

+from aiter.mla import mla_decode_fwd
+from aiter import (


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.mla.mla_decode_fwd imported but unused

Suggested change

from aiter.mla import mla_decode_fwd

from aiter import (

from aiter import (

github-actions · 2026-05-12T11:50:16Z

+from aiter import (
+    fused_qk_rope_concat_and_cache_mla,
+    cp_gather_indexer_k_quant_cache,
+    dtypes,
+    indexer_k_quant_and_cache,
+    top_k_per_row_decode,
+    top_k_per_row_prefill,
+)


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.fused_qk_rope_concat_and_cache_mla imported but unused

Suggested change

from aiter import (

fused_qk_rope_concat_and_cache_mla,

cp_gather_indexer_k_quant_cache,

dtypes,

indexer_k_quant_and_cache,

top_k_per_row_decode,

top_k_per_row_prefill,

)

from aiter import (

cp_gather_indexer_k_quant_cache,

dtypes,

indexer_k_quant_and_cache,

top_k_per_row_decode,

)

github-actions · 2026-05-12T11:50:16Z

+from aiter import (
+    fused_qk_rope_concat_and_cache_mla,
+    cp_gather_indexer_k_quant_cache,
+    dtypes,
+    indexer_k_quant_and_cache,
+    top_k_per_row_decode,
+    top_k_per_row_prefill,
+)


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.top_k_per_row_prefill imported but unused

Suggested change

from aiter import (

fused_qk_rope_concat_and_cache_mla,

cp_gather_indexer_k_quant_cache,

dtypes,

indexer_k_quant_and_cache,

top_k_per_row_decode,

top_k_per_row_prefill,

)

from aiter import (

cp_gather_indexer_k_quant_cache,

dtypes,

indexer_k_quant_and_cache,

top_k_per_row_decode,

)

github-actions · 2026-05-12T13:33:08Z

+    """vLLM-facing sparse MLA backend surface for ATOM attention layers."""
+
+    @staticmethod
+    def get_builder_cls() -> Type["AiterMLASparseMetadataBuilder"]:


⚠️ [ruff] <F821> _{reported by reviewdog 🐶}
Undefined name AiterMLASparseMetadataBuilder

github-actions · 2026-05-12T13:33:08Z

+    """vLLM-facing sparse MLA indexer backend surface."""
+
+    @staticmethod
+    def get_builder_cls() -> Type["AiterMLASparseIndexerMetadataBuilder"]:


⚠️ [ruff] <F821> _{reported by reviewdog 🐶}
Undefined name AiterMLASparseIndexerMetadataBuilder

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

divide the atom-vllm metadata from atom metadata Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

github-actions Bot reviewed May 11, 2026

View reviewed changes

github-actions Bot reviewed May 12, 2026

View reviewed changes

zejunchen-zejun added 10 commits May 15, 2026 11:14

[feat][Attention Refactor] Reconstruct the Attention arch

3a3ab96

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix import issue

4df7a29

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix fused_gemm_a8w8_blockscale_preshuffle_split_cat issue

98d2944

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix head size and layer issue

1574298

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix layer issue

9c60228

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix GLM5.1 init issue

5271a8d

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

remove legacy code

8f55a48

divide the atom-vllm metadata from atom metadata Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

refine mla name

40605c8

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

reuse atom native mha kernel

3983ff9

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

set the default value 1 for max qlen

b832aee

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

zejunchen-zejun force-pushed the zejun/refact_attn_0511 branch from 3b061e9 to b832aee Compare May 15, 2026 05:30

zejunchen-zejun changed the title ~~[feat][Attention Refactor] Reconstruct the Attention arch~~ [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention arch May 15, 2026

zejunchen-zejun changed the title ~~[feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention arch~~ [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch May 15, 2026

zejunchen-zejun mentioned this pull request May 15, 2026

[Draft] [Attention] Refactor ATOM-Plugin Attention #758

Open

		from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4

	from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import ( # noqa: E501 # isort: skip
	batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,
	)

		from functools import partial as functools_partial
		from atom.model_ops.linear import use_triton_gemm

		from aiter import get_mla_metadata_v1
		from atom.utils.block_convert import kv_indices_generate_triton

		from atom.utils.block_convert import kv_indices_generate_triton
		from atom.utils.forward_context import Context

		from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS

	from aiter.mla import mla_decode_fwd
	from aiter import (
	from aiter import (

Conversation

zejunchen-zejun commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zejunchen-zejun commented May 11, 2026 •

edited

Loading