Skip to content

[feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch#750

Draft
zejunchen-zejun wants to merge 10 commits into
mainfrom
zejun/refact_attn_0511
Draft

[feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch#750
zejunchen-zejun wants to merge 10 commits into
mainfrom
zejun/refact_attn_0511

Conversation

@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

@zejunchen-zejun zejunchen-zejun commented May 11, 2026

This PR refactor the attention architecture for ATOM-vLLM

Here is the RFC: #758

Here is the validation result:
ATOM-vLLM CI

DeepSeek-R1-FP8 TP8 / atom-vllm CI: 0.9484457922668689 >= 0.93
gpt-oss-120b TP1 / atom-vllm CI: FAILED - aiter JIT header missing
Kimi-K2-Thinking-MXFP4 TP4 / atom-vllm CI: FAILED - server not ready
Qwen3.5-35B-A3B-FP8 TP2 / atom-vllm CI: 0.7862016679302501 >= 0.76

ATOM-vLLM nightly validation
Qwen3-235B-A22B-Instruct-2507-FP8 TP8+EP8 / atom-vllm nightly: FAILED - max_qlen=None
Qwen3-Next-80B-A3B-Instruct-FP8 TP1 / atom-vllm nightly: FAILED - tuple index error
Qwen3-Next-80B-A3B-Instruct-FP8 TP4 / atom-vllm nightly: FAILED - tuple index error
Qwen3.5-397B-A17B-FP8 TP8 / atom-vllm nightly: 0.8688400303260045 >= 0.83
Qwen3.5-397B-A17B TP8 / atom-vllm nightly: 0.8688400303260045 >= 0.83
Qwen3.5-397B-A17B-MXFP4 TP4 / atom-vllm nightly: 0.8468536770280516 >= 0.83
Meta-Llama-3.1-405B-Instruct-FP8 TP8 / atom-vllm nightly: FAILED - fp8 gemm dtype mismatch
Llama-3.1-8B-Instruct TP1 / atom-vllm nightly: FAILED - max_qlen=None
Kimi-K2-Thinking-MXFP4 TP8 / atom-vllm nightly: 0.931008339651251 >= 0.90
Kimi-K2.5-MXFP4 TP8 / atom-vllm nightly: FAILED - vision config missing hidden_size
DeepSeek-R1-FP8 TP8 / atom-vllm nightly: 0.9507202426080363 >= 0.93
DeepSeek-R1-0528-MXFP4 TP8 / atom-vllm nightly: 0.9370735405610311 >= 0.93
DeepSeek-V3.2-FP8 TP8 / atom-vllm nightly: 0.9492039423805914 >= 0.93
gpt-oss-120b TP1 / atom-vllm nightly: FAILED - aiter JIT header missing
gpt-oss-120b TP2 / atom-vllm nightly: FAILED - aiter JIT header missing
GLM-5.1-FP8 TP8 / atom-vllm nightly: 0.9423805913570887 >= 0.88

ATOM CI and nightly:
Meta-Llama-3-8B-Instruct / native atom: FAILED - local model missing
DeepSeek-R1-0528 / native atom: 0.9522365428354814 >= 0.94
DeepSeek-V4-Pro / native atom: 0.9552691432903715 >= 0.92
DeepSeek-R1-0528 MTP / native atom: 0.9461713419257013 >= 0.94
gpt-oss-120b / native atom: FAILED - no result JSON
Llama-3.3-70B-Instruct-MXFP4-Preview / native atom: FAILED - local model missing
DeepSeek-R1-0528-FP4 / native atom: 0.9492039423805914 >= 0.93
DeepSeek-R1-0528-FP4 MTP / native atom: 0.9401061410159212 >= 0.93
Qwen3-235B-A22B-Instruct-2507-FP8 / native atom: 0.8953752843062927 >= 0.87
Qwen3-Next-80B-A3B-Thinking / native atom: 0.6732373009855952 >= 0.65
gpt-oss-120b 2GPU / native atom: FAILED - no result JSON
Qwen3-235B-A22B-Instruct-2507-MXFP4 / native atom: 0.8764215314632298 >= 0.87
Kimi-K2.5-MXFP4 / native atom: 0.9423805913570887 >= 0.92
Kimi-K2.5-MXFP4 Eagle3 / native atom: 0.935557240333586 >= 0.91
GLM-5-FP8 / native atom: 0.9416224412433661 >= 0.93
GLM-5.1-FP8 / native atom: 0.8893100833965125 >= 0.875
GLM-5.1-MXFP4 MTP / native atom: 0.8809704321455648 >= 0.87
GLM-5.1-MXFP4 / native atom: 0.88855193328279 >= 0.87
Qwen3.5-397B-A17B-FP8 / native atom: 0.8786959818043972 >= 0.85
Qwen3.5-397B-A17B-FP8 MTP / native atom: 0.8673237300985596 >= 0.85
Qwen3.5-397B-A17B-MXFP4 / native atom: 0.8620166793025019 >= 0.835
Qwen3.5-397B-A17B-MXFP4 MTP / native atom: FAILED - 0.8339651250947687 < 0.835
MiniMax-M2.5 / native atom: 0.9317664897649734 >= 0.92
MiniMax-M2.5-MXFP4 / native atom: 0.9196360879454132 >= 0.91

from atom.plugin.prepare import is_plugin_mode
from atom.utils import CpuGpuBuffer
from atom.utils.block_convert import (
block_table_convert_triton,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
atom.utils.block_convert.block_table_convert_triton imported but unused

reviewdog suggestion errorGitHub comment range and suggestion line range must be same. L20-L20 v.s. L19-L22

Comment on lines +10 to +11
from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.ops.triton.batched_gemm_a16wfp4.batched_gemm_a16wfp4 imported but unused

Suggested change
from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4

Comment on lines +12 to +15
from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import ( # noqa: E501 # isort: skip
batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant imported but unused

Suggested change
from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import ( # noqa: E501 # isort: skip
batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,
)

Comment on lines +16 to +17
from functools import partial as functools_partial
from atom.model_ops.linear import use_triton_gemm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
functools.partial imported but unused

Suggested change
from functools import partial as functools_partial
from atom.model_ops.linear import use_triton_gemm
from atom.model_ops.linear import use_triton_gemm

if use_triton_gemm():
try:
from aiter.ops.triton.fused_gemm_a8w8_blockscale_split_cat import (
fused_gemm_a8w8_blockscale_preshuffle_split_cat,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.ops.triton.fused_gemm_a8w8_blockscale_split_cat.fused_gemm_a8w8_blockscale_preshuffle_split_cat imported but unused; consider using importlib.util.find_spec to test for availability

fused_gemm_a8w8_blockscale_preshuffle_split_cat,
)
from aiter.ops.triton.fused_gemm_afp4wfp4_split_cat import (
fused_gemm_afp4wfp4_preshuffle_split_cat,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.ops.triton.fused_gemm_afp4wfp4_split_cat.fused_gemm_afp4wfp4_preshuffle_split_cat imported but unused; consider using importlib.util.find_spec to test for availability

Comment on lines +8 to +9
from aiter import get_mla_metadata_v1
from atom.utils.block_convert import kv_indices_generate_triton
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.get_mla_metadata_v1 imported but unused

Suggested change
from aiter import get_mla_metadata_v1
from atom.utils.block_convert import kv_indices_generate_triton
from atom.utils.block_convert import kv_indices_generate_triton

Comment on lines +9 to +10
from atom.utils.block_convert import kv_indices_generate_triton
from atom.utils.forward_context import Context
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
atom.utils.block_convert.kv_indices_generate_triton imported but unused

Suggested change
from atom.utils.block_convert import kv_indices_generate_triton
from atom.utils.forward_context import Context
from atom.utils.forward_context import Context

Comment on lines +11 to +12
from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
atom.model_ops.attention_mla.MLAAttention imported but unused

Suggested change
from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS

Comment on lines +11 to +12
from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
atom.model_ops.attention_mla._MLA_MIN_HEADS imported but unused

Suggested change
from atom.model_ops.attention_mla import MLAAttention, _MLA_MIN_HEADS

Comment on lines +16 to +17
from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.ops.triton.batched_gemm_a16wfp4.batched_gemm_a16wfp4 imported but unused

Suggested change
from aiter.ops.triton.batched_gemm_a16wfp4 import batched_gemm_a16wfp4

Comment on lines +18 to +21
from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import ( # noqa: E501 # isort: skip
batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,
)
from aiter.mla import mla_decode_fwd
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant imported but unused

Suggested change
from aiter.ops.triton.batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant import ( # noqa: E501 # isort: skip
batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,
)
from aiter.mla import mla_decode_fwd
from aiter.mla import mla_decode_fwd

Comment on lines +21 to +22
from aiter.mla import mla_decode_fwd
from aiter import (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.mla.mla_decode_fwd imported but unused

Suggested change
from aiter.mla import mla_decode_fwd
from aiter import (
from aiter import (

Comment on lines +22 to +29
from aiter import (
fused_qk_rope_concat_and_cache_mla,
cp_gather_indexer_k_quant_cache,
dtypes,
indexer_k_quant_and_cache,
top_k_per_row_decode,
top_k_per_row_prefill,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.fused_qk_rope_concat_and_cache_mla imported but unused

Suggested change
from aiter import (
fused_qk_rope_concat_and_cache_mla,
cp_gather_indexer_k_quant_cache,
dtypes,
indexer_k_quant_and_cache,
top_k_per_row_decode,
top_k_per_row_prefill,
)
from aiter import (
cp_gather_indexer_k_quant_cache,
dtypes,
indexer_k_quant_and_cache,
top_k_per_row_decode,
)

Comment on lines +22 to +29
from aiter import (
fused_qk_rope_concat_and_cache_mla,
cp_gather_indexer_k_quant_cache,
dtypes,
indexer_k_quant_and_cache,
top_k_per_row_decode,
top_k_per_row_prefill,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.top_k_per_row_prefill imported but unused

Suggested change
from aiter import (
fused_qk_rope_concat_and_cache_mla,
cp_gather_indexer_k_quant_cache,
dtypes,
indexer_k_quant_and_cache,
top_k_per_row_decode,
top_k_per_row_prefill,
)
from aiter import (
cp_gather_indexer_k_quant_cache,
dtypes,
indexer_k_quant_and_cache,
top_k_per_row_decode,
)

"""vLLM-facing sparse MLA backend surface for ATOM attention layers."""

@staticmethod
def get_builder_cls() -> Type["AiterMLASparseMetadataBuilder"]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F821> reported by reviewdog 🐶
Undefined name AiterMLASparseMetadataBuilder

"""vLLM-facing sparse MLA indexer backend surface."""

@staticmethod
def get_builder_cls() -> Type["AiterMLASparseIndexerMetadataBuilder"]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F821> reported by reviewdog 🐶
Undefined name AiterMLASparseIndexerMetadataBuilder

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
divide the atom-vllm metadata from atom metadata

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/refact_attn_0511 branch from 3b061e9 to b832aee Compare May 15, 2026 05:30
@zejunchen-zejun zejunchen-zejun changed the title [feat][Attention Refactor] Reconstruct the Attention arch [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention arch May 15, 2026
@zejunchen-zejun zejunchen-zejun changed the title [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention arch [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants