Skip to content

Align CPU attention KV cache layout hint with physical HND layout#1

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/override-kv-cache-layout-hnd
Draft

Align CPU attention KV cache layout hint with physical HND layout#1
Copilot wants to merge 2 commits into
mainfrom
copilot/override-kv-cache-layout-hnd

Conversation

Copilot AI commented May 15, 2026

Copy link
Copy Markdown

CPUAttentionBackend stores KV cache in HND order ((2, num_blocks, num_kv_heads, block_size, head_size)), but the backend did not declare a required layout, allowing default NHD hints to leak through. This PR makes CPU backend layout requirements explicit so selector-level layout configuration matches actual cache memory layout.

  • Backend contract update

    • Added CPUAttentionBackend.get_required_kv_cache_layout() and return "HND".
    • Imported KVCacheLayoutType in cpu_attn.py to keep the method signature aligned with other backends.
  • Coverage for the new backend requirement

    • Added a focused CPU attention test asserting:
      • CPUAttentionBackend.get_required_kv_cache_layout() == "HND".
  • Pattern consistency with existing backends

    • Follows the same selector-driven required-layout mechanism already used by FlashInferBackend (SM100), FlashInferMLABackend, and FlashInferMLASparseBackend.
@classmethod
def get_required_kv_cache_layout(cls) -> "KVCacheLayoutType | None":
    return "HND"

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • astral.sh
    • Triggering command: /usr/bin/curl curl -LsSf REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

Problem

In CPUAttentionBackend, the physical KV cache layout is HND as seen from get_kv_cache_shape:

return 2, num_blocks, num_kv_heads, block_size, head_size

The shape is (2, num_blocks, num_kv_heads, block_size, head_size)num_kv_heads (H) comes before block_size (N), which is HND layout.

However, get_kv_cache_layout() returns "NHD" by default (when no override or env var is set), causing inconsistency between the actual memory layout and the reported layout hint.

Fix

Override get_required_kv_cache_layout in CPUAttentionBackend to return "HND". The attention selector (vllm/v1/attention/selector.py) already checks this method after selecting a backend and calls set_kv_cache_layout() accordingly. This is the same pattern used by FlashInferBackend (SM100), FlashInferMLABackend, and FlashInferMLASparseBackend.

In vllm/v1/attention/backends/cpu_attn.py, add to the CPUAttentionBackend class:

@classmethod
def get_required_kv_cache_layout(cls) -> "KVCacheLayoutType | None":
    return "HND"

You'll need to import KVCacheLayoutType from vllm.v1.attention.backends.utils.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: cpu attention的情况下,kv layout物理上是HND,但是get hint的时候是NHD,有没有办法显示指定一下

Assistant: Now let me look at the CPU attention backend to see how it handles the layout:

Assistant: 现在我理解了这个问题。让我来总结一下:

问题分析

CPU attention 的 KV cache 物理布局是 HND,从 get_kv_cache_shape 可以看到:

    @staticmethod
    def get_kv_cache_shape(
        num_blocks: int,
        block_size: int,
        num_kv_heads: int,
        head_size: int,
        cache_dtype_str: str = "auto",
    ) -> tuple[int, ...]:
        return 2, num_blocks, num_kv_heads, block_size, head_size

shape 是 (2, num_blocks, num_kv_heads, block_size, head_size) — 即 H(num_kv_heads) 在 N(block_size) 前面,这是 HND 布局。

但是 get_kv_cache_layout() 在没有显式设置的情况下,会走默认路径(get_kv_connector_cache_layout()),通常返回 "NHD",导致不一致。

解决方案

参考 XPU 平台的做法,在 CpuPlatform.get_attn_backend_cls() 中显式设置 layout:

        from vllm.v1.attention.backends.utils import set_kv_cache_layout

        set_kv_cache_layout("NHD")
        logger.info(
            "Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; "
            "only NHD layout is supported by XPU attention kernels."
        )

对于 CPU 平台,你可以在 vllm/platforms/cpu.pyget_attn_backend_cls 方法中加入:

from vllm.v1.attention.backends.utils import set_kv_cache_layout

set_kv_cache_layout("HND")
logger.info(
    "Setting VLLM_KV_CACHE_LAYOUT to 'HND' for CPU; "
    "CPU attention kernels use HND layout."
)

或者,你也可以通过环境变量 VLLM_KV_CACHE_LAYOUT=HND 来显式指定,不需要改代码。

另外一个方案是让 CPUAttentionBackend 实现 get_required_kv_cache_layout() 类方法(类似 benchmark runner 中对 FlashInfer/TRTLLM 的处理),返回 "HND",这样上层框架在初始化时会自动调用 set_kv_cache_layout

你想让我帮你创建一个 PR 来修复这个问题吗?

User: 我觉得在CPUAttentionBackend实现是不是正确的做法?因为layout实际上是和attention绑定,而不是和cpu绑定

Assistant: 你说得对!从代码可以看到,框架已经有这个机制了。在 vllm/v1/attention/selector.py 中:

    # Adjust kv cache layout if the selected backend requires a specific one
    required_layout = backend.get_required_kv_cache_layout()
    if required_layout is not None:
        from vllm.v1.attention.backends.utils import set_kv_cache_layout

        set_kv_cache_layout(required_layout)
        logger.info(
            "Using %s KV cache layout for %s backend.",
            required_layout,
            backend.get_name(),
        )

selector 选完 backend 后会自动检查 get_required_kv_cache_layout() 并设置。其他 backend 如 FlashInferMLABackendFlashInferMLASparseBackendFlashInferBackend(SM100) 都是这么做的。

所以正确的修复就是在 CPUAttentionBackend 中 override 这个方法:

    @classmethod
    def get_required_kv_cache_layout(cls) -> "KVCacheLayoutType | None":
        return "HND"

因为 get_kv_cache_shape 返回的是 (2, num_blocks, num_kv_heads, block_size, head_size) — H 在 N 前面,物理上就是 HND。这样 selector 在选择 CPU attention backend 时就会自动把全局 layout 设为 HND,hint 和实际内存布局就一致了。

这比在 CpuPlatform 里设置更合理,因为 layout 确实是 attention kernel 的属性,不是平台的属性。你要我帮你创建一个 PR 来做这个修改吗?

User: 开个PR吧

Assistant: I've started creating a PR in hlin99/vllm to fix this issue. The coding agent is working on adding get_required_kv_cache_layout to CPUAttentionBackend that returns "HND", matching the actual physical layout defined in get_kv_cache_shape. You'll receive a notification when...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/hlin99/vllm/sessions/69ab9065-255f-4c7d-bacf-f17a7c77d8ff

Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix KV cache layout for CPUAttentionBackend Align CPU attention KV cache layout hint with physical HND layout May 15, 2026
Copilot finished work on behalf of hlin99 May 15, 2026 12:15
Copilot AI requested a review from hlin99 May 15, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants