Align CPU attention KV cache layout hint with physical HND layout#1
Draft
Copilot wants to merge 2 commits into
Draft
Align CPU attention KV cache layout hint with physical HND layout#1Copilot wants to merge 2 commits into
Copilot wants to merge 2 commits into
Conversation
Agent-Logs-Url: https://github.com/hlin99/vllm/sessions/69ab9065-255f-4c7d-bacf-f17a7c77d8ff Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix KV cache layout for CPUAttentionBackend
Align CPU attention KV cache layout hint with physical HND layout
May 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CPUAttentionBackendstores KV cache in HND order ((2, num_blocks, num_kv_heads, block_size, head_size)), but the backend did not declare a required layout, allowing default NHD hints to leak through. This PR makes CPU backend layout requirements explicit so selector-level layout configuration matches actual cache memory layout.Backend contract update
CPUAttentionBackend.get_required_kv_cache_layout()and return"HND".KVCacheLayoutTypeincpu_attn.pyto keep the method signature aligned with other backends.Coverage for the new backend requirement
CPUAttentionBackend.get_required_kv_cache_layout() == "HND".Pattern consistency with existing backends
FlashInferBackend(SM100),FlashInferMLABackend, andFlashInferMLASparseBackend.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
astral.sh/usr/bin/curl curl -LsSf REDACTED(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
Problem
In
CPUAttentionBackend, the physical KV cache layout is HND as seen fromget_kv_cache_shape:The shape is
(2, num_blocks, num_kv_heads, block_size, head_size)—num_kv_heads(H) comes beforeblock_size(N), which is HND layout.However,
get_kv_cache_layout()returns "NHD" by default (when no override or env var is set), causing inconsistency between the actual memory layout and the reported layout hint.Fix
Override
get_required_kv_cache_layoutinCPUAttentionBackendto return"HND". The attention selector (vllm/v1/attention/selector.py) already checks this method after selecting a backend and callsset_kv_cache_layout()accordingly. This is the same pattern used byFlashInferBackend(SM100),FlashInferMLABackend, andFlashInferMLASparseBackend.In
vllm/v1/attention/backends/cpu_attn.py, add to theCPUAttentionBackendclass:You'll need to import
KVCacheLayoutTypefromvllm.v1.attention.backends.utils.The following is the prior conversation context from the user's chat exploration (may be truncated):
User: cpu attention的情况下,kv layout物理上是HND,但是get hint的时候是NHD,有没有办法显示指定一下
Assistant: Now let me look at the CPU attention backend to see how it handles the layout:
Assistant: 现在我理解了这个问题。让我来总结一下:
问题分析
CPU attention 的 KV cache 物理布局是 HND,从
get_kv_cache_shape可以看到:shape 是
(2, num_blocks, num_kv_heads, block_size, head_size)— 即 H(num_kv_heads) 在 N(block_size) 前面,这是 HND 布局。但是
get_kv_cache_layout()在没有显式设置的情况下,会走默认路径(get_kv_connector_cache_layout()),通常返回 "NHD",导致不一致。解决方案
参考 XPU 平台的做法,在
CpuPlatform.get_attn_backend_cls()中显式设置 layout:对于 CPU 平台,你可以在
vllm/platforms/cpu.py的get_attn_backend_cls方法中加入:或者,你也可以通过环境变量
VLLM_KV_CACHE_LAYOUT=HND来显式指定,不需要改代码。另外一个方案是让
CPUAttentionBackend实现get_required_kv_cache_layout()类方法(类似 benchmark runner 中对 FlashInfer/TRTLLM 的处理),返回"HND",这样上层框架在初始化时会自动调用set_kv_cache_layout。你想让我帮你创建一个 PR 来修复这个问题吗?
User: 我觉得在CPUAttentionBackend实现是不是正确的做法?因为layout实际上是和attention绑定,而不是和cpu绑定
Assistant: 你说得对!从代码可以看到,框架已经有这个机制了。在
vllm/v1/attention/selector.py中:selector 选完 backend 后会自动检查
get_required_kv_cache_layout()并设置。其他 backend 如FlashInferMLABackend、FlashInferMLASparseBackend、FlashInferBackend(SM100) 都是这么做的。所以正确的修复就是在
CPUAttentionBackend中 override 这个方法:因为
get_kv_cache_shape返回的是(2, num_blocks, num_kv_heads, block_size, head_size)— H 在 N 前面,物理上就是 HND。这样 selector 在选择 CPU attention backend 时就会自动把全局 layout 设为 HND,hint 和实际内存布局就一致了。这比在
CpuPlatform里设置更合理,因为 layout 确实是 attention kernel 的属性,不是平台的属性。你要我帮你创建一个 PR 来做这个修改吗?User: 开个PR吧
Assistant: I've started creating a PR in
hlin99/vllmto fix this issue. The coding agent is working on addingget_required_kv_cache_layouttoCPUAttentionBackendthat returns"HND", matching the actual physical layout defined inget_kv_cache_shape. You'll receive a notification when...This pull request was created from Copilot chat.