From 29408d118ab7df51e8fa9b85e3ccaa2ebeb3a944 Mon Sep 17 00:00:00 2001
From: ZhiweiYan-96 <zhiwei.yan@amd.com>
Date: Mon, 11 May 2026 10:16:06 +0000
Subject: [PATCH 1/3] add work log

---
 .../2026-04-30-attn-refractory.md             | 354 +++++++++++++
 ...26-05-07-radix-cache-vs-paged-attention.md | 482 ++++++++++++++++++
 ...26-05-08-kv-indices-address-space-notes.md | 359 +++++++++++++
 .../2026-05-08-metadata-layering-summary.md   | 144 ++++++
 ...6-05-11-attn-refractory-session-summary.md | 398 +++++++++++++++
 .../attn_refractory/decoupling-diagram.md     | 158 ++++++
 .../vllm-integration-architecture.md          | 269 ++++++++++
 7 files changed, 2164 insertions(+)
 create mode 100644 work_log/attn_refractory/2026-04-30-attn-refractory.md
 create mode 100644 work_log/attn_refractory/2026-05-07-radix-cache-vs-paged-attention.md
 create mode 100644 work_log/attn_refractory/2026-05-08-kv-indices-address-space-notes.md
 create mode 100644 work_log/attn_refractory/2026-05-08-metadata-layering-summary.md
 create mode 100644 work_log/attn_refractory/2026-05-11-attn-refractory-session-summary.md
 create mode 100644 work_log/attn_refractory/decoupling-diagram.md
 create mode 100644 work_log/attn_refractory/vllm-integration-architecture.md

diff --git a/work_log/attn_refractory/2026-04-30-attn-refractory.md b/work_log/attn_refractory/2026-04-30-attn-refractory.md
new file mode 100644
index 000000000..f0946299a
--- /dev/null
+++ b/work_log/attn_refractory/2026-04-30-attn-refractory.md
@@ -0,0 +1,354 @@
+# 2026-04-30 ATOM Attention Refactory Session Summary
+
+## 1. 本次会话的核心目标
+
+本次连续讨论的主线是围绕 `ATOM plugin` 中 attention backend 的重构方向展开，重点包括：
+
+- 调研当前 `ATOM plugin` attention backend 的架构问题
+- 评估 `vLLM plugin` 与 `SGLang plugin` 的耦合程度
+- 分析 SGLang plugin 在支持新模型、特别是非传统 MHA/MLA 类型 backend 时会遇到的困难
+- 讨论如何在 plugin 层把 `vLLM` 与 `SGLang` 解耦
+- 讨论 `SGLang` 中 `hybrid/composite backend` 的接入方式
+- 评估三种模式共享 ATOM attention 实现的难度：
+  - `ATOM native/server`
+  - `ATOM vLLM plugin`
+  - `ATOM SGLang plugin`
+
+## 2. 对当前 ATOM plugin attention 架构的主要判断
+
+### 2.1 当前不是一个真正统一的 attention backend 架构
+
+本次调研得到的一个核心结论是：
+
+- `vLLM plugin` 和 `SGLang plugin` 虽然都叫 “ATOM attention”
+- 但它们并不是通过同一套干净的 backend contract 接进来的
+- 更准确地说，是两套宿主接入模型外加多层 patch / adapter / runtime glue 共同构成的系统
+
+### 2.2 当前的主要问题不是底层 kernel，而是 runtime / ownership / patch 层次
+
+现有问题主要集中在：
+
+- 全局状态过重
+  - `_CURRENT_FRAMEWORK`
+  - `current_atom_config`
+  - `ops.Attention`
+- vLLM 与 SGLang 的接入模型差异很大，但还共享 `plugin` 根目录里一批伪公共逻辑
+- 很多扩展点依赖 monkey patch 和 runtime override
+- SGLang 侧对 DeepSeek MLA 已经上卷到 model-level patch，而不是单纯 backend 替换
+
+### 2.3 attention 文件结构的拆分轴不统一
+
+本次对 `plugin` 目录下 attention 相关文件的解读认为，目前至少混杂了三层不同对象：
+
+- host/runtime glue
+- backend runtime core
+- model-specific specialization
+
+尤其在 `SGLang` 侧：
+
+- `radix_attention.py` 更像 adapter
+- `sgl_attn_backend.py` 更像 full-attn runtime backend core
+- `sgl_attention_mla.py` 更像 DeepSeek MLA specialization
+
+而在根目录：
+
+- `attention.py`
+- `attention_mha.py`
+- `attention_mla.py`
+- `attention_mla_sparse.py`
+
+虽然放在 `plugin/` 根目录，但职责上明显更接近 `vLLM plugin` 的 glue / patch 层，而不是真正共享的 plugin runtime core。
+
+## 3. 关于 vLLM plugin 与 SGLang plugin 的耦合
+
+### 3.1 结论：耦合度高
+
+这次讨论中对二者耦合的判断是：
+
+- 不是轻量共享几个工具函数
+- 而是共享了一套 runtime state、attention 抽象、selector 和 plugin-mode 判断方式
+
+关键耦合点包括：
+
+- `atom/plugin/prepare.py`
+- `atom/plugin/register.py`
+- `atom/plugin/config.py`
+- `atom.plugin.prepare.is_vllm()/is_sglang()/is_plugin_mode()`
+- `ops.Attention` 的全局切换
+
+### 3.2 关键判断
+
+真正的问题不是 “目录没有拆开”，而是：
+
+**host runtime 差异没有被限制在 adapter 边界，而是泄漏进了 backend 选择、metadata 组织和全局状态模型。**
+
+## 4. 关于 plugin 层解耦的共识
+
+### 4.1 plugin 层不应该继续保留共享 runtime
+
+讨论最终倾向于一个更激进但更干净的方向：
+
+- `atom/plugin/vllm/` 是一个完整子系统
+- `atom/plugin/sglang/` 是另一个完整子系统
+- `atom/plugin/` 根目录不再承担共享 runtime 中心的角色
+
+共享只保留给更底层的 `ATOM core`，例如：
+
+- `model_ops`
+- kernels
+- loader
+- metadata helpers
+- model-family specialization 中真正 host-agnostic 的部分
+
+### 4.2 “bootstrap” 的定义
+
+本次还明确了 `bootstrap` 在本讨论语境中的含义：
+
+- 不是核心执行逻辑
+- 而是插件被宿主框架加载时的最早期接线/初始化层
+
+也就是：
+
+- 注册扩展点
+- 安装 patch
+- 选择 wrapper / adapter / backend
+- 建立 host 与 plugin 的连接关系
+
+## 5. 已经做过的代码原型
+
+本次会话中，为了验证“按 host 拆入口”的思路，已经做了一批原型代码修改：
+
+### 5.1 新增的 bootstrap / prepare / register 原型
+
+- `atom/plugin/sglang/bootstrap.py`
+- `atom/plugin/vllm/bootstrap.py`
+- `atom/plugin/sglang/prepare.py`
+- `atom/plugin/sglang/register.py`
+
+### 5.2 `prepare.py` 已部分收缩为兼容层
+
+- 实际的 SGLang prepare 逻辑已经移动到 `atom/plugin/sglang/prepare.py`
+- `atom/plugin/prepare.py` 现在更像一个 legacy shim
+- 但它仍保留了 framework-state helper（因为仓库里很多地方还在依赖）
+
+### 5.3 SGLang register 逻辑已开始下沉
+
+SGLang 独有的几块逻辑已经迁入：
+
+- `register_ops_to_sglang`
+- `set_sglang_attn_cls`
+- `init_aiter_dist_for_sglang`
+- `bootstrap_sglang_runtime`
+
+共享 `atom/plugin/register.py` 目前更多是兼容 facade。
+
+### 5.4 说明
+
+这些改动的定位是：
+
+- 用来显式化未来结构
+- 不是完整重构
+- 还保留了较多兼容层
+
+## 6. 关于 SGLang attention backend 的重要判断
+
+### 6.1 `ATOMAttnBackendForSgl` 与 public `AiterAttnBackend`
+
+在一次反复讨论后，本次对它们的关系收敛为：
+
+- 在 backend 角色位置上，两者基本是 apple-to-apple
+- `ATOMAttnBackendForSgl` 不是“更高层的新东西”
+- 而是 `SGLang full-attention runtime backend` 的 ATOM 对位实现
+
+因此更合适的重构思路不是怀疑它的存在合法性，而是：
+
+- 保留它作为 full-attn backend core
+- 再去拆它内部的 metadata / kv_cache / decode / graph 等职责
+
+### 6.2 DeepSeek MLA 的特殊性
+
+`sgl_attention_mla.py` 暴露出一个很重要的事实：
+
+- 对 `MLA`，尤其是 `DeepSeek MLA`
+- SGLang plugin 已经不是单纯换 backend
+- 而是 patch 了模型级 `forward`
+
+这说明：
+
+**MLA 在 SGLang plugin 里已经进入了 model specialization 维度。**
+
+## 7. 关于 GDN / KDA / Lightning / Mamba2 等非传统 “attention”
+
+### 7.1 调研结论
+
+public SGLang 已经证明，系统里不止 full-attention backend，还存在：
+
+- linear attention backend
+  - `GDNAttnBackend`
+  - `KDAAttnBackend`
+  - `LightningAttentionBackend`
+  - `Mamba2AttnBackend`
+- hybrid/composite backend
+  - `HybridLinearAttnBackend`
+  - `HybridAttnBackend`
+- wrapper/composition backend
+  - `TboAttnBackend`
+
+### 7.2 对当前 ATOM plugin 设计造成的挑战
+
+当前 ATOM plugin 的主抽象仍偏向：
+
+- MHA
+- MLA
+- full attention runtime
+
+这会带来几个结构性问题：
+
+1. 还没有把 `linear backend` 当成一等公民
+2. 还没有给 `hybrid/composite backend` 预留独立宿主位置
+3. 容易把 algorithm backend、kernel backend、host glue 混成一层
+4. 容易继续把 `GDN` 等路径硬塞回 full-attn backend 体系
+
+### 7.3 当前共识
+
+后续不应该继续只谈 “attention backend 重构”，而应该升级为：
+
+**sequence backend / mixer backend 体系重构**
+
+建议至少在设计上并列三类：
+
+- `full_attn`
+- `linear_attn`
+- `hybrid/composite`
+
+## 8. hybrid/composite backend 在 plugin 侧的接入思路
+
+### 8.1 核心判断
+
+由于 ATOM SGLang plugin 的启动方式是：
+
+- 启动 `sglang server`
+- 再通过 plugin runtime override 接管 model / attention
+
+所以要接 `hybrid/composite backend`，不可避免需要 patch 某个 seam。
+
+### 8.2 最好的 seam
+
+本次讨论认为，public SGLang 里最好的 seam 是：
+
+- `model_runner.py` 中导入并调用的 `attn_backend_wrapper`
+
+注意：
+
+- 它不是 `ModelRunner` 的成员方法
+- 而是 `model_runner.py` 模块级导入的符号
+
+所以如果要 patch，更稳的是 patch：
+
+- `sglang.srt.model_executor.model_runner.attn_backend_wrapper`
+
+而不是仅 patch `attention_registry.attn_backend_wrapper`。
+
+### 8.3 最推荐的方式
+
+对 hybrid/composite backend 的接入方式，本次最终倾向于：
+
+**plugin-own 一个薄的 composite wrapper/factory，只在 backend 构造 seam 做单点 patch。**
+
+不推荐：
+
+- 深度 monkey patch public hybrid backend 实现
+- 一开始就复制整套 public linear/full backend 逻辑
+
+更推荐：
+
+- full side 先用 ATOM full backend
+- linear side 初期先复用 public SGLang 的 `GDNAttnBackend` / `KDAAttnBackend` / `LightningAttentionBackend` / `Mamba2AttnBackend`
+- composite wrapper 由 plugin 侧拥有
+
+## 9. 关于三种模式共享 ATOM attention 实现的难度
+
+本次对：
+
+- `ATOM native/server`
+- `ATOM vLLM plugin`
+- `ATOM SGLang plugin`
+
+三种模式共享 ATOM attention 实现，得到的判断如下。
+
+### 9.1 MHA
+
+难度：`Medium-High`
+
+原因：
+
+- `ATOM native` 与 `ATOM vLLM plugin` 已经在 `PagedAttentionImpl` 等层共享较多
+- 真正难的是 `SGLang plugin` 这一侧
+- 它不走 `PagedAttention` 的 ATOM 主路径，而走：
+  - `RadixAttention`
+  - `ATOMAttnBackendForSgl`
+  - `ForwardBatch`
+
+所以 MHA 共享的主要困难在于：
+
+**runtime orchestration 差异**
+
+### 9.2 MLA
+
+难度：`High`
+
+原因：
+
+- native/server 与 vLLM plugin 仍然较多共享 `MLAAttention`
+- 但 SGLang plugin 下的 DeepSeek MLA 已经上卷到 model specialization
+- 不只是 backend 不同，连 forward 组织方式都不同
+
+所以 MLA 的困难在于：
+
+**runtime orchestration + model specialization 双重叠加**
+
+## 10. 已产出的文档与图
+
+本次会话过程中，已额外产出下列文档/图，用于不同角度说明问题。
+
+### 10.1 代码目录内 Markdown
+
+- `atom/plugin/decoupling-diagram.md`
+- `atom/plugin/vllm-integration-architecture.md`
+- `atom/plugin/sglang-attention-backend-survey.md`
+
+### 10.2 Canvas / 可视化分析
+
+- `atom-attention-backend-architecture-review.canvas.tsx`
+- `atom-plugin-coupling-risk-analysis.canvas.tsx`
+- `atom-plugin-decoupling-diagram.canvas.tsx`
+- `atom-vllm-plugin-architecture.canvas.tsx`
+- `atom-attention-sharing-modes.canvas.tsx`
+
+### 10.3 这些产物对应的主题
+
+- 当前 attention backend 架构缺陷
+- vLLM / SGLang plugin 耦合与风险
+- plugin 解耦方向与 bootstrap 理解
+- vLLM plugin 集成架构
+- public SGLang backend survey
+- 三种模式共享 ATOM attention 的难度评估
+
+## 11. 当前最值得继续推进的方向
+
+如果延续本次会话的结论，后续工作最值得按下面顺序推进：
+
+1. 继续收缩 `atom/plugin/` 根目录的 shared runtime 语义
+2. 把 `full_attn / linear_attn / hybrid` 三层作为 plugin 后续结构设计的一等公民
+3. 在 `SGLang plugin` 侧先补出 `hybrid/composite` 组合层
+4. 初期复用 public SGLang linear backend，优先验证 runtime 结构是否成立
+5. 然后再评估哪些 linear backend 需要逐步替换成 ATOM 自己的实现
+6. 对 `MLA` 尽早拆开：
+   - generic MLA runtime
+   - DeepSeek specialization
+
+## 12. 一句话总结
+
+本次会话最终把问题收敛为：
+
+**当前 ATOM plugin attention 的关键矛盾，不是底层 kernel 能不能共享，而是 vLLM / SGLang / native 三种模式在 runtime、metadata、model specialization 和 backend ownership 上没有对齐；后续重构应从“统一 attention backend”升级为“重新定义 plugin 的 sequence backend / host-owned runtime 结构”。**
diff --git a/work_log/attn_refractory/2026-05-07-radix-cache-vs-paged-attention.md b/work_log/attn_refractory/2026-05-07-radix-cache-vs-paged-attention.md
new file mode 100644
index 000000000..f0f9d5bcc
--- /dev/null
+++ b/work_log/attn_refractory/2026-05-07-radix-cache-vs-paged-attention.md
@@ -0,0 +1,482 @@
+# 2026-05-07 Radix Cache vs Paged Attention Notes
+
+> 预估阅读时间：15 分钟  
+> 主题：梳理 `SGLang` 中 `radix cache` / `RadixAttention` 与 `ATOM` / `vLLM` 常见的 `paged attention` 之间的关系，解释为什么它们最终可以落到相同的 kernel，以及这件事对 `sglang plugin` attention backend 重构意味着什么。
+
+## 1. 这篇笔记想回答什么
+
+围绕 `sglang plugin attn backend` 的重构，最近反复出现了几个问题：
+
+1. `SGLang` 用的是 `RadixAttention`，`ATOM`/`vLLM plugin` 里常见的是 `PagedAttention`，两者是不是两套完全不同的注意力体系？
+2. 如果它们真的不同，为什么在 `DeepSeek MLA` 这类路径里，最终又会调用到相同的底层 kernel？
+3. 既然 `SGLang` 还多了一层 `radix cache` 前缀管理，为什么没有一个非常直观的结论说 "`SGLang` 一定比 `vLLM` 更快 / 更省显存"？
+4. 对当前重构来说，`radix` / `paged` 的差异到底应该被归类到：
+   - host runtime 差异
+   - KV cache 管理差异
+   - metadata lowering 差异
+   - kernel 差异
+   的哪一层？
+
+本文尝试把这几个问题放进同一个分析框架里。
+
+## 2. 先给结论
+
+### 2.1 最核心的一句话
+
+`RadixAttention` 和 `PagedAttention` 主要不是在底层注意力数学上不同，而是在 **runtime 如何管理、共享、命中和定位历史 KV** 上不同。  
+一旦 runtime 把历史 KV lowering 成底层 kernel 能吃的 `page_table`、`kv_indptr`、`kv_indices`、`qo_indptr` 之类的 metadata，kernel 就不再关心这些索引最初是来自 radix tree 还是来自 paged block table。
+
+### 2.2 另一句更工程化的结论
+
+对重构来说，不应该把 “`RadixAttention` vs `PagedAttention`” 当作 “能不能共享底层执行实现” 的直接判断依据。  
+更应该把系统拆成三层：
+
+1. **host/runtime 层**  
+   例如 `ForwardBatch`、`RadixAttention`、`ModelRunner`、prefix cache、speculative 调度
+2. **execution metadata lowering 层**  
+   例如 `page_table`、`block_tables`、`kv_indptr`、`kv_indices`、`qo_indptr`
+3. **kernel / execution core 层**  
+   例如 `pa_fwd_asm`、`pa_persistent_fwd`、`flash_attn_varlen_func`、`mla_decode_fwd`、`mla_prefill_fwd`
+
+`radix` 和 `paged` 的差异，主要在第 1、2 层；  
+兼容性和共享机会，主要出现在第 2、3 层。
+
+## 3. 为什么名字容易让人误解
+
+当前最容易造成误解的点是：
+
+- `PagedAttention` 这个名字听起来像“最终执行算法”
+- `RadixAttention` 这个名字也听起来像“最终执行算法”
+
+但在工程上，它们都更接近 **host-facing attention runtime abstraction**，而不是最终 kernel 名字。
+
+更具体一点：
+
+- `PagedAttention` 更强调 **按固定 page/block 管理 KV cache**
+- `RadixAttention` 更强调 **按 prefix/radix tree 管理请求历史与前缀复用**
+
+这两者都不是在说：
+
+- `QK^T` 怎么算
+- softmax 怎么做
+- decode kernel 用哪套 asm/triton/flashinfer
+
+这些底层执行问题，是更下一层的 backend / kernel 决定的。
+
+## 4. `RadixAttention` 到底是什么
+
+### 4.1 `RadixAttention` 不是 kernel
+
+在 public `SGLang` 里，`RadixAttention` 本身并不直接定义底层 kernel 选择。  
+它更像是：
+
+- SGLang 模型层里统一使用的 attention layer 壳
+- 它知道自己在 `ForwardBatch` 语义里运行
+- 它把真正的执行委托给 `forward_batch.attn_backend`
+
+因此，`RadixAttention` 更像一个 **layer/runtime 入口点**。
+
+### 4.2 `radix` 的重点是 prefix 管理
+
+`radix cache` 的本质，是一棵 prefix tree，用来做：
+
+- 前缀命中
+- 前缀复用
+- unfinished / finished request 的缓存插入
+- 基于 prefix 的 eviction / lifecycle 管理
+
+它回答的问题更像：
+
+- 这个请求的历史前缀已经缓存到哪里了？
+- 哪一部分可以直接复用，不必重新 prefill？
+- 哪些 KV slot/page 仍然属于共享前缀？
+- 哪些历史片段可以回收？
+
+所以 `radix` 优化的主要是 **prefix reuse**，而不是单次 attention kernel 的算力效率。
+
+### 4.3 `radix cache` 在 SGLang 里不是完全“反 page”的
+
+这一点很重要。public `SGLang` 的 `radix_cache` 本身就已经带有 page-aware 的逻辑。
+
+例如它有 page 对齐：
+
+- `page_align_keys()`
+
+也有按 page 粒度匹配 prefix：
+
+- `_key_match_paged()`
+
+这意味着：
+
+- `radix cache` 解决的是前缀共享与命中问题
+- 但它并不排斥最终按 page/block 粒度来表达缓存
+
+也就是说，**radix tree 的逻辑管理** 与 **paged/block 的物理表达** 并不冲突。
+
+## 5. `PagedAttention` 到底是什么
+
+### 5.1 `PagedAttention` 的重点不是 prefix tree
+
+`PagedAttention` 更强调：
+
+- KV cache 被组织成固定大小的 page/block
+- 每个请求通过 `page_table` / `block_table` 找到自己历史对应的 page
+- kernel 依照这些 page/block 索引读取历史 KV
+
+它回答的问题更像：
+
+- 当前请求历史有哪些 block/page？
+- 它们在物理内存中的位置是什么？
+- 这些 page 如何映射到 kernel 所需的 block table？
+
+所以 `paged attention` 更偏：
+
+- **物理存储布局**
+- **kernel 读取 KV 的地址组织方式**
+
+### 5.2 `PagedAttention` 天生不等于 prefix reuse
+
+`paged attention` 当然也可以配合 prefix cache 使用，但 prefix reuse 并不是它名字里最核心的概念。  
+它的核心价值更在于：
+
+- 支持非连续物理布局
+- 让 decode kernel 可以按 page/block 高效读取 KV
+
+## 6. 为什么 `radix` 和 `paged` 最后可以兼容
+
+### 6.1 因为它们主要解决的是不同层的问题
+
+可以把它们看成：
+
+- `radix`: 逻辑历史管理器
+- `paged`: 物理页表表达方式
+
+两者关系有点像：
+
+- `radix` 负责决定“哪些历史是共享的、哪些是命中的、请求当前逻辑历史是什么”
+- `paged` 负责决定“这些历史最终以什么 page/block 索引形式交给 kernel”
+
+只要 runtime 最终能把逻辑历史翻译成 page/block 或 indptr/index 形式，底层 kernel 并不需要知道前缀最初是如何组织的。
+
+### 6.2 兼容发生在 lowering 之后
+
+这一点非常关键：
+
+上层看起来是：
+
+- `ForwardBatch`
+- `RadixAttention`
+- `req_to_token`
+- prefix cache
+
+但 lower 到 backend 之后，会变成执行态 metadata，例如：
+
+- `page_table`
+- `block_tables`
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+- `kv_last_page_len`
+
+到了这个层次，kernel 只看：
+
+- query 的 shape
+- KV cache 的 shape
+- 历史长度
+- 索引表 / indptr
+
+它不看：
+
+- 这套索引是不是来自 radix tree
+- 是不是来自某个 prefix cache 命中
+- 是不是通过 `PagedAttention` 对象构造出来的
+
+所以兼容性不是“名字兼容”，而是 **execution metadata 兼容**。
+
+## 7. public `SGLang` 自己就在做类似 lowering
+
+这不是 `ATOM plugin` 独有的现象。public `SGLang` 本身就在这么做，只是不同 backend lower 到的 metadata 形态略有不同。
+
+### 7.1 `aiter_backend`
+
+public `SGLang` 的 `aiter_backend` 会从：
+
+- `forward_batch.seq_lens`
+- `forward_batch.req_pool_indices`
+- `req_to_token`
+
+构造：
+
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+
+也就是说，它会把 host runtime 的历史表示 lower 成 AITER kernel 所需的执行态索引。
+
+### 7.2 `triton_backend`
+
+public `SGLang` 的 `triton_backend` 也做类似事情。  
+它同样维护：
+
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+
+然后再调用 Triton 的 decode/extend kernel。
+
+### 7.3 `trtllm_mha_backend` / 部分 `flashinfer` 路径
+
+这类 backend 更偏向：
+
+- 直接构造 `page_table`
+- 或 `block_tables`
+
+再喂给 flashinfer / TRTLLM 风格的 kernel。
+
+所以更准确地说，public `SGLang` 的共同模式不是：
+
+> “所有 backend 都把 radix runtime 翻成 paged attention”
+
+而是：
+
+> “所有 backend 都会把 radix/runtime 层的历史表示，lower 成各自 kernel 能吃的索引/页表 metadata”
+
+其中有的偏 `kv_indptr/kv_indices`，有的偏 `page_table/block_tables`。
+
+## 8. 从 kernel 角度看，为什么可以兼容
+
+### 8.1 kernel 真正关心什么
+
+绝大多数 attention kernel 真正关心的是：
+
+1. 当前 query 的排列方式
+2. 每个请求可见的历史 KV 长度
+3. 如何从某个索引结构里拿到历史 KV 的物理地址
+4. page/block 大小、stride、layout
+5. 是否需要 prefix/extend/speculative 的特殊 metadata
+
+它通常不关心：
+
+1. 这份索引最初是不是从 prefix tree 查出来的
+2. 命中前缀的逻辑是谁做的
+3. runtime 是 `RadixAttention` 还是 `PagedAttention`
+
+### 8.2 对 MHA kernel 的影响
+
+MHA decode 常见地会落到：
+
+- `pa_fwd_asm`
+- `pa_persistent_fwd`
+- `flash_attn_varlen_func`
+
+这些 kernel 更偏 page/block 或 varlen index 风格。  
+只要 runtime 最终给出：
+
+- `page_table`
+- `context_lens`
+- `block_tables`
+
+或者：
+
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+
+它们都能算。
+
+### 8.3 对 MLA kernel 的影响
+
+`DeepSeek MLA` 更容易看出这种兼容性，因为 MLA 的底层算子路径更明确。
+
+典型 kernel 包括：
+
+- `mla_decode_fwd`
+- `mla_prefill_fwd`
+- `concat_and_cache_mla`
+- `fused_qk_rope_concat_and_cache_mla`
+
+这些 kernel 只要求：
+
+- 压缩后的 latent KV 表示
+- 以及相应的 `qo_indptr` / `kv_indptr` / `kv_indices`
+
+所以无论上层 host 是：
+
+- `ATOM native`
+- `ATOM vLLM plugin`
+- `ATOM SGLang plugin`
+
+只要最终 lower 成相同的 MLA execution metadata，就自然会落到同一套 MLA kernel。
+
+## 9. `DeepSeek MLA` 为什么尤其容易“看起来收敛”
+
+### 9.1 因为 MLA 的 execution core 更专用
+
+`DeepSeek MLA` 的真正执行对象不是：
+
+- “`RadixAttention` 版本的 MLA”
+- “`PagedAttention` 版本的 MLA”
+
+而是：
+
+- 一组固定的 MLA latent/cache 表示
+- 一套固定的 rope + kv cache 写入方式
+- 一套固定的 prefill/decode kernel
+
+换句话说，MLA 的底层 core 更容易抽成“共享执行层”。
+
+### 9.2 因为 host 差异主要体现在 metadata 准备与调度
+
+对于 `DeepSeek MLA` 来说，上层 host 差异更多体现在：
+
+- `positions` 从哪里来
+- `forward_batch` 怎么组织
+- decode / extend / speculative / target_verify 怎么分流
+- `req_to_token` 怎么转换成 `kv_indices`
+
+一旦进入 `mla_decode_fwd` 或 `mla_prefill_fwd` 前，这些差异已经被消化掉了。
+
+所以从外部观察，很容易得到一种印象：
+
+> “怎么上面完全不一样，下面却是同一套 kernel？”
+
+其实并不奇怪，因为两边只是上层 runtime 不同，而 MLA execution core 本来就在下层收敛。
+
+## 10. 既然兼容，为什么 `SGLang` 没有天然更快或更省显存
+
+这是另外一个很容易误判的问题。
+
+### 10.1 `radix cache` 省的是前缀重复，不是所有 KV
+
+`radix cache` 能节省的，是重复前缀带来的重复 prefill 和重复 KV 占用。
+
+它主要帮助的是：
+
+- 大量请求共享长前缀
+- 同 prompt 多分支采样
+- prefix-heavy workload
+- 某些 speculative / chunked prefill 场景
+
+如果 workload 不满足这些条件，它就不会天然表现为显著优势。
+
+### 10.2 它优化的是 prefix reuse，不是单 token decode kernel
+
+radix 管理层并不会让 `mla_decode_fwd`、`pa_fwd_asm` 这种 kernel 自己更便宜。  
+kernel 的 raw 速度主要还是受：
+
+- 头数
+- head dim
+- KV layout
+- quant
+- page size
+- GPU kernel 实现
+
+这些因素决定。
+
+所以常见现象是：
+
+- prefix-heavy workload 下，`SGLang` 的收益更明显
+- decode-heavy workload 下，收益没那么明显
+- 如果 `vLLM` 也有自己的 prefix caching/block caching，差距会进一步缩小
+
+### 10.3 radix 自己也有管理成本
+
+`radix cache` 不是免费层。它也有：
+
+- prefix match
+- tree split / insert / eviction
+- request -> KV slot 映射维护
+- speculative / unfinished request 的额外 bookkeeping
+
+如果命中率不高，这部分成本并不会自动转化成收益。
+
+## 11. 对当前重构最有价值的判断
+
+### 11.1 不要把 `RadixAttention` vs `PagedAttention` 当成是否共享 execution core 的边界
+
+因为从 public `SGLang` 和当前 `ATOM plugin` 的代码路径看，二者最终都在做类似事情：
+
+- 从 host runtime 的历史表示出发
+- 构造底层 kernel 所需的 execution metadata
+
+所以：
+
+- runtime 不同，不代表 execution core 必须不同
+- 名字不同，不代表 kernel 层必须分叉
+
+### 11.2 真正的边界更适合这样划
+
+#### A. `sglang plugin runtime` 应拥有的部分
+
+- `ForwardBatch`
+- `RadixAttention`
+- prefix cache / radix cache
+- `ModelRunner` / `attn_backend_wrapper` 相关调度
+- speculative / graph / verify / chunked prefill 等宿主语义
+
+#### B. execution metadata lowering 层
+
+这层是最值得重新定义的边界。它负责：
+
+- 把 `req_to_token`、prefix 命中结果、cache state
+- 转成 `page_table` / `kv_indptr` / `kv_indices` / `qo_indptr`
+
+这层既带有 host 语义，又已经开始接近共享执行核心，是未来最值得梳理清楚的一层。
+
+#### C. shared execution core
+
+例如：
+
+- `mla_decode_fwd`
+- `mla_prefill_fwd`
+- `pa_fwd_asm`
+- `pa_persistent_fwd`
+- `flash_attn_varlen_func`
+
+以及与这些 kernel 紧耦合的那部分 ATOM core helper。
+
+### 11.3 这对 `sglang plugin` 的重构意味着什么
+
+当前 `sglang plugin` 最不该做的是：
+
+- 因为底层 kernel 能共享，就把 `sglang` 的 runtime seam 也强行对齐到 `vllM`
+- 或者因为上层叫 `RadixAttention`，就认为它和 `PagedAttention` 完全不能共享底层执行实现
+
+更合理的重构方向是：
+
+1. 保持 `sglang` 的 host/runtime 语义完整
+2. 识别 runtime lowering 层的清晰边界
+3. 尽量把 execution core 继续下沉回共享 `ATOM core`
+
+## 12. 一种对外更容易讲清楚的表述
+
+如果后续要向别人解释，可以用下面这几句话：
+
+### 12.1 关于 `radix` vs `paged`
+
+`radix cache` 主要解决的是 prefix 命中与复用，`paged attention` 主要解决的是 page/block 形式的 KV 组织与读取。  
+前者偏逻辑管理，后者偏物理表达；两者不是同一层次的问题。
+
+### 12.2 关于为什么最终会调到同一个 kernel
+
+因为 attention kernel 只关心最终的执行态 metadata，比如 `page_table`、`kv_indptr`、`kv_indices`、`qo_indptr`。  
+一旦 runtime 把历史 KV lowering 成这类形式，kernel 就不再关心这些索引最初是由 radix tree 还是 paged runtime 生成的。
+
+### 12.3 关于为什么 `SGLang` 没有天然显著更快
+
+因为 `radix` 带来的主要收益是 prefix reuse，而不是单 token decode kernel 的 raw speed。  
+如果 workload 不是 prefix-heavy，或者 decode 占主要成本，那么 radix 带来的优势不会自然放大成“总是更快 / 更省显存”。
+
+## 13. 最终总结
+
+把这次探索收敛成一句话：
+
+**`RadixAttention` 与 `PagedAttention` 的主要差异，不在底层注意力数学，也不一定在最终 KV 的物理表示本身，而在 host runtime 如何管理、共享、命中并 lowering 历史 KV；一旦 lowering 完成，底层 kernel 完全可以是共享的。**
+
+对当前 `sglang plugin attn backend` 的重构来说，真正值得做的不是争论 “`radix` 和 `paged` 能不能统一成一个名字”，而是把系统拆清楚：
+
+- 哪些是 `sglang` 必须自有的 runtime / prefix cache / scheduling 语义
+- 哪些是 execution metadata lowering
+- 哪些已经是可以共享的 `ATOM` execution core
+
+这比直接讨论 “`radix` 和 `paged` 谁更先进、谁应该替代谁” 更接近真正的工程问题。
diff --git a/work_log/attn_refractory/2026-05-08-kv-indices-address-space-notes.md b/work_log/attn_refractory/2026-05-08-kv-indices-address-space-notes.md
new file mode 100644
index 000000000..e5a63be0f
--- /dev/null
+++ b/work_log/attn_refractory/2026-05-08-kv-indices-address-space-notes.md
@@ -0,0 +1,359 @@
+# 2026-05-08 KV Indices and Address Space Notes
+
+> 预估阅读时间：8-10 分钟  
+> 主题：解释 `kv_indptr` / `kv_indices` 为什么不是纯 host metadata，也不是完全 storage-agnostic 的抽象，并用 `sglang` / `ATOM` / `vllm plugin` 的实际 KV cache 存储举例说明。
+
+## 1. 想回答的问题
+
+最近有一个很容易卡住重构讨论的问题：
+
+> `kv_indptr` / `kv_indices` 看起来和 KV cache 的存储方式深度绑定。  
+> 既然如此，为什么还能把它们看成 `sglang` / `ATOM` / `vllm plugin` 之间可以共享的 metadata？
+
+这个问题的关键在于区分两件事：
+
+1. **host/runtime 是否相同**
+2. **kernel 最终消费的地址空间模型是否相同**
+
+一句话先说结论：
+
+**`kv_indptr` / `kv_indices` 不是完全 storage-agnostic 的抽象，它们绑定的是某种“可索引的 KV 地址空间”；  
+但它们可以是 host/runtime-agnostic 的，只要不同宿主最终都能把自己的 runtime state lowering 到同一种地址空间模型。**
+
+## 2. 三层视角
+
+理解这个问题，最好先把 attention 相关代码拆成三层：
+
+### 2.1 host/runtime 层
+
+这一层描述的是宿主框架如何组织请求和调度，例如：
+
+- `ATOM native` 的 `ScheduledBatch`
+- `vllm plugin` 的 `query_start_loc` / `num_prefills`
+- `sglang plugin` 的 `ForwardBatch` / `forward_mode` / `spec_info`
+
+这一层表达的是：
+
+- 哪些请求在 batch 中
+- 哪些是 decode，哪些是 prefill
+- prefix / speculative / graph 的语义是什么
+
+### 2.2 execution metadata lowering 层
+
+这一层做的事情是：
+
+- 把 host/runtime 里的 batch 状态
+- 翻译成 kernel 真正能消费的索引结构
+
+典型字段：
+
+- `block_tables` / `page_table`
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+- `kv_last_page_len`
+
+### 2.3 kernel / execution core 层
+
+这一层只关心：
+
+- query tensor
+- KV cache tensor
+- index / indptr / page table
+- persistent kernel workspace
+
+例如：
+
+- `mla_decode_fwd`
+- `mla_prefill_fwd`
+- `pa_fwd_asm`
+- `pa_persistent_fwd`
+
+## 3. `sglang` 的实际 KV cache 存储
+
+在 `sglang` 里，KV cache 本身不是存在 radix tree 里。  
+radix tree 负责的是 prefix 复用与命中；真正的 KV 还是存在 memory pool 里。
+
+`sglang` 自己的注释说得很清楚：
+
+- `ReqToTokenPool` 负责 request -> token location 映射
+- `TokenToKVPoolAllocator` 负责 KV cache index 管理
+- `KVCache` 真正持有物理 K/V tensor
+
+也就是说，`sglang` 的 runtime 世界里至少有两层：
+
+1. **逻辑视角**：某个 request 的第 `i` 个 token 属于哪里
+2. **物理视角**：这个 token 对应的 K/V 存在物理 pool 的哪个 slot
+
+### 3.1 `ReqToTokenPool`
+
+`ReqToTokenPool` 本质上是一张二维表：
+
+```text
+req_to_token[req_id, token_pos] = physical_slot_id
+```
+
+它回答的是：
+
+- 某个 request 的第 `j` 个逻辑 token
+- 对应到物理 KV pool 里的哪个 slot
+
+### 3.2 物理 KV pool
+
+物理 pool 里，K 和 V 都是按 slot 存的。  
+写入 KV 时，最终就是按 `loc/indices` 写入：
+
+```text
+k_cache[indices] = k
+v_cache[indices] = v
+```
+
+所以从 kernel 角度看，最终它看到的是：
+
+- 一个可以索引的 KV 地址空间
+- 一组 slot index
+
+而不是看到“radix tree”本身。
+
+## 4. `ATOM` / `vllm plugin` 的实际 KV cache 存储
+
+`ATOM` / `vllm plugin` 更偏 paged/block 风格。
+
+典型思路是：
+
+- 每个 request 有一张 `block_table`
+- 每个 block 对应一个固定大小的 page
+- kernel 按 `block_table` 或其展开后的 index 去读 KV
+
+所以它的上层直觉更像：
+
+```text
+request -> block table -> page/block address -> physical KV
+```
+
+和 `sglang` 相比，最大的不同不是“有没有物理地址空间”，而是：
+
+- `sglang` 上层先经过 `req_to_token`
+- `vllm/ATOM` 上层先经过 `block_table`
+
+但两边最终都能 lower 到一组可供 kernel 读取的地址索引。
+
+## 5. 例子一：MLA decode 中的 `kv_indptr` / `kv_indices`
+
+这个例子里，可以把 `kv_indices` 理解为 **token-slot index**。
+
+### 5.1 在 `sglang` 里
+
+假设一个 request 当前长度是 10，  
+它在物理 KV pool 里的 slot 顺序不是连续的，而是：
+
+```text
+[8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
+```
+
+那 `req_to_token[req_id, :10]` 就大致表示：
+
+```text
+logical token position: 0  1   2   3   4  5  6  7  8  9
+physical slot id:      8  9  10  11   0  1  2  3  4  5
+```
+
+如果这个 batch 只有这一个 request，那么 lowering 之后可以得到：
+
+```text
+kv_indptr = [0, 10]
+kv_indices = [8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
+```
+
+这里：
+
+- `kv_indptr` 表示“第 0 个 request 的 KV 范围是 `[0, 10)`”
+- `kv_indices` 表示“真正去物理 KV pool 里读这些 slot”
+
+### 5.2 在 `vllm/ATOM` 里
+
+如果 page size 是 4，同样 10 个 token 对应的 block layout 可以写成：
+
+```text
+block 2 -> slots [8, 9, 10, 11]
+block 0 -> slots [0, 1, 2, 3]
+block 1 -> slots [4, 5, 6, 7]
+```
+
+block table 大致就是：
+
+```text
+[2, 0, 1]
+```
+
+展开前 10 个 token 后，对应的 slot 顺序仍然是：
+
+```text
+[8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
+```
+
+于是 lower 之后，给 MLA decode kernel 的：
+
+```text
+kv_indptr = [0, 10]
+kv_indices = [8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
+```
+
+和 `sglang` 这边是等价的。
+
+### 5.3 这个例子说明什么
+
+说明在 MLA decode 这个路径里：
+
+- `kv_indices` 绑定的是 **token-slot 地址空间**
+- 它不是“完全与存储无关”
+- 但它已经不关心 slot 列表最初是由 `req_to_token` 还是 `block_table` 推出来的
+
+也就是说，**它和具体宿主无关，但和被选定的地址空间模型有关。**
+
+## 6. 例子二：MHA persistent decode 中的 `page_table` / `kv_indices`
+
+这个例子里，`kv_indices` 更接近 **page/block index**，而不是 token-slot index。
+
+### 6.1 在 `sglang plugin` 里
+
+仍然假设 page size = 4。  
+如果某个 request 的 token-slot 顺序是：
+
+```text
+[8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
+```
+
+那么每页的起始 slot 是：
+
+```text
+[8, 0, 4]
+```
+
+除以 page size 后，对应 page id：
+
+```text
+[2, 0, 1]
+```
+
+在 `sglang plugin` 里，这就是 `page_table` 的来源：  
+从 `req_to_token` 按页采样，再除以 `page_size`。
+
+然后 `_build_pa_metadata_for_decode()` 再把它变成 `pa_persistent_fwd` 所需的 page-level `kv_indices`。
+
+### 6.2 在 `ATOM` / `vllm` 里
+
+这一层本来就是 paged/block 视角，所以 `block_table` 原生就已经是：
+
+```text
+[2, 0, 1]
+```
+
+也就是说：
+
+- `sglang` 是 `req_to_token -> page_table`
+- `vllm/ATOM` 是直接 `block_table -> page_table`
+
+但到了 persistent kernel 眼里，最后看到的是同一种 page/block 地址空间。
+
+### 6.3 这个例子说明什么
+
+说明在 MHA persistent decode 这个路径里：
+
+- `kv_indices` 绑定的是 **page/block 地址空间**
+- 它仍然不是“完全 storage-agnostic”
+- 但也不需要知道上层 host 是 `sglang` 还是 `vllm`
+
+只要两边都能 lower 到同一个 page/block 地址空间，kernel 就能共享。
+
+## 7. 这两个例子真正说明的事
+
+上面两个例子其实在说明同一个结论：
+
+### 7.1 `kv_indptr` / `kv_indices` 不是 host metadata
+
+它们不描述：
+
+- `ForwardBatch.forward_mode`
+- `query_start_loc`
+- `num_prefills`
+- `run_graph`
+- prefix hit 的高层语义
+
+它们只描述：
+
+- 当前这次 kernel 调用
+- 要读哪些 KV 地址
+- 每个 request 的边界在哪里
+
+所以它们更接近：
+
+- execution metadata
+- kernel-facing metadata
+
+而不是 host/runtime metadata。
+
+### 7.2 但它们也不是完全 storage-agnostic
+
+它们依赖：
+
+- 当前 KV cache 的地址空间模型
+- kernel 对该地址空间的解释方式
+
+如果底层 cache 不是：
+
+- 可线性索引的 token slot
+- 或可 flatten 的 page/block index
+
+那 `kv_indptr` / `kv_indices` 这套抽象就未必成立。
+
+所以它们不是“无限通用”的。
+
+## 8. 对重构的直接启示
+
+这个背景知识对当前 `sglang attention` 重构最有价值的结论是：
+
+### 8.1 不要去统一 host metadata
+
+比如不要强行统一：
+
+- `ForwardBatch`
+- `query_start_loc`
+- `num_prefills`
+- `run_graph`
+
+这些都属于宿主框架自己的 runtime 语言。
+
+### 8.2 应该争取统一 execution metadata
+
+例如：
+
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+- `page_table`
+- `kv_last_page_len`
+- `work_indptr`
+- `reduce_indptr`
+
+以及围绕这些字段构造的更小 dataclass，例如当前 draft 里已经开始尝试的：
+
+- `MLADecodeKernelMetadata`
+- `MHAAsmKernelMetadata`
+- `MHAPersistentKernelMetadata`
+
+### 8.3 正确的共享边界
+
+所以真正可共享的不是：
+
+- `sglang` 的 `ForwardMetadata`
+- `vllm plugin` 那一整套 metadata family
+
+而是：
+
+- 它们再往下切出来的
+- kernel-facing execution metadata
+
+## 9. 一句话总结
+
+**`kv_indptr` / `kv_indices` 不是“和存储彻底无关”的抽象，它们绑定的是某种被 kernel 接受的 KV 地址空间；但正因为它们已经脱离了宿主 batch/runtime 语义，`sglang` 和 `vllm/ATOM` 即使上层完全不同，也仍然可以在这层 metadata 上收敛并共享 kernel 调用逻辑。**
diff --git a/work_log/attn_refractory/2026-05-08-metadata-layering-summary.md b/work_log/attn_refractory/2026-05-08-metadata-layering-summary.md
new file mode 100644
index 000000000..86bccce8e
--- /dev/null
+++ b/work_log/attn_refractory/2026-05-08-metadata-layering-summary.md
@@ -0,0 +1,144 @@
+# 2026-05-08 Metadata Layering Summary
+
+## 1. 目的
+
+这份笔记简要总结 `ATOM native`、`vllm plugin`、`sglang plugin` 三边 metadata 的关系，重点回答：
+
+- `ATOM/atom/utils/forward_context.py` 中的 `AttentionMetaData` 是什么
+- `ATOM/atom/plugin/attention.py` 里的 metadata dataclass 在做什么
+- `ATOM/atom/plugin/sglang/attention_backend/sgl_attn_backend.py` 里的 `ForwardMetadata` 属于哪一层
+- 哪些 metadata 值得共享，哪些不值得强行统一
+
+## 2. 三套 metadata 的定位
+
+### 2.1 `AttentionMetaData`
+
+文件：`ATOM/atom/utils/forward_context.py`
+
+它是 `ATOM` attention 执行链路里的**通用外层容器**。  
+里面既有：
+
+- 通用 attention 字段
+  - `slot_mapping`
+  - `block_tables`
+  - `kv_indptr`
+  - `kv_indices`
+  - `work_meta_data`
+- 也预留了 plugin 扩展入口
+  - `plugin_metadata`
+
+所以它更像一个大容器，而不是某个 host 专属的数据模型。
+
+### 2.2 `vllm plugin metadata`
+
+文件：`ATOM/atom/plugin/attention.py`
+
+`vllm plugin` 不是重写一套完全独立的 metadata，而是在 `AttentionMetaData` 外壳上，额外挂了一层 `plugin_metadata` payload。
+
+代表类型包括：
+
+- `AiterFlashAttentionMetadataForPluginMode`
+- `AiterMLACommonMetadataForPluginMode`
+- `AiterMLADecodeMetadataForPluginMode`
+- `AiterMLASparseMetadataForPluginMode`
+
+这些 dataclass 主要承载：
+
+- `vllm` 的 batch/runtime 语义
+- decode/prefill/extend phase 拆分
+- chunked prefill / DCP / sparse MLA 等 feature-specific 信息
+
+一句话：**`vllm plugin` 是“外层复用 `AttentionMetaData`，内层自带一套 host-specific metadata family”。**
+
+### 2.3 `sglang plugin` 的 `ForwardMetadata`
+
+文件：`ATOM/atom/plugin/sglang/attention_backend/sgl_attn_backend.py`
+
+`ForwardMetadata` 不是 `AttentionMetaData.plugin_metadata` 那种 payload，而是 `sglang backend` 内部的**lowering 结果缓存对象**。
+
+它更接近：
+
+- `ForwardBatch`
+- `req_to_token`
+- `token_to_kv_pool`
+- graph/speculative path
+
+向 kernel metadata 的过渡层。
+
+一句话：**`ForwardMetadata` 更像 backend-local lowering state，而不是 host-agnostic 的最终执行 metadata。**
+
+## 3. 三类字段
+
+### 3.1 host/runtime-bound
+
+这类字段描述宿主框架如何组织 batch/runtime，而不是 kernel 如何计算。
+
+例子：
+
+- `vllm plugin`
+  - `query_start_loc`
+  - `num_decodes`
+  - `num_prefills`
+  - `chunked_context`
+- `sglang plugin`
+  - `run_graph`
+  - 各种 `ForwardBatch.forward_mode` 派生分支语义
+
+### 3.2 execution/kernel-bound
+
+这类字段最接近 kernel 真正需要的执行信息。
+
+例子：
+
+- `slot_mapping`
+- `block_tables`
+- `context_lens`
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+- `kv_last_page_len`
+- `work_indptr`
+- `reduce_indptr`
+
+### 3.3 hardware-sensitive
+
+这类字段和 page size、dtype、kernel 选择、并行配置相关。
+
+例子：
+
+- `max_q_len`
+- `max_kv_len`
+- `num_kv_splits`
+- `attn_out_dtype`
+- `head_dim`
+
+## 4. 哪些值得共享
+
+### 不建议强行共享的
+
+- `AttentionMetaData` 整体
+- `vllm plugin` 整套 `plugin_metadata`
+- `sglang plugin` 整体的 `ForwardMetadata`
+
+原因不是它们没价值，而是它们都带着很重的 host/runtime 语义。
+
+### 值得重点共享的
+
+应该共享的是更小的 **execution metadata view**，例如当前 draft 已经开始尝试的：
+
+- `MLADecodeKernelMetadata`
+- `MHAAsmKernelMetadata`
+- `MHAPersistentKernelMetadata`
+
+这些结构只保留某个 kernel call 真正需要的字段，更适合在：
+
+- `ATOM core`
+- `vllm plugin`
+- `sglang plugin`
+
+之间共享。
+
+## 5. 一句话结论
+
+现在不是三边共用“一套 metadata”，而是三边各自都有一层 host-owned metadata / lowering 体系。  
+真正值得共享的，不是这些大 metadata 本身，而是从它们里面切出来的、更小的、kernel-facing execution metadata。
diff --git a/work_log/attn_refractory/2026-05-11-attn-refractory-session-summary.md b/work_log/attn_refractory/2026-05-11-attn-refractory-session-summary.md
new file mode 100644
index 000000000..3b2128f5b
--- /dev/null
+++ b/work_log/attn_refractory/2026-05-11-attn-refractory-session-summary.md
@@ -0,0 +1,398 @@
+# 2026-05-11 ATOM Attention Refactory Session Summary
+
+> 主题：从 `sglang plugin` 设计者视角，梳理 `attention backend` 重构中的 runtime / metadata / KV cache / kernel reuse 边界，并记录几种 kernel 共用 draft 的结论。
+
+## 1. 本次会话的主线
+
+本次讨论的核心不是继续从 `vllm plugin` 出发看问题，而是明确切换到：
+
+- **`sglang plugin attn backend` 设计者视角**
+- 关注 `sglang plugin` 自己的 runtime 语义
+- 判断哪些层应该继续由 `sglang plugin` 自有
+- 判断哪些层值得尽量与 `ATOM core` 共用
+
+围绕这个目标，本次会话主要探索了 6 个问题：
+
+1. `RadixAttention` / `radix cache` 与 `PagedAttention` 的关系
+2. public `sglang` 是否也会把 radix runtime lowering 到 paged / index metadata
+3. metadata 为什么会越长越多，以及哪些字段本质上属于 host/runtime
+4. `kv_indptr` / `kv_indices` 为什么既绑定存储模型，又仍然能跨 host 共用
+5. `sglang plugin` 和 public `sglang` 在 KV cache 管理上的异同
+6. 在不大改文件结构、不做过度软件工程的前提下，怎么试探性地共用 kernel call
+
+## 2. 对 `sglang plugin` 重构最关键的几个判断
+
+### 2.1 不是所有问题都该叫 “attention backend 复用”
+
+当前问题至少分成三层：
+
+1. **host/runtime 层**
+   - `ForwardBatch`
+   - `RadixAttention`
+   - speculative / graph / verify / extend 调度
+2. **execution metadata lowering 层**
+   - `page_table`
+   - `kv_indptr`
+   - `kv_indices`
+   - `qo_indptr`
+   - `kv_last_page_len`
+3. **kernel / execution core 层**
+   - `flash_attn_varlen_func`
+   - `mla_decode_fwd`
+   - `mla_prefill_fwd`
+   - `pa_fwd_asm`
+   - `pa_persistent_fwd`
+
+后续任何“共用更多 code”的讨论，都应该先说明是在第几层说话。
+
+### 2.2 `sglang plugin` 不应该为了复用去对齐到 `vllm` 的 host runtime
+
+`vllm plugin` 和 `sglang plugin` 上层接入模型不同：
+
+- `vllm plugin` 更像 layer-owned / metadata-builder-owned 路径
+- `sglang plugin` 更像 `ForwardBatch` 驱动的 backend runtime 路径
+
+所以：
+
+- 不该共享 `vllm plugin` 那层 runtime facade
+- 也不该强行统一大 metadata 对象
+- 真正值得共享的，是更靠近 kernel 的 execution metadata 和 kernel call
+
+### 2.3 `sglang plugin` 想共用 `ATOM` code，最现实的边界在 kernel call 一层
+
+如果先不做大规模结构改造，最适合先共用的是：
+
+- `mla_decode_fwd`
+- `pa_persistent_fwd`
+- `pa_fwd_asm`
+- `flash_attn_varlen_func`
+
+也就是：
+
+- 保留各自的 runtime lowering
+- 先把“掉 kernel 前最后一跳”抽出来
+
+## 3. `RadixAttention` / `radix cache` 和 `PagedAttention` 的结论
+
+### 3.1 `RadixAttention` 不是最终 kernel
+
+`RadixAttention` 更像：
+
+- `sglang` 模型层统一使用的 attention layer 壳
+- 它知道自己跑在 `ForwardBatch` 语义里
+- 最终还是委托给 `forward_batch.attn_backend`
+
+所以它不是 “另一套底层注意力数学”，而是 **host-facing runtime abstraction**。
+
+### 3.2 `PagedAttention` 和 `radix` 解决的问题不在同一层
+
+- `radix cache` 更偏 prefix 命中 / 复用 / eviction
+- `paged attention` 更偏 block/page 形式的物理组织和读取
+
+可以理解成：
+
+- `radix` 管“历史怎么共享和命中”
+- `paged` 管“命中的历史最后如何变成 page/block 索引给 kernel”
+
+### 3.3 public `sglang` 自己也在做类似 lowering
+
+这不是 `ATOM plugin` 才有的行为。
+
+public `sglang` 的：
+
+- `aiter_backend`
+- `triton_backend`
+- `trtllm_mha_backend`
+- 部分 `flashinfer` 路径
+
+本身也会把：
+
+- `ForwardBatch`
+- `req_to_token`
+- `seq_lens`
+
+lower 成：
+
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+- `page_table` / `block_tables`
+
+所以，`radix runtime -> paged/index metadata` 不是 plugin 特例，而是公共设计模式的一部分。
+
+## 4. Metadata 分层结论
+
+### 4.1 现在至少有三套 metadata family
+
+1. **`ATOM native`**
+   - `AttentionMetaData`
+2. **`vllm plugin`**
+   - `AttentionMetaData` 外壳
+   - `plugin_metadata` 内层 payload
+3. **`sglang plugin`**
+   - `ForwardMetadata`
+
+### 4.2 `AttentionMetaData` 和 `plugin/attention.py` 的关系
+
+`AttentionMetaData` 是外层通用容器；  
+`plugin/attention.py` 里的很多 dataclass，是 `vllm plugin` 为表达宿主 batch/runtime 语义而塞进 `plugin_metadata` 的 payload。
+
+所以它们不是并列关系，而是：
+
+```text
+AttentionMetaData
+  └─ plugin_metadata
+       ├─ flash-style plugin metadata
+       ├─ MLA plugin metadata
+       └─ sparse MLA plugin metadata
+```
+
+### 4.3 `ForwardMetadata` 的定位
+
+`ForwardMetadata` 更像：
+
+- `sglang backend` 内部的 lowering state
+- 它不是最终统一 metadata
+- 更不是天然可共享的 host-agnostic object
+
+它里面混着：
+
+- runtime control 信息
+- execution metadata
+- persistent kernel 预构造结果
+
+## 5. 三类字段
+
+为了判断能不能共享，本次把字段分成三类：
+
+### 5.1 host/runtime-bound
+
+典型例子：
+
+- `vllm` 的 `query_start_loc`
+- `num_decodes`
+- `num_prefills`
+- `chunked_context`
+- `sglang` 的 `run_graph`
+
+这些字段描述的是宿主框架如何组织 batch/runtime，而不是 kernel 怎么算。
+
+### 5.2 execution/kernel-bound
+
+典型例子：
+
+- `slot_mapping`
+- `block_tables`
+- `page_table`
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+- `kv_last_page_len`
+- `work_indptr`
+- `reduce_indptr`
+
+这些字段才是 kernel 真正需要消费的执行信息。
+
+### 5.3 hardware-sensitive
+
+典型例子：
+
+- `max_q_len`
+- `max_kv_len`
+- `num_kv_splits`
+- `head_dim`
+- `attn_out_dtype`
+
+这些字段不是 host 语义，但会影响：
+
+- page/block 解释
+- kernel 选路
+- workspace/split 策略
+
+## 6. 关于 `kv_indptr` / `kv_indices` 的关键结论
+
+这个问题本次单独做了较多背景解释，结论是：
+
+### 6.1 它们不是完全 storage-agnostic
+
+因为它们绑定的是某种 **可索引的 KV 地址空间**：
+
+- token slot index
+- page/block index
+- compacted KV index
+
+所以它们不能脱离底层存储模型独立存在。
+
+### 6.2 但它们可以是 host/runtime-agnostic
+
+因为它们不表达：
+
+- `ForwardBatch.forward_mode`
+- `query_start_loc`
+- prefix 命中策略
+- graph/speculative 的宿主控制语义
+
+它们只表达：
+
+- 这次 kernel 要去哪些 KV 地址读数据
+- 每个 request 的边界在哪里
+
+### 6.3 两个例子
+
+#### 例子 A：`sglang`
+
+- `req_to_token[req_id, token_pos] = slot_id`
+- lowering 后得到：
+  - `kv_indptr = [0, 10]`
+  - `kv_indices = [8, 9, 10, 11, 0, 1, 2, 3, 4, 5]`
+
+#### 例子 B：`vllm/ATOM`
+
+- 上层是 `block_table`
+- 展开后同样可得到：
+  - `kv_indptr = [0, 10]`
+  - `kv_indices = [8, 9, 10, 11, 0, 1, 2, 3, 4, 5]`
+
+这说明：
+
+- 上层来源不同
+- 但 lower 后的 execution metadata 可以收敛
+
+## 7. public `sglang` 与 `sglang plugin` 在 KV cache 管理上的异同
+
+### 7.1 相同点
+
+- owner 都还是 public `sglang`
+  - `ReqToTokenPool`
+  - `TokenToKVPool`
+  - `KVCache`
+  - `radix_cache`
+- request 到 slot 的映射机制没变
+- prefix / radix 复用体系没变
+
+### 7.2 不同点
+
+#### A. MHA 写 cache 路径不同
+
+public `sglang` 更多沿用 pool 提供的标准写入接口：
+
+- `token_to_kv_pool.set_kv_buffer(...)`
+
+而 `sglang plugin` 在 MHA 路径里会绕过标准写法，自己做一次：
+
+- `set_kv_buffer_with_layout_shuffle(...)`
+- 把同一块 public pool buffer 按更偏 `ATOM` kernel 的 SHUFFLE layout 写进去
+
+#### B. MLA 路径更接近 public `sglang`
+
+MLA 上，plugin 更多沿用 public pool 的 contract：
+
+- `token_to_kv_pool.set_kv_buffer(...)`
+- `get_key_buffer(...)`
+- `get_value_buffer(...)`
+
+#### C. Lowering 目标不同
+
+- public `sglang` 的 metadata 更通用
+- plugin 的 `ForwardMetadata` 更偏 `ATOM` kernel
+  - `page_table`
+  - `pa_metadata_*`
+
+### 7.3 一个更准确的说法
+
+`sglang plugin` **没有改写 public `sglang` 的 pool owner / allocator / radix tree**，  
+但它在 MHA 路径上：
+
+- 改变了同一块 public pool buffer 的布局解释与写入约定
+- 并把 lowering 结果继续朝 `ATOM` kernel 所需的 metadata 形态推进
+
+## 8. Kernel 共用的两种 draft
+
+本次会话里，实际做了两种方向的 draft。
+
+### 8.1 方案 A：共享 kernel wrapper
+
+思路：
+
+- 新增共享小 metadata
+  - `MLADecodeKernelMetadata`
+  - `MHAAsmKernelMetadata`
+  - `MHAPersistentKernelMetadata`
+- 新增共享 wrapper
+  - `run_mla_decode_kernel`
+  - `run_mha_asm_kernel`
+  - `run_mha_persistent_kernel`
+
+优点：
+
+- 边界清晰
+- 更像长期可演进的 execution API
+- 不需要统一大 metadata
+
+缺点：
+
+- 还是引入了一层新的 shared contract
+
+### 8.2 方案 B：`sglang plugin` 主动适配 `ATOM core`
+
+思路：
+
+- 不改 `ATOM core`
+- 在 `sglang plugin` 里把 `ForwardMetadata` 转成 `AttentionMetaData`
+- 再伪造最小 `shim/self`
+- 直接调用：
+  - `PagedAttentionImpl.paged_attention_asm`
+  - `PagedAttentionImpl.paged_attention_persistent_asm`
+  - `MLAAttention._forward_decode`
+
+优点：
+
+- 不改 `ATOM core`
+- 更直观地证明 “plugin 主动复用 core” 是可行的
+
+缺点：
+
+- 需要 shim
+- 尤其 MLA 路径更脆弱
+- 更适合验证可行性，不适合长期保留
+
+## 9. 当前最值得保留的判断
+
+### 9.1 不应该强行统一大 metadata
+
+不建议直接统一：
+
+- `AttentionMetaData`
+- `plugin_metadata`
+- `ForwardMetadata`
+
+因为它们都带有较重的宿主 runtime 语义。
+
+### 9.2 最值得共享的是 kernel-facing execution metadata
+
+真正最有价值的共享边界是：
+
+- `kv_indptr`
+- `kv_indices`
+- `qo_indptr`
+- `page_table`
+- `context_lens`
+- `work_*`
+- `reduce_*`
+
+以及基于它们的 kernel call / runner。
+
+### 9.3 重构顺序建议
+
+如果继续推进 `sglang plugin` 重构，更合理的顺序是：
+
+1. 保持 `sglang` runtime 自有
+2. 显式化 metadata lowering 边界
+3. 尽量共享 kernel call / execution helper
+4. 最后再考虑是否继续上推到 runner 层
+
+## 10. 一句话总结
+
+本次会话最终收敛出的关键认识是：
+
+**`sglang plugin attention` 重构的关键，不是把 `sglang` runtime 改得更像 `vllm`，而是把 runtime、metadata lowering、kernel execution 这三层拆清楚；只要 lowering 之后能在 execution metadata 上对齐，就能在不统一宿主语义的前提下，共用更多 `ATOM` 的底层 attention 代码。**
diff --git a/work_log/attn_refractory/decoupling-diagram.md b/work_log/attn_refractory/decoupling-diagram.md
new file mode 100644
index 000000000..b95c0031e
--- /dev/null
+++ b/work_log/attn_refractory/decoupling-diagram.md
@@ -0,0 +1,158 @@
+# ATOM Plugin Decoupling Diagram
+
+下面这张图对比了当前入口架构和目标解耦架构。这里采用的是更激进、也更干净的方案：
+
+- `vLLM plugin` 和 `SGLang plugin` 在 `atom/plugin/` 层彻底分家
+- 不再保留共享的 plugin runtime
+- 共享只发生在更底层的 `ATOM core`
+
+## Current
+
+```mermaid
+flowchart TD
+    A[Shared Plugin Entry<br/>prepare.py / register.py / config.py]
+    B[Global Runtime State<br/>_CURRENT_FRAMEWORK<br/>current_atom_config<br/>ops.Attention]
+    C[vLLM Branch<br/>platform / registry / patch / graph]
+    D[SGLang Branch<br/>wrapper / registry hijack / RadixAttention / graph]
+    E[Mixed Plugin Core<br/>attention selection / static_forward_context / loader glue]
+
+    A --> B
+    B --> C
+    B --> D
+    C --> E
+    D --> E
+```
+
+### 当前设计的核心问题
+
+- 入口层通过切换全局状态来区分 `vLLM` 和 `SGLang`，而不是通过物理隔离的子系统建立边界。
+- `prepare.py` / `register.py` / `config.py` 这些共享入口，本身就是耦合中心。
+- host 差异直接泄漏到 plugin 内部结构里，导致 backend 选择、attention 抽象和 runtime context 都知道上层框架。
+
+## Target
+
+```mermaid
+flowchart TD
+    A1[vLLM Plugin<br/>bootstrap.py<br/>register.py<br/>config.py<br/>runtime.py<br/>attention glue<br/>model adapters]
+    A2[SGLang Plugin<br/>bootstrap.py<br/>register.py<br/>config.py<br/>runtime.py<br/>attention glue<br/>model adapters]
+
+    B[ATOM Core<br/>model_ops<br/>kernels<br/>loader<br/>metadata structs<br/>model specialization]
+
+    A1 --> B
+    A2 --> B
+```
+
+### 目标设计的关键点
+
+- `vLLM` 和 `SGLang` 在 plugin 层是两个完整子系统，而不是一套共享 runtime 上的两个分支。
+- `atom/plugin/` 下不再存在共享的 `prepare_model()`、共享的 `register.py` 总入口、共享的 plugin config translator。
+- `vLLM plugin` 自己处理：
+  - platform registration
+  - model override
+  - graph / MLA / weight hook patch
+  - vLLM model adapter / attention glue
+- `SGLang plugin` 自己处理：
+  - external model wrapper
+  - attention backend registration / override
+  - graph / speculative / wrapper patch
+  - SGLang model adapter / attention glue
+- 真正共享的只保留在更底层的 `ATOM core`：
+  - `model_ops`
+  - kernels
+  - loader
+  - generic metadata structures
+  - model-family specialization
+
+## 最重要的切断点
+
+```mermaid
+flowchart LR
+    A[Cut 1<br/>shared prepare.py]
+    B[Cut 2<br/>shared register.py]
+    C[Cut 3<br/>shared config.py]
+    D[Cut 4<br/>framework global state]
+    E[Cut 5<br/>ops.Attention global rebinding]
+
+    A --> F[vllm/ and sglang/ own bootstrap]
+    B --> F
+    C --> F
+    D --> G[host-local runtime only]
+    E --> H[host-specific attention glue]
+```
+
+## 推荐目录形态
+
+```text
+atom/plugin/
+  vllm/
+    bootstrap.py
+    register.py
+    config.py
+    runtime.py
+    attention/
+    models/
+    graph/
+  sglang/
+    bootstrap.py
+    register.py
+    config.py
+    runtime.py
+    attention/
+    models/
+    graph/
+```
+
+这里的重点不是“换目录”，而是：
+
+- `vllm/runtime.py` 和 `sglang/runtime.py` 各自独立
+- `vllm/config.py` 和 `sglang/config.py` 各自独立
+- `vllm/register.py` 和 `sglang/register.py` 各自独立
+- 不再有 plugin 级共享 runtime
+
+## 术语说明
+
+### bootstrap
+
+`bootstrap` 在这里指的是“插件被宿主框架加载时，最先执行的接线初始化层”。
+
+它负责的事情通常包括：
+
+- 注册 plugin 扩展点
+- 安装 patch / hook
+- 构造 host adapter / wrapper
+- 把宿主配置翻译到 ATOM 可用的格式
+
+它不应该负责：
+
+- attention forward 本身
+- 长期 runtime state 管理
+- 每次请求都要走的核心执行逻辑
+
+### register
+
+`register` 指的是把 plugin 的能力挂到宿主暴露的扩展点上，例如：
+
+- 注册 platform
+- 注册 model class
+- 注册 attention backend
+
+它更偏“声明接入点”，通常是 bootstrap 的一部分。
+
+### adapter
+
+`adapter` 指的是宿主框架和 ATOM core 之间的桥接层。  
+它的职责是做接口和参数语义翻译，而不是定义 attention / model 的核心语义。
+
+### runtime
+
+`runtime` 指的是插件在宿主里长期存在、服务执行链路的那部分运行时逻辑。  
+如果按这份设计推进，`runtime` 应该是 host-local 的：
+
+- `vLLM runtime` 只属于 `vLLM plugin`
+- `SGLang runtime` 只属于 `SGLang plugin`
+
+而不是共享一个“Plugin runtime”。
+
+## 一句话结论
+
+真正的解耦不是把共享入口再包装一层，而是把 `atom/plugin` 从“共享状态机”改成“两个完全独立的 host plugin 子系统”，共享只留给更底层的 `ATOM core`。
diff --git a/work_log/attn_refractory/vllm-integration-architecture.md b/work_log/attn_refractory/vllm-integration-architecture.md
new file mode 100644
index 000000000..e8876268d
--- /dev/null
+++ b/work_log/attn_refractory/vllm-integration-architecture.md
@@ -0,0 +1,269 @@
+# ATOM vLLM Plugin Integration Architecture
+
+## 1. 启动入口
+
+vLLM plugin 的 setuptools entrypoint 定义在：
+
+- `pyproject.toml`
+
+关键位置：
+
+- `vllm.platform_plugins`
+  - `atom = "atom.plugin.vllm.register:register_platform"`
+- `vllm.general_plugins`
+  - `atom_model_registry = "atom.plugin.vllm.register:register_model"`
+
+这意味着 vLLM 启动时会先进入：
+
+- `atom/plugin/vllm/register.py::register_platform()`
+- `atom/plugin/vllm/register.py::register_model()`
+
+## 2. 关键代码文件与职责
+
+### 2.1 `atom/plugin/vllm/register.py`
+
+这是 vLLM plugin 的启动接线层，主要负责：
+
+- 设置 plugin mode
+- 返回自定义 platform
+- 覆盖 vLLM 的 `ModelRegistry`
+- 安装 MLA patch
+- patch `Attention.process_weights_after_loading`
+- 安装 graph capture patch
+
+重点符号：
+
+- `register_platform()`
+- `register_model()`
+- `_VLLM_MODEL_REGISTRY_OVERRIDES`
+
+## 2.2 `atom/plugin/vllm/platform.py`
+
+这是 vLLM 平台侧 attention backend 选择入口。
+
+重点符号：
+
+- `ATOMPlatform.get_attn_backend_cls()`
+
+它根据 `attn_selector_config` 决定：
+
+- MHA -> `atom.model_ops.attentions.aiter_attention.AiterBackend`
+- MLA -> `atom.model_ops.attentions.aiter_mla.AiterMLABackend`
+- sparse MLA -> `atom.plugin.vllm.attention_backend.mla_sparse.AiterMLASparseBackend`
+
+这里是 vLLM runtime 真正选择 “哪个 ATOM attention backend class” 的入口。
+
+## 2.3 `atom/plugin/vllm/model_wrapper.py`
+
+这是 vLLM model integration 的核心文件。
+
+主要职责：
+
+- 把 `VllmConfig` 翻译成 `atom_config`
+- 做 plugin 运行环境准备
+- 选择并实例化 ATOM 模型类
+- 把 vLLM forward context 中的 `positions` 传给 ATOM 路径
+- 对 sparse MLA indexer 做额外注册
+
+重点符号：
+
+- `_prepare_env()`
+- `ATOMModelBase.__init__()`
+- `_register_indexer_caches_with_vllm()`
+- `forward()`
+
+## 2.4 `atom/plugin/config.py`
+
+这是 vLLM 配置翻译的桥。
+
+重点符号：
+
+- `generate_atom_config_for_vllm_plugin()`
+- `_generate_atom_config_from_vllm_config()`
+
+它负责把：
+
+- `VllmConfig`
+- `scheduler_config`
+- `cache_config`
+- `parallel_config`
+- `quant_config`
+
+映射成 ATOM 自己的 `Config` 与 `PluginConfig`。
+
+## 2.5 `atom/model_ops/paged_attention.py`
+
+这是 ATOM attention 在 vLLM plugin 下真正进入执行的关键对象。
+
+重点符号：
+
+- `PagedAttention.__init__()`
+- `PagedAttention.forward()`
+
+在 `is_vllm()` 下，它不会直接走 native/server 的那套 `impl` 初始化，而是：
+
+- 构造 vLLM 的 `Attention` / `MLAAttention` 外壳
+- 把额外 impl 参数塞进去
+- 注册到 `static_forward_context`
+- forward 时通过 `unified_attention_with_output_base_for_plugin_mode()` 回调到 ATOM impl
+
+## 2.6 `atom/plugin/attention.py`
+
+这是 vLLM plugin mode 的核心 glue 层。
+
+虽然文件在 `plugin/` 根目录，但从职责看它明显偏 vLLM。
+
+主要职责：
+
+- 定义 `vllmAiterAttentionBackendMethods`
+- 提供 backend decorator / metadata builder decorator
+- 处理 plugin-mode metadata
+- 处理 `static_forward_context`
+- 为 vLLM 的 attention backend 补 OOT 接口行为
+
+重点符号：
+
+- `vllmAiterAttentionBackendMethods`
+- 各类 `*DecoratorForPluginMode`
+
+## 2.7 `atom/plugin/attention_mha.py`
+
+这是 vLLM plugin mode 下 MHA impl 的补丁层。
+
+重点符号：
+
+- `PagedAttentionImplPluginModeMethods`
+
+它补的是：
+
+- rope/cache/update
+- plugin-mode forward
+- 与 vLLM KV cache/metadata 对齐的逻辑
+
+## 2.8 `atom/plugin/attention_mla.py`
+
+这是 vLLM plugin mode 下 MLA impl 的补丁层。
+
+重点符号：
+
+- `MLAAttentionImplPluginModeMethods`
+- `_mla_plugin_mode_init`
+
+它补的是：
+
+- MLA plugin-mode metadata
+- qk rope / kv cache / chunked prefill
+- vLLM MLA 路径专有行为
+
+## 2.9 `atom/plugin/vllm/mla_patch.py`
+
+这是 vLLM 原生 `MLAAttention` 的 monkey patch 入口。
+
+重点符号：
+
+- `_patch_vllm_mla_attention_forward_impl()`
+- `_patch_vllm_mla_attention_process_weights_after_loading()`
+- `patch_vllm_mla_attention()`
+
+这层作用是：
+
+- 把 vLLM `MLAAttention.forward_impl()` 改接到 ATOM impl
+- 把权重后处理改接到 ATOM 的 `impl.process_weights_after_loading()`
+
+## 2.10 `atom/plugin/vllm/graph_capture_patch.py`
+
+这是 vLLM graph capture 补丁入口。
+
+重点符号：
+
+- `apply_graph_capture_patch()`
+
+它实际委托到共享实现：
+
+- `atom/plugin/graph_capture_patch.py`
+
+但调用点是 vLLM 自己的 plugin register 流程。
+
+## 3. 当前 vLLM plugin 集成链路
+
+```mermaid
+flowchart TD
+    A[vLLM startup]
+    B[entrypoint<br/>register_platform]
+    C[entrypoint<br/>register_model]
+    D[ATOMPlatform.get_attn_backend_cls]
+    E[ATOMModelBase]
+    F[generate_atom_config_for_vllm_plugin]
+    G[_prepare_env<br/>set_attn_cls + init_aiter_dist]
+    H[Instantiate ATOM model]
+    I[PagedAttention / MLAAttention wrapper]
+    J[plugin/attention*.py glue]
+    K[ATOM impl<br/>PagedAttentionImpl / MLAAttention]
+    L[aiter kernels]
+
+    A --> B
+    A --> C
+    B --> D
+    C --> E
+    E --> F
+    E --> G
+    E --> H
+    H --> I
+    I --> J
+    J --> K
+    K --> L
+```
+
+## 4. 最关键的架构判断
+
+### 4.1 vLLM plugin 不是简单“替换 backend class”
+
+它实际上同时做了三件事：
+
+1. 替换 vLLM 的 model entry
+2. 替换/选择 vLLM 的 attention backend class
+3. 用 patch + decorator 的方式把 vLLM 的 layer/runtime 语义桥接到 ATOM impl
+
+### 4.2 真正共享得最好的是 native/server 与 vLLM plugin 的 attention impl
+
+尤其：
+
+- `PagedAttentionImpl`
+- `MLAAttention`
+- 低层 `aiter` kernel family
+
+也就是说，vLLM plugin 侧的关键复杂度主要不在 kernel，而在：
+
+- `ModelRegistry` override
+- `VllmConfig -> atom_config`
+- `static_forward_context`
+- `MLAAttention` patch
+- graph capture patch
+
+### 4.3 当前 `plugin/attention.py`、`attention_mha.py`、`attention_mla.py`
+
+虽然放在 `atom/plugin/` 根目录，但从 vLLM integration 的角度看，它们本质上更像：
+
+- `vLLM plugin impl glue`
+- 而不是真正框架无关的 shared runtime
+
+## 5. 推荐你重点阅读的文件顺序
+
+如果你是为了快速理解 vLLM plugin 集成架构，建议按这个顺序读：
+
+1. `pyproject.toml`
+2. `atom/plugin/vllm/register.py`
+3. `atom/plugin/vllm/platform.py`
+4. `atom/plugin/vllm/model_wrapper.py`
+5. `atom/plugin/config.py`
+6. `atom/model_ops/paged_attention.py`
+7. `atom/plugin/vllm/mla_patch.py`
+8. `atom/plugin/attention.py`
+9. `atom/plugin/attention_mha.py`
+10. `atom/plugin/attention_mla.py`
+
+## 6. 一句话结论
+
+vLLM plugin 的集成方式，本质上是：
+
+**用 entrypoint + platform + model wrapper 接入 vLLM，用 ATOM 自己的 attention impl 和 aiter kernels 提供核心执行，再用 patch/decorator 把 vLLM 的 layer/runtime 语义桥接进去。**

From e1d06c729da7fecb4597449a4e606f07f88a93a6 Mon Sep 17 00:00:00 2001
From: ZhiweiYan-96 <zhiwei.yan@amd.com>
Date: Tue, 12 May 2026 08:29:41 +0000
Subject: [PATCH 2/3] [ATOM-SGL][Attn refrac] Separate model-specific MLA from
 SGL full attention backend

---
 atom/model_ops/__init__.py                    |   4 +-
 atom/model_ops/attentions/aiter_attention.py  |   4 +-
 atom/plugin/register.py                       |   2 +-
 .../full_attention/__init__.py                |   8 +
 .../full_attention_backend.py}                |  12 +-
 .../{ => full_attention}/radix_attention.py   |   0
 .../sglang/models/base_model_wrapper.py       |   2 +-
 atom/plugin/sglang/models/deepseek_mla.py     |  75 +++++++++
 .../deepseek_mla_forward.py}                  | 154 +++---------------
 tests/plugin/test_sglang_model_wrapper.py     |   6 +-
 tests/plugin/test_sglang_register.py          |  21 ++-
 11 files changed, 137 insertions(+), 151 deletions(-)
 create mode 100644 atom/plugin/sglang/attention_backend/full_attention/__init__.py
 rename atom/plugin/sglang/attention_backend/{sgl_attn_backend.py => full_attention/full_attention_backend.py} (99%)
 rename atom/plugin/sglang/attention_backend/{ => full_attention}/radix_attention.py (100%)
 create mode 100644 atom/plugin/sglang/models/deepseek_mla.py
 rename atom/plugin/sglang/{attention_backend/sgl_attention_mla.py => models/deepseek_mla_forward.py} (86%)

diff --git a/atom/model_ops/__init__.py b/atom/model_ops/__init__.py
index 3883dd694..4670a5cdb 100644
--- a/atom/model_ops/__init__.py
+++ b/atom/model_ops/__init__.py
@@ -1,5 +1,7 @@
 from .paged_attention import PagedAttention
-from atom.plugin.sglang.attention_backend.radix_attention import RadixAttention
+from atom.plugin.sglang.attention_backend.full_attention.radix_attention import (
+    RadixAttention,
+)
 
 # This global class is used to construct the attention op in model,
 # it can be assigned to different attention ops.
diff --git a/atom/model_ops/attentions/aiter_attention.py b/atom/model_ops/attentions/aiter_attention.py
index e4e7cc47e..ba2bec968 100644
--- a/atom/model_ops/attentions/aiter_attention.py
+++ b/atom/model_ops/attentions/aiter_attention.py
@@ -16,7 +16,9 @@
 import atom.model_ops as ops
 from atom.model_ops.paged_attention import PagedAttention
 from atom.model_ops.attention_mha import PagedAttentionImpl
-from atom.plugin.sglang.attention_backend.radix_attention import RadixAttention
+from atom.plugin.sglang.attention_backend.full_attention.radix_attention import (
+    RadixAttention,
+)
 from atom.utils.forward_context import AttentionMetaData, Context
 
 from .backends import AttentionBackend, CommonAttentionBuilder
diff --git a/atom/plugin/register.py b/atom/plugin/register.py
index 86f06623d..043f14133 100644
--- a/atom/plugin/register.py
+++ b/atom/plugin/register.py
@@ -45,7 +45,7 @@ def _register_custom_attention_to_sglang() -> None:
     from sglang.srt.layers.attention.attention_registry import (
         register_attention_backend,
     )
-    from atom.plugin.sglang.attention_backend.sgl_attn_backend import (
+    from atom.plugin.sglang.attention_backend.full_attention.full_attention_backend import (
         ATOMAttnBackendForSgl,
     )
 
diff --git a/atom/plugin/sglang/attention_backend/full_attention/__init__.py b/atom/plugin/sglang/attention_backend/full_attention/__init__.py
new file mode 100644
index 000000000..2b9f6f272
--- /dev/null
+++ b/atom/plugin/sglang/attention_backend/full_attention/__init__.py
@@ -0,0 +1,8 @@
+from .radix_attention import RadixAttention
+from .full_attention_backend import ATOMAttnBackendForSgl, ForwardMetadata
+
+__all__ = [
+    "RadixAttention",
+    "ATOMAttnBackendForSgl",
+    "ForwardMetadata",
+]
diff --git a/atom/plugin/sglang/attention_backend/sgl_attn_backend.py b/atom/plugin/sglang/attention_backend/full_attention/full_attention_backend.py
similarity index 99%
rename from atom/plugin/sglang/attention_backend/sgl_attn_backend.py
rename to atom/plugin/sglang/attention_backend/full_attention/full_attention_backend.py
index a8ebc1122..2b58f6fed 100644
--- a/atom/plugin/sglang/attention_backend/sgl_attn_backend.py
+++ b/atom/plugin/sglang/attention_backend/full_attention/full_attention_backend.py
@@ -1,10 +1,10 @@
 from __future__ import annotations
 
-# sglang-specific attention backend replacing sglang's built-in AiterAttnBackend.
-# Shared by ALL models (DeepSeek, Qwen3, etc.) — handles KV cache writes,
-# page-table fixup, pa_persistent_fwd decode path, and MLA prefill kernels.
-# Sits at the lowest layer of the attention stack: sglang's RadixAttention
-# delegates the actual kernel dispatch here.
+# SGLang full-attention backend replacing sglang's built-in AiterAttnBackend.
+# Shared by ALL full-attention models (DeepSeek, Qwen3, etc.) — handles KV
+# cache writes, page-table fixup, pa_persistent_fwd decode path, and MLA
+# prefill kernels. Sits at the lowest layer of the attention stack:
+# sglang's RadixAttention delegates the actual kernel dispatch here.
 #
 # TODO: rewrite this file once sglang's attention flow is unified into ATOM's
 # attention layer — KV cache management and attention kernel dispatch will then
@@ -47,7 +47,7 @@
 except ImportError as e:
     raise ImportError(
         "Failed to import 'aiter', which provides AMD-specific attention kernels "
-        "required by sgl_attn_backend. Please ensure 'aiter' is installed and "
+        "required by full_attention_backend. Please ensure 'aiter' is installed and "
         f"available on your AMD system. Original import error: {e}"
     ) from e
 
diff --git a/atom/plugin/sglang/attention_backend/radix_attention.py b/atom/plugin/sglang/attention_backend/full_attention/radix_attention.py
similarity index 100%
rename from atom/plugin/sglang/attention_backend/radix_attention.py
rename to atom/plugin/sglang/attention_backend/full_attention/radix_attention.py
diff --git a/atom/plugin/sglang/models/base_model_wrapper.py b/atom/plugin/sglang/models/base_model_wrapper.py
index 3f2c743b1..dbace2e64 100644
--- a/atom/plugin/sglang/models/base_model_wrapper.py
+++ b/atom/plugin/sglang/models/base_model_wrapper.py
@@ -386,7 +386,7 @@ def __init__(
         # Apply ds model-specific sglang patches (attn dispatch, weight hooks, etc.)
         # TODO: will remove this after sglang supports atom attention backend
         if self.model_arch_spec.apply_deepseek_patch:
-            from atom.plugin.sglang.attention_backend.sgl_attention_mla import (
+            from atom.plugin.sglang.models.deepseek_mla import (
                 setup_deepseek_for_sglang,
             )
 
diff --git a/atom/plugin/sglang/models/deepseek_mla.py b/atom/plugin/sglang/models/deepseek_mla.py
new file mode 100644
index 000000000..a8d4deb1a
--- /dev/null
+++ b/atom/plugin/sglang/models/deepseek_mla.py
@@ -0,0 +1,75 @@
+# SPDX-License-Identifier: MIT
+# Copyright (C) 2024-2025, Advanced Micro Devices, Inc. All rights reserved.
+
+"""Model-level DeepSeek MLA patching for SGLang plugin mode.
+
+This module owns the monkey-patch entrypoints that adapt DeepSeek MLA models to
+SGLang plugin mode. The heavy DeepSeek-specific forward and weight helpers live
+in `atom.plugin.sglang.models.deepseek_mla_forward`.
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING, Any
+
+import torch
+
+from atom.plugin.sglang.models.deepseek_mla_forward import (
+    forward_sgl_plugin_mode,
+    init_sgl_attrs,
+    process_mla_kv_b_proj_after_loading,
+)
+
+if TYPE_CHECKING:
+    from atom.models.deepseek_v2 import DeepseekV2MLAAttention
+
+
+def setup_deepseek_for_sglang(model) -> None:
+    """Patch a DeepSeek V2/V3 model for SGLang plugin mode."""
+    config = model.config
+
+    # Store atom_config for the OOT wrapper before install-time hooks run.
+    if not hasattr(model, "atom_config"):
+        from atom.config import get_current_atom_config
+
+        model.atom_config = get_current_atom_config()
+
+    kv_cache_dtype = model.atom_config.kv_cache_dtype
+
+    # Initialise SGLang's MLA TP context before patching per-layer forwards.
+    from sglang.srt.configs.model_config import is_deepseek_nsa
+    from sglang.srt.layers.communicator import get_attn_tp_context
+
+    get_attn_tp_context().init_context(config.q_lora_rank, is_deepseek_nsa(config))
+
+    from atom.models.deepseek_v2 import DeepseekV2MLAAttention
+
+    for module in model.modules():
+        if isinstance(module, DeepseekV2MLAAttention):
+            _patch_mla_attention_for_sglang(module, config, kv_cache_dtype)
+
+
+def _patch_mla_attention_for_sglang(
+    attn: "DeepseekV2MLAAttention",
+    config: Any,
+    kv_cache_dtype: str = "bf16",
+) -> None:
+    """Patch one DeepSeek MLA layer for SGLang plugin mode."""
+    init_sgl_attrs(attn, config, kv_cache_dtype)
+
+    def patched_forward(
+        positions: torch.Tensor,
+        hidden_states: torch.Tensor,
+        **kwargs: Any,
+    ) -> torch.Tensor:
+        from atom.plugin.sglang.models.base_model_wrapper import (
+            get_current_forward_batch,
+        )
+
+        kwargs["forward_batch"] = get_current_forward_batch()
+        return forward_sgl_plugin_mode(attn, positions, hidden_states, **kwargs)
+
+    attn.forward = patched_forward
+    attn.process_weights_after_loading = lambda: process_mla_kv_b_proj_after_loading(
+        attn
+    )
diff --git a/atom/plugin/sglang/attention_backend/sgl_attention_mla.py b/atom/plugin/sglang/models/deepseek_mla_forward.py
similarity index 86%
rename from atom/plugin/sglang/attention_backend/sgl_attention_mla.py
rename to atom/plugin/sglang/models/deepseek_mla_forward.py
index 1dae9349e..25f1ef79a 100644
--- a/atom/plugin/sglang/attention_backend/sgl_attention_mla.py
+++ b/atom/plugin/sglang/models/deepseek_mla_forward.py
@@ -1,17 +1,19 @@
-"""Sglang-specific MLA forward and weight processing for DeepseekV2/V3.
+# SPDX-License-Identifier: MIT
+# Copyright (C) 2024-2025, Advanced Micro Devices, Inc. All rights reserved.
 
-DeepSeek MLA (Multi-Latent Attention) forward logic for sglang plugin mode:
+"""Model-specific DeepSeek MLA helpers for SGLang plugin mode.
+
+DeepSeek MLA (Multi-Latent Attention) forward logic for SGLang plugin mode:
 absorbed BMM computation, MHA/MLA path dispatch (prefill -> MHA, decode -> MLA),
-kv_b_proj weight splitting (w_kc/w_vc), and monkey-patch setup via
-setup_deepseek_for_sglang().
+and kv_b_proj weight splitting (w_kc/w_vc).
 
-This module is lazily imported from base_model_wrapper.py only when running in
-sglang plugin mode (``is_sglang() == True``).  Keeping all sglang-dependent
-imports here avoids crashing when sglang is not installed.
+This module lives under ``atom.plugin.sglang.models`` because the logic is
+DeepSeek-model-specific rather than a generic SGLang attention backend.
 
 TODO: rewrite this file once sglang's attention flow is unified into ATOM's
-attention layer — the MLA absorbed path and MHA dispatch will then be handled
-natively by ATOM's attention ops, making this sglang-specific module unnecessary.
+attention layer - the MLA absorbed path and MHA dispatch will then be handled
+natively by ATOM's attention ops, making this sglang-specific module
+unnecessary.
 """
 
 from __future__ import annotations
@@ -29,7 +31,6 @@
 from atom.models.utils import maybe_prefix
 from atom.models.deepseek_v2 import _fuse_rmsnorm_quant
 
-# sglang imports
 from sglang.srt.layers.communicator import AttentionInputs, get_attn_tp_context
 from sglang.srt.layers.attention.nsa.utils import nsa_use_prefill_cp
 from sglang.srt.model_executor.cuda_graph_runner import get_is_capture_mode
@@ -59,7 +60,6 @@
     from atom.models.deepseek_v2 import DeepseekV2MLAAttention
 
 
-# bmm_fp8 custom-op wrapper (adapted from sglang forward_mla.py)
 if _is_cuda:
     from sgl_kernel import bmm_fp8 as _raw_bmm_fp8
     from sglang.srt.utils.custom_op import register_custom_op
@@ -90,7 +90,6 @@ def bmm_fp8(A, B, A_scale, B_scale, dtype, out=None):
         raise RuntimeError("bmm_fp8 requires CUDA (sgl_kernel)")
 
 
-# NamedTuple for prepare → core data flow
 class SglPrepareResult(NamedTuple):
     q_pe: torch.Tensor
     k_pe: torch.Tensor
@@ -182,7 +181,6 @@ def _prepare_weight_for_bmm(
     )
 
 
-# Init helpers
 def init_sgl_attrs(
     attn: DeepseekV2MLAAttention,
     config,
@@ -216,7 +214,6 @@ def init_sgl_attrs(
         attn.attn_mha.attn.kv_b_proj = None
 
 
-# Absorbed batched-matmul (shared by prepare and core)
 def mla_absorbed_bmm(
     attn: DeepseekV2MLAAttention,
     inp: torch.Tensor,
@@ -225,12 +222,7 @@ def mla_absorbed_bmm(
     weight_scale_k: Optional[torch.Tensor],
     out_dim: int,
 ) -> torch.Tensor:
-    """Batched matmul for MLA absorbed weights (w_kc / w_vc).
-
-    Handles deep_gemm, mxfp4, fp8-triton, fp8-cublas, and bf16 fallback paths.
-    inp: (num_tokens, num_heads, in_dim) — token-major
-    Returns: (num_tokens, num_heads, out_dim) — token-major
-    """
+    """Batched matmul for MLA absorbed weights (w_kc / w_vc)."""
     effective_weight_scale = (
         weight_scale_k if weight_scale_k is not None else weight_scale
     )
@@ -299,7 +291,6 @@ def mla_absorbed_bmm(
         )
         return out.transpose(0, 1)
 
-    # CUDA fp8 path
     if weight.dtype == torch.float8_e4m3fn:
         val, scale = per_tensor_quant_mla_fp8(
             inp.transpose(0, 1),
@@ -308,7 +299,6 @@ def mla_absorbed_bmm(
         out = bmm_fp8(val, weight, scale, effective_weight_scale, torch.bfloat16)
         return out.transpose(0, 1)
 
-    # bf16 fallback
     return torch.bmm(inp.transpose(0, 1), weight).transpose(0, 1)
 
 
@@ -352,14 +342,13 @@ def mla_v_up_proj(
     ).flatten(1, 2)
 
 
-# Forward: prepare → core
 def forward_sgl_prepare(
     attn: DeepseekV2MLAAttention,
     positions: torch.Tensor,
     hidden_states: torch.Tensor,
     **model_kwargs,
 ) -> SglPrepareResult:
-    """Prepare QKV for sglang MLA attention (adapted from sglang forward_absorb_prepare)."""
+    """Prepare QKV for sglang MLA attention."""
     hidden_states_scale = None
     if isinstance(hidden_states, tuple):
         hidden_states, hidden_states_scale = hidden_states
@@ -401,9 +390,6 @@ def forward_sgl_prepare(
         k_nope = latent_cache[..., : attn.kv_lora_rank]
         q_scale = None
 
-        # Reuse native ATOM gating for q/k RMSNorm fusion. Quant fusion is used
-        # when DeepSeek enables qknorm-quant; otherwise keep the non-quant fused
-        # path aligned with native ATOM before falling back to plain layernorm.
         if getattr(attn, "fuse_qknorm_quant", False):
             q, q_scale, q_lora, k_nope = _fuse_qk_rmsnorm_and_q_quant(
                 attn,
@@ -413,7 +399,6 @@ def forward_sgl_prepare(
             )
         elif getattr(attn, "fuse_qknorm", False):
             q, k_nope = _fuse_qk_rmsnorm(attn, q, k_nope)
-        # Otherwise keep the original overlap path for unfused qk norm.
         elif attn.alt_stream is not None and get_is_capture_mode():
             current_stream = torch.cuda.current_stream()
             attn.alt_stream.wait_stream(current_stream)
@@ -425,11 +410,9 @@ def forward_sgl_prepare(
             q = attn.q_a_layernorm(q)
             k_nope = attn.kv_a_layernorm(k_nope)
 
-        if attn.use_nsa:
-            if q_lora is None:
-                q_lora = q
+        if attn.use_nsa and q_lora is None:
+            q_lora = q
 
-        # overlap q_b_proj and indexer during decode
         if (
             attn.alt_stream is not None
             and get_is_capture_mode()
@@ -506,7 +489,7 @@ def forward_sgl_core(
     attn: DeepseekV2MLAAttention,
     prepared: SglPrepareResult,
 ) -> torch.Tensor:
-    """Core MLA attention computation for sglang (adapted from sglang forward_absorb_core)."""
+    """Core MLA attention computation for sglang."""
     save_kv_cache = True
 
     if attn.use_fused_qk_rope_concat_and_cache_mla:
@@ -545,7 +528,6 @@ def forward_sgl_core(
             is_neox=attn.rotary_emb.is_neox_style,
             is_nope_first=True,
         )
-        # Decode/speculative MLA consumes q plus packed MLA cache directly.
         k = None
         v = None
         save_kv_cache = False
@@ -571,7 +553,6 @@ def forward_sgl_core(
     )
     attn_output = attn_output.view(-1, attn.num_local_heads, attn.kv_lora_rank)
 
-    # up-proj by w_vc
     attn_bmm_output = mla_v_up_proj(
         attn, attn_output, attn.w_vc, attn.w_scale, attn.w_scale_v, attn.v_head_dim
     )
@@ -580,15 +561,7 @@ def forward_sgl_core(
 
 
 def _dispatch_sgl_plugin_attn_path(forward_batch) -> str:
-    """Decide the attention algorithm for this batch based on forward_mode.
-
-    Returns "mha" for extend/prefill (uses standard Q×K×V with flash_attn)
-    or "mla" for decode (uses absorbed weights + mla_decode_fwd).
-
-    This is the per-batch *routing* decision, distinct from
-    ``_can_run_sgl_mha_now`` which is a *capability* gate checking whether
-    the model configuration supports the MHA path at all.
-    """
+    """Decide the attention algorithm for this batch based on forward_mode."""
     if forward_batch.forward_mode.is_extend_without_speculative():
         return "mha"
     return "mla"
@@ -635,12 +608,7 @@ def _set_mla_kv_buffer_for_mha(
 
 
 def _can_run_sgl_mha_now(attn: DeepseekV2MLAAttention, forward_batch) -> bool:
-    """Check if the model configuration supports the MHA attention path.
-
-    This is a *capability* gate — NSA models and MXFP4-quantised weights
-    (uint8) cannot use the MHA path. Distinct from
-    ``_dispatch_sgl_plugin_attn_path`` which routes each batch.
-    """
+    """Check if the model configuration supports the MHA attention path."""
     del forward_batch
     if attn.use_nsa:
         return False
@@ -819,8 +787,6 @@ def prepare_qkv_latent(
         hidden_states, hidden_states_scale = hidden_states
     qkv_lora = attn.fused_qkv_a_proj(hidden_states, hidden_states_scale)
 
-    # Fallback: when communicator does not enable input_scattered gather,
-    # force qkv latent token dimension to align with positions.
     expected_tokens = 0
     if hasattr(forward_batch, "positions") and forward_batch.positions is not None:
         expected_tokens = int(forward_batch.positions.shape[0])
@@ -843,7 +809,6 @@ def prepare_qkv_latent(
     return qkv_lora
 
 
-# Top-level forward entry point
 def forward_sgl_plugin_mode(
     attn: DeepseekV2MLAAttention,
     positions: torch.Tensor,
@@ -884,7 +849,6 @@ def forward_sgl_plugin_mode(
         raise ValueError(f"Unsupported plugin attention path: {attn_path}")
 
 
-# Weight post-processing: decomposed into sub-functions
 def _read_kv_b_proj_weight(attn: DeepseekV2MLAAttention) -> torch.Tensor:
     """Read kv_b_proj weight, handling AWQ and fnuz dtypes."""
     if hasattr(attn.kv_b_proj, "qweight"):
@@ -901,8 +865,6 @@ def _read_kv_b_proj_weight(attn: DeepseekV2MLAAttention) -> torch.Tensor:
     else:
         w = attn.kv_b_proj.weight
 
-    # On ROCm, ATOM creates parameters with fnuz dtype but loads fn bytes.
-    # View-cast back to fn so the normalize path works correctly.
     if _is_fp8_fnuz and w.dtype == torch.float8_e4m3fnuz:
         w = w.view(torch.float8_e4m3fn)
 
@@ -926,10 +888,7 @@ def _process_fp8_weight(
     w: torch.Tensor,
     weight_block_size: Optional[list[int]],
 ) -> tuple[torch.Tensor, bool, Optional[torch.Tensor]]:
-    """Process FP8 weights for kv_b_proj.
-
-    Returns (w, use_deep_gemm_bmm, block_scale).
-    """
+    """Process FP8 weights for kv_b_proj."""
     from atom.model_ops.utils import normalize_e4m3fn_to_e4m3fnuz
     from sglang.srt.layers.quantization.fp8_utils import (
         block_quant_dequant,
@@ -1061,7 +1020,6 @@ def _split_and_assign_kc_vc(
         [attn.qk_nope_head_dim, attn.v_head_dim], dim=1
     )
 
-    # quark fp4 special path
     quant_method = getattr(attn.kv_b_proj, "quant_method", None)
     quant_config = getattr(quant_method, "quant_config", None)
     if (
@@ -1084,8 +1042,6 @@ def _split_and_assign_kc_vc(
             w_kc = w_kc.transpose(1, 2).contiguous().transpose(1, 2)
             w_vc = w_vc.contiguous().transpose(1, 2)
 
-        # Align bf16 kv_b_proj post-load handling with vLLM: split first, then
-        # quantize kc/vc independently for the fp8 BMM path.
         if w.dtype == torch.bfloat16 and (_is_hip or _is_cuda):
             w_kc, w_scale_k = dynamic_per_batched_tensor_quant(w_kc, dtype=dtypes.fp8)
             w_vc, w_scale_v = dynamic_per_batched_tensor_quant(w_vc, dtype=dtypes.fp8)
@@ -1117,89 +1073,19 @@ def _split_and_assign_kc_vc(
 
 
 def process_mla_kv_b_proj_after_loading(attn: DeepseekV2MLAAttention) -> None:
-    """Process kv_b_proj weights after loading for sglang MLA mode.
-
-    Orchestrates reading, quantization handling, and splitting of
-    kv_b_proj into absorbed w_kc / w_vc weights.
-    """
+    """Process kv_b_proj weights after loading for sglang MLA mode."""
     w = _read_kv_b_proj_weight(attn)
     weight_block_size = _get_weight_block_size(attn)
 
     use_deep_gemm_bmm = False
     block_scale = None
 
-    # fp8 path
     if w.dtype in (torch.float8_e4m3fn, torch.float8_e4m3fnuz):
         w, use_deep_gemm_bmm, block_scale = _process_fp8_weight(
             attn, w, weight_block_size
         )
 
-    # int8 path
     if w.dtype == torch.int8:
         w = _process_int8_weight(attn, w, weight_block_size)
 
-    # split and assign kc/vc
     _split_and_assign_kc_vc(attn, w, use_deep_gemm_bmm, block_scale, weight_block_size)
-
-
-# One-time model setup (called from base_model_wrapper.py)
-def setup_deepseek_for_sglang(model) -> None:
-    """Patch a DeepseekV2/V3 model for sglang plugin mode.
-
-    - Initialises sglang TP context
-    - Patches each MLAAttention.forward to dispatch to the sglang MLA path
-    - Registers process_weights_after_loading hooks
-    - Stores atom_config on the model
-    """
-    config = model.config
-
-    # Store atom_config (needed by load_weights in the OOT wrapper)
-    if not hasattr(model, "atom_config"):
-        from atom.config import get_current_atom_config
-
-        model.atom_config = get_current_atom_config()
-
-    kv_cache_dtype = model.atom_config.kv_cache_dtype
-
-    # Initialise sglang TP context for MLA gather/scatter
-    from sglang.srt.configs.model_config import is_deepseek_nsa
-    from sglang.srt.layers.communicator import get_attn_tp_context
-
-    get_attn_tp_context().init_context(config.q_lora_rank, is_deepseek_nsa(config))
-
-    # Patch each MLAAttention instance
-    from atom.models.deepseek_v2 import DeepseekV2MLAAttention
-
-    for module in model.modules():
-        if isinstance(module, DeepseekV2MLAAttention):
-            _patch_mla_attention_for_sglang(module, config, kv_cache_dtype)
-
-
-def _patch_mla_attention_for_sglang(attn, config, kv_cache_dtype: str = "bf16") -> None:
-    """Patch a single DeepseekV2MLAAttention for sglang plugin mode.
-
-    We patch attn.forward (rather than relying solely on ops.Attention =
-    RadixAttention) because MLA's absorbed-weight forward path replaces the
-    *entire* forward method — including RoPE, and absorbed
-    BMM — not just the attention backend.  ops.Attention = RadixAttention
-    handles the backend layer (flash_attn / paged_attn dispatch) and is
-    already set via set_attn_cls(); this patch sits above that layer.
-    """
-    init_sgl_attrs(attn, config, kv_cache_dtype)
-
-    def patched_forward(
-        positions: torch.Tensor,
-        hidden_states: torch.Tensor,
-        **kwargs,
-    ) -> torch.Tensor:
-        from atom.plugin.sglang.models.base_model_wrapper import (
-            get_current_forward_batch,
-        )
-
-        kwargs["forward_batch"] = get_current_forward_batch()
-        return forward_sgl_plugin_mode(attn, positions, hidden_states, **kwargs)
-
-    attn.forward = patched_forward
-    attn.process_weights_after_loading = lambda: process_mla_kv_b_proj_after_loading(
-        attn
-    )
diff --git a/tests/plugin/test_sglang_model_wrapper.py b/tests/plugin/test_sglang_model_wrapper.py
index e4015ed9d..20d0e0792 100644
--- a/tests/plugin/test_sglang_model_wrapper.py
+++ b/tests/plugin/test_sglang_model_wrapper.py
@@ -54,8 +54,7 @@ def _make_fake_modules(*, is_last_rank: bool, setup_hook=None) -> dict[str, Modu
     forward_batch_mod.ForwardBatch = object
     forward_batch_mod.PPProxyTensors = object
 
-    attn_backend_pkg = _package("atom.plugin.sglang.attention_backend")
-    mla_mod = ModuleType("atom.plugin.sglang.attention_backend.sgl_attention_mla")
+    mla_mod = ModuleType("atom.plugin.sglang.models.deepseek_mla")
     mla_mod.setup_deepseek_for_sglang = setup_hook or (lambda model: None)
 
     return {
@@ -68,8 +67,7 @@ def _make_fake_modules(*, is_last_rank: bool, setup_hook=None) -> dict[str, Modu
         "sglang.srt.layers.quantization.base_config": quant_base_mod,
         "sglang.srt.model_executor": model_executor_pkg,
         "sglang.srt.model_executor.forward_batch_info": forward_batch_mod,
-        "atom.plugin.sglang.attention_backend": attn_backend_pkg,
-        "atom.plugin.sglang.attention_backend.sgl_attention_mla": mla_mod,
+        "atom.plugin.sglang.models.deepseek_mla": mla_mod,
     }
 
 
diff --git a/tests/plugin/test_sglang_register.py b/tests/plugin/test_sglang_register.py
index 562aaf8bc..1adb20fb4 100644
--- a/tests/plugin/test_sglang_register.py
+++ b/tests/plugin/test_sglang_register.py
@@ -324,10 +324,13 @@ def __init__(self, runner):
         "atom.models.qwen3_moe": ModuleType("atom.models.qwen3_moe"),
         "atom.models.glm4_moe": ModuleType("atom.models.glm4_moe"),
         "atom.models.deepseek_v2": ModuleType("atom.models.deepseek_v2"),
+        "atom.models.minimax_m2": ModuleType("atom.models.minimax_m2"),
+        "atom.models.qwen3_next": ModuleType("atom.models.qwen3_next"),
+        "atom.models.qwen3_5": ModuleType("atom.models.qwen3_5"),
         "atom.config": ModuleType("atom.config"),
         "atom.plugin.prepare": fake_prepare_mod,
-        "atom.plugin.sglang.attention_backend.sgl_attn_backend": ModuleType(
-            "atom.plugin.sglang.attention_backend.sgl_attn_backend"
+        "atom.plugin.sglang.attention_backend.full_attention.full_attention_backend": ModuleType(
+            "atom.plugin.sglang.attention_backend.full_attention.full_attention_backend"
         ),
     }
     fake_modules["atom.models.qwen3"].Qwen3ForCausalLM = type(
@@ -342,9 +345,21 @@ def __init__(self, runner):
     fake_modules["atom.models.deepseek_v2"].DeepseekV3ForCausalLM = type(
         "DeepseekV3ForCausalLM", (), {}
     )
+    fake_modules["atom.models.minimax_m2"].MiniMaxM2ForCausalLM = type(
+        "MiniMaxM2ForCausalLM", (), {}
+    )
+    fake_modules["atom.models.qwen3_next"].Qwen3NextForCausalLM = type(
+        "Qwen3NextForCausalLM", (), {}
+    )
+    fake_modules["atom.models.qwen3_5"].Qwen3_5ForCausalLM = type(
+        "Qwen3_5ForCausalLM", (), {}
+    )
+    fake_modules["atom.models.qwen3_5"].Qwen3_5MoeForCausalLM = type(
+        "Qwen3_5MoeForCausalLM", (), {}
+    )
     fake_modules["atom.config"].Config = type("Config", (), {})
     fake_modules[
-        "atom.plugin.sglang.attention_backend.sgl_attn_backend"
+        "atom.plugin.sglang.attention_backend.full_attention.full_attention_backend"
     ].ATOMAttnBackendForSgl = _FakeBackend
 
     with patch.dict(sys.modules, fake_modules):

From 35982548b938776f00a212f5984a5b7f46095815 Mon Sep 17 00:00:00 2001
From: ZhiweiYan-96 <zhiwei.yan@amd.com>
Date: Tue, 12 May 2026 08:33:02 +0000
Subject: [PATCH 3/3] remove work log

---
 .../2026-04-30-attn-refractory.md             | 354 -------------
 ...26-05-07-radix-cache-vs-paged-attention.md | 482 ------------------
 ...26-05-08-kv-indices-address-space-notes.md | 359 -------------
 .../2026-05-08-metadata-layering-summary.md   | 144 ------
 ...6-05-11-attn-refractory-session-summary.md | 398 ---------------
 .../attn_refractory/decoupling-diagram.md     | 158 ------
 .../vllm-integration-architecture.md          | 269 ----------
 7 files changed, 2164 deletions(-)
 delete mode 100644 work_log/attn_refractory/2026-04-30-attn-refractory.md
 delete mode 100644 work_log/attn_refractory/2026-05-07-radix-cache-vs-paged-attention.md
 delete mode 100644 work_log/attn_refractory/2026-05-08-kv-indices-address-space-notes.md
 delete mode 100644 work_log/attn_refractory/2026-05-08-metadata-layering-summary.md
 delete mode 100644 work_log/attn_refractory/2026-05-11-attn-refractory-session-summary.md
 delete mode 100644 work_log/attn_refractory/decoupling-diagram.md
 delete mode 100644 work_log/attn_refractory/vllm-integration-architecture.md

diff --git a/work_log/attn_refractory/2026-04-30-attn-refractory.md b/work_log/attn_refractory/2026-04-30-attn-refractory.md
deleted file mode 100644
index f0946299a..000000000
--- a/work_log/attn_refractory/2026-04-30-attn-refractory.md
+++ /dev/null
@@ -1,354 +0,0 @@
-# 2026-04-30 ATOM Attention Refactory Session Summary
-
-## 1. 本次会话的核心目标
-
-本次连续讨论的主线是围绕 `ATOM plugin` 中 attention backend 的重构方向展开，重点包括：
-
-- 调研当前 `ATOM plugin` attention backend 的架构问题
-- 评估 `vLLM plugin` 与 `SGLang plugin` 的耦合程度
-- 分析 SGLang plugin 在支持新模型、特别是非传统 MHA/MLA 类型 backend 时会遇到的困难
-- 讨论如何在 plugin 层把 `vLLM` 与 `SGLang` 解耦
-- 讨论 `SGLang` 中 `hybrid/composite backend` 的接入方式
-- 评估三种模式共享 ATOM attention 实现的难度：
-  - `ATOM native/server`
-  - `ATOM vLLM plugin`
-  - `ATOM SGLang plugin`
-
-## 2. 对当前 ATOM plugin attention 架构的主要判断
-
-### 2.1 当前不是一个真正统一的 attention backend 架构
-
-本次调研得到的一个核心结论是：
-
-- `vLLM plugin` 和 `SGLang plugin` 虽然都叫 “ATOM attention”
-- 但它们并不是通过同一套干净的 backend contract 接进来的
-- 更准确地说，是两套宿主接入模型外加多层 patch / adapter / runtime glue 共同构成的系统
-
-### 2.2 当前的主要问题不是底层 kernel，而是 runtime / ownership / patch 层次
-
-现有问题主要集中在：
-
-- 全局状态过重
-  - `_CURRENT_FRAMEWORK`
-  - `current_atom_config`
-  - `ops.Attention`
-- vLLM 与 SGLang 的接入模型差异很大，但还共享 `plugin` 根目录里一批伪公共逻辑
-- 很多扩展点依赖 monkey patch 和 runtime override
-- SGLang 侧对 DeepSeek MLA 已经上卷到 model-level patch，而不是单纯 backend 替换
-
-### 2.3 attention 文件结构的拆分轴不统一
-
-本次对 `plugin` 目录下 attention 相关文件的解读认为，目前至少混杂了三层不同对象：
-
-- host/runtime glue
-- backend runtime core
-- model-specific specialization
-
-尤其在 `SGLang` 侧：
-
-- `radix_attention.py` 更像 adapter
-- `sgl_attn_backend.py` 更像 full-attn runtime backend core
-- `sgl_attention_mla.py` 更像 DeepSeek MLA specialization
-
-而在根目录：
-
-- `attention.py`
-- `attention_mha.py`
-- `attention_mla.py`
-- `attention_mla_sparse.py`
-
-虽然放在 `plugin/` 根目录，但职责上明显更接近 `vLLM plugin` 的 glue / patch 层，而不是真正共享的 plugin runtime core。
-
-## 3. 关于 vLLM plugin 与 SGLang plugin 的耦合
-
-### 3.1 结论：耦合度高
-
-这次讨论中对二者耦合的判断是：
-
-- 不是轻量共享几个工具函数
-- 而是共享了一套 runtime state、attention 抽象、selector 和 plugin-mode 判断方式
-
-关键耦合点包括：
-
-- `atom/plugin/prepare.py`
-- `atom/plugin/register.py`
-- `atom/plugin/config.py`
-- `atom.plugin.prepare.is_vllm()/is_sglang()/is_plugin_mode()`
-- `ops.Attention` 的全局切换
-
-### 3.2 关键判断
-
-真正的问题不是 “目录没有拆开”，而是：
-
-**host runtime 差异没有被限制在 adapter 边界，而是泄漏进了 backend 选择、metadata 组织和全局状态模型。**
-
-## 4. 关于 plugin 层解耦的共识
-
-### 4.1 plugin 层不应该继续保留共享 runtime
-
-讨论最终倾向于一个更激进但更干净的方向：
-
-- `atom/plugin/vllm/` 是一个完整子系统
-- `atom/plugin/sglang/` 是另一个完整子系统
-- `atom/plugin/` 根目录不再承担共享 runtime 中心的角色
-
-共享只保留给更底层的 `ATOM core`，例如：
-
-- `model_ops`
-- kernels
-- loader
-- metadata helpers
-- model-family specialization 中真正 host-agnostic 的部分
-
-### 4.2 “bootstrap” 的定义
-
-本次还明确了 `bootstrap` 在本讨论语境中的含义：
-
-- 不是核心执行逻辑
-- 而是插件被宿主框架加载时的最早期接线/初始化层
-
-也就是：
-
-- 注册扩展点
-- 安装 patch
-- 选择 wrapper / adapter / backend
-- 建立 host 与 plugin 的连接关系
-
-## 5. 已经做过的代码原型
-
-本次会话中，为了验证“按 host 拆入口”的思路，已经做了一批原型代码修改：
-
-### 5.1 新增的 bootstrap / prepare / register 原型
-
-- `atom/plugin/sglang/bootstrap.py`
-- `atom/plugin/vllm/bootstrap.py`
-- `atom/plugin/sglang/prepare.py`
-- `atom/plugin/sglang/register.py`
-
-### 5.2 `prepare.py` 已部分收缩为兼容层
-
-- 实际的 SGLang prepare 逻辑已经移动到 `atom/plugin/sglang/prepare.py`
-- `atom/plugin/prepare.py` 现在更像一个 legacy shim
-- 但它仍保留了 framework-state helper（因为仓库里很多地方还在依赖）
-
-### 5.3 SGLang register 逻辑已开始下沉
-
-SGLang 独有的几块逻辑已经迁入：
-
-- `register_ops_to_sglang`
-- `set_sglang_attn_cls`
-- `init_aiter_dist_for_sglang`
-- `bootstrap_sglang_runtime`
-
-共享 `atom/plugin/register.py` 目前更多是兼容 facade。
-
-### 5.4 说明
-
-这些改动的定位是：
-
-- 用来显式化未来结构
-- 不是完整重构
-- 还保留了较多兼容层
-
-## 6. 关于 SGLang attention backend 的重要判断
-
-### 6.1 `ATOMAttnBackendForSgl` 与 public `AiterAttnBackend`
-
-在一次反复讨论后，本次对它们的关系收敛为：
-
-- 在 backend 角色位置上，两者基本是 apple-to-apple
-- `ATOMAttnBackendForSgl` 不是“更高层的新东西”
-- 而是 `SGLang full-attention runtime backend` 的 ATOM 对位实现
-
-因此更合适的重构思路不是怀疑它的存在合法性，而是：
-
-- 保留它作为 full-attn backend core
-- 再去拆它内部的 metadata / kv_cache / decode / graph 等职责
-
-### 6.2 DeepSeek MLA 的特殊性
-
-`sgl_attention_mla.py` 暴露出一个很重要的事实：
-
-- 对 `MLA`，尤其是 `DeepSeek MLA`
-- SGLang plugin 已经不是单纯换 backend
-- 而是 patch 了模型级 `forward`
-
-这说明：
-
-**MLA 在 SGLang plugin 里已经进入了 model specialization 维度。**
-
-## 7. 关于 GDN / KDA / Lightning / Mamba2 等非传统 “attention”
-
-### 7.1 调研结论
-
-public SGLang 已经证明，系统里不止 full-attention backend，还存在：
-
-- linear attention backend
-  - `GDNAttnBackend`
-  - `KDAAttnBackend`
-  - `LightningAttentionBackend`
-  - `Mamba2AttnBackend`
-- hybrid/composite backend
-  - `HybridLinearAttnBackend`
-  - `HybridAttnBackend`
-- wrapper/composition backend
-  - `TboAttnBackend`
-
-### 7.2 对当前 ATOM plugin 设计造成的挑战
-
-当前 ATOM plugin 的主抽象仍偏向：
-
-- MHA
-- MLA
-- full attention runtime
-
-这会带来几个结构性问题：
-
-1. 还没有把 `linear backend` 当成一等公民
-2. 还没有给 `hybrid/composite backend` 预留独立宿主位置
-3. 容易把 algorithm backend、kernel backend、host glue 混成一层
-4. 容易继续把 `GDN` 等路径硬塞回 full-attn backend 体系
-
-### 7.3 当前共识
-
-后续不应该继续只谈 “attention backend 重构”，而应该升级为：
-
-**sequence backend / mixer backend 体系重构**
-
-建议至少在设计上并列三类：
-
-- `full_attn`
-- `linear_attn`
-- `hybrid/composite`
-
-## 8. hybrid/composite backend 在 plugin 侧的接入思路
-
-### 8.1 核心判断
-
-由于 ATOM SGLang plugin 的启动方式是：
-
-- 启动 `sglang server`
-- 再通过 plugin runtime override 接管 model / attention
-
-所以要接 `hybrid/composite backend`，不可避免需要 patch 某个 seam。
-
-### 8.2 最好的 seam
-
-本次讨论认为，public SGLang 里最好的 seam 是：
-
-- `model_runner.py` 中导入并调用的 `attn_backend_wrapper`
-
-注意：
-
-- 它不是 `ModelRunner` 的成员方法
-- 而是 `model_runner.py` 模块级导入的符号
-
-所以如果要 patch，更稳的是 patch：
-
-- `sglang.srt.model_executor.model_runner.attn_backend_wrapper`
-
-而不是仅 patch `attention_registry.attn_backend_wrapper`。
-
-### 8.3 最推荐的方式
-
-对 hybrid/composite backend 的接入方式，本次最终倾向于：
-
-**plugin-own 一个薄的 composite wrapper/factory，只在 backend 构造 seam 做单点 patch。**
-
-不推荐：
-
-- 深度 monkey patch public hybrid backend 实现
-- 一开始就复制整套 public linear/full backend 逻辑
-
-更推荐：
-
-- full side 先用 ATOM full backend
-- linear side 初期先复用 public SGLang 的 `GDNAttnBackend` / `KDAAttnBackend` / `LightningAttentionBackend` / `Mamba2AttnBackend`
-- composite wrapper 由 plugin 侧拥有
-
-## 9. 关于三种模式共享 ATOM attention 实现的难度
-
-本次对：
-
-- `ATOM native/server`
-- `ATOM vLLM plugin`
-- `ATOM SGLang plugin`
-
-三种模式共享 ATOM attention 实现，得到的判断如下。
-
-### 9.1 MHA
-
-难度：`Medium-High`
-
-原因：
-
-- `ATOM native` 与 `ATOM vLLM plugin` 已经在 `PagedAttentionImpl` 等层共享较多
-- 真正难的是 `SGLang plugin` 这一侧
-- 它不走 `PagedAttention` 的 ATOM 主路径，而走：
-  - `RadixAttention`
-  - `ATOMAttnBackendForSgl`
-  - `ForwardBatch`
-
-所以 MHA 共享的主要困难在于：
-
-**runtime orchestration 差异**
-
-### 9.2 MLA
-
-难度：`High`
-
-原因：
-
-- native/server 与 vLLM plugin 仍然较多共享 `MLAAttention`
-- 但 SGLang plugin 下的 DeepSeek MLA 已经上卷到 model specialization
-- 不只是 backend 不同，连 forward 组织方式都不同
-
-所以 MLA 的困难在于：
-
-**runtime orchestration + model specialization 双重叠加**
-
-## 10. 已产出的文档与图
-
-本次会话过程中，已额外产出下列文档/图，用于不同角度说明问题。
-
-### 10.1 代码目录内 Markdown
-
-- `atom/plugin/decoupling-diagram.md`
-- `atom/plugin/vllm-integration-architecture.md`
-- `atom/plugin/sglang-attention-backend-survey.md`
-
-### 10.2 Canvas / 可视化分析
-
-- `atom-attention-backend-architecture-review.canvas.tsx`
-- `atom-plugin-coupling-risk-analysis.canvas.tsx`
-- `atom-plugin-decoupling-diagram.canvas.tsx`
-- `atom-vllm-plugin-architecture.canvas.tsx`
-- `atom-attention-sharing-modes.canvas.tsx`
-
-### 10.3 这些产物对应的主题
-
-- 当前 attention backend 架构缺陷
-- vLLM / SGLang plugin 耦合与风险
-- plugin 解耦方向与 bootstrap 理解
-- vLLM plugin 集成架构
-- public SGLang backend survey
-- 三种模式共享 ATOM attention 的难度评估
-
-## 11. 当前最值得继续推进的方向
-
-如果延续本次会话的结论，后续工作最值得按下面顺序推进：
-
-1. 继续收缩 `atom/plugin/` 根目录的 shared runtime 语义
-2. 把 `full_attn / linear_attn / hybrid` 三层作为 plugin 后续结构设计的一等公民
-3. 在 `SGLang plugin` 侧先补出 `hybrid/composite` 组合层
-4. 初期复用 public SGLang linear backend，优先验证 runtime 结构是否成立
-5. 然后再评估哪些 linear backend 需要逐步替换成 ATOM 自己的实现
-6. 对 `MLA` 尽早拆开：
-   - generic MLA runtime
-   - DeepSeek specialization
-
-## 12. 一句话总结
-
-本次会话最终把问题收敛为：
-
-**当前 ATOM plugin attention 的关键矛盾，不是底层 kernel 能不能共享，而是 vLLM / SGLang / native 三种模式在 runtime、metadata、model specialization 和 backend ownership 上没有对齐；后续重构应从“统一 attention backend”升级为“重新定义 plugin 的 sequence backend / host-owned runtime 结构”。**
diff --git a/work_log/attn_refractory/2026-05-07-radix-cache-vs-paged-attention.md b/work_log/attn_refractory/2026-05-07-radix-cache-vs-paged-attention.md
deleted file mode 100644
index f0f9d5bcc..000000000
--- a/work_log/attn_refractory/2026-05-07-radix-cache-vs-paged-attention.md
+++ /dev/null
@@ -1,482 +0,0 @@
-# 2026-05-07 Radix Cache vs Paged Attention Notes
-
-> 预估阅读时间：15 分钟  
-> 主题：梳理 `SGLang` 中 `radix cache` / `RadixAttention` 与 `ATOM` / `vLLM` 常见的 `paged attention` 之间的关系，解释为什么它们最终可以落到相同的 kernel，以及这件事对 `sglang plugin` attention backend 重构意味着什么。
-
-## 1. 这篇笔记想回答什么
-
-围绕 `sglang plugin attn backend` 的重构，最近反复出现了几个问题：
-
-1. `SGLang` 用的是 `RadixAttention`，`ATOM`/`vLLM plugin` 里常见的是 `PagedAttention`，两者是不是两套完全不同的注意力体系？
-2. 如果它们真的不同，为什么在 `DeepSeek MLA` 这类路径里，最终又会调用到相同的底层 kernel？
-3. 既然 `SGLang` 还多了一层 `radix cache` 前缀管理，为什么没有一个非常直观的结论说 "`SGLang` 一定比 `vLLM` 更快 / 更省显存"？
-4. 对当前重构来说，`radix` / `paged` 的差异到底应该被归类到：
-   - host runtime 差异
-   - KV cache 管理差异
-   - metadata lowering 差异
-   - kernel 差异
-   的哪一层？
-
-本文尝试把这几个问题放进同一个分析框架里。
-
-## 2. 先给结论
-
-### 2.1 最核心的一句话
-
-`RadixAttention` 和 `PagedAttention` 主要不是在底层注意力数学上不同，而是在 **runtime 如何管理、共享、命中和定位历史 KV** 上不同。  
-一旦 runtime 把历史 KV lowering 成底层 kernel 能吃的 `page_table`、`kv_indptr`、`kv_indices`、`qo_indptr` 之类的 metadata，kernel 就不再关心这些索引最初是来自 radix tree 还是来自 paged block table。
-
-### 2.2 另一句更工程化的结论
-
-对重构来说，不应该把 “`RadixAttention` vs `PagedAttention`” 当作 “能不能共享底层执行实现” 的直接判断依据。  
-更应该把系统拆成三层：
-
-1. **host/runtime 层**  
-   例如 `ForwardBatch`、`RadixAttention`、`ModelRunner`、prefix cache、speculative 调度
-2. **execution metadata lowering 层**  
-   例如 `page_table`、`block_tables`、`kv_indptr`、`kv_indices`、`qo_indptr`
-3. **kernel / execution core 层**  
-   例如 `pa_fwd_asm`、`pa_persistent_fwd`、`flash_attn_varlen_func`、`mla_decode_fwd`、`mla_prefill_fwd`
-
-`radix` 和 `paged` 的差异，主要在第 1、2 层；  
-兼容性和共享机会，主要出现在第 2、3 层。
-
-## 3. 为什么名字容易让人误解
-
-当前最容易造成误解的点是：
-
-- `PagedAttention` 这个名字听起来像“最终执行算法”
-- `RadixAttention` 这个名字也听起来像“最终执行算法”
-
-但在工程上，它们都更接近 **host-facing attention runtime abstraction**，而不是最终 kernel 名字。
-
-更具体一点：
-
-- `PagedAttention` 更强调 **按固定 page/block 管理 KV cache**
-- `RadixAttention` 更强调 **按 prefix/radix tree 管理请求历史与前缀复用**
-
-这两者都不是在说：
-
-- `QK^T` 怎么算
-- softmax 怎么做
-- decode kernel 用哪套 asm/triton/flashinfer
-
-这些底层执行问题，是更下一层的 backend / kernel 决定的。
-
-## 4. `RadixAttention` 到底是什么
-
-### 4.1 `RadixAttention` 不是 kernel
-
-在 public `SGLang` 里，`RadixAttention` 本身并不直接定义底层 kernel 选择。  
-它更像是：
-
-- SGLang 模型层里统一使用的 attention layer 壳
-- 它知道自己在 `ForwardBatch` 语义里运行
-- 它把真正的执行委托给 `forward_batch.attn_backend`
-
-因此，`RadixAttention` 更像一个 **layer/runtime 入口点**。
-
-### 4.2 `radix` 的重点是 prefix 管理
-
-`radix cache` 的本质，是一棵 prefix tree，用来做：
-
-- 前缀命中
-- 前缀复用
-- unfinished / finished request 的缓存插入
-- 基于 prefix 的 eviction / lifecycle 管理
-
-它回答的问题更像：
-
-- 这个请求的历史前缀已经缓存到哪里了？
-- 哪一部分可以直接复用，不必重新 prefill？
-- 哪些 KV slot/page 仍然属于共享前缀？
-- 哪些历史片段可以回收？
-
-所以 `radix` 优化的主要是 **prefix reuse**，而不是单次 attention kernel 的算力效率。
-
-### 4.3 `radix cache` 在 SGLang 里不是完全“反 page”的
-
-这一点很重要。public `SGLang` 的 `radix_cache` 本身就已经带有 page-aware 的逻辑。
-
-例如它有 page 对齐：
-
-- `page_align_keys()`
-
-也有按 page 粒度匹配 prefix：
-
-- `_key_match_paged()`
-
-这意味着：
-
-- `radix cache` 解决的是前缀共享与命中问题
-- 但它并不排斥最终按 page/block 粒度来表达缓存
-
-也就是说，**radix tree 的逻辑管理** 与 **paged/block 的物理表达** 并不冲突。
-
-## 5. `PagedAttention` 到底是什么
-
-### 5.1 `PagedAttention` 的重点不是 prefix tree
-
-`PagedAttention` 更强调：
-
-- KV cache 被组织成固定大小的 page/block
-- 每个请求通过 `page_table` / `block_table` 找到自己历史对应的 page
-- kernel 依照这些 page/block 索引读取历史 KV
-
-它回答的问题更像：
-
-- 当前请求历史有哪些 block/page？
-- 它们在物理内存中的位置是什么？
-- 这些 page 如何映射到 kernel 所需的 block table？
-
-所以 `paged attention` 更偏：
-
-- **物理存储布局**
-- **kernel 读取 KV 的地址组织方式**
-
-### 5.2 `PagedAttention` 天生不等于 prefix reuse
-
-`paged attention` 当然也可以配合 prefix cache 使用，但 prefix reuse 并不是它名字里最核心的概念。  
-它的核心价值更在于：
-
-- 支持非连续物理布局
-- 让 decode kernel 可以按 page/block 高效读取 KV
-
-## 6. 为什么 `radix` 和 `paged` 最后可以兼容
-
-### 6.1 因为它们主要解决的是不同层的问题
-
-可以把它们看成：
-
-- `radix`: 逻辑历史管理器
-- `paged`: 物理页表表达方式
-
-两者关系有点像：
-
-- `radix` 负责决定“哪些历史是共享的、哪些是命中的、请求当前逻辑历史是什么”
-- `paged` 负责决定“这些历史最终以什么 page/block 索引形式交给 kernel”
-
-只要 runtime 最终能把逻辑历史翻译成 page/block 或 indptr/index 形式，底层 kernel 并不需要知道前缀最初是如何组织的。
-
-### 6.2 兼容发生在 lowering 之后
-
-这一点非常关键：
-
-上层看起来是：
-
-- `ForwardBatch`
-- `RadixAttention`
-- `req_to_token`
-- prefix cache
-
-但 lower 到 backend 之后，会变成执行态 metadata，例如：
-
-- `page_table`
-- `block_tables`
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-- `kv_last_page_len`
-
-到了这个层次，kernel 只看：
-
-- query 的 shape
-- KV cache 的 shape
-- 历史长度
-- 索引表 / indptr
-
-它不看：
-
-- 这套索引是不是来自 radix tree
-- 是不是来自某个 prefix cache 命中
-- 是不是通过 `PagedAttention` 对象构造出来的
-
-所以兼容性不是“名字兼容”，而是 **execution metadata 兼容**。
-
-## 7. public `SGLang` 自己就在做类似 lowering
-
-这不是 `ATOM plugin` 独有的现象。public `SGLang` 本身就在这么做，只是不同 backend lower 到的 metadata 形态略有不同。
-
-### 7.1 `aiter_backend`
-
-public `SGLang` 的 `aiter_backend` 会从：
-
-- `forward_batch.seq_lens`
-- `forward_batch.req_pool_indices`
-- `req_to_token`
-
-构造：
-
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-
-也就是说，它会把 host runtime 的历史表示 lower 成 AITER kernel 所需的执行态索引。
-
-### 7.2 `triton_backend`
-
-public `SGLang` 的 `triton_backend` 也做类似事情。  
-它同样维护：
-
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-
-然后再调用 Triton 的 decode/extend kernel。
-
-### 7.3 `trtllm_mha_backend` / 部分 `flashinfer` 路径
-
-这类 backend 更偏向：
-
-- 直接构造 `page_table`
-- 或 `block_tables`
-
-再喂给 flashinfer / TRTLLM 风格的 kernel。
-
-所以更准确地说，public `SGLang` 的共同模式不是：
-
-> “所有 backend 都把 radix runtime 翻成 paged attention”
-
-而是：
-
-> “所有 backend 都会把 radix/runtime 层的历史表示，lower 成各自 kernel 能吃的索引/页表 metadata”
-
-其中有的偏 `kv_indptr/kv_indices`，有的偏 `page_table/block_tables`。
-
-## 8. 从 kernel 角度看，为什么可以兼容
-
-### 8.1 kernel 真正关心什么
-
-绝大多数 attention kernel 真正关心的是：
-
-1. 当前 query 的排列方式
-2. 每个请求可见的历史 KV 长度
-3. 如何从某个索引结构里拿到历史 KV 的物理地址
-4. page/block 大小、stride、layout
-5. 是否需要 prefix/extend/speculative 的特殊 metadata
-
-它通常不关心：
-
-1. 这份索引最初是不是从 prefix tree 查出来的
-2. 命中前缀的逻辑是谁做的
-3. runtime 是 `RadixAttention` 还是 `PagedAttention`
-
-### 8.2 对 MHA kernel 的影响
-
-MHA decode 常见地会落到：
-
-- `pa_fwd_asm`
-- `pa_persistent_fwd`
-- `flash_attn_varlen_func`
-
-这些 kernel 更偏 page/block 或 varlen index 风格。  
-只要 runtime 最终给出：
-
-- `page_table`
-- `context_lens`
-- `block_tables`
-
-或者：
-
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-
-它们都能算。
-
-### 8.3 对 MLA kernel 的影响
-
-`DeepSeek MLA` 更容易看出这种兼容性，因为 MLA 的底层算子路径更明确。
-
-典型 kernel 包括：
-
-- `mla_decode_fwd`
-- `mla_prefill_fwd`
-- `concat_and_cache_mla`
-- `fused_qk_rope_concat_and_cache_mla`
-
-这些 kernel 只要求：
-
-- 压缩后的 latent KV 表示
-- 以及相应的 `qo_indptr` / `kv_indptr` / `kv_indices`
-
-所以无论上层 host 是：
-
-- `ATOM native`
-- `ATOM vLLM plugin`
-- `ATOM SGLang plugin`
-
-只要最终 lower 成相同的 MLA execution metadata，就自然会落到同一套 MLA kernel。
-
-## 9. `DeepSeek MLA` 为什么尤其容易“看起来收敛”
-
-### 9.1 因为 MLA 的 execution core 更专用
-
-`DeepSeek MLA` 的真正执行对象不是：
-
-- “`RadixAttention` 版本的 MLA”
-- “`PagedAttention` 版本的 MLA”
-
-而是：
-
-- 一组固定的 MLA latent/cache 表示
-- 一套固定的 rope + kv cache 写入方式
-- 一套固定的 prefill/decode kernel
-
-换句话说，MLA 的底层 core 更容易抽成“共享执行层”。
-
-### 9.2 因为 host 差异主要体现在 metadata 准备与调度
-
-对于 `DeepSeek MLA` 来说，上层 host 差异更多体现在：
-
-- `positions` 从哪里来
-- `forward_batch` 怎么组织
-- decode / extend / speculative / target_verify 怎么分流
-- `req_to_token` 怎么转换成 `kv_indices`
-
-一旦进入 `mla_decode_fwd` 或 `mla_prefill_fwd` 前，这些差异已经被消化掉了。
-
-所以从外部观察，很容易得到一种印象：
-
-> “怎么上面完全不一样，下面却是同一套 kernel？”
-
-其实并不奇怪，因为两边只是上层 runtime 不同，而 MLA execution core 本来就在下层收敛。
-
-## 10. 既然兼容，为什么 `SGLang` 没有天然更快或更省显存
-
-这是另外一个很容易误判的问题。
-
-### 10.1 `radix cache` 省的是前缀重复，不是所有 KV
-
-`radix cache` 能节省的，是重复前缀带来的重复 prefill 和重复 KV 占用。
-
-它主要帮助的是：
-
-- 大量请求共享长前缀
-- 同 prompt 多分支采样
-- prefix-heavy workload
-- 某些 speculative / chunked prefill 场景
-
-如果 workload 不满足这些条件，它就不会天然表现为显著优势。
-
-### 10.2 它优化的是 prefix reuse，不是单 token decode kernel
-
-radix 管理层并不会让 `mla_decode_fwd`、`pa_fwd_asm` 这种 kernel 自己更便宜。  
-kernel 的 raw 速度主要还是受：
-
-- 头数
-- head dim
-- KV layout
-- quant
-- page size
-- GPU kernel 实现
-
-这些因素决定。
-
-所以常见现象是：
-
-- prefix-heavy workload 下，`SGLang` 的收益更明显
-- decode-heavy workload 下，收益没那么明显
-- 如果 `vLLM` 也有自己的 prefix caching/block caching，差距会进一步缩小
-
-### 10.3 radix 自己也有管理成本
-
-`radix cache` 不是免费层。它也有：
-
-- prefix match
-- tree split / insert / eviction
-- request -> KV slot 映射维护
-- speculative / unfinished request 的额外 bookkeeping
-
-如果命中率不高，这部分成本并不会自动转化成收益。
-
-## 11. 对当前重构最有价值的判断
-
-### 11.1 不要把 `RadixAttention` vs `PagedAttention` 当成是否共享 execution core 的边界
-
-因为从 public `SGLang` 和当前 `ATOM plugin` 的代码路径看，二者最终都在做类似事情：
-
-- 从 host runtime 的历史表示出发
-- 构造底层 kernel 所需的 execution metadata
-
-所以：
-
-- runtime 不同，不代表 execution core 必须不同
-- 名字不同，不代表 kernel 层必须分叉
-
-### 11.2 真正的边界更适合这样划
-
-#### A. `sglang plugin runtime` 应拥有的部分
-
-- `ForwardBatch`
-- `RadixAttention`
-- prefix cache / radix cache
-- `ModelRunner` / `attn_backend_wrapper` 相关调度
-- speculative / graph / verify / chunked prefill 等宿主语义
-
-#### B. execution metadata lowering 层
-
-这层是最值得重新定义的边界。它负责：
-
-- 把 `req_to_token`、prefix 命中结果、cache state
-- 转成 `page_table` / `kv_indptr` / `kv_indices` / `qo_indptr`
-
-这层既带有 host 语义，又已经开始接近共享执行核心，是未来最值得梳理清楚的一层。
-
-#### C. shared execution core
-
-例如：
-
-- `mla_decode_fwd`
-- `mla_prefill_fwd`
-- `pa_fwd_asm`
-- `pa_persistent_fwd`
-- `flash_attn_varlen_func`
-
-以及与这些 kernel 紧耦合的那部分 ATOM core helper。
-
-### 11.3 这对 `sglang plugin` 的重构意味着什么
-
-当前 `sglang plugin` 最不该做的是：
-
-- 因为底层 kernel 能共享，就把 `sglang` 的 runtime seam 也强行对齐到 `vllM`
-- 或者因为上层叫 `RadixAttention`，就认为它和 `PagedAttention` 完全不能共享底层执行实现
-
-更合理的重构方向是：
-
-1. 保持 `sglang` 的 host/runtime 语义完整
-2. 识别 runtime lowering 层的清晰边界
-3. 尽量把 execution core 继续下沉回共享 `ATOM core`
-
-## 12. 一种对外更容易讲清楚的表述
-
-如果后续要向别人解释，可以用下面这几句话：
-
-### 12.1 关于 `radix` vs `paged`
-
-`radix cache` 主要解决的是 prefix 命中与复用，`paged attention` 主要解决的是 page/block 形式的 KV 组织与读取。  
-前者偏逻辑管理，后者偏物理表达；两者不是同一层次的问题。
-
-### 12.2 关于为什么最终会调到同一个 kernel
-
-因为 attention kernel 只关心最终的执行态 metadata，比如 `page_table`、`kv_indptr`、`kv_indices`、`qo_indptr`。  
-一旦 runtime 把历史 KV lowering 成这类形式，kernel 就不再关心这些索引最初是由 radix tree 还是 paged runtime 生成的。
-
-### 12.3 关于为什么 `SGLang` 没有天然显著更快
-
-因为 `radix` 带来的主要收益是 prefix reuse，而不是单 token decode kernel 的 raw speed。  
-如果 workload 不是 prefix-heavy，或者 decode 占主要成本，那么 radix 带来的优势不会自然放大成“总是更快 / 更省显存”。
-
-## 13. 最终总结
-
-把这次探索收敛成一句话：
-
-**`RadixAttention` 与 `PagedAttention` 的主要差异，不在底层注意力数学，也不一定在最终 KV 的物理表示本身，而在 host runtime 如何管理、共享、命中并 lowering 历史 KV；一旦 lowering 完成，底层 kernel 完全可以是共享的。**
-
-对当前 `sglang plugin attn backend` 的重构来说，真正值得做的不是争论 “`radix` 和 `paged` 能不能统一成一个名字”，而是把系统拆清楚：
-
-- 哪些是 `sglang` 必须自有的 runtime / prefix cache / scheduling 语义
-- 哪些是 execution metadata lowering
-- 哪些已经是可以共享的 `ATOM` execution core
-
-这比直接讨论 “`radix` 和 `paged` 谁更先进、谁应该替代谁” 更接近真正的工程问题。
diff --git a/work_log/attn_refractory/2026-05-08-kv-indices-address-space-notes.md b/work_log/attn_refractory/2026-05-08-kv-indices-address-space-notes.md
deleted file mode 100644
index e5a63be0f..000000000
--- a/work_log/attn_refractory/2026-05-08-kv-indices-address-space-notes.md
+++ /dev/null
@@ -1,359 +0,0 @@
-# 2026-05-08 KV Indices and Address Space Notes
-
-> 预估阅读时间：8-10 分钟  
-> 主题：解释 `kv_indptr` / `kv_indices` 为什么不是纯 host metadata，也不是完全 storage-agnostic 的抽象，并用 `sglang` / `ATOM` / `vllm plugin` 的实际 KV cache 存储举例说明。
-
-## 1. 想回答的问题
-
-最近有一个很容易卡住重构讨论的问题：
-
-> `kv_indptr` / `kv_indices` 看起来和 KV cache 的存储方式深度绑定。  
-> 既然如此，为什么还能把它们看成 `sglang` / `ATOM` / `vllm plugin` 之间可以共享的 metadata？
-
-这个问题的关键在于区分两件事：
-
-1. **host/runtime 是否相同**
-2. **kernel 最终消费的地址空间模型是否相同**
-
-一句话先说结论：
-
-**`kv_indptr` / `kv_indices` 不是完全 storage-agnostic 的抽象，它们绑定的是某种“可索引的 KV 地址空间”；  
-但它们可以是 host/runtime-agnostic 的，只要不同宿主最终都能把自己的 runtime state lowering 到同一种地址空间模型。**
-
-## 2. 三层视角
-
-理解这个问题，最好先把 attention 相关代码拆成三层：
-
-### 2.1 host/runtime 层
-
-这一层描述的是宿主框架如何组织请求和调度，例如：
-
-- `ATOM native` 的 `ScheduledBatch`
-- `vllm plugin` 的 `query_start_loc` / `num_prefills`
-- `sglang plugin` 的 `ForwardBatch` / `forward_mode` / `spec_info`
-
-这一层表达的是：
-
-- 哪些请求在 batch 中
-- 哪些是 decode，哪些是 prefill
-- prefix / speculative / graph 的语义是什么
-
-### 2.2 execution metadata lowering 层
-
-这一层做的事情是：
-
-- 把 host/runtime 里的 batch 状态
-- 翻译成 kernel 真正能消费的索引结构
-
-典型字段：
-
-- `block_tables` / `page_table`
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-- `kv_last_page_len`
-
-### 2.3 kernel / execution core 层
-
-这一层只关心：
-
-- query tensor
-- KV cache tensor
-- index / indptr / page table
-- persistent kernel workspace
-
-例如：
-
-- `mla_decode_fwd`
-- `mla_prefill_fwd`
-- `pa_fwd_asm`
-- `pa_persistent_fwd`
-
-## 3. `sglang` 的实际 KV cache 存储
-
-在 `sglang` 里，KV cache 本身不是存在 radix tree 里。  
-radix tree 负责的是 prefix 复用与命中；真正的 KV 还是存在 memory pool 里。
-
-`sglang` 自己的注释说得很清楚：
-
-- `ReqToTokenPool` 负责 request -> token location 映射
-- `TokenToKVPoolAllocator` 负责 KV cache index 管理
-- `KVCache` 真正持有物理 K/V tensor
-
-也就是说，`sglang` 的 runtime 世界里至少有两层：
-
-1. **逻辑视角**：某个 request 的第 `i` 个 token 属于哪里
-2. **物理视角**：这个 token 对应的 K/V 存在物理 pool 的哪个 slot
-
-### 3.1 `ReqToTokenPool`
-
-`ReqToTokenPool` 本质上是一张二维表：
-
-```text
-req_to_token[req_id, token_pos] = physical_slot_id
-```
-
-它回答的是：
-
-- 某个 request 的第 `j` 个逻辑 token
-- 对应到物理 KV pool 里的哪个 slot
-
-### 3.2 物理 KV pool
-
-物理 pool 里，K 和 V 都是按 slot 存的。  
-写入 KV 时，最终就是按 `loc/indices` 写入：
-
-```text
-k_cache[indices] = k
-v_cache[indices] = v
-```
-
-所以从 kernel 角度看，最终它看到的是：
-
-- 一个可以索引的 KV 地址空间
-- 一组 slot index
-
-而不是看到“radix tree”本身。
-
-## 4. `ATOM` / `vllm plugin` 的实际 KV cache 存储
-
-`ATOM` / `vllm plugin` 更偏 paged/block 风格。
-
-典型思路是：
-
-- 每个 request 有一张 `block_table`
-- 每个 block 对应一个固定大小的 page
-- kernel 按 `block_table` 或其展开后的 index 去读 KV
-
-所以它的上层直觉更像：
-
-```text
-request -> block table -> page/block address -> physical KV
-```
-
-和 `sglang` 相比，最大的不同不是“有没有物理地址空间”，而是：
-
-- `sglang` 上层先经过 `req_to_token`
-- `vllm/ATOM` 上层先经过 `block_table`
-
-但两边最终都能 lower 到一组可供 kernel 读取的地址索引。
-
-## 5. 例子一：MLA decode 中的 `kv_indptr` / `kv_indices`
-
-这个例子里，可以把 `kv_indices` 理解为 **token-slot index**。
-
-### 5.1 在 `sglang` 里
-
-假设一个 request 当前长度是 10，  
-它在物理 KV pool 里的 slot 顺序不是连续的，而是：
-
-```text
-[8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
-```
-
-那 `req_to_token[req_id, :10]` 就大致表示：
-
-```text
-logical token position: 0  1   2   3   4  5  6  7  8  9
-physical slot id:      8  9  10  11   0  1  2  3  4  5
-```
-
-如果这个 batch 只有这一个 request，那么 lowering 之后可以得到：
-
-```text
-kv_indptr = [0, 10]
-kv_indices = [8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
-```
-
-这里：
-
-- `kv_indptr` 表示“第 0 个 request 的 KV 范围是 `[0, 10)`”
-- `kv_indices` 表示“真正去物理 KV pool 里读这些 slot”
-
-### 5.2 在 `vllm/ATOM` 里
-
-如果 page size 是 4，同样 10 个 token 对应的 block layout 可以写成：
-
-```text
-block 2 -> slots [8, 9, 10, 11]
-block 0 -> slots [0, 1, 2, 3]
-block 1 -> slots [4, 5, 6, 7]
-```
-
-block table 大致就是：
-
-```text
-[2, 0, 1]
-```
-
-展开前 10 个 token 后，对应的 slot 顺序仍然是：
-
-```text
-[8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
-```
-
-于是 lower 之后，给 MLA decode kernel 的：
-
-```text
-kv_indptr = [0, 10]
-kv_indices = [8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
-```
-
-和 `sglang` 这边是等价的。
-
-### 5.3 这个例子说明什么
-
-说明在 MLA decode 这个路径里：
-
-- `kv_indices` 绑定的是 **token-slot 地址空间**
-- 它不是“完全与存储无关”
-- 但它已经不关心 slot 列表最初是由 `req_to_token` 还是 `block_table` 推出来的
-
-也就是说，**它和具体宿主无关，但和被选定的地址空间模型有关。**
-
-## 6. 例子二：MHA persistent decode 中的 `page_table` / `kv_indices`
-
-这个例子里，`kv_indices` 更接近 **page/block index**，而不是 token-slot index。
-
-### 6.1 在 `sglang plugin` 里
-
-仍然假设 page size = 4。  
-如果某个 request 的 token-slot 顺序是：
-
-```text
-[8, 9, 10, 11, 0, 1, 2, 3, 4, 5]
-```
-
-那么每页的起始 slot 是：
-
-```text
-[8, 0, 4]
-```
-
-除以 page size 后，对应 page id：
-
-```text
-[2, 0, 1]
-```
-
-在 `sglang plugin` 里，这就是 `page_table` 的来源：  
-从 `req_to_token` 按页采样，再除以 `page_size`。
-
-然后 `_build_pa_metadata_for_decode()` 再把它变成 `pa_persistent_fwd` 所需的 page-level `kv_indices`。
-
-### 6.2 在 `ATOM` / `vllm` 里
-
-这一层本来就是 paged/block 视角，所以 `block_table` 原生就已经是：
-
-```text
-[2, 0, 1]
-```
-
-也就是说：
-
-- `sglang` 是 `req_to_token -> page_table`
-- `vllm/ATOM` 是直接 `block_table -> page_table`
-
-但到了 persistent kernel 眼里，最后看到的是同一种 page/block 地址空间。
-
-### 6.3 这个例子说明什么
-
-说明在 MHA persistent decode 这个路径里：
-
-- `kv_indices` 绑定的是 **page/block 地址空间**
-- 它仍然不是“完全 storage-agnostic”
-- 但也不需要知道上层 host 是 `sglang` 还是 `vllm`
-
-只要两边都能 lower 到同一个 page/block 地址空间，kernel 就能共享。
-
-## 7. 这两个例子真正说明的事
-
-上面两个例子其实在说明同一个结论：
-
-### 7.1 `kv_indptr` / `kv_indices` 不是 host metadata
-
-它们不描述：
-
-- `ForwardBatch.forward_mode`
-- `query_start_loc`
-- `num_prefills`
-- `run_graph`
-- prefix hit 的高层语义
-
-它们只描述：
-
-- 当前这次 kernel 调用
-- 要读哪些 KV 地址
-- 每个 request 的边界在哪里
-
-所以它们更接近：
-
-- execution metadata
-- kernel-facing metadata
-
-而不是 host/runtime metadata。
-
-### 7.2 但它们也不是完全 storage-agnostic
-
-它们依赖：
-
-- 当前 KV cache 的地址空间模型
-- kernel 对该地址空间的解释方式
-
-如果底层 cache 不是：
-
-- 可线性索引的 token slot
-- 或可 flatten 的 page/block index
-
-那 `kv_indptr` / `kv_indices` 这套抽象就未必成立。
-
-所以它们不是“无限通用”的。
-
-## 8. 对重构的直接启示
-
-这个背景知识对当前 `sglang attention` 重构最有价值的结论是：
-
-### 8.1 不要去统一 host metadata
-
-比如不要强行统一：
-
-- `ForwardBatch`
-- `query_start_loc`
-- `num_prefills`
-- `run_graph`
-
-这些都属于宿主框架自己的 runtime 语言。
-
-### 8.2 应该争取统一 execution metadata
-
-例如：
-
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-- `page_table`
-- `kv_last_page_len`
-- `work_indptr`
-- `reduce_indptr`
-
-以及围绕这些字段构造的更小 dataclass，例如当前 draft 里已经开始尝试的：
-
-- `MLADecodeKernelMetadata`
-- `MHAAsmKernelMetadata`
-- `MHAPersistentKernelMetadata`
-
-### 8.3 正确的共享边界
-
-所以真正可共享的不是：
-
-- `sglang` 的 `ForwardMetadata`
-- `vllm plugin` 那一整套 metadata family
-
-而是：
-
-- 它们再往下切出来的
-- kernel-facing execution metadata
-
-## 9. 一句话总结
-
-**`kv_indptr` / `kv_indices` 不是“和存储彻底无关”的抽象，它们绑定的是某种被 kernel 接受的 KV 地址空间；但正因为它们已经脱离了宿主 batch/runtime 语义，`sglang` 和 `vllm/ATOM` 即使上层完全不同，也仍然可以在这层 metadata 上收敛并共享 kernel 调用逻辑。**
diff --git a/work_log/attn_refractory/2026-05-08-metadata-layering-summary.md b/work_log/attn_refractory/2026-05-08-metadata-layering-summary.md
deleted file mode 100644
index 86bccce8e..000000000
--- a/work_log/attn_refractory/2026-05-08-metadata-layering-summary.md
+++ /dev/null
@@ -1,144 +0,0 @@
-# 2026-05-08 Metadata Layering Summary
-
-## 1. 目的
-
-这份笔记简要总结 `ATOM native`、`vllm plugin`、`sglang plugin` 三边 metadata 的关系，重点回答：
-
-- `ATOM/atom/utils/forward_context.py` 中的 `AttentionMetaData` 是什么
-- `ATOM/atom/plugin/attention.py` 里的 metadata dataclass 在做什么
-- `ATOM/atom/plugin/sglang/attention_backend/sgl_attn_backend.py` 里的 `ForwardMetadata` 属于哪一层
-- 哪些 metadata 值得共享，哪些不值得强行统一
-
-## 2. 三套 metadata 的定位
-
-### 2.1 `AttentionMetaData`
-
-文件：`ATOM/atom/utils/forward_context.py`
-
-它是 `ATOM` attention 执行链路里的**通用外层容器**。  
-里面既有：
-
-- 通用 attention 字段
-  - `slot_mapping`
-  - `block_tables`
-  - `kv_indptr`
-  - `kv_indices`
-  - `work_meta_data`
-- 也预留了 plugin 扩展入口
-  - `plugin_metadata`
-
-所以它更像一个大容器，而不是某个 host 专属的数据模型。
-
-### 2.2 `vllm plugin metadata`
-
-文件：`ATOM/atom/plugin/attention.py`
-
-`vllm plugin` 不是重写一套完全独立的 metadata，而是在 `AttentionMetaData` 外壳上，额外挂了一层 `plugin_metadata` payload。
-
-代表类型包括：
-
-- `AiterFlashAttentionMetadataForPluginMode`
-- `AiterMLACommonMetadataForPluginMode`
-- `AiterMLADecodeMetadataForPluginMode`
-- `AiterMLASparseMetadataForPluginMode`
-
-这些 dataclass 主要承载：
-
-- `vllm` 的 batch/runtime 语义
-- decode/prefill/extend phase 拆分
-- chunked prefill / DCP / sparse MLA 等 feature-specific 信息
-
-一句话：**`vllm plugin` 是“外层复用 `AttentionMetaData`，内层自带一套 host-specific metadata family”。**
-
-### 2.3 `sglang plugin` 的 `ForwardMetadata`
-
-文件：`ATOM/atom/plugin/sglang/attention_backend/sgl_attn_backend.py`
-
-`ForwardMetadata` 不是 `AttentionMetaData.plugin_metadata` 那种 payload，而是 `sglang backend` 内部的**lowering 结果缓存对象**。
-
-它更接近：
-
-- `ForwardBatch`
-- `req_to_token`
-- `token_to_kv_pool`
-- graph/speculative path
-
-向 kernel metadata 的过渡层。
-
-一句话：**`ForwardMetadata` 更像 backend-local lowering state，而不是 host-agnostic 的最终执行 metadata。**
-
-## 3. 三类字段
-
-### 3.1 host/runtime-bound
-
-这类字段描述宿主框架如何组织 batch/runtime，而不是 kernel 如何计算。
-
-例子：
-
-- `vllm plugin`
-  - `query_start_loc`
-  - `num_decodes`
-  - `num_prefills`
-  - `chunked_context`
-- `sglang plugin`
-  - `run_graph`
-  - 各种 `ForwardBatch.forward_mode` 派生分支语义
-
-### 3.2 execution/kernel-bound
-
-这类字段最接近 kernel 真正需要的执行信息。
-
-例子：
-
-- `slot_mapping`
-- `block_tables`
-- `context_lens`
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-- `kv_last_page_len`
-- `work_indptr`
-- `reduce_indptr`
-
-### 3.3 hardware-sensitive
-
-这类字段和 page size、dtype、kernel 选择、并行配置相关。
-
-例子：
-
-- `max_q_len`
-- `max_kv_len`
-- `num_kv_splits`
-- `attn_out_dtype`
-- `head_dim`
-
-## 4. 哪些值得共享
-
-### 不建议强行共享的
-
-- `AttentionMetaData` 整体
-- `vllm plugin` 整套 `plugin_metadata`
-- `sglang plugin` 整体的 `ForwardMetadata`
-
-原因不是它们没价值，而是它们都带着很重的 host/runtime 语义。
-
-### 值得重点共享的
-
-应该共享的是更小的 **execution metadata view**，例如当前 draft 已经开始尝试的：
-
-- `MLADecodeKernelMetadata`
-- `MHAAsmKernelMetadata`
-- `MHAPersistentKernelMetadata`
-
-这些结构只保留某个 kernel call 真正需要的字段，更适合在：
-
-- `ATOM core`
-- `vllm plugin`
-- `sglang plugin`
-
-之间共享。
-
-## 5. 一句话结论
-
-现在不是三边共用“一套 metadata”，而是三边各自都有一层 host-owned metadata / lowering 体系。  
-真正值得共享的，不是这些大 metadata 本身，而是从它们里面切出来的、更小的、kernel-facing execution metadata。
diff --git a/work_log/attn_refractory/2026-05-11-attn-refractory-session-summary.md b/work_log/attn_refractory/2026-05-11-attn-refractory-session-summary.md
deleted file mode 100644
index 3b2128f5b..000000000
--- a/work_log/attn_refractory/2026-05-11-attn-refractory-session-summary.md
+++ /dev/null
@@ -1,398 +0,0 @@
-# 2026-05-11 ATOM Attention Refactory Session Summary
-
-> 主题：从 `sglang plugin` 设计者视角，梳理 `attention backend` 重构中的 runtime / metadata / KV cache / kernel reuse 边界，并记录几种 kernel 共用 draft 的结论。
-
-## 1. 本次会话的主线
-
-本次讨论的核心不是继续从 `vllm plugin` 出发看问题，而是明确切换到：
-
-- **`sglang plugin attn backend` 设计者视角**
-- 关注 `sglang plugin` 自己的 runtime 语义
-- 判断哪些层应该继续由 `sglang plugin` 自有
-- 判断哪些层值得尽量与 `ATOM core` 共用
-
-围绕这个目标，本次会话主要探索了 6 个问题：
-
-1. `RadixAttention` / `radix cache` 与 `PagedAttention` 的关系
-2. public `sglang` 是否也会把 radix runtime lowering 到 paged / index metadata
-3. metadata 为什么会越长越多，以及哪些字段本质上属于 host/runtime
-4. `kv_indptr` / `kv_indices` 为什么既绑定存储模型，又仍然能跨 host 共用
-5. `sglang plugin` 和 public `sglang` 在 KV cache 管理上的异同
-6. 在不大改文件结构、不做过度软件工程的前提下，怎么试探性地共用 kernel call
-
-## 2. 对 `sglang plugin` 重构最关键的几个判断
-
-### 2.1 不是所有问题都该叫 “attention backend 复用”
-
-当前问题至少分成三层：
-
-1. **host/runtime 层**
-   - `ForwardBatch`
-   - `RadixAttention`
-   - speculative / graph / verify / extend 调度
-2. **execution metadata lowering 层**
-   - `page_table`
-   - `kv_indptr`
-   - `kv_indices`
-   - `qo_indptr`
-   - `kv_last_page_len`
-3. **kernel / execution core 层**
-   - `flash_attn_varlen_func`
-   - `mla_decode_fwd`
-   - `mla_prefill_fwd`
-   - `pa_fwd_asm`
-   - `pa_persistent_fwd`
-
-后续任何“共用更多 code”的讨论，都应该先说明是在第几层说话。
-
-### 2.2 `sglang plugin` 不应该为了复用去对齐到 `vllm` 的 host runtime
-
-`vllm plugin` 和 `sglang plugin` 上层接入模型不同：
-
-- `vllm plugin` 更像 layer-owned / metadata-builder-owned 路径
-- `sglang plugin` 更像 `ForwardBatch` 驱动的 backend runtime 路径
-
-所以：
-
-- 不该共享 `vllm plugin` 那层 runtime facade
-- 也不该强行统一大 metadata 对象
-- 真正值得共享的，是更靠近 kernel 的 execution metadata 和 kernel call
-
-### 2.3 `sglang plugin` 想共用 `ATOM` code，最现实的边界在 kernel call 一层
-
-如果先不做大规模结构改造，最适合先共用的是：
-
-- `mla_decode_fwd`
-- `pa_persistent_fwd`
-- `pa_fwd_asm`
-- `flash_attn_varlen_func`
-
-也就是：
-
-- 保留各自的 runtime lowering
-- 先把“掉 kernel 前最后一跳”抽出来
-
-## 3. `RadixAttention` / `radix cache` 和 `PagedAttention` 的结论
-
-### 3.1 `RadixAttention` 不是最终 kernel
-
-`RadixAttention` 更像：
-
-- `sglang` 模型层统一使用的 attention layer 壳
-- 它知道自己跑在 `ForwardBatch` 语义里
-- 最终还是委托给 `forward_batch.attn_backend`
-
-所以它不是 “另一套底层注意力数学”，而是 **host-facing runtime abstraction**。
-
-### 3.2 `PagedAttention` 和 `radix` 解决的问题不在同一层
-
-- `radix cache` 更偏 prefix 命中 / 复用 / eviction
-- `paged attention` 更偏 block/page 形式的物理组织和读取
-
-可以理解成：
-
-- `radix` 管“历史怎么共享和命中”
-- `paged` 管“命中的历史最后如何变成 page/block 索引给 kernel”
-
-### 3.3 public `sglang` 自己也在做类似 lowering
-
-这不是 `ATOM plugin` 才有的行为。
-
-public `sglang` 的：
-
-- `aiter_backend`
-- `triton_backend`
-- `trtllm_mha_backend`
-- 部分 `flashinfer` 路径
-
-本身也会把：
-
-- `ForwardBatch`
-- `req_to_token`
-- `seq_lens`
-
-lower 成：
-
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-- `page_table` / `block_tables`
-
-所以，`radix runtime -> paged/index metadata` 不是 plugin 特例，而是公共设计模式的一部分。
-
-## 4. Metadata 分层结论
-
-### 4.1 现在至少有三套 metadata family
-
-1. **`ATOM native`**
-   - `AttentionMetaData`
-2. **`vllm plugin`**
-   - `AttentionMetaData` 外壳
-   - `plugin_metadata` 内层 payload
-3. **`sglang plugin`**
-   - `ForwardMetadata`
-
-### 4.2 `AttentionMetaData` 和 `plugin/attention.py` 的关系
-
-`AttentionMetaData` 是外层通用容器；  
-`plugin/attention.py` 里的很多 dataclass，是 `vllm plugin` 为表达宿主 batch/runtime 语义而塞进 `plugin_metadata` 的 payload。
-
-所以它们不是并列关系，而是：
-
-```text
-AttentionMetaData
-  └─ plugin_metadata
-       ├─ flash-style plugin metadata
-       ├─ MLA plugin metadata
-       └─ sparse MLA plugin metadata
-```
-
-### 4.3 `ForwardMetadata` 的定位
-
-`ForwardMetadata` 更像：
-
-- `sglang backend` 内部的 lowering state
-- 它不是最终统一 metadata
-- 更不是天然可共享的 host-agnostic object
-
-它里面混着：
-
-- runtime control 信息
-- execution metadata
-- persistent kernel 预构造结果
-
-## 5. 三类字段
-
-为了判断能不能共享，本次把字段分成三类：
-
-### 5.1 host/runtime-bound
-
-典型例子：
-
-- `vllm` 的 `query_start_loc`
-- `num_decodes`
-- `num_prefills`
-- `chunked_context`
-- `sglang` 的 `run_graph`
-
-这些字段描述的是宿主框架如何组织 batch/runtime，而不是 kernel 怎么算。
-
-### 5.2 execution/kernel-bound
-
-典型例子：
-
-- `slot_mapping`
-- `block_tables`
-- `page_table`
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-- `kv_last_page_len`
-- `work_indptr`
-- `reduce_indptr`
-
-这些字段才是 kernel 真正需要消费的执行信息。
-
-### 5.3 hardware-sensitive
-
-典型例子：
-
-- `max_q_len`
-- `max_kv_len`
-- `num_kv_splits`
-- `head_dim`
-- `attn_out_dtype`
-
-这些字段不是 host 语义，但会影响：
-
-- page/block 解释
-- kernel 选路
-- workspace/split 策略
-
-## 6. 关于 `kv_indptr` / `kv_indices` 的关键结论
-
-这个问题本次单独做了较多背景解释，结论是：
-
-### 6.1 它们不是完全 storage-agnostic
-
-因为它们绑定的是某种 **可索引的 KV 地址空间**：
-
-- token slot index
-- page/block index
-- compacted KV index
-
-所以它们不能脱离底层存储模型独立存在。
-
-### 6.2 但它们可以是 host/runtime-agnostic
-
-因为它们不表达：
-
-- `ForwardBatch.forward_mode`
-- `query_start_loc`
-- prefix 命中策略
-- graph/speculative 的宿主控制语义
-
-它们只表达：
-
-- 这次 kernel 要去哪些 KV 地址读数据
-- 每个 request 的边界在哪里
-
-### 6.3 两个例子
-
-#### 例子 A：`sglang`
-
-- `req_to_token[req_id, token_pos] = slot_id`
-- lowering 后得到：
-  - `kv_indptr = [0, 10]`
-  - `kv_indices = [8, 9, 10, 11, 0, 1, 2, 3, 4, 5]`
-
-#### 例子 B：`vllm/ATOM`
-
-- 上层是 `block_table`
-- 展开后同样可得到：
-  - `kv_indptr = [0, 10]`
-  - `kv_indices = [8, 9, 10, 11, 0, 1, 2, 3, 4, 5]`
-
-这说明：
-
-- 上层来源不同
-- 但 lower 后的 execution metadata 可以收敛
-
-## 7. public `sglang` 与 `sglang plugin` 在 KV cache 管理上的异同
-
-### 7.1 相同点
-
-- owner 都还是 public `sglang`
-  - `ReqToTokenPool`
-  - `TokenToKVPool`
-  - `KVCache`
-  - `radix_cache`
-- request 到 slot 的映射机制没变
-- prefix / radix 复用体系没变
-
-### 7.2 不同点
-
-#### A. MHA 写 cache 路径不同
-
-public `sglang` 更多沿用 pool 提供的标准写入接口：
-
-- `token_to_kv_pool.set_kv_buffer(...)`
-
-而 `sglang plugin` 在 MHA 路径里会绕过标准写法，自己做一次：
-
-- `set_kv_buffer_with_layout_shuffle(...)`
-- 把同一块 public pool buffer 按更偏 `ATOM` kernel 的 SHUFFLE layout 写进去
-
-#### B. MLA 路径更接近 public `sglang`
-
-MLA 上，plugin 更多沿用 public pool 的 contract：
-
-- `token_to_kv_pool.set_kv_buffer(...)`
-- `get_key_buffer(...)`
-- `get_value_buffer(...)`
-
-#### C. Lowering 目标不同
-
-- public `sglang` 的 metadata 更通用
-- plugin 的 `ForwardMetadata` 更偏 `ATOM` kernel
-  - `page_table`
-  - `pa_metadata_*`
-
-### 7.3 一个更准确的说法
-
-`sglang plugin` **没有改写 public `sglang` 的 pool owner / allocator / radix tree**，  
-但它在 MHA 路径上：
-
-- 改变了同一块 public pool buffer 的布局解释与写入约定
-- 并把 lowering 结果继续朝 `ATOM` kernel 所需的 metadata 形态推进
-
-## 8. Kernel 共用的两种 draft
-
-本次会话里，实际做了两种方向的 draft。
-
-### 8.1 方案 A：共享 kernel wrapper
-
-思路：
-
-- 新增共享小 metadata
-  - `MLADecodeKernelMetadata`
-  - `MHAAsmKernelMetadata`
-  - `MHAPersistentKernelMetadata`
-- 新增共享 wrapper
-  - `run_mla_decode_kernel`
-  - `run_mha_asm_kernel`
-  - `run_mha_persistent_kernel`
-
-优点：
-
-- 边界清晰
-- 更像长期可演进的 execution API
-- 不需要统一大 metadata
-
-缺点：
-
-- 还是引入了一层新的 shared contract
-
-### 8.2 方案 B：`sglang plugin` 主动适配 `ATOM core`
-
-思路：
-
-- 不改 `ATOM core`
-- 在 `sglang plugin` 里把 `ForwardMetadata` 转成 `AttentionMetaData`
-- 再伪造最小 `shim/self`
-- 直接调用：
-  - `PagedAttentionImpl.paged_attention_asm`
-  - `PagedAttentionImpl.paged_attention_persistent_asm`
-  - `MLAAttention._forward_decode`
-
-优点：
-
-- 不改 `ATOM core`
-- 更直观地证明 “plugin 主动复用 core” 是可行的
-
-缺点：
-
-- 需要 shim
-- 尤其 MLA 路径更脆弱
-- 更适合验证可行性，不适合长期保留
-
-## 9. 当前最值得保留的判断
-
-### 9.1 不应该强行统一大 metadata
-
-不建议直接统一：
-
-- `AttentionMetaData`
-- `plugin_metadata`
-- `ForwardMetadata`
-
-因为它们都带有较重的宿主 runtime 语义。
-
-### 9.2 最值得共享的是 kernel-facing execution metadata
-
-真正最有价值的共享边界是：
-
-- `kv_indptr`
-- `kv_indices`
-- `qo_indptr`
-- `page_table`
-- `context_lens`
-- `work_*`
-- `reduce_*`
-
-以及基于它们的 kernel call / runner。
-
-### 9.3 重构顺序建议
-
-如果继续推进 `sglang plugin` 重构，更合理的顺序是：
-
-1. 保持 `sglang` runtime 自有
-2. 显式化 metadata lowering 边界
-3. 尽量共享 kernel call / execution helper
-4. 最后再考虑是否继续上推到 runner 层
-
-## 10. 一句话总结
-
-本次会话最终收敛出的关键认识是：
-
-**`sglang plugin attention` 重构的关键，不是把 `sglang` runtime 改得更像 `vllm`，而是把 runtime、metadata lowering、kernel execution 这三层拆清楚；只要 lowering 之后能在 execution metadata 上对齐，就能在不统一宿主语义的前提下，共用更多 `ATOM` 的底层 attention 代码。**
diff --git a/work_log/attn_refractory/decoupling-diagram.md b/work_log/attn_refractory/decoupling-diagram.md
deleted file mode 100644
index b95c0031e..000000000
--- a/work_log/attn_refractory/decoupling-diagram.md
+++ /dev/null
@@ -1,158 +0,0 @@
-# ATOM Plugin Decoupling Diagram
-
-下面这张图对比了当前入口架构和目标解耦架构。这里采用的是更激进、也更干净的方案：
-
-- `vLLM plugin` 和 `SGLang plugin` 在 `atom/plugin/` 层彻底分家
-- 不再保留共享的 plugin runtime
-- 共享只发生在更底层的 `ATOM core`
-
-## Current
-
-```mermaid
-flowchart TD
-    A[Shared Plugin Entry<br/>prepare.py / register.py / config.py]
-    B[Global Runtime State<br/>_CURRENT_FRAMEWORK<br/>current_atom_config<br/>ops.Attention]
-    C[vLLM Branch<br/>platform / registry / patch / graph]
-    D[SGLang Branch<br/>wrapper / registry hijack / RadixAttention / graph]
-    E[Mixed Plugin Core<br/>attention selection / static_forward_context / loader glue]
-
-    A --> B
-    B --> C
-    B --> D
-    C --> E
-    D --> E
-```
-
-### 当前设计的核心问题
-
-- 入口层通过切换全局状态来区分 `vLLM` 和 `SGLang`，而不是通过物理隔离的子系统建立边界。
-- `prepare.py` / `register.py` / `config.py` 这些共享入口，本身就是耦合中心。
-- host 差异直接泄漏到 plugin 内部结构里，导致 backend 选择、attention 抽象和 runtime context 都知道上层框架。
-
-## Target
-
-```mermaid
-flowchart TD
-    A1[vLLM Plugin<br/>bootstrap.py<br/>register.py<br/>config.py<br/>runtime.py<br/>attention glue<br/>model adapters]
-    A2[SGLang Plugin<br/>bootstrap.py<br/>register.py<br/>config.py<br/>runtime.py<br/>attention glue<br/>model adapters]
-
-    B[ATOM Core<br/>model_ops<br/>kernels<br/>loader<br/>metadata structs<br/>model specialization]
-
-    A1 --> B
-    A2 --> B
-```
-
-### 目标设计的关键点
-
-- `vLLM` 和 `SGLang` 在 plugin 层是两个完整子系统，而不是一套共享 runtime 上的两个分支。
-- `atom/plugin/` 下不再存在共享的 `prepare_model()`、共享的 `register.py` 总入口、共享的 plugin config translator。
-- `vLLM plugin` 自己处理：
-  - platform registration
-  - model override
-  - graph / MLA / weight hook patch
-  - vLLM model adapter / attention glue
-- `SGLang plugin` 自己处理：
-  - external model wrapper
-  - attention backend registration / override
-  - graph / speculative / wrapper patch
-  - SGLang model adapter / attention glue
-- 真正共享的只保留在更底层的 `ATOM core`：
-  - `model_ops`
-  - kernels
-  - loader
-  - generic metadata structures
-  - model-family specialization
-
-## 最重要的切断点
-
-```mermaid
-flowchart LR
-    A[Cut 1<br/>shared prepare.py]
-    B[Cut 2<br/>shared register.py]
-    C[Cut 3<br/>shared config.py]
-    D[Cut 4<br/>framework global state]
-    E[Cut 5<br/>ops.Attention global rebinding]
-
-    A --> F[vllm/ and sglang/ own bootstrap]
-    B --> F
-    C --> F
-    D --> G[host-local runtime only]
-    E --> H[host-specific attention glue]
-```
-
-## 推荐目录形态
-
-```text
-atom/plugin/
-  vllm/
-    bootstrap.py
-    register.py
-    config.py
-    runtime.py
-    attention/
-    models/
-    graph/
-  sglang/
-    bootstrap.py
-    register.py
-    config.py
-    runtime.py
-    attention/
-    models/
-    graph/
-```
-
-这里的重点不是“换目录”，而是：
-
-- `vllm/runtime.py` 和 `sglang/runtime.py` 各自独立
-- `vllm/config.py` 和 `sglang/config.py` 各自独立
-- `vllm/register.py` 和 `sglang/register.py` 各自独立
-- 不再有 plugin 级共享 runtime
-
-## 术语说明
-
-### bootstrap
-
-`bootstrap` 在这里指的是“插件被宿主框架加载时，最先执行的接线初始化层”。
-
-它负责的事情通常包括：
-
-- 注册 plugin 扩展点
-- 安装 patch / hook
-- 构造 host adapter / wrapper
-- 把宿主配置翻译到 ATOM 可用的格式
-
-它不应该负责：
-
-- attention forward 本身
-- 长期 runtime state 管理
-- 每次请求都要走的核心执行逻辑
-
-### register
-
-`register` 指的是把 plugin 的能力挂到宿主暴露的扩展点上，例如：
-
-- 注册 platform
-- 注册 model class
-- 注册 attention backend
-
-它更偏“声明接入点”，通常是 bootstrap 的一部分。
-
-### adapter
-
-`adapter` 指的是宿主框架和 ATOM core 之间的桥接层。  
-它的职责是做接口和参数语义翻译，而不是定义 attention / model 的核心语义。
-
-### runtime
-
-`runtime` 指的是插件在宿主里长期存在、服务执行链路的那部分运行时逻辑。  
-如果按这份设计推进，`runtime` 应该是 host-local 的：
-
-- `vLLM runtime` 只属于 `vLLM plugin`
-- `SGLang runtime` 只属于 `SGLang plugin`
-
-而不是共享一个“Plugin runtime”。
-
-## 一句话结论
-
-真正的解耦不是把共享入口再包装一层，而是把 `atom/plugin` 从“共享状态机”改成“两个完全独立的 host plugin 子系统”，共享只留给更底层的 `ATOM core`。
diff --git a/work_log/attn_refractory/vllm-integration-architecture.md b/work_log/attn_refractory/vllm-integration-architecture.md
deleted file mode 100644
index e8876268d..000000000
--- a/work_log/attn_refractory/vllm-integration-architecture.md
+++ /dev/null
@@ -1,269 +0,0 @@
-# ATOM vLLM Plugin Integration Architecture
-
-## 1. 启动入口
-
-vLLM plugin 的 setuptools entrypoint 定义在：
-
-- `pyproject.toml`
-
-关键位置：
-
-- `vllm.platform_plugins`
-  - `atom = "atom.plugin.vllm.register:register_platform"`
-- `vllm.general_plugins`
-  - `atom_model_registry = "atom.plugin.vllm.register:register_model"`
-
-这意味着 vLLM 启动时会先进入：
-
-- `atom/plugin/vllm/register.py::register_platform()`
-- `atom/plugin/vllm/register.py::register_model()`
-
-## 2. 关键代码文件与职责
-
-### 2.1 `atom/plugin/vllm/register.py`
-
-这是 vLLM plugin 的启动接线层，主要负责：
-
-- 设置 plugin mode
-- 返回自定义 platform
-- 覆盖 vLLM 的 `ModelRegistry`
-- 安装 MLA patch
-- patch `Attention.process_weights_after_loading`
-- 安装 graph capture patch
-
-重点符号：
-
-- `register_platform()`
-- `register_model()`
-- `_VLLM_MODEL_REGISTRY_OVERRIDES`
-
-## 2.2 `atom/plugin/vllm/platform.py`
-
-这是 vLLM 平台侧 attention backend 选择入口。
-
-重点符号：
-
-- `ATOMPlatform.get_attn_backend_cls()`
-
-它根据 `attn_selector_config` 决定：
-
-- MHA -> `atom.model_ops.attentions.aiter_attention.AiterBackend`
-- MLA -> `atom.model_ops.attentions.aiter_mla.AiterMLABackend`
-- sparse MLA -> `atom.plugin.vllm.attention_backend.mla_sparse.AiterMLASparseBackend`
-
-这里是 vLLM runtime 真正选择 “哪个 ATOM attention backend class” 的入口。
-
-## 2.3 `atom/plugin/vllm/model_wrapper.py`
-
-这是 vLLM model integration 的核心文件。
-
-主要职责：
-
-- 把 `VllmConfig` 翻译成 `atom_config`
-- 做 plugin 运行环境准备
-- 选择并实例化 ATOM 模型类
-- 把 vLLM forward context 中的 `positions` 传给 ATOM 路径
-- 对 sparse MLA indexer 做额外注册
-
-重点符号：
-
-- `_prepare_env()`
-- `ATOMModelBase.__init__()`
-- `_register_indexer_caches_with_vllm()`
-- `forward()`
-
-## 2.4 `atom/plugin/config.py`
-
-这是 vLLM 配置翻译的桥。
-
-重点符号：
-
-- `generate_atom_config_for_vllm_plugin()`
-- `_generate_atom_config_from_vllm_config()`
-
-它负责把：
-
-- `VllmConfig`
-- `scheduler_config`
-- `cache_config`
-- `parallel_config`
-- `quant_config`
-
-映射成 ATOM 自己的 `Config` 与 `PluginConfig`。
-
-## 2.5 `atom/model_ops/paged_attention.py`
-
-这是 ATOM attention 在 vLLM plugin 下真正进入执行的关键对象。
-
-重点符号：
-
-- `PagedAttention.__init__()`
-- `PagedAttention.forward()`
-
-在 `is_vllm()` 下，它不会直接走 native/server 的那套 `impl` 初始化，而是：
-
-- 构造 vLLM 的 `Attention` / `MLAAttention` 外壳
-- 把额外 impl 参数塞进去
-- 注册到 `static_forward_context`
-- forward 时通过 `unified_attention_with_output_base_for_plugin_mode()` 回调到 ATOM impl
-
-## 2.6 `atom/plugin/attention.py`
-
-这是 vLLM plugin mode 的核心 glue 层。
-
-虽然文件在 `plugin/` 根目录，但从职责看它明显偏 vLLM。
-
-主要职责：
-
-- 定义 `vllmAiterAttentionBackendMethods`
-- 提供 backend decorator / metadata builder decorator
-- 处理 plugin-mode metadata
-- 处理 `static_forward_context`
-- 为 vLLM 的 attention backend 补 OOT 接口行为
-
-重点符号：
-
-- `vllmAiterAttentionBackendMethods`
-- 各类 `*DecoratorForPluginMode`
-
-## 2.7 `atom/plugin/attention_mha.py`
-
-这是 vLLM plugin mode 下 MHA impl 的补丁层。
-
-重点符号：
-
-- `PagedAttentionImplPluginModeMethods`
-
-它补的是：
-
-- rope/cache/update
-- plugin-mode forward
-- 与 vLLM KV cache/metadata 对齐的逻辑
-
-## 2.8 `atom/plugin/attention_mla.py`
-
-这是 vLLM plugin mode 下 MLA impl 的补丁层。
-
-重点符号：
-
-- `MLAAttentionImplPluginModeMethods`
-- `_mla_plugin_mode_init`
-
-它补的是：
-
-- MLA plugin-mode metadata
-- qk rope / kv cache / chunked prefill
-- vLLM MLA 路径专有行为
-
-## 2.9 `atom/plugin/vllm/mla_patch.py`
-
-这是 vLLM 原生 `MLAAttention` 的 monkey patch 入口。
-
-重点符号：
-
-- `_patch_vllm_mla_attention_forward_impl()`
-- `_patch_vllm_mla_attention_process_weights_after_loading()`
-- `patch_vllm_mla_attention()`
-
-这层作用是：
-
-- 把 vLLM `MLAAttention.forward_impl()` 改接到 ATOM impl
-- 把权重后处理改接到 ATOM 的 `impl.process_weights_after_loading()`
-
-## 2.10 `atom/plugin/vllm/graph_capture_patch.py`
-
-这是 vLLM graph capture 补丁入口。
-
-重点符号：
-
-- `apply_graph_capture_patch()`
-
-它实际委托到共享实现：
-
-- `atom/plugin/graph_capture_patch.py`
-
-但调用点是 vLLM 自己的 plugin register 流程。
-
-## 3. 当前 vLLM plugin 集成链路
-
-```mermaid
-flowchart TD
-    A[vLLM startup]
-    B[entrypoint<br/>register_platform]
-    C[entrypoint<br/>register_model]
-    D[ATOMPlatform.get_attn_backend_cls]
-    E[ATOMModelBase]
-    F[generate_atom_config_for_vllm_plugin]
-    G[_prepare_env<br/>set_attn_cls + init_aiter_dist]
-    H[Instantiate ATOM model]
-    I[PagedAttention / MLAAttention wrapper]
-    J[plugin/attention*.py glue]
-    K[ATOM impl<br/>PagedAttentionImpl / MLAAttention]
-    L[aiter kernels]
-
-    A --> B
-    A --> C
-    B --> D
-    C --> E
-    E --> F
-    E --> G
-    E --> H
-    H --> I
-    I --> J
-    J --> K
-    K --> L
-```
-
-## 4. 最关键的架构判断
-
-### 4.1 vLLM plugin 不是简单“替换 backend class”
-
-它实际上同时做了三件事：
-
-1. 替换 vLLM 的 model entry
-2. 替换/选择 vLLM 的 attention backend class
-3. 用 patch + decorator 的方式把 vLLM 的 layer/runtime 语义桥接到 ATOM impl
-
-### 4.2 真正共享得最好的是 native/server 与 vLLM plugin 的 attention impl
-
-尤其：
-
-- `PagedAttentionImpl`
-- `MLAAttention`
-- 低层 `aiter` kernel family
-
-也就是说，vLLM plugin 侧的关键复杂度主要不在 kernel，而在：
-
-- `ModelRegistry` override
-- `VllmConfig -> atom_config`
-- `static_forward_context`
-- `MLAAttention` patch
-- graph capture patch
-
-### 4.3 当前 `plugin/attention.py`、`attention_mha.py`、`attention_mla.py`
-
-虽然放在 `atom/plugin/` 根目录，但从 vLLM integration 的角度看，它们本质上更像：
-
-- `vLLM plugin impl glue`
-- 而不是真正框架无关的 shared runtime
-
-## 5. 推荐你重点阅读的文件顺序
-
-如果你是为了快速理解 vLLM plugin 集成架构，建议按这个顺序读：
-
-1. `pyproject.toml`
-2. `atom/plugin/vllm/register.py`
-3. `atom/plugin/vllm/platform.py`
-4. `atom/plugin/vllm/model_wrapper.py`
-5. `atom/plugin/config.py`
-6. `atom/model_ops/paged_attention.py`
-7. `atom/plugin/vllm/mla_patch.py`
-8. `atom/plugin/attention.py`
-9. `atom/plugin/attention_mha.py`
-10. `atom/plugin/attention_mla.py`
-
-## 6. 一句话结论
-
-vLLM plugin 的集成方式，本质上是：
-
-**用 entrypoint + platform + model wrapper 接入 vLLM，用 ATOM 自己的 attention impl 和 aiter kernels 提供核心执行，再用 patch/decorator 把 vLLM 的 layer/runtime 语义桥接进去。**