[ATOM SGLang] SGL plugin Attention Refractory#863
Open
ZhiweiYan-96 wants to merge 11 commits into
Open
Conversation
Move the SGLang DeepSeek MLA runtime entry from legacy forward glue into SGLangDeepseekMLAAttention while keeping RadixAttention and the full-attention backend as the host/backend layers. Shrink deepseek_mla_forward.py into a helper module and clarify absorbed vs non-absorbed path naming.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the ATOM SGLang plugin attention stack to make SGLang runtime state, model-level adaptation (e.g., DeepSeek MLA), and full-attention backend responsibilities explicit and better separated. It introduces a small model-adapter registry, moves runtime/forward-context bridging into a dedicated runtime package, and splits the previously monolithic backend helpers into focused modules while keeping behavior aligned with existing supported models.
Changes:
- Introduces
atom.plugin.sglang.runtime(scoped runtime globals, forward-context bridge, and model adapter registry) and updates wrappers to use it. - Decouples DeepSeek MLA model adaptation into
atom/plugin/sglang/models/deepseek_mla*and removes the old monolithicsgl_attention_mla.py. - Splits the SGLang full-attention backend into helper modules (
metadata,kv_cache,pa_metadata) and updates import paths across plugin and core ops.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/plugin/test_sglang_register.py | Updates mocks/imports for the renamed full-attention backend module and additional model imports. |
| tests/plugin/test_sglang_model_wrapper.py | Updates DeepSeek MLA setup-hook import path to the new models.deepseek_mla module. |
| atom/plugin/sglang/runtime/model_arch.py | Adds SGLangModelAdapterSpec + registry for prepare/install hooks and wrapper flags. |
| atom/plugin/sglang/runtime/forward_context.py | Adds SGLangPluginRuntime to bridge ForwardBatch into ATOM forward_context and handle dummy/idle batches. |
| atom/plugin/sglang/runtime/context.py | Adds scoped runtime utilities (plugin_runtime_scope, forward-batch ContextVars, metadata binding helpers). |
| atom/plugin/sglang/runtime/init.py | Exposes the runtime utilities as a public package surface. |
| atom/plugin/sglang/models/qwen3_5.py | Switches to runtime package import and updates comment to reference MODEL_ARCH_SPECS. |
| atom/plugin/sglang/models/deepseek_nextn_wrapper.py | Migrates draft wrapper to SGLangPluginRuntime + plugin_runtime_scope. |
| atom/plugin/sglang/models/deepseek_mla.py | Adds install-time DeepSeek MLA patch entrypoint (setup_deepseek_for_sglang) in a model-owned module. |
| atom/plugin/sglang/models/deepseek_mla_forward.py | Extracts DeepSeek MLA shared helper functions (BMM paths, weight post-load processing, KV staging). |
| atom/plugin/sglang/models/deepseek_mla_attention.py | Adds SGLangDeepseekMLAAttention model-level adapter to lower latent MLA inputs into backend-ready attention calls. |
| atom/plugin/sglang/models/base_model_wrapper.py | Replaces embedded runtime/context logic with atom.plugin.sglang.runtime and adapter-driven hooks. |
| atom/plugin/sglang/attention_backend/sgl_attention_mla.py | Removes the old monolithic DeepSeek MLA SGLang plugin module. |
| atom/plugin/sglang/attention_backend/full_attention/radix_attention.py | Updates fallback get_current_forward_batch import to runtime package. |
| atom/plugin/sglang/attention_backend/full_attention/pa_metadata.py | Adds helper module for PA persistent metadata buffer allocation/build. |
| atom/plugin/sglang/attention_backend/full_attention/metadata.py | Adds ForwardMetadata dataclass in its own module. |
| atom/plugin/sglang/attention_backend/full_attention/kv_cache.py | Moves KV layout shuffle kernel + helper into a dedicated module. |
| atom/plugin/sglang/attention_backend/full_attention/full_attention_backend.py | Refactors backend to use extracted helper modules and updates naming/imports. |
| atom/plugin/sglang/attention_backend/full_attention/init.py | Adds package exports for full-attention backend components. |
| atom/plugin/sglang/attention_backend/attention_gdn.py | Updates import path for SGLangForwardBatchMetadata to runtime package. |
| atom/plugin/register.py | Updates custom attention backend import path to the new full-attention backend module. |
| atom/plugin/prepare.py | Routes model-specific config preparation via the new model adapter spec (get_model_arch_spec). |
| atom/model_ops/attentions/aiter_attention.py | Updates RadixAttention import path to the new full-attention location. |
| atom/model_ops/init.py | Updates RadixAttention import path to the new full-attention location. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+160
to
172
| with SGLangPluginRuntime( | ||
| atom_config=self.atom_config, | ||
| forward_batch=forward_batch, | ||
| positions=positions, | ||
| input_ids=input_ids, | ||
| input_embeds=input_embeds, | ||
| ): | ||
| hidden_states = self.model( | ||
| input_ids=input_ids, | ||
| positions=positions, | ||
| hidden_states=forward_batch.spec_info.hidden_states, | ||
| inputs_embeds=input_embeds, | ||
| ) |
86024e8 to
8de3516
Compare
Co-authored-by: Cursor <cursoragent@cursor.com>
Comment on lines
+100
to
+117
| """Fuse q/k RMSNorm and q quant using ATOM's DeepSeek-V2 path.""" | ||
|
|
||
| (q_quantized, q_scale), q_normed, k_nope_normed, _ = _fuse_rmsnorm_quant( | ||
| q, | ||
| attn.q_a_layernorm.weight, | ||
| attn.q_a_layernorm.eps, | ||
| k_nope, | ||
| attn.kv_a_layernorm.weight, | ||
| attn.kv_a_layernorm.eps, | ||
| None, | ||
| dtype_quant=attn.quant_dtype, | ||
| shuffle=False, | ||
| scale_shuffle_padding=False, | ||
| group_size=128, | ||
| quant_type=_linear_quant_type_value(attn.q_b_proj), | ||
| output_unquantized_inp1=output_unquantized_q, | ||
| transpose_scale=True, | ||
| ) |
Comment on lines
+160
to
172
| with SGLangPluginRuntime( | ||
| atom_config=self.atom_config, | ||
| forward_batch=forward_batch, | ||
| positions=positions, | ||
| input_ids=input_ids, | ||
| input_embeds=input_embeds, | ||
| ): | ||
| hidden_states = self.model( | ||
| input_ids=input_ids, | ||
| positions=positions, | ||
| hidden_states=forward_batch.spec_info.hidden_states, | ||
| inputs_embeds=input_embeds, | ||
| ) |
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Comment on lines
+166
to
172
| ): | ||
| hidden_states = self.model( | ||
| input_ids=input_ids, | ||
| positions=positions, | ||
| hidden_states=forward_batch.spec_info.hidden_states, | ||
| inputs_embeds=input_embeds, | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ATOM SGLang Attention Refactor
Status
Summary
This RFC proposes a staged refactor of the ATOM SGLang plugin attention stack. The goal is to make SGLang-specific runtime, model adaptation, and attention backend responsibilities explicit.
The current direction is:
SGLangDeepseekMLAAttentionas an explicit model-level attention adapter.ForwardBatch -> ATOM forward_contextbridging into scoped runtime utilities.SGLangModelAdapterSpecso existing special cases are declared instead of hard-coded.ATOMAttnBackendForSglby backend lifecycle responsibility.Background
The existing SGLang plugin support grew through several overlapping concerns:
ATOMAttnBackendForSglhandles metadata construction, cache writes, CUDA graph metadata, MHA/MLA dispatch, speculative modes, and kernel calls.base_model_wrapper.pycollected generic wrapper logic, runtime state, model-specific flags, and forward-context bridging.Recent branches split these concerns:
This PR holds all the change from :
attn_model_decouple[ATOM-SGL][Attn refrac] Separate model-specific MLA from SGL full attention backend zejunchen-zejun/ATOM#28 separates SGLang full-attention backend files from model-specific DeepSeek MLA files.attn_refrac_share_model[ATOM-SGL][Attn refrac] Route DeepSeek MLA through an SGLang wrapper zejunchen-zejun/ATOM#29 introducesSGLangDeepseekMLAAttentionas a model-level DeepSeek MLA adapter.attn_refractory_runtime[ATOM SGL] runtime extraction zejunchen-zejun/ATOM#30 extracts SGLang runtime context intoatom/plugin/sglang/runtime.attn_backend_split[ATOM SGL] Split AtomAttnSGLBackend based on responsibility zejunchen-zejun/ATOM#31 starts splitting full-attention backend helpers out ofATOMAttnBackendForSgl.sglang_model_adapterZhiwei/sglang model adapter zejunchen-zejun/ATOM#32 introduces a first function-based model adapter spec.Goals
Target architecture
Refactor Tracks
Track 1: Attention File and Responsibility Decoupling
This track has two parts: first, separate generic SGLang full-attention files from DeepSeek-specific MLA files; second, split the remaining full-attention backend by responsibility instead of keeping all backend lifecycle logic in
ATOMAttnBackendForSgl.The first problem was file ownership. Generic SGLang full-attention backend code and DeepSeek-specific MLA helpers lived too close together. The refactor moves them apart:
This track is represented by
attn_model_decouple. Its purpose is not to change runtime behavior. Its purpose is to establish ownership:full_attention/owns SGLang framework backend behavior.models/deepseek_mla*.pyowns DeepSeek model-specific MLA behavior.RadixAttentionremains the SGLang framework adapter.This is the foundation for every later PR. Without this move, DeepSeek-specific logic would continue to leak into generic backend files.
The second problem is that
ATOMAttnBackendForSglstill owns too many backend responsibilities after the file move. The refactor starts splitting it into focused helpers:Future splits can continue along the same responsibility boundary:
This split should not be top-level
MHA backendvsMLA backend. MHA and MLA are dispatch cases, but metadata construction, KV cache layout, CUDA graph, PA metadata, and speculative modes cut across both.Track 2:
SGLangDeepseekMLAAttentionDeepSeek MLA cannot be treated like Qwen-style
q/k/vattention. Its model forward passes latent MLA state:These are model-level semantic inputs, not backend-ready attention inputs. The refactor introduces
SGLangDeepseekMLAAttentionto own this lowering:This track is represented by
attn_refrac_share_model.The important design choice is that the wrapper sits above
RadixAttention.RadixAttentionis a SGLang framework adapter: it expects attention-ready tensors and aForwardBatch. DeepSeek MLA, however, callsself.mla_attn(...)withmodel-specific latent state. The wrapper is the place where that semantic gap is closed.
SGLangDeepseekMLAAttentionis responsible for:forward_batchfrom explicit kwargs or current runtime context,q_cto final query when needed,RadixAttention/ SGLang backend,The absorbed path roughly lowers:
The non-absorbed path roughly lowers:
The wrapper should not own generic backend concerns such as page table construction, CUDA graph replay, or PA metadata buffers. Those stay under the SGLang framework backend.
It solves several problems:
Track 3: SGLang Runtime Bridge
The SGLang wrapper must translate framework runtime state into what ATOM model code expects. This includes:
ForwardBatch,forward_context,The refactor extracts this into
atom/plugin/sglang/runtime:This track is represented by
attn_refractory_runtime.There are three distinct runtime problems:
1. Current SGLang Forward State
Some model-level adapters need access to the current SGLang
ForwardBatchwithout threading it through every intermediate ATOM model call. The runtime package providesSGLangForwardBatchMetadatafor this:It also keeps
get_current_forward_batch()as a narrow compatibility path for adapters such asRadixAttentionfallback lookup and DeepSeek MLA wrapper input resolution.2. ATOM Plugin Global State
ATOM still has process-global plugin state:
SGLang target/draft model wrappers can coexist, especially under speculative decoding.
plugin_runtime_scope()scopes those globals around construction, load, patch, and forward sections so one wrapper does not leak runtime state into another.3. SGLang
ForwardBatchto ATOMforward_contextMany ATOM model ops read
atom.utils.forward_context.get_forward_context()for information such as:SGLangPluginRuntimeis a scoped adapter for model wrappers:It owns:
ForwardBatch,The important boundary is:
The runtime bridge is not for
ATOMAttnBackendForSglkernel dispatch. The full-attention backend should use SGLangForwardBatchand backend metadata directly.This separation prevents a common failure mode: pushing model-wrapper runtime concerns into the attention backend simply because both happen to see
ForwardBatch.Track 4: Model Adapter Interface
The current code already has multiple model adaptation patterns:
Using more booleans in
ModelArchSpecdoes not scale. The first implementation step isSGLangModelAdapterSpec:This is intentionally small. It replaces hard-coded special cases without claiming to be a complete future-proof framework.
Current uses:
DeepseekV3ForCausalLMusesinstall_adapters=setup_deepseek_for_sglang.Qwen3NextForCausalLMkeepswrapper_binds_gdn_context=True.Qwen3_5ForConditionalGenerationandQwen3_5MoeForConditionalGenerationuseprepare_config=apply_prepare_model_adaptations.Future lifecycle hooks may include:
construct_model,load_weights,post_load,runtime_policy,output_policy,The key point is that new models should declare adaptation needs through a registry instead of adding new one-off branches in the generic wrapper.
This track is represented by
sglang_model_adapter. It is intentionally a small first step: it codifies existing DeepSeek and Qwen3.5 special cases without trying to solve every future model family in one PR.The intended lifecycle for future adapters is:
The first PR only implements the two hooks that are already needed by existing
code:
It deliberately leaves the rest as design direction. That keeps review scope small while still moving away from boolean flags.
Existing mappings:
Future mappings should be additive:
The adapter registry is therefore a coordination point, not a replacement for model-specific modules. Complex models should still keep their logic in focused files such as
deepseek_mla_attention.py,deepseek_nextn_wrapper.py, or a futuredeepseek_v4_adapter.py.