[ATOM-SGLang][Feat] Enable Deepseek v3 MTP#23
Open
ZhiweiYan-96 wants to merge 12 commits into
Open
Conversation
e07ea02 to
c34a46f
Compare
c1eb4b3 to
7c0969b
Compare
d2da383 to
08611de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed Design
1. MTP module creation: Override the draft architecture through the external model package
As background knowledge, it is helpful to first detail how
SGLangloads the DeepSeek MTP module in its native flow. FromSGLang's point of view, DeepSeek MTP is not an auxiliary block hidden inside the target model. It is a standalone draft architecture. The loading path is roughly:--speculative-algorithm NEXTNSGLangnormalizesNEXTNinto theEAGLEruntime family in server argsModelConfigrewrites the DeepSeek V3 draft architecture toDeepseekV3ForCausalLMNextNinside_config_draft_model()ModelRegistryresolves the model class by that architecture nameIn other words,
SGLangfirst interprets "DeepSeek MTP" as "a separately loaded draft model", and only then enters the runtime phase. The external model package hook works exactly at this architecture-resolution stage.For MTP side, SGLang uses
DeepseekV3ForCausalLMNextNas MTP model architecture.DeepseekV3ForCausalLMNextNThe following diagram shows the native
SGLangview of how the MTP module is loaded:flowchart TD subgraph SGL["SGLang domain"] A["CLI / server args<br/>--speculative-algorithm NEXTN"] B["Normalize algorithm<br/>NEXTN -> EAGLE"] C["Build draft ModelConfig"] D["_config_draft_model()<br/>rewrite architecture to<br/>DeepseekV3ForCausalLMNextN"] E["ModelRegistry.resolve_model_cls(...)"] F["Instantiate draft model class"] G["Speculative worker uses draft model<br/>propose / verify / extend"] end A --> B B --> C C --> D D --> E E --> F F --> GSGLangallows external model packages to register architectures throughSGLANG_EXTERNAL_MODEL_PACKAGEand module-levelEntryClass. This is also the core mechanism for ATOM SGLang plugin. The plugin uses this mechanism to expose a class with the exact architecture name expected bySGLang:Once this class is available in the plugin package,
SGLangresolves the draft architecture to the plugin implementation instead of the upstream one insglang/srt/models/deepseek_nextn.py.The following diagram illustrates what "overriding the draft architecture" means in practice:
flowchart TD subgraph SGL["SGLang domain"] A["launch server<br/>SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models"] B["Import external model package"] C["Read module EntryClass"] D["Register architecture:<br/>DeepseekV3ForCausalLMNextN"] E["ModelRegistry.resolve_model_cls(...)"] H["upstream draft implementation<br/>sglang/srt/models/deepseek_nextn.py"] end subgraph PLUGIN["ATOM SGLang Plugin domain"] F["plugin wrapper<br/>DeepseekV3ForCausalLMNextN"] end subgraph CORE["ATOM Core domain"] G["DeepSeekMTP"] end A --> B B --> C C --> D D --> E E --> F F --> G H -. "same architecture name is overridden" .-> DThe important point is that architecture resolution and
ModelRegistryselection still happen inside theSGLangdomain. TheATOM SGLang Plugindomain only contributes a same-name wrapper through the external package entry point, while the actual draft computation is delegated toDeepSeekMTPin theATOM Coredomain. This makes it easier to separate who owns scheduling and model resolution from who owns the draft implementation.2. MTP Warrper in ATOM: A thin wrapper as the compatibility bridge
The plugin adds a lightweight wrapper named
DeepseekV3ForCausalLMNextN. Externally, it matches the draft-model interface expected bySGLang. Internally, it delegates the actual draft computation toATOM DeepSeekMTP.The wrapper is responsible for:
atom_configDeepSeekMTPATOM/atom/models/deepseek_mtp.py::DeepSeekMTPget_embed_and_head(),set_embed_and_head(), andset_embed()so speculative workers can share embeddings and LM head weights with the target modelforward_batch.spec_info.hidden_statesinforward()spec_decode=TruepathThe design principle is:
SGLangarchitecture name and draft-worker contract unchanged at the top layerATOM DeepSeekMTPas the implementation at the lower layerThis minimizes duplication, avoids recreating the upstream NextN hierarchy inside the plugin, and makes future improvements to
ATOM's native MTP implementation reusable in plugin mode.Risks
Intrusive change to formal runtime variable control codes
ATOMcore code currently relies on some process-global runtime/config state. In speculative mode, target and draft wrappers coexist. Without isolation, initializing or running the draft wrapper may overwrite global state used by the target path, leading to subtle cross-contamination in MoE or attention behavior.To address this, the plugin introduces a runtime scope that explicitly binds and restores the proper runtime context around wrapper
__init__,forward(), andload_weights(). This allows target and draft instances to coexist safely.However, this also makes an architectural issue visible: the current plugin system still has meaningful complexity around process-global state management. In order to let multiple wrappers coexist, the plugin must repeatedly switch and restore global runtime state at execution boundaries. In that sense,
runtime scopingshould be understood as a containment mechanism for the current global-state model, not as the ideal long-term abstraction. It solves the correctness problem for this branch, but it also suggests a future direction toward fewer implicit globals and more explicitly instantiated runtime state.Direct attn backend replacment
sglang_aiter_backend.AiterAttnBackend = ATOMAttnBackendForSglThe reason can be summarized by the key
SGLangcall chain:In other words,
EAGLEdraft multi-step decode actually goes through:flowchart LR A["EAGLEWorker"] --> B["DraftBackendFactory"] B --> C["AiterMultiStepDraftBackend"] C --> D["AiterAttnBackend(...)"] R["ATOM-sglang attention registry"] -. "not used on this path" .-> DSo if the plugin only overrides the
"aiter"registry entry, but does not also rewrite:sglang.srt.layers.attention.aiter_backend.AiterAttnBackendthen
EAGLEdraft decode still directly constructs the upstreamAiterAttnBackend. That is why this monkeypatch is hacky, but still practically necessary on the current branch.The plugin is mutating an upstream module symbol directly. This is not a clean extension point.
Others changes
Complete the radix attention forward in specualtive mode, like
TARGET_VERIFYDRAFT_EXTENDAccuracy
Acceptance ratio