fix(deepseek-v4): close MTP acceptance gap by Xiangyi1996 · Pull Request #207 · lightseekorg/tokenspeed

Xiangyi1996 · 2026-05-21T11:31:23Z

Summary

This PR closes the DeepSeek V4 MTP acceptance gap between TokenSpeed and TRTLLM.

Root cause:

The remaining gap was not from compressed KV / CSA indexer cache.
It came from MTP draft decode using stale/incorrect V4 paged KV cache metadata.
V4 has multiple cache tables; the SWA compact table could be observed with the wrong request context during draft/target-verify transitions.

Fix:

Make target-verify/draft-extend forward modes explicit.
Refresh paged cache group metadata for MTP draft/target paths.
Carry V4 SWA/compressed KV/CSA metadata consistently through draft decode.
Keep target-verify logits/hidden states correctly for speculative decoding.
Add tests for V4 SWA slot sanitization / paged metadata behavior.

Validation

pre-commit run --all-files: passed
py_compile on touched runtime/test files: passed
Acceptance rerun after rebase:
- Decoded Tok/Iter = 2.8447
- Spec Accept Rate = 0.6485
- In TRTLLM 2.8-2.9 range

chatgpt-codex-connector

💡 Codex Review

tokenspeed/python/tokenspeed/runtime/layers/attention/registry.py

Lines 433 to 436 in 63e22c5

    
           draft_cache_cell_size = ( 
        
               draft_attn_config.cache_cell_size() 
        
               * draft_model_config.num_attention_layers 
        
           )

Use V4 grouped draft cache size in page-budget profiling

When the draft model is also DeepSeek V4 (the new is_deepseek_v4_draft_model path), this branch still computes draft_cache_cell_size from draft_attn_config.cache_cell_size(), which is the generic MLA estimate and does not include V4 grouped caches (SWA/compressed/indexer/state). profile_deepseek_v4_max_num_pages then overestimates available KV pages for target+draft, so deployments can admit a token/page budget that exceeds real GPU memory and fail with OOM under load; this should use the V4-specific draft size (draft_profile_cache_cell_size / layout-based sizing) instead.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c01d2a08dd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f8108eed5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2494dc30ac

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb37d86925

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4223a006a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lightseek-bot · 2026-05-22T08:59:13Z

@Xiangyi1996 please fix the conflicts thanks!

Co-authored-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

…ken_id R1-0528-NVFP4-v2 marks q_a_proj / kv_a_proj_with_mqa in exclude_modules (stored as bf16 at logical shape), but DeepseekV3FusedQkvAProjWithMqa allocates an NVFP4-packed buffer because the fused prefix is not in exclude_modules. Detect component-level exclusion and pass through quant_config=None to fall back to bf16. Also add get_hot_token_id() returning None to DeepseekV3ForCausalLMNextN to match the EAGLE3/MTP drafter contract (mirrors qwen3_5_nextn.py). Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c08927dc81

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T13:55:32Z

            self.drafter is not None
-            and ctx.forward_mode.is_decode()
+            and ctx.forward_mode.is_target_verify()
            and hasattr(self.attn_backend, "update_mamba_state_after_mtp_verify")


Keep Mamba verify-state update for decode-mode speculation

This condition now runs the post-verify Mamba state update only for TARGET_VERIFY, but non-DeepSeek-V4 speculative runs still use ForwardMode.DECODE. In those runs (e.g., hybrid linear attention backends with update_mamba_state_after_mtp_verify), the update is skipped, so the Mamba input/output cache pointers are not advanced after acceptance and subsequent decode steps can read stale state. This is a regression from the previous is_decode() check and can break speculative decoding correctness whenever CUDA-graph replay is used.

Useful? React with 👍 / 👎.

Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from c285d95 to 63e22c5 Compare May 22, 2026 05:34

Xiangyi1996 marked this pull request as ready for review May 22, 2026 05:37

Xiangyi1996 requested a review from a team as a code owner May 22, 2026 05:37

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/forward_batch_info.py Outdated

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/model_executor.py Outdated

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated

Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from f77d47a to e4223a0 Compare May 22, 2026 07:20

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py

lightseek-bot requested review from SimonCqk and dongjiyingdjy May 22, 2026 08:40

yechank-nvidia and others added 8 commits May 22, 2026 06:47

feat(deepseek-v4): support mtp speculative decoding

fe47349

Co-authored-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

fix(deepseek-v4): refresh mtp draft cache metadata

59f62bc

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(deepseek-v4): profile grouped draft cache size

efa8282

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(spec): gate target-verify mode to v4

65ef969

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(spec): preserve non-v4 draft backend mode

7ff21fc

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(spec): align idle and replay metadata modes

c28f0ee

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(spec): preserve draft seq-lens alias

c08927d

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from e4223a0 to c08927d Compare May 22, 2026 13:48

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

lightseek-bot mentioned this pull request May 22, 2026

[Draft]feat(deepseek-v4): support MTP speculative decoding #123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deepseek-v4): close MTP acceptance gap#207

fix(deepseek-v4): close MTP acceptance gap#207
Xiangyi1996 wants to merge 8 commits into
lightseekorg:mainfrom
Xiangyi1996:xiangyi/v4-mtp-gap-rebased

Xiangyi1996 commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

lightseek-bot commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	draft_cache_cell_size = (
	draft_attn_config.cache_cell_size()
	* draft_model_config.num_attention_layers
	)

Conversation

Xiangyi1996 commented May 21, 2026

Summary

Validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

lightseek-bot commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants