Skip to content

fix(deepseek-v4): close MTP acceptance gap#207

Open
Xiangyi1996 wants to merge 8 commits into
lightseekorg:mainfrom
Xiangyi1996:xiangyi/v4-mtp-gap-rebased
Open

fix(deepseek-v4): close MTP acceptance gap#207
Xiangyi1996 wants to merge 8 commits into
lightseekorg:mainfrom
Xiangyi1996:xiangyi/v4-mtp-gap-rebased

Conversation

@Xiangyi1996
Copy link
Copy Markdown

Summary

This PR closes the DeepSeek V4 MTP acceptance gap between TokenSpeed and TRTLLM.

Root cause:

  • The remaining gap was not from compressed KV / CSA indexer cache.
  • It came from MTP draft decode using stale/incorrect V4 paged KV cache metadata.
  • V4 has multiple cache tables; the SWA compact table could be observed with the wrong request context during draft/target-verify transitions.

Fix:

  • Make target-verify/draft-extend forward modes explicit.
  • Refresh paged cache group metadata for MTP draft/target paths.
  • Carry V4 SWA/compressed KV/CSA metadata consistently through draft decode.
  • Keep target-verify logits/hidden states correctly for speculative decoding.
  • Add tests for V4 SWA slot sanitization / paged metadata behavior.

Validation

  • pre-commit run --all-files: passed
  • py_compile on touched runtime/test files: passed
  • Acceptance rerun after rebase:
    • Decoded Tok/Iter = 2.8447
    • Spec Accept Rate = 0.6485
    • In TRTLLM 2.8-2.9 range

@Xiangyi1996 Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from c285d95 to 63e22c5 Compare May 22, 2026 05:34
@Xiangyi1996 Xiangyi1996 marked this pull request as ready for review May 22, 2026 05:37
@Xiangyi1996 Xiangyi1996 requested a review from a team as a code owner May 22, 2026 05:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

draft_cache_cell_size = (
draft_attn_config.cache_cell_size()
* draft_model_config.num_attention_layers
)

P1 Badge Use V4 grouped draft cache size in page-budget profiling

When the draft model is also DeepSeek V4 (the new is_deepseek_v4_draft_model path), this branch still computes draft_cache_cell_size from draft_attn_config.cache_cell_size(), which is the generic MLA estimate and does not include V4 grouped caches (SWA/compressed/indexer/state). profile_deepseek_v4_max_num_pages then overestimates available KV pages for target+draft, so deployments can admit a token/page budget that exceeds real GPU memory and fail with OOM under load; this should use the V4-specific draft size (draft_profile_cache_cell_size / layout-based sizing) instead.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c01d2a08dd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/forward_batch_info.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f8108eed5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated
Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2494dc30ac

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/model_executor.py Outdated
Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb37d86925

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated
@Xiangyi1996 Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from f77d47a to e4223a0 Compare May 22, 2026 07:20
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4223a006a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py
@lightseek-bot
Copy link
Copy Markdown
Contributor

@Xiangyi1996 please fix the conflicts thanks!

yechank-nvidia and others added 8 commits May 22, 2026 06:47
Co-authored-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
…ken_id

R1-0528-NVFP4-v2 marks q_a_proj / kv_a_proj_with_mqa in exclude_modules
(stored as bf16 at logical shape), but DeepseekV3FusedQkvAProjWithMqa
allocates an NVFP4-packed buffer because the fused prefix is not in
exclude_modules. Detect component-level exclusion and pass through
quant_config=None to fall back to bf16.

Also add get_hot_token_id() returning None to DeepseekV3ForCausalLMNextN
to match the EAGLE3/MTP drafter contract (mirrors qwen3_5_nextn.py).

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
@Xiangyi1996 Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from e4223a0 to c08927d Compare May 22, 2026 13:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c08927dc81

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 899 to 901
self.drafter is not None
and ctx.forward_mode.is_decode()
and ctx.forward_mode.is_target_verify()
and hasattr(self.attn_backend, "update_mamba_state_after_mtp_verify")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep Mamba verify-state update for decode-mode speculation

This condition now runs the post-verify Mamba state update only for TARGET_VERIFY, but non-DeepSeek-V4 speculative runs still use ForwardMode.DECODE. In those runs (e.g., hybrid linear attention backends with update_mamba_state_after_mtp_verify), the update is skipped, so the Mamba input/output cache pointers are not advanced after acceptance and subsequent decode steps can read stale state. This is a regression from the previous is_decode() check and can break speculative decoding correctness whenever CUDA-graph replay is used.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants