fix(deepseek-v4): close MTP acceptance gap#207
Conversation
c285d95 to
63e22c5
Compare
There was a problem hiding this comment.
💡 Codex Review
tokenspeed/python/tokenspeed/runtime/layers/attention/registry.py
Lines 433 to 436 in 63e22c5
When the draft model is also DeepSeek V4 (the new is_deepseek_v4_draft_model path), this branch still computes draft_cache_cell_size from draft_attn_config.cache_cell_size(), which is the generic MLA estimate and does not include V4 grouped caches (SWA/compressed/indexer/state). profile_deepseek_v4_max_num_pages then overestimates available KV pages for target+draft, so deployments can admit a token/page budget that exceeds real GPU memory and fail with OOM under load; this should use the V4-specific draft size (draft_profile_cache_cell_size / layout-based sizing) instead.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c01d2a08dd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f8108eed5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2494dc30ac
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb37d86925
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
f77d47a to
e4223a0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4223a006a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@Xiangyi1996 please fix the conflicts thanks! |
Co-authored-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
…ken_id R1-0528-NVFP4-v2 marks q_a_proj / kv_a_proj_with_mqa in exclude_modules (stored as bf16 at logical shape), but DeepseekV3FusedQkvAProjWithMqa allocates an NVFP4-packed buffer because the fused prefix is not in exclude_modules. Detect component-level exclusion and pass through quant_config=None to fall back to bf16. Also add get_hot_token_id() returning None to DeepseekV3ForCausalLMNextN to match the EAGLE3/MTP drafter contract (mirrors qwen3_5_nextn.py). Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
e4223a0 to
c08927d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c08927dc81
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self.drafter is not None | ||
| and ctx.forward_mode.is_decode() | ||
| and ctx.forward_mode.is_target_verify() | ||
| and hasattr(self.attn_backend, "update_mamba_state_after_mtp_verify") |
There was a problem hiding this comment.
Keep Mamba verify-state update for decode-mode speculation
This condition now runs the post-verify Mamba state update only for TARGET_VERIFY, but non-DeepSeek-V4 speculative runs still use ForwardMode.DECODE. In those runs (e.g., hybrid linear attention backends with update_mamba_state_after_mtp_verify), the update is skipped, so the Mamba input/output cache pointers are not advanced after acceptance and subsequent decode steps can read stale state. This is a regression from the previous is_decode() check and can break speculative decoding correctness whenever CUDA-graph replay is used.
Useful? React with 👍 / 👎.
Summary
This PR closes the DeepSeek V4 MTP acceptance gap between TokenSpeed and TRTLLM.
Root cause:
Fix:
Validation
pre-commit run --all-files: passedpy_compileon touched runtime/test files: passedDecoded Tok/Iter = 2.8447Spec Accept Rate = 0.6485