Skip to content

Warn when speculative decoding may hurt throughput for MoE models#1313

Open
Shylin26 wants to merge 2 commits into
ml-explore:mainfrom
Shylin26:fix/moe-speculative-decoding-warning
Open

Warn when speculative decoding may hurt throughput for MoE models#1313
Shylin26 wants to merge 2 commits into
ml-explore:mainfrom
Shylin26:fix/moe-speculative-decoding-warning

Conversation

@Shylin26
Copy link
Copy Markdown

Fixes #1132

Problem

When using --draft-model with MoE target models (e.g. Qwen3.5-397B-A17B),
speculative decoding can hurt throughput by 25-45% because the target model's
active parameter count is close to the draft model size, making draft+verify
more expensive than direct generation.

Solution

Added _warn_speculative_moe() that runs after both models are loaded. It:

  • Detects MoE architecture via num_experts and num_experts_per_tok in model config
  • Calculates active parameters as (num_experts_per_tok / num_experts) * total_params
  • Warns if draft model size is within 4x of active parameters

Non-MoE models are completely unaffected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: warn when speculative decoding is unlikely to help (MoE models)

1 participant