Warn when speculative decoding may hurt throughput for MoE models by Shylin26 · Pull Request #1313 · ml-explore/mlx-lm

Shylin26 · 2026-05-26T16:24:25Z

Problem

When using --draft-model with MoE target models (e.g. Qwen3.5-397B-A17B),
speculative decoding can hurt throughput by 25-45% because the target model's
active parameter count is close to the draft model size, making draft+verify
more expensive than direct generation.

Solution

Added _warn_speculative_moe() that runs after both models are loaded. It:

Detects MoE architecture via num_experts and num_experts_per_tok in model config
Calculates active parameters as (num_experts_per_tok / num_experts) * total_params
Warns if draft model size is within 4x of active parameters

Non-MoE models are completely unaffected.

warn when speculative decoding may hurt MoE model throughput

e4a4ae3

Shylin26 mentioned this pull request May 26, 2026

Feature request: warn when speculative decoding is unlikely to help (MoE models) #1132

Open

use logging.warning instead of print, add division by zero guard

07b0174

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn when speculative decoding may hurt throughput for MoE models#1313

Warn when speculative decoding may hurt throughput for MoE models#1313
Shylin26 wants to merge 2 commits into
ml-explore:mainfrom
Shylin26:fix/moe-speculative-decoding-warning

Shylin26 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Shylin26 commented May 26, 2026

Problem

Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant