Tested the new MLX backend (merged today) against mlx-community/Qwen3.5-35B-A3B-4bit + z-lab/Qwen3.5-35B-A3B-DFlash on M4 Pro 64GB. Results are dramatic depending on thinking mode:
With thinking enabled (math/reasoning prompt):
• 117.7 t/s, 9.49 avg accepted tokens per round
• mlx_lm baseline: 90.5 t/s → 1.30x speedup
With thinking disabled (enable_thinking=False):
• 23-33 t/s, 2.71-3.86 avg accepted tokens per round
• mlx_lm baseline: 90.5 t/s → 0.26-0.36x — worse than autoregressive
The draft model was clearly trained on thinking-mode output distributions. Without tokens, its predictions don’t align with the target model and nearly all drafts are rejected.
Why this matters for production use:
Tool-calling workloads (Home Assistant voice pipeline, function-calling agents, structured output) require enable_thinking=False. These are exactly the latency-sensitive use cases where DFlash speedup would have the most impact — a voice assistant needs fast first-token response, not fast reasoning chains.
Request:
1. A draft model variant trained with enable_thinking=False for Qwen3.5-35B-A3B — even if acceptance rate is lower than the thinking variant, anything above ~5 avg acceptance would beat baseline
2. Alternatively, a hybrid draft that works acceptably in both modes
Hardware: Apple M4 Pro, 64GB unified memory, MLX 0.31.1, dflash installed from main branch today.
Happy to run any benchmarks you need on Apple Silicon.
Tested the new MLX backend (merged today) against mlx-community/Qwen3.5-35B-A3B-4bit + z-lab/Qwen3.5-35B-A3B-DFlash on M4 Pro 64GB. Results are dramatic depending on thinking mode:
With thinking enabled (math/reasoning prompt):
• 117.7 t/s, 9.49 avg accepted tokens per round
• mlx_lm baseline: 90.5 t/s → 1.30x speedup
With thinking disabled (enable_thinking=False):
• 23-33 t/s, 2.71-3.86 avg accepted tokens per round
• mlx_lm baseline: 90.5 t/s → 0.26-0.36x — worse than autoregressive
The draft model was clearly trained on thinking-mode output distributions. Without tokens, its predictions don’t align with the target model and nearly all drafts are rejected.
Why this matters for production use:
Tool-calling workloads (Home Assistant voice pipeline, function-calling agents, structured output) require enable_thinking=False. These are exactly the latency-sensitive use cases where DFlash speedup would have the most impact — a voice assistant needs fast first-token response, not fast reasoning chains.
Request:
1. A draft model variant trained with enable_thinking=False for Qwen3.5-35B-A3B — even if acceptance rate is lower than the thinking variant, anything above ~5 avg acceptance would beat baseline
2. Alternatively, a hybrid draft that works acceptably in both modes
Hardware: Apple M4 Pro, 64GB unified memory, MLX 0.31.1, dflash installed from main branch today.
Happy to run any benchmarks you need on Apple Silicon.