Draft model acceptance collapses without thinking mode — tool-calling use cases blocked

Tested the new MLX backend (merged today) against mlx-community/Qwen3.5-35B-A3B-4bit + z-lab/Qwen3.5-35B-A3B-DFlash on M4 Pro 64GB. Results are dramatic depending on thinking mode:

With thinking enabled (math/reasoning prompt):
	•	117.7 t/s, 9.49 avg accepted tokens per round
	•	mlx_lm baseline: 90.5 t/s → 1.30x speedup

With thinking disabled (enable_thinking=False):
	•	23-33 t/s, 2.71-3.86 avg accepted tokens per round
	•	mlx_lm baseline: 90.5 t/s → 0.26-0.36x — worse than autoregressive

The draft model was clearly trained on thinking-mode output distributions. Without <think> tokens, its predictions don’t align with the target model and nearly all drafts are rejected.

Why this matters for production use:
Tool-calling workloads (Home Assistant voice pipeline, function-calling agents, structured output) require enable_thinking=False. These are exactly the latency-sensitive use cases where DFlash speedup would have the most impact — a voice assistant needs fast first-token response, not fast reasoning chains.

Request:
	1.	A draft model variant trained with enable_thinking=False for Qwen3.5-35B-A3B — even if acceptance rate is lower than the thinking variant, anything above ~5 avg acceptance would beat baseline
	2.	Alternatively, a hybrid draft that works acceptably in both modes

Hardware: Apple M4 Pro, 64GB unified memory, MLX 0.31.1, dflash installed from main branch today.

Happy to run any benchmarks you need on Apple Silicon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft model acceptance collapses without thinking mode — tool-calling use cases blocked #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Draft model acceptance collapses without thinking mode — tool-calling use cases blocked #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions