Add BatchQuantizedKVCache by xxxkkw · Pull Request #1322 · ml-explore/mlx-lm

xxxkkw · 2026-05-28T16:05:48Z

Summary

Add BatchQuantizedKVCache for non-rotating quantized KV histories.
Allow quantized KV prompt caches to participate in continuous batching via merge, extract, filter, and extend operations.
Fix quantized SDPA mask broadcasting for batched grouped-query attention with left-padded histories, which was exposed by a real Qwen3 batch decode smoke test.

Reproduction

Before this change, _merge_caches([[quantized_cache_1], [quantized_cache_2]]) fails because QuantizedKVCache has no batching support. After the initial cache support, a real Qwen3 batch decode with left-padded quantized histories exposed a second issue in quantized SDPA:

ValueError: [broadcast_shapes] Shapes (2,1,1,66) and (2,8,2,1,66) cannot be broadcast.

The follow-up commit expands batched masks across the grouped-query repeat dimension before applying them to quantized attention scores.

Environment

macOS Darwin 25.4.0 arm64
Apple M1 Max, 32 GB unified memory
MLX / mlx-lm virtual environment on Apple Silicon
Model: mlx-community/Qwen3-0.6B-4bit

Real-weight benchmark

Protocol: two batched requests with existing prompt-cache histories, prefill_step_size=2048, 4-bit KV quantization with group size 64 for the quantized run. The 100K input / 10K output baseline run was attempted first and failed with Metal OOM, so the reported long-context comparison uses the requested fallback size: 50K input / 5K output per sequence.

Mode	Input tokens	Output tokens	Prefill tokens	Prefill time	Decode time	Generated tokens	Decode TPS	KV cache bytes	Peak memory
BatchKVCache fp16 histories	50,000 / 49,872	5,000 each	99,870	131.87s	389.12s	10,000	25.70 tok/s	11.48 GB	25.00 GB
BatchQuantizedKVCache 4-bit histories	50,000 / 49,872	5,000 each	99,870	210.83s	151.72s	10,000	65.91 tok/s	3.23 GB	13.71 GB

Observed deltas at 50K/5K:

KV cache storage: 11.48 GB → 3.23 GB (71.9% lower)
Peak memory: 25.00 GB → 13.71 GB (45.2% lower)
Decode throughput: 25.70 → 65.91 tok/s (2.56x higher)
Quantized prefill is slower because cache quantization is performed while building the histories.

Test plan

python -m unittest discover -s tests -p test_models.py -k quantized
python -m unittest discover -s tests -p test_prompt_cache.py -k quantized
Real Qwen3 smoke: batch decode two existing quantized prompt-cache histories with mlx-community/Qwen3-0.6B-4bit
Real Qwen3 long-context benchmark above

Allow quantized KV history caches to participate in continuous batching by adding batch merge, extract, filter, and extend support for non-rotating quantized caches.

Expand batched attention masks across the grouped-query repeat dimension so quantized SDPA works with left-padded batch histories.

xxxkkw added 2 commits May 29, 2026 00:05

Add BatchQuantizedKVCache

8ed00d4

Allow quantized KV history caches to participate in continuous batching by adding batch merge, extract, filter, and extend support for non-rotating quantized caches.

Fix quantized SDPA batch masks

37a32d7

Expand batched attention masks across the grouped-query repeat dimension so quantized SDPA works with left-padded batch histories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BatchQuantizedKVCache#1322

Add BatchQuantizedKVCache#1322
xxxkkw wants to merge 2 commits into
ml-explore:mainfrom
xxxkkw:batch-quantized-kv-cache

xxxkkw commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xxxkkw commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction

Environment

Real-weight benchmark

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xxxkkw commented May 28, 2026 •

edited

Loading