Skip to content

Add BatchQuantizedKVCache#1322

Open
xxxkkw wants to merge 2 commits into
ml-explore:mainfrom
xxxkkw:batch-quantized-kv-cache
Open

Add BatchQuantizedKVCache#1322
xxxkkw wants to merge 2 commits into
ml-explore:mainfrom
xxxkkw:batch-quantized-kv-cache

Conversation

@xxxkkw
Copy link
Copy Markdown

@xxxkkw xxxkkw commented May 28, 2026

Summary

  • Add BatchQuantizedKVCache for non-rotating quantized KV histories.
  • Allow quantized KV prompt caches to participate in continuous batching via merge, extract, filter, and extend operations.
  • Fix quantized SDPA mask broadcasting for batched grouped-query attention with left-padded histories, which was exposed by a real Qwen3 batch decode smoke test.

Reproduction

Before this change, _merge_caches([[quantized_cache_1], [quantized_cache_2]]) fails because QuantizedKVCache has no batching support. After the initial cache support, a real Qwen3 batch decode with left-padded quantized histories exposed a second issue in quantized SDPA:

ValueError: [broadcast_shapes] Shapes (2,1,1,66) and (2,8,2,1,66) cannot be broadcast.

The follow-up commit expands batched masks across the grouped-query repeat dimension before applying them to quantized attention scores.

Environment

  • macOS Darwin 25.4.0 arm64
  • Apple M1 Max, 32 GB unified memory
  • MLX / mlx-lm virtual environment on Apple Silicon
  • Model: mlx-community/Qwen3-0.6B-4bit

Real-weight benchmark

Protocol: two batched requests with existing prompt-cache histories, prefill_step_size=2048, 4-bit KV quantization with group size 64 for the quantized run. The 100K input / 10K output baseline run was attempted first and failed with Metal OOM, so the reported long-context comparison uses the requested fallback size: 50K input / 5K output per sequence.

Mode Input tokens Output tokens Prefill tokens Prefill time Decode time Generated tokens Decode TPS KV cache bytes Peak memory
BatchKVCache fp16 histories 50,000 / 49,872 5,000 each 99,870 131.87s 389.12s 10,000 25.70 tok/s 11.48 GB 25.00 GB
BatchQuantizedKVCache 4-bit histories 50,000 / 49,872 5,000 each 99,870 210.83s 151.72s 10,000 65.91 tok/s 3.23 GB 13.71 GB

Observed deltas at 50K/5K:

  • KV cache storage: 11.48 GB → 3.23 GB (71.9% lower)
  • Peak memory: 25.00 GB → 13.71 GB (45.2% lower)
  • Decode throughput: 25.70 → 65.91 tok/s (2.56x higher)
  • Quantized prefill is slower because cache quantization is performed while building the histories.

Test plan

  • python -m unittest discover -s tests -p test_models.py -k quantized
  • python -m unittest discover -s tests -p test_prompt_cache.py -k quantized
  • Real Qwen3 smoke: batch decode two existing quantized prompt-cache histories with mlx-community/Qwen3-0.6B-4bit
  • Real Qwen3 long-context benchmark above

xxxkkw added 2 commits May 29, 2026 00:05
Allow quantized KV history caches to participate in continuous batching by adding batch merge, extract, filter, and extend support for non-rotating quantized caches.
Expand batched attention masks across the grouped-query repeat dimension so quantized SDPA works with left-padded batch histories.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant