Add BatchQuantizedKVCache#1322
Open
xxxkkw wants to merge 2 commits into
Open
Conversation
Allow quantized KV history caches to participate in continuous batching by adding batch merge, extract, filter, and extend support for non-rotating quantized caches.
Expand batched attention masks across the grouped-query repeat dimension so quantized SDPA works with left-padded batch histories.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BatchQuantizedKVCachefor non-rotating quantized KV histories.Reproduction
Before this change,
_merge_caches([[quantized_cache_1], [quantized_cache_2]])fails becauseQuantizedKVCachehas no batching support. After the initial cache support, a real Qwen3 batch decode with left-padded quantized histories exposed a second issue in quantized SDPA:The follow-up commit expands batched masks across the grouped-query repeat dimension before applying them to quantized attention scores.
Environment
mlx-community/Qwen3-0.6B-4bitReal-weight benchmark
Protocol: two batched requests with existing prompt-cache histories,
prefill_step_size=2048, 4-bit KV quantization with group size 64 for the quantized run. The 100K input / 10K output baseline run was attempted first and failed with Metal OOM, so the reported long-context comparison uses the requested fallback size: 50K input / 5K output per sequence.Observed deltas at 50K/5K:
Test plan
python -m unittest discover -s tests -p test_models.py -k quantizedpython -m unittest discover -s tests -p test_prompt_cache.py -k quantizedmlx-community/Qwen3-0.6B-4bit