Guard MLA caches from KV quantization by xxxkkw · Pull Request #1323 · ml-explore/mlx-lm

xxxkkw · 2026-05-28T16:30:22Z

Summary

Add an explicit MLACache marker for DeepSeek-style latent MLA caches.
Route DeepSeek V3 and Kimi K2.5 through MLA caches, and mark the latent side of DeepSeek V3.2's CacheList as MLA-specific.
Skip kv_bits conversion for caches that do not support generic KV quantization, while keeping ordinary KV cache quantization unchanged.

Reproduction

Before this change, a tiny DeepSeek V3 config with kv_bits=4, kv_group_size=32, prompt length 3, and prefill_step_size=2 fails on the second step with:

AttributeError: 'list' object has no attribute 'swapaxes'

The failure happens because generic KVCache.to_quantized() turns the MLA rope side channel into a quantized tuple, but the DeepSeek V3 MLA attention path still needs an array for k_pe.swapaxes(...).

After this change, the same repro generates successfully and the cache remains MLACache.

Environment

macOS Darwin 25.4.0 arm64
Apple M1 Max, 32 GB unified memory
Python virtual environment with MLX / mlx-lm test dependencies

Test plan

python -m unittest discover -s tests -p test_models.py -k mla_models_make_mla_caches
python -m unittest discover -s tests -p test_models.py -k deepseek_v3_kv_bits_skips_mla_cache
python -m unittest discover -s tests -p test_models.py -k deepseek_v3
python -m unittest discover -s tests -p test_models.py -k quantized
python -m unittest discover -s tests -p test_prompt_cache.py -k cache_with_generate
Synthetic 1-layer DeepSeek V3.2 and Kimi K2.5 cache-forward check with MLACache

No throughput speedup is claimed here; this is a correctness guard for an invalid quantization path. Generic quantized KV cache coverage continues to pass.

Keep DeepSeek MLA latent caches on the normal cache path so kv_bits does not convert the rope side channel into a quantized tuple and break decode.

Guard MLA caches from KV quantization

b3e3678

Keep DeepSeek MLA latent caches on the normal cache path so kv_bits does not convert the rope side channel into a quantized tuple and break decode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard MLA caches from KV quantization#1323

Guard MLA caches from KV quantization#1323
xxxkkw wants to merge 1 commit into
ml-explore:mainfrom
xxxkkw:mla-cache-quant-guard

xxxkkw commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xxxkkw commented May 28, 2026

Summary

Reproduction

Environment

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant