Skip to content

Guard MLA caches from KV quantization#1323

Open
xxxkkw wants to merge 1 commit into
ml-explore:mainfrom
xxxkkw:mla-cache-quant-guard
Open

Guard MLA caches from KV quantization#1323
xxxkkw wants to merge 1 commit into
ml-explore:mainfrom
xxxkkw:mla-cache-quant-guard

Conversation

@xxxkkw
Copy link
Copy Markdown

@xxxkkw xxxkkw commented May 28, 2026

Summary

  • Add an explicit MLACache marker for DeepSeek-style latent MLA caches.
  • Route DeepSeek V3 and Kimi K2.5 through MLA caches, and mark the latent side of DeepSeek V3.2's CacheList as MLA-specific.
  • Skip kv_bits conversion for caches that do not support generic KV quantization, while keeping ordinary KV cache quantization unchanged.

Reproduction

Before this change, a tiny DeepSeek V3 config with kv_bits=4, kv_group_size=32, prompt length 3, and prefill_step_size=2 fails on the second step with:

AttributeError: 'list' object has no attribute 'swapaxes'

The failure happens because generic KVCache.to_quantized() turns the MLA rope side channel into a quantized tuple, but the DeepSeek V3 MLA attention path still needs an array for k_pe.swapaxes(...).

After this change, the same repro generates successfully and the cache remains MLACache.

Environment

  • macOS Darwin 25.4.0 arm64
  • Apple M1 Max, 32 GB unified memory
  • Python virtual environment with MLX / mlx-lm test dependencies

Test plan

  • python -m unittest discover -s tests -p test_models.py -k mla_models_make_mla_caches
  • python -m unittest discover -s tests -p test_models.py -k deepseek_v3_kv_bits_skips_mla_cache
  • python -m unittest discover -s tests -p test_models.py -k deepseek_v3
  • python -m unittest discover -s tests -p test_models.py -k quantized
  • python -m unittest discover -s tests -p test_prompt_cache.py -k cache_with_generate
  • Synthetic 1-layer DeepSeek V3.2 and Kimi K2.5 cache-forward check with MLACache

No throughput speedup is claimed here; this is a correctness guard for an invalid quantization path. Generic quantized KV cache coverage continues to pass.

Keep DeepSeek MLA latent caches on the normal cache path so kv_bits does not convert the rope side channel into a quantized tuple and break decode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant