TurboQuant compresses standard transformer KV caches.
| Model Family | Status | Notes |
|---|---|---|
| Llama-3 / 3.1 / 3.2 | Full support | GQA-aware mode recommended |
| Mistral / Mixtral | Full support | Sliding window auto-detected |
| Gemma / Gemma 2 | Full support | |
| Qwen2.5 / Qwen3 | Full support | |
| Phi-3 / Phi-4 | Full support | |
| Command-R | Full support | |
| DeepSeek-V2/V3 | Skip MLA layers | KV already compressed by MLA |
| Qwen3.5 / Jamba | Attention layers only | Non-attention layers skipped |
| T5 / BART / mBART | Partial | Self-attention KV only |
| Mamba / RWKV | Not applicable | No KV cache (SSM/RNN) |
turboquant.wrap() and TurboQuantDynamicCache.from_model() automatically:
- Detect head_dim, GQA ratio, layer count
- Identify hybrid layers (skip non-attention)
- Select optimal bit_width (3-bit for head_dim>=128, 4-bit otherwise)
- Detect MLA (DeepSeek) and raise informative error
For unsupported or exotic architectures:
from turboquant import TurboQuantKVCache
cache = TurboQuantKVCache(
head_dim=128, # set manually
bit_width=3,
residual_length=0,
)