Add TurboQuant3/4 modes to quantized_scaled_dot_product_attention by dedalien · Pull Request #3453 · ml-explore/mlx

dedalien · 2026-04-26T14:20:50Z

Integrates TurboQuant (arXiv 2504.19874) into the generic quant-SDPA infrastructure from #3026 (Dan Yeh), rather than as a standalone kernel.

New modes "turbo3" (3-bit) and "turbo4" (4-bit) use Lloyd-Max codebooks for the N(0,1) distribution of WHT-rotated, norm-normalised key coordinates. Codebooks are compile-time Metal constants; no runtime buffer needed.

Memory layout

K/V: uint32-packed bit indices, same PackReader<N> path as affine
k_scales/v_scales: per-vector float16 norms / √D
group_size == head_dim (one scale per full D-element vector)
Queries pre-rotated by caller via WHT

Changes

primitives.h/cpp: TurboQuant3/TurboQuant4 in QuantizationMode enum, string parsing, quantization_mode_to_string explicit cases
fast.cpp: TurboQuant validation branch (no biases, float scales, group_size == head_dim, head_dim ∈ {64,128,256}, GPU-only fallback)
quantized_utils.h: QuantMode enum extension, QuantConfig/Dequant specialisations, Lloyd-Max codebook constants
sdpa_vector.h: static_assert extended for bits=3; QUANT_SDPA_DISPATCH entries for D ∈ {64,128,256} gated on if constexpr (D >= group_size)
scaled_dot_product_attention.cpp: quant_mode_to_int (TurboQuant3=4, TurboQuant4=5)
python/src/fast.cpp: turbo3/turbo4 in mode docstring

Tests
12 sub-cases: D ∈ {64,128,256} × bits ∈ {3,4} × B ∈ {1,2}, each targeting a distinct Metal template instantiation. Explicit bfloat16, error path coverage.

Implementation notes

This implements TurboQuant_mse (Algorithm 1). TurboQuant_prod (Algorithm 2,
which adds a 1-bit QJL residual correction on the MSE residual) is not
implemented; at 3–4 bits the inner product bias of TurboQuant_mse is
empirically negligible (paper Section 4.1, Fig. 1: bias converges to zero
above 2 bits). This matches existing community implementations (arozanov,
rachittshah, sharpner).

The random rotation uses a Walsh-Hadamard Transform (WHT) rather than the
full random Gaussian rotation from the paper. WHT runs in O(D log D) vs
O(D²), is fully vectorizable on GPU, and gives equivalent practical quality
on real LLM key vectors.

Tested on Qwen3.6-27B (head_dim=256, 24Q/4KV, GQA=6), 24 GB unified memory Mac. Enables longer context generation on memory-constrained hardware by compressing the KV cache ~5×.

Note: This PR builds on #3026. Suggest merging after that one lands, or rebasing on main once it does.

…nary size)

Add Affine dispatch entries for group_size=64 at bits={4,6,8} and relax the validation in quantized_scaled_dot_product_attention. This matches the default produced by mx.quantize(mode="affine") and the kv_group_size=64 default used by mlx-lm, so users following the MLX/mlx-lm conventions no longer hit an error when using fused quantized attention. Benchmarks (M4, B=1 H=32 D=128 Lq=1, affine 4-bit): Context gs=32 fused gs=64 fused speedup 32K 50 us 41 us +22% 64K 95 us 82 us +16% 128K 176 us 152 us +16% gs=64 is faster at long context because it has half the scale/bias memory traffic. Costs: mlx.metallib: 128,161,428 -> 128,233,236 bytes (+0.056%) libmlx.dylib: unchanged Existing 10 test_quantized_sdpa* tests continue to pass (54 subtests).

Integrates TurboQuant (arXiv 2504.19874) into the generic quant-SDPA infrastructure from ml-explore#3026, rather than as a standalone kernel. New modes "turbo3" (3-bit) and "turbo4" (4-bit) use Lloyd-Max codebooks for the N(0,1) distribution of WHT-rotated, norm-normalised key coordinates. Codebooks are compile-time Metal constants; no runtime buffer needed. Memory layout: - K/V: uint32-packed bit indices, same PackReader<N> path as affine - k_scales/v_scales: per-vector float16 norms / sqrt(D) - group_size == head_dim (one scale per full D-element vector) - Queries pre-rotated by caller via WHT Changes: - primitives.h/cpp: TurboQuant3/TurboQuant4 in QuantizationMode enum, string parsing, quantization_mode_to_string explicit cases - fast.cpp: TurboQuant validation branch (no biases, float scales, group_size == head_dim, head_dim in {64,128,256}, GPU-only fallback) - quantized_utils.h: QuantMode enum extension, QuantConfig/Dequant specialisations, Lloyd-Max codebook constants - sdpa_vector.h: static_assert extended for bits=3; QUANT_SDPA_DISPATCH entries for D in {64,128,256} gated on if constexpr (D >= group_size) - scaled_dot_product_attention.cpp: quant_mode_to_int (TurboQuant3=4, TurboQuant4=5) - python/src/fast.cpp: turbo3/turbo4 in mode docstring Tested on Qwen3.6-27B (head_dim=256, 24Q/4KV, GQA=6), 24 GB unified memory Mac. Enables longer context generation on memory-constrained hardware by compressing the KV cache ~5x.

12 sub-cases: D in {64,128,256} x bits in {3,4} x B in {1,2}, each targeting a distinct Metal template instantiation. Explicit bfloat16, error path coverage.

zcbenz · 2026-04-26T23:09:07Z

Before we merge the #3026 it is suggested to turn this into a PR that targets CC-Yeh's repo.

CC-Yeh and others added 23 commits April 23, 2026 12:50

first attempt

516c4d9

fix

5303bda

Unify mxfp4/8 paths and optimize mxfp8 fused calculation

0dca606

supports nvpf4

53b7971

supports affine 4/8 bits

ea009e4

supports affine 2/3/5/6 bits

7cf246e

clean up

9940a41

adapt ml-explore#3023

5d365e2

Limit affine SDPA to group_size=32 and bits={4,6,8}

6bf3380

fix group_size for nvfp4 and simplify code

e92db55

refactor: use function_constant to reduce template specializations(bi…

76ec033

…nary size)

tune blocks

1029f0e

fix

0968a01

cleanup

4905913

cleanup + refactor

9a179da

enable causal

8f91f17

support sinks

72df7df

cleanup

ed4f4fb

cleanup

da4d964

improvements

cb183c3

tests: add test_quantized_sdpa_turbo for turbo3/turbo4

2a12f86

12 sub-cases: D in {64,128,256} x bits in {3,4} x B in {1,2}, each targeting a distinct Metal template instantiation. Explicit bfloat16, error path coverage.

dedalien mentioned this pull request Apr 26, 2026

Add TurboQuantKVCache: 3-bit/4-bit KV cache compression for generation ml-explore/mlx-lm#1202

Open

zcbenz closed this Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TurboQuant3/4 modes to quantized_scaled_dot_product_attention#3453

Add TurboQuant3/4 modes to quantized_scaled_dot_product_attention#3453
dedalien wants to merge 23 commits into
ml-explore:mainfrom
dedalien:turboq/integrate-generic-quant-sdpa

dedalien commented Apr 26, 2026 •

edited

Loading

Uh oh!

zcbenz commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dedalien commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcbenz commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dedalien commented Apr 26, 2026 •

edited

Loading