Skip to content

[Experiment] QMV LUT Dequant#394

Draft
CC-Yeh wants to merge 13 commits into
mainfrom
qmv_lut
Draft

[Experiment] QMV LUT Dequant#394
CC-Yeh wants to merge 13 commits into
mainfrom
qmv_lut

Conversation

@CC-Yeh
Copy link
Copy Markdown
Contributor

@CC-Yeh CC-Yeh commented May 8, 2026

Tested replacing QmvFast's pure-ALU int4→float dequant: uint_to_fp
mantissa trick (nibble extract via shift/mask + bit-OR + fsub) with
a threadgroup-memory LUT lookup. No win at any tested shape, mild
E2E regression on real models.
Mantissa trick stays optimal.

Headline (Apple M4, ZP_BF16_gs64, 4-bit, kernel µs medians)

Shape M main LUT Δ
(4096, 4096) 1 80.7 81.7 +1% (tied)
(4096, 4096) 2 83.0 123.2 +48%
(4096, 4096) 4 150.5 241.0 +60%
(14336, 4096) 1 314.4 305.8 −2.7% (tied, within noise)

E2E LFM2.5 4-bit decode: −1.8% (RHT), −3.5% (MLX).

What was tried

  • Constant-memory LUT: divergent loads serialize. +73–130%.
  • Threadgroup-memory LUT: partial recovery. Still +50–60% at M≥2.
  • bfloat2 entries: same wallclock as half2.
  • Manual bf << 16 widen: eliminates air.convert from AIR
    (verified) but no wallclock change. Convert wasn't the cost.

Verdict

The actual cost is L1 cache port contention: LUT reads compete
with weight loads at the same L1 read port. The mantissa trick has
zero memory ops in the dequant chain (pure ALU, extracts nibbles
and converts to float in one fused bit-twiddle), so it doesn't fight
for the port. No LUT variant can beat that on Apple GPU.

@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 8, 2026

Kernel benchmarks (M4, ZP_BF16_gs64, criterion median µs)

(4096, 4096) — 8 MB, fits SLC

M main LUT Δ
1 80.7 81.7 +1% (tied)
2 83.0 123.2 +48%
4 150.5 241.0 +60%

(14336, 4096) M=1 — 28 MB, 5-rep verification

main LUT
median 314.4 305.8
min/max 309.3 / 323.8 304.9 / 319.1
spread 4.6% 4.6%

LUT 2.7% faster, within noise. No regression at this shape/M.

(Earlier sweep reported +275% for this case — turned out to be a
broken measurement; clean rerun didn't reproduce. The "wide-shallow
shapes are catastrophic" claim is retracted.)

E2E LFM2.5 4-bit (n=15, M4)

Model Build Decode tok/s
RHT-4bitLmHead main 143.7
RHT-4bitLmHead LUT 141.1
MLX-4bit main 146.2
MLX-4bit LUT 141.1

Decode regresses 1.8–3.5% across both formats — small but consistent.

@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 12, 2026

AWQ + LUT[256]

Shape M Baseline (ms) LUT (ms) Δ% verdict
A 2048×2048 5 0.258 0.266 +3.0% regress
A 2048×2048 8 bimodal* bimodal* ~−6% (fast mode) small win
A 2048×2048 16 0.079 0.087 +10.1% regress
A 2048×2048 32 0.136 0.153 +12.5% regress
A 2048×2048 48 0.216 0.218 +1.0% ≈ same
A 2048×2048 64 0.217 0.220 +1.7% ≈ same
A 2048×2048 128 0.362 0.368 +1.6% ≈ same
B 2560×6912 5 0.180 0.201 +11.4% regress
B 2560×6912 8 0.178 0.195 +9.0% regress
B 2560×6912 16 0.300 0.338 +12.8% regress
B 2560×6912 32 0.557 0.635 +14.2% regress
B 2560×6912 48 0.711 0.719 +1.2% ≈ same
B 2560×6912 64 0.712 0.727 +2.1% ≈ same
B 2560×6912 128 1.377 1.402 +1.8% ≈ same

@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 12, 2026

⏺ ## QMM (BM-matched: BM=8 for M<48, BM=64 for M≥48) — Apple M4

Baseline is AWQ-LUT[256]...

Shape M BM AWQ-LUT[256] (ms) NF4-const (ms) NF4-tg (ms) Δ const Δ tg flag
2048×2048 5 8 0.0605 0.0607 0.0679 +0.3% +12.2%
2048×2048 8 8 0.2501 0.2507 0.2537 +0.2% +1.4%
2048×2048 16 8 0.3847 0.3841 0.3865 −0.2% +0.5%
2048×2048 32 8 0.6963 0.6910 0.6956 −0.8% −0.1%
2048×2048 48 64 0.9889 0.9872 0.9890 −0.2% 0.0%
2048×2048 64 64 0.5432 0.5392 0.9941 −0.7% +83.0% tg bimodal
2048×2048 128 64 0.9302 1.6712 1.6989 +79.7% +82.6% bimodal
2560×6912 5 8 0.4755 0.8157 0.8272 +71.6% +74.0% bimodal
2560×6912 8 8 0.8203 0.8046 0.8154 −1.9% −0.6%
2560×6912 16 8 1.5123 1.4880 1.5029 −1.6% −0.6%
2560×6912 32 8 1.6051 1.2469 1.2423 −22.3% −22.6% bimodal
2560×6912 48 64 1.8350 1.8002 1.8073 −1.9% −1.5%
2560×6912 64 64 1.2140 1.2255 1.4245 +0.9% +17.3% tg loses
2560×6912 128 64 2.0861 1.8828 1.9234 −9.7% −7.8%

QMV (M ∈ {1, 2, 4}, 6 shapes) — Apple M4

Shape M AWQ-LUT[256] (ms) NF4-const (ms) NF4-tg (ms) Δ const Δ tg flag
2048×2048 (LFM) 1 0.0393 0.0243 0.0242 −38.1% −38.5% AWQ bimodal
2560×6912 (Qwen MLPup) 1 0.0993 0.0911 0.0892 −8.2% −10.2%
6912×2560 (Qwen MLPdown) 1 0.0973 0.0909 0.0907 −6.6% −6.7%
4096×4096 (Llama) 1 0.0883 0.0828 0.0852 −6.2% −3.4%
4096×14336 (Llama MLPup) 1 0.3151 0.2976 0.2976 −5.6% −5.6%
14336×4096 (Llama MLPdown) 1 0.3226 0.3049 0.3048 −5.5% −5.5%
2048×2048 (LFM) 2 0.0383 0.0352 0.0335 −8.2% −12.6%
2560×6912 (Qwen MLPup) 2 0.1540 0.1280 0.1222 −16.9% −20.7%
6912×2560 (Qwen MLPdown) 2 0.1549 0.1324 0.1268 −14.6% −18.2%
4096×4096 (Llama) 2 0.1454 0.1258 0.1183 −13.4% −18.7%
4096×14336 (Llama MLPup) 2 0.4817 0.4063 1.7706 −15.7% +267.5% tg collapse
14336×4096 (Llama MLPdown) 2 0.4864 0.4121 0.3942 −15.3% −19.0%
2048×2048 (LFM) 4 0.0729 0.0621 0.0587 −14.8% −19.4%
2560×6912 (Qwen MLPup) 4 0.2936 0.2434 0.2289 −17.1% −22.0%
6912×2560 (Qwen MLPdown) 4 0.3014 0.2515 0.2405 −16.5% −20.2%
4096×4096 (Llama) 4 0.2796 0.2337 0.2210 −16.4% −21.0%
4096×14336 (Llama MLPup) 4 0.9523 0.7916 0.7458 −16.9% −21.7%
14336×4096 (Llama MLPdown) 4 0.9621 0.8022 0.7612 −16.6% −20.9%

@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 12, 2026

⏺ ## QMV vs scalar-AWQ (M ∈ {1, 2, 4}, 6 shapes) — Apple M4

Shape M AWQ-scalar (ms) NF4-const (ms) NF4-tg (ms) Δ const Δ tg
LFM-2048 1 0.0248 0.0242 0.0236 −2.5% −5.0%
LFM-2048 2 0.0265 0.0355 0.0338 +34.1% +27.4%
LFM-2048 4 0.0411 0.0626 0.0590 +52.4% +43.6%
Qwen-MLPup 1 0.0963 0.0900 0.0897 −6.6% −6.8%
Qwen-MLPup 2 0.0971 0.1312 0.1248 +35.1% +28.5%
Qwen-MLPup 4 0.1536 0.2445 0.2325 +59.2% +51.3%
Qwen-MLPdown 1 0.0946 0.0875 0.0862 −7.5% −8.9%
Qwen-MLPdown 2 0.1032 0.1340 0.1283 +29.9% +24.3%
Qwen-MLPdown 4 0.1565 0.2516 0.2382 +60.8% +52.2%
Llama-4096 1 0.0916 0.0843 0.0841 −8.1% −8.2%
Llama-4096 2 0.0977 0.1265 0.1172 +29.5% +20.0%
Llama-4096 4 0.1460 0.2381 0.2208 +63.1% +51.2%
Llama-MLPup 1 0.3233 0.2974 0.2975 −8.0% −8.0%
Llama-MLPup 2 0.3358 0.4037 0.3823 +20.2% +13.8%
Llama-MLPup 4 0.4759 0.7977 0.7489 +67.6% +57.4%
Llama-MLPdown 1 0.3257 0.3040 0.3046 −6.7% −6.5%
Llama-MLPdown 2 0.3241 0.4146 0.3947 +27.9% +21.8%
Llama-MLPdown 4 0.4775 0.8064 0.7602 +68.9% +59.2%

QMM vs scalar-AWQ (M ∈ {5..128}, 2 shapes) — Apple M4

Shape M AWQ-scalar (ms) NF4-const (ms) NF4-tg (ms) Δ const Δ tg
2048×2048 5 0.2568 0.1593 0.0711 −38.0% −72.3%
2048×2048 8 0.0569 0.0606 0.0596 +6.5% +4.8%
2048×2048 16 0.0789 0.0865 0.0904 +9.6% +14.6%
2048×2048 32 0.1377 0.1552 0.1526 +12.7% +10.8%
2048×2048 48 0.2181 0.2185 0.2191 +0.2% +0.4%
2048×2048 64 0.2168 0.2177 0.2204 +0.4% +1.6%
2048×2048 128 0.3621 0.3634 0.3687 +0.4% +1.8%
2560×6912 5 0.1818 0.1947 0.1967 +7.1% +8.2%
2560×6912 8 0.1788 0.1915 0.1915 +7.1% +7.1%
2560×6912 16 0.2998 0.3311 0.3325 +10.5% +10.9%
2560×6912 32 0.5562 0.6241 0.6278 +12.2% +12.9%
2560×6912 48 0.7109 0.7150 0.7176 +0.6% +0.9%
2560×6912 64 0.7121 0.7228 0.7219 +1.5% +1.4%
2560×6912 128 0.3778 0.3929 0.4012 +1.1% +1.7%

CC-Yeh added 6 commits May 15, 2026 23:08
…102;102;102m4�[39m�[38;2;102;102;102m-�[39mbit�[38;2;187;187;187m �[39mLUT�[38;2;187;187;187m �[39mzero�[38;2;102;102;102m-�[39mpoint�[38;2;187;187;187m �[39mvariant�[38;2;187;187;187m �[39malongside�[38;2;187;187;187m �[39mE4M3

Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39mBy:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply@anthropic�[38;2;102;102;102m.�[39mcom�[38;2;102;102;102m>�[39m
…102;102;102m/�[39mZp�[38;2;187;187;187m �[39mvariants�[38;2;187;187;187m �[39mvs�[38;2;187;187;187m �[39mQmvFast�[38;2;102;102;102m/�[39mAWQ�[38;2;187;187;187m �[39macross�[38;2;187;187;187m �[39mM

Extend�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mcommitted�[38;2;187;187;187m �[39mqmm_lut�[38;2;102;102;102m/�[39mqmv_lut�[38;2;187;187;187m �[39mharness�[38;2;187;187;187m �[39mto�[38;2;187;187;187m �[39mmeasure�[38;2;187;187;187m �[39mall�[38;2;187;187;187m �[39mfour�[38;2;187;187;187m �[39mNF4
variants�[38;2;187;187;187m �[39m(Constant,�[38;2;187;187;187m �[39mTg,�[38;2;187;187;187m �[39mE4M3,�[38;2;187;187;187m �[39mZp)�[38;2;187;187;187m �[39mper�[38;2;187;187;187m �[39mcell�[38;2;187;187;187m �[39magainst�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mproduction
baselines:�[38;2;187;187;187m �[39mQmvFast�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mfor�[39;00m�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mdecode�[38;2;187;187;187m �[39mQMV�[38;2;187;187;187m �[39msweep�[38;2;187;187;187m �[39m(M�[38;2;102;102;102m=�[39m�[38;2;102;102;102m1.�[39m�[38;2;102;102;102m.�[39m�[38;2;102;102;102m4�[39m)�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mand�[39;00m�[38;2;187;187;187m �[39mAWQ�[38;2;102;102;102m-�[39mint4
QmmTransposed�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mfor�[39;00m�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mprefill�[38;2;187;187;187m �[39mQMM�[38;2;187;187;187m �[39msweep�[38;2;187;187;187m �[39m(M�[38;2;102;102;102m=�[39m�[38;2;102;102;102m5�[39m�[38;2;102;102;102m/�[39m�[38;2;102;102;102m16�[39m�[38;2;102;102;102m/�[39m�[38;2;102;102;102m32�[39m�[38;2;102;102;102m/�[39m�[38;2;102;102;102m64�[39m�[38;2;102;102;102m/�[39m�[38;2;102;102;102m128�[39m),�[38;2;187;187;187m �[39mon�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mtwo
existing�[38;2;187;187;187m �[39mreal�[38;2;187;187;187m �[39mshapes�[38;2;187;187;187m �[39m(�[38;2;102;102;102m2048�[39mx2048,�[38;2;187;187;187m �[39m�[38;2;102;102;102m2560�[39mx6912)�[38;2;102;102;102m.�[39m�[38;2;187;187;187m �[39mE4M3�[38;2;187;187;187m �[39mscales�[38;2;187;187;187m �[39mare�[38;2;187;187;187m �[39mbuilt�[38;2;187;187;187m �[39mby
�[38;2;170;34;255mround�[39m�[38;2;102;102;102m-�[39mtripping�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39msame�[38;2;187;187;187m �[39mf32�[38;2;187;187;187m �[39mscales�[38;2;187;187;187m �[39mthrough�[38;2;187;187;187m �[39mf32_to_e4m3;�[38;2;187;187;187m �[39mZp�[38;2;187;187;187m �[39muses�[38;2;187;187;187m �[39mrealistic
packed�[38;2;187;187;187m �[39m�[38;2;102;102;102m4�[39m�[38;2;102;102;102m-�[39mbit�[38;2;187;187;187m �[39mzero�[38;2;102;102;102m-�[39mpoint�[38;2;187;187;187m �[39mindices�[38;2;102;102;102m.�[39m�[38;2;187;187;187m �[39mSame�[38;2;187;187;187m �[39mpick_tile()�[38;2;187;187;187m �[39mtiling�[38;2;187;187;187m �[39mapplied�[38;2;187;187;187m �[39mto�[38;2;187;187;187m �[39mevery
NF4�[38;2;187;187;187m �[39mvariant�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mfor�[39;00m�[38;2;187;187;187m �[39mapples�[38;2;102;102;102m-�[39mto�[38;2;102;102;102m-�[39mapples�[38;2;187;187;187m �[39mQMM�[38;2;102;102;102m.�[39m�[38;2;187;187;187m �[39mFormat�[38;2;187;187;187m �[39mconfound�[38;2;187;187;187m �[39m(QmvFast�[38;2;187;187;187m �[39masymmetric
int4�[38;2;187;187;187m �[39mvs�[38;2;187;187;187m �[39mNF4�[38;2;187;187;187m �[39mcodebook)�[38;2;187;187;187m �[39mannotated�[38;2;187;187;187m �[39m�[38;2;170;34;255;01min�[39;00m�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mQMV�[38;2;187;187;187m �[39moutput�[38;2;102;102;102m.�[39m

Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39mBy:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply@anthropic�[38;2;102;102;102m.�[39mcom�[38;2;102;102;102m>�[39m
…g�[38;2;187;187;187m �[39m�[38;2;102;102;102m4�[39m�[38;2;102;102;102m-�[39mway�[38;2;187;187;187m �[39mNF4�[38;2;187;187;187m �[39mvs�[38;2;187;187;187m �[39mzp�[38;2;102;102;102m-�[39mscalar�[38;2;187;187;187m �[39mbench

Rewrite�[38;2;187;187;187m �[39mqmm_lut_bench�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mto�[39;00m�[38;2;187;187;187m �[39mencode�[38;2;187;187;187m �[39mN�[38;2;187;187;187m �[39mdispatches�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mof�[39;00m�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39msame�[38;2;187;187;187m �[39mkernel�[38;2;187;187;187m �[39m�[38;2;170;34;255;01minto�[39;00m�[38;2;187;187;187m �[39mone
command�[38;2;187;187;187m �[39mbuffer�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mand�[39;00m�[38;2;187;187;187m �[39mdivide�[38;2;187;187;187m �[39mgpu_execution_time�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mby�[39;00m�[38;2;187;187;187m �[39mN,�[38;2;187;187;187m �[39mamortizing�[38;2;187;187;187m �[39mfixed
per�[38;2;102;102;102m-�[39mCB�[38;2;187;187;187m �[39mGPU�[38;2;187;187;187m �[39moverhead.�[38;2;187;187;187m �[39mBaseline�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mis�[39;00m�[38;2;187;187;187m �[39mnow�[38;2;187;187;187m �[39mzp�[38;2;102;102;102m-�[39mscalar�[38;2;187;187;187m �[39mint4�[38;2;187;187;187m �[39m(QmvFast�[38;2;102;102;102m/�[39m
QmmTransposed,�[38;2;187;187;187m �[39muse_zero_points�[38;2;102;102;102m=�[39m�[38;2;170;34;255;01mtrue�[39;00m,�[38;2;187;187;187m �[39muse_mlx_quant�[38;2;102;102;102m=�[39m�[38;2;170;34;255;01mfalse�[39;00m,�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mno�[39;00m�[38;2;187;187;187m �[39mbias);
challengers�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mare�[39;00m�[38;2;187;187;187m �[39mnf4�[38;2;187;187;187m �[39m�[38;2;102;102;102m/�[39m�[38;2;187;187;187m �[39mnf4�[38;2;102;102;102m-�[39me4m3�[38;2;187;187;187m �[39m�[38;2;102;102;102m/�[39m�[38;2;187;187;187m �[39mnf4�[38;2;102;102;102m-�[39m�[38;2;102;102;102m4�[39m�[38;2;0;0;255mbit�[39m�[38;2;102;102;102m-�[39mlut�[38;2;187;187;187m �[39mreported�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mas�[39;00m�[38;2;187;187;187m �[39m�[38;2;102;102;102m%�[39m�[38;2;187;187;187m �[39mvs�[38;2;187;187;187m �[39mbaseline.
Passes�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mmagnitude�[38;2;187;187;187m �[39m�[38;2;102;102;102m+�[39m�[38;2;187;187;187m �[39mdirection�[38;2;187;187;187m �[39msanity�[38;2;187;187;187m �[39mgates�[38;2;187;187;187m �[39magainst�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mtrusted�[38;2;187;187;187m �[39mM4
reference�[38;2;187;187;187m �[39m�[38;2;0;0;255mtable�[39m.

Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39m�[38;2;170;34;255;01mBy�[39;00m:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply�[38;2;184;134;11m@anthropic�[39m.com�[38;2;102;102;102m>�[39m
…2;187;187;187m �[39mSPECIALIZE�[38;2;187;187;187m �[39m�[38;2;102;102;102m+�[39m�[38;2;187;187;187m �[39m�[38;2;102;102;102m3�[39m�[38;2;102;102;102m-�[39mway�[38;2;187;187;187m �[39mscalar�[38;2;102;102;102m/�[39mawq�[38;2;102;102;102m-�[39mlut256�[38;2;102;102;102m/�[39mnf4�[38;2;187;187;187m �[39mbench

Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39m�[38;2;170;34;255;01mBy�[39;00m:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply�[38;2;184;134;11m@anthropic�[39m.com�[38;2;102;102;102m>�[39m
…6�[38;2;187;187;187m �[39m(byte�[38;2;102;102;102m-�[39mbatched�[38;2;187;187;187m �[39mhalf2�[38;2;187;187;187m �[39m�[38;2;102;102;102m256�[39m�[38;2;102;102;102m-�[39mLUT)�[38;2;187;187;187m �[39m�[38;2;102;102;102m+�[39m�[38;2;187;187;187m �[39mcorrectness�[38;2;187;187;187m �[39m�[38;2;102;102;102m+�[39m�[38;2;187;187;187m �[39mbench

Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39m�[38;2;170;34;255;01mBy�[39;00m:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply�[38;2;184;134;11m@anthropic�[39m.com�[38;2;102;102;102m>�[39m
@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 15, 2026

NF4 vs scalar int4 — short, clear, with QMM (Apple M4, reproduced)

+ = NF4 slower than scalar int4. Batched timing (N=128 dispatches/command buffer); 5 warmup / 20 measured / drop-5-farthest-from-median / mean. Independently re-run-confirmed.

Decode (QMV)

Shape (K×N) M=1 M=2 M=4
LFM-2048 (2048×2048) +36%* +30–45% +50%
Qwen-MLPup (896×4864) +19%* +42% +49%
Qwen-MLPdown (4864×896) +22%* +41–45% +49%
Llama-4096 (4096×4096) ≈tie (−6%) +40–47% +55%
Llama-MLPup (4096×14336) ≈tie (−2%) +23–31% +57%
Llama-MLPdown (14336×4096) ≈tie (−3%) +32–36% +59%

Prefill (QMM)

Shape (K×N) M=5 8 16 32 48 64 128
ShapeA (2048×2048) −9% ≈0 +6% +13% ≈0 ≈0 ≈0
ShapeB (2560×6912) +5% +9% +11% +12% ≈0 ≈0 ≈0

* Why these M=1 cells are less reliable (not "noise" hand-waving — here's the cause): at M=1 these small matrices run in only ~15–20 µs, so absolute time is dominated by GPU clock-ramp / kernel-launch jitter and the % vs scalar swings run-to-run (e.g. LFM-2048 M=1 measured +29% / +36% / +71% across three independent runs; the table shows the canonical-run value). Direction is consistent (NF4 always neutral-to-slower at M=1, never faster) but treat the exact M=1 small-shape % as ±high. All M≥2 and Llama cells are stable (σ small) and trustworthy.

The story in three lines

  1. Decode (the regime that matters): NF4 is +30–60% slower than scalar int4 at M≥2 on every shape; only ties at M=1 on big shapes.
  2. Prefill: roughly even (±2%) at large tiles (M≥48) and tiny M; +6–13% slower in the mid range (M=16–32).
  3. Why (RESOLVED): the cost is intrinsic to NF4's codebook dequant, not the kernel. Decisive test: NF4 codebook dequant was grafted into the exact tuned QmvFast skeleton (identical tiling/occupancy/loop as the scalar path; bit-exact vs Nf4QmvConstant, worst_rel = 0.000e0, independently re-verified). The gap did not close — nf4-grafted ≈ nf4-const at every M≥2 (e.g. LFM-2048 M=4: +49.7% vs +49.0%; Llama-MLPup M=4: +57.4% vs +56.8%), both still +45–60% over scalar. So it is not an unoptimized-kernel artifact, and the codebook access pattern is only a minor lever (see refinement below — it helps but doesn't close it). The 16-entry codebook lookup is fundamentally more expensive than the int4 mantissa trick on Apple M4, largely independent of kernel structure. Refinement: the codebook address space is a real but minor lever — a 16-entry codebook in threadgroup memory (nf4-tg, bit-exact vs const) avoids the constant-cache divergent-load serialization and beats nf4-const by ~5–13pp at M≥2 (e.g. LFM-2048 M=2: +36% vs +44%; Llama-MLPup M=4: +50% vs +62%) — making nf4-tg the best NF4 dequant — but it still does not reach scalar (best M≥2 ≈ +22% slower). Other zero-memory routes were also tested and are far worse: a register simd_shuffle codebook is +250–300% vs scalar at M≥2 (per-weight simd_shuffle is expensive on M4; size-invariant 8≈16≈32), and a full-replication in-register switch select chain (nf4-select) is catastrophic at +670–1376% (the per-nibble select serializes into a long dependent chain). Design space exhausted (6 mechanisms) — final ranking @ M≥2: awq-lut256 ≤ scalar < nf4-tg (+22–59%) < nf4-const ≈ nf4-grafted (+26–62%) < nf4-byte256 (+48–93%) ≪ nf4-shuffle (+200–320%) ≪ nf4-select (+670–1376%). No kernel-level change makes NF4 competitive with scalar int4 at decode on Apple M4 — nf4-tg narrows the gap but does not close it; NF4 is an accuracy-only format here. If NF4 must be used, ship the threadgroup-codebook (Nf4QmvTg) variant.

CC-Yeh and others added 3 commits May 16, 2026 03:12
Adds a third dequant mode (use_nf4 SPECIALIZE) to QuantizedMatmulQmvFast:
16-entry NF4 codebook lookup, scale-only (no zero-points/bias), inside the
exact production QmvFast tiling/occupancy/loop. Precedence: use_nf4 ->
codebook path, else existing use_lut branch. Wired through the CPU kernel
signature (additive trailing bool), all existing call sites (default false),
and the qmv_lut_bench dispatcher (new nf4-grafted column). Adds
nf4_graft_test correctness gate: grafted-NF4 is bit-exact vs Nf4QmvConstant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s + bench

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant