[Experiment] QMV LUT Dequant by CC-Yeh · Pull Request #394 · trymirai/uzu

CC-Yeh · 2026-05-08T13:27:36Z

Tested replacing QmvFast's pure-ALU int4→float dequant: uint_to_fp
mantissa trick (nibble extract via shift/mask + bit-OR + fsub) with
a threadgroup-memory LUT lookup. No win at any tested shape, mild
E2E regression on real models. Mantissa trick stays optimal.

Headline (Apple M4, ZP_BF16_gs64, 4-bit, kernel µs medians)

Shape	M	main	LUT	Δ
(4096, 4096)	1	80.7	81.7	+1% (tied)
(4096, 4096)	2	83.0	123.2	+48%
(4096, 4096)	4	150.5	241.0	+60%
(14336, 4096)	1	314.4	305.8	−2.7% (tied, within noise)

E2E LFM2.5 4-bit decode: −1.8% (RHT), −3.5% (MLX).

What was tried

Constant-memory LUT: divergent loads serialize. +73–130%.
Threadgroup-memory LUT: partial recovery. Still +50–60% at M≥2.
bfloat2 entries: same wallclock as half2.
Manual bf << 16 widen: eliminates air.convert from AIR
(verified) but no wallclock change. Convert wasn't the cost.

Verdict

The actual cost is L1 cache port contention: LUT reads compete
with weight loads at the same L1 read port. The mantissa trick has
zero memory ops in the dequant chain (pure ALU, extracts nibbles
and converts to float in one fused bit-twiddle), so it doesn't fight
for the port. No LUT variant can beat that on Apple GPU.

CC-Yeh · 2026-05-08T13:28:15Z

Kernel benchmarks (M4, ZP_BF16_gs64, criterion median µs)

(4096, 4096) — 8 MB, fits SLC

M	main	LUT	Δ
1	80.7	81.7	+1% (tied)
2	83.0	123.2	+48%
4	150.5	241.0	+60%

(14336, 4096) M=1 — 28 MB, 5-rep verification

	main	LUT
median	314.4	305.8
min/max	309.3 / 323.8	304.9 / 319.1
spread	4.6%	4.6%

LUT 2.7% faster, within noise. No regression at this shape/M.

(Earlier sweep reported +275% for this case — turned out to be a
broken measurement; clean rerun didn't reproduce. The "wide-shallow
shapes are catastrophic" claim is retracted.)

E2E LFM2.5 4-bit (n=15, M4)

Model	Build	Decode tok/s
RHT-4bitLmHead	main	143.7
RHT-4bitLmHead	LUT	141.1
MLX-4bit	main	146.2
MLX-4bit	LUT	141.1

Decode regresses 1.8–3.5% across both formats — small but consistent.

CC-Yeh · 2026-05-12T12:39:01Z

AWQ + LUT[256]

Shape	M	Baseline (ms)	LUT (ms)	Δ%	verdict
A 2048×2048	5	0.258	0.266	+3.0%	regress
A 2048×2048	8	bimodal*	bimodal*	~−6% (fast mode)	small win
A 2048×2048	16	0.079	0.087	+10.1%	regress
A 2048×2048	32	0.136	0.153	+12.5%	regress
A 2048×2048	48	0.216	0.218	+1.0%	≈ same
A 2048×2048	64	0.217	0.220	+1.7%	≈ same
A 2048×2048	128	0.362	0.368	+1.6%	≈ same
B 2560×6912	5	0.180	0.201	+11.4%	regress
B 2560×6912	8	0.178	0.195	+9.0%	regress
B 2560×6912	16	0.300	0.338	+12.8%	regress
B 2560×6912	32	0.557	0.635	+14.2%	regress
B 2560×6912	48	0.711	0.719	+1.2%	≈ same
B 2560×6912	64	0.712	0.727	+2.1%	≈ same
B 2560×6912	128	1.377	1.402	+1.8%	≈ same

CC-Yeh · 2026-05-12T14:15:43Z

⏺ ## QMM (BM-matched: BM=8 for M<48, BM=64 for M≥48) — Apple M4

Baseline is AWQ-LUT[256]...

Shape	M	BM	AWQ-LUT[256] (ms)	NF4-const (ms)	NF4-tg (ms)	Δ const	Δ tg	flag
2048×2048	5	8	0.0605	0.0607	0.0679	+0.3%	+12.2%
2048×2048	8	8	0.2501	0.2507	0.2537	+0.2%	+1.4%
2048×2048	16	8	0.3847	0.3841	0.3865	−0.2%	+0.5%
2048×2048	32	8	0.6963	0.6910	0.6956	−0.8%	−0.1%
2048×2048	48	64	0.9889	0.9872	0.9890	−0.2%	0.0%
2048×2048	64	64	0.5432	0.5392	0.9941	−0.7%	+83.0%	tg bimodal
2048×2048	128	64	0.9302	1.6712	1.6989	+79.7%	+82.6%	bimodal
2560×6912	5	8	0.4755	0.8157	0.8272	+71.6%	+74.0%	bimodal
2560×6912	8	8	0.8203	0.8046	0.8154	−1.9%	−0.6%
2560×6912	16	8	1.5123	1.4880	1.5029	−1.6%	−0.6%
2560×6912	32	8	1.6051	1.2469	1.2423	−22.3%	−22.6%	bimodal
2560×6912	48	64	1.8350	1.8002	1.8073	−1.9%	−1.5%
2560×6912	64	64	1.2140	1.2255	1.4245	+0.9%	+17.3%	tg loses
2560×6912	128	64	2.0861	1.8828	1.9234	−9.7%	−7.8%

QMV (M ∈ {1, 2, 4}, 6 shapes) — Apple M4

Shape	M	AWQ-LUT[256] (ms)	NF4-const (ms)	NF4-tg (ms)	Δ const	Δ tg	flag
2048×2048 (LFM)	1	0.0393	0.0243	0.0242	−38.1%	−38.5%	AWQ bimodal
2560×6912 (Qwen MLPup)	1	0.0993	0.0911	0.0892	−8.2%	−10.2%
6912×2560 (Qwen MLPdown)	1	0.0973	0.0909	0.0907	−6.6%	−6.7%
4096×4096 (Llama)	1	0.0883	0.0828	0.0852	−6.2%	−3.4%
4096×14336 (Llama MLPup)	1	0.3151	0.2976	0.2976	−5.6%	−5.6%
14336×4096 (Llama MLPdown)	1	0.3226	0.3049	0.3048	−5.5%	−5.5%
2048×2048 (LFM)	2	0.0383	0.0352	0.0335	−8.2%	−12.6%
2560×6912 (Qwen MLPup)	2	0.1540	0.1280	0.1222	−16.9%	−20.7%
6912×2560 (Qwen MLPdown)	2	0.1549	0.1324	0.1268	−14.6%	−18.2%
4096×4096 (Llama)	2	0.1454	0.1258	0.1183	−13.4%	−18.7%
4096×14336 (Llama MLPup)	2	0.4817	0.4063	1.7706	−15.7%	+267.5%	tg collapse
14336×4096 (Llama MLPdown)	2	0.4864	0.4121	0.3942	−15.3%	−19.0%
2048×2048 (LFM)	4	0.0729	0.0621	0.0587	−14.8%	−19.4%
2560×6912 (Qwen MLPup)	4	0.2936	0.2434	0.2289	−17.1%	−22.0%
6912×2560 (Qwen MLPdown)	4	0.3014	0.2515	0.2405	−16.5%	−20.2%
4096×4096 (Llama)	4	0.2796	0.2337	0.2210	−16.4%	−21.0%
4096×14336 (Llama MLPup)	4	0.9523	0.7916	0.7458	−16.9%	−21.7%
14336×4096 (Llama MLPdown)	4	0.9621	0.8022	0.7612	−16.6%	−20.9%

CC-Yeh · 2026-05-12T14:49:03Z

⏺ ## QMV vs scalar-AWQ (M ∈ {1, 2, 4}, 6 shapes) — Apple M4

Shape	M	AWQ-scalar (ms)	NF4-const (ms)	NF4-tg (ms)	Δ const	Δ tg
LFM-2048	1	0.0248	0.0242	0.0236	−2.5%	−5.0%
LFM-2048	2	0.0265	0.0355	0.0338	+34.1%	+27.4%
LFM-2048	4	0.0411	0.0626	0.0590	+52.4%	+43.6%
Qwen-MLPup	1	0.0963	0.0900	0.0897	−6.6%	−6.8%
Qwen-MLPup	2	0.0971	0.1312	0.1248	+35.1%	+28.5%
Qwen-MLPup	4	0.1536	0.2445	0.2325	+59.2%	+51.3%
Qwen-MLPdown	1	0.0946	0.0875	0.0862	−7.5%	−8.9%
Qwen-MLPdown	2	0.1032	0.1340	0.1283	+29.9%	+24.3%
Qwen-MLPdown	4	0.1565	0.2516	0.2382	+60.8%	+52.2%
Llama-4096	1	0.0916	0.0843	0.0841	−8.1%	−8.2%
Llama-4096	2	0.0977	0.1265	0.1172	+29.5%	+20.0%
Llama-4096	4	0.1460	0.2381	0.2208	+63.1%	+51.2%
Llama-MLPup	1	0.3233	0.2974	0.2975	−8.0%	−8.0%
Llama-MLPup	2	0.3358	0.4037	0.3823	+20.2%	+13.8%
Llama-MLPup	4	0.4759	0.7977	0.7489	+67.6%	+57.4%
Llama-MLPdown	1	0.3257	0.3040	0.3046	−6.7%	−6.5%
Llama-MLPdown	2	0.3241	0.4146	0.3947	+27.9%	+21.8%
Llama-MLPdown	4	0.4775	0.8064	0.7602	+68.9%	+59.2%

QMM vs scalar-AWQ (M ∈ {5..128}, 2 shapes) — Apple M4

Shape	M	AWQ-scalar (ms)	NF4-const (ms)	NF4-tg (ms)	Δ const	Δ tg
2048×2048	5	0.2568	0.1593	0.0711	−38.0%	−72.3%
2048×2048	8	0.0569	0.0606	0.0596	+6.5%	+4.8%
2048×2048	16	0.0789	0.0865	0.0904	+9.6%	+14.6%
2048×2048	32	0.1377	0.1552	0.1526	+12.7%	+10.8%
2048×2048	48	0.2181	0.2185	0.2191	+0.2%	+0.4%
2048×2048	64	0.2168	0.2177	0.2204	+0.4%	+1.6%
2048×2048	128	0.3621	0.3634	0.3687	+0.4%	+1.8%
2560×6912	5	0.1818	0.1947	0.1967	+7.1%	+8.2%
2560×6912	8	0.1788	0.1915	0.1915	+7.1%	+7.1%
2560×6912	16	0.2998	0.3311	0.3325	+10.5%	+10.9%
2560×6912	32	0.5562	0.6241	0.6278	+12.2%	+12.9%
2560×6912	48	0.7109	0.7150	0.7176	+0.6%	+0.9%
2560×6912	64	0.7121	0.7228	0.7219	+1.5%	+1.4%
2560×6912	128	0.3778	0.3929	0.4012	+1.1%	+1.7%

…102;102;102m4�[39m�[38;2;102;102;102m-�[39mbit�[38;2;187;187;187m �[39mLUT�[38;2;187;187;187m �[39mzero�[38;2;102;102;102m-�[39mpoint�[38;2;187;187;187m �[39mvariant�[38;2;187;187;187m �[39malongside�[38;2;187;187;187m �[39mE4M3 Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39mBy:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply@anthropic�[38;2;102;102;102m.�[39mcom�[38;2;102;102;102m>�[39m

…102;102;102m/�[39mZp�[38;2;187;187;187m �[39mvariants�[38;2;187;187;187m �[39mvs�[38;2;187;187;187m �[39mQmvFast�[38;2;102;102;102m/�[39mAWQ�[38;2;187;187;187m �[39macross�[38;2;187;187;187m �[39mM Extend�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mcommitted�[38;2;187;187;187m �[39mqmm_lut�[38;2;102;102;102m/�[39mqmv_lut�[38;2;187;187;187m �[39mharness�[38;2;187;187;187m �[39mto�[38;2;187;187;187m �[39mmeasure�[38;2;187;187;187m �[39mall�[38;2;187;187;187m �[39mfour�[38;2;187;187;187m �[39mNF4 variants�[38;2;187;187;187m �[39m(Constant,�[38;2;187;187;187m �[39mTg,�[38;2;187;187;187m �[39mE4M3,�[38;2;187;187;187m �[39mZp)�[38;2;187;187;187m �[39mper�[38;2;187;187;187m �[39mcell�[38;2;187;187;187m �[39magainst�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mproduction baselines:�[38;2;187;187;187m �[39mQmvFast�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mfor�[39;00m�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mdecode�[38;2;187;187;187m �[39mQMV�[38;2;187;187;187m �[39msweep�[38;2;187;187;187m �[39m(M�[38;2;102;102;102m=�[39m�[38;2;102;102;102m1.�[39m�[38;2;102;102;102m.�[39m�[38;2;102;102;102m4�[39m)�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mand�[39;00m�[38;2;187;187;187m �[39mAWQ�[38;2;102;102;102m-�[39mint4 QmmTransposed�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mfor�[39;00m�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mprefill�[38;2;187;187;187m �[39mQMM�[38;2;187;187;187m �[39msweep�[38;2;187;187;187m �[39m(M�[38;2;102;102;102m=�[39m�[38;2;102;102;102m5�[39m�[38;2;102;102;102m/�[39m�[38;2;102;102;102m16�[39m�[38;2;102;102;102m/�[39m�[38;2;102;102;102m32�[39m�[38;2;102;102;102m/�[39m�[38;2;102;102;102m64�[39m�[38;2;102;102;102m/�[39m�[38;2;102;102;102m128�[39m),�[38;2;187;187;187m �[39mon�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mtwo existing�[38;2;187;187;187m �[39mreal�[38;2;187;187;187m �[39mshapes�[38;2;187;187;187m �[39m(�[38;2;102;102;102m2048�[39mx2048,�[38;2;187;187;187m �[39m�[38;2;102;102;102m2560�[39mx6912)�[38;2;102;102;102m.�[39m�[38;2;187;187;187m �[39mE4M3�[38;2;187;187;187m �[39mscales�[38;2;187;187;187m �[39mare�[38;2;187;187;187m �[39mbuilt�[38;2;187;187;187m �[39mby �[38;2;170;34;255mround�[39m�[38;2;102;102;102m-�[39mtripping�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39msame�[38;2;187;187;187m �[39mf32�[38;2;187;187;187m �[39mscales�[38;2;187;187;187m �[39mthrough�[38;2;187;187;187m �[39mf32_to_e4m3;�[38;2;187;187;187m �[39mZp�[38;2;187;187;187m �[39muses�[38;2;187;187;187m �[39mrealistic packed�[38;2;187;187;187m �[39m�[38;2;102;102;102m4�[39m�[38;2;102;102;102m-�[39mbit�[38;2;187;187;187m �[39mzero�[38;2;102;102;102m-�[39mpoint�[38;2;187;187;187m �[39mindices�[38;2;102;102;102m.�[39m�[38;2;187;187;187m �[39mSame�[38;2;187;187;187m �[39mpick_tile()�[38;2;187;187;187m �[39mtiling�[38;2;187;187;187m �[39mapplied�[38;2;187;187;187m �[39mto�[38;2;187;187;187m �[39mevery NF4�[38;2;187;187;187m �[39mvariant�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mfor�[39;00m�[38;2;187;187;187m �[39mapples�[38;2;102;102;102m-�[39mto�[38;2;102;102;102m-�[39mapples�[38;2;187;187;187m �[39mQMM�[38;2;102;102;102m.�[39m�[38;2;187;187;187m �[39mFormat�[38;2;187;187;187m �[39mconfound�[38;2;187;187;187m �[39m(QmvFast�[38;2;187;187;187m �[39masymmetric int4�[38;2;187;187;187m �[39mvs�[38;2;187;187;187m �[39mNF4�[38;2;187;187;187m �[39mcodebook)�[38;2;187;187;187m �[39mannotated�[38;2;187;187;187m �[39m�[38;2;170;34;255;01min�[39;00m�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mQMV�[38;2;187;187;187m �[39moutput�[38;2;102;102;102m.�[39m Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39mBy:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply@anthropic�[38;2;102;102;102m.�[39mcom�[38;2;102;102;102m>�[39m

…g�[38;2;187;187;187m �[39m�[38;2;102;102;102m4�[39m�[38;2;102;102;102m-�[39mway�[38;2;187;187;187m �[39mNF4�[38;2;187;187;187m �[39mvs�[38;2;187;187;187m �[39mzp�[38;2;102;102;102m-�[39mscalar�[38;2;187;187;187m �[39mbench Rewrite�[38;2;187;187;187m �[39mqmm_lut_bench�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mto�[39;00m�[38;2;187;187;187m �[39mencode�[38;2;187;187;187m �[39mN�[38;2;187;187;187m �[39mdispatches�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mof�[39;00m�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39msame�[38;2;187;187;187m �[39mkernel�[38;2;187;187;187m �[39m�[38;2;170;34;255;01minto�[39;00m�[38;2;187;187;187m �[39mone command�[38;2;187;187;187m �[39mbuffer�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mand�[39;00m�[38;2;187;187;187m �[39mdivide�[38;2;187;187;187m �[39mgpu_execution_time�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mby�[39;00m�[38;2;187;187;187m �[39mN,�[38;2;187;187;187m �[39mamortizing�[38;2;187;187;187m �[39mfixed per�[38;2;102;102;102m-�[39mCB�[38;2;187;187;187m �[39mGPU�[38;2;187;187;187m �[39moverhead.�[38;2;187;187;187m �[39mBaseline�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mis�[39;00m�[38;2;187;187;187m �[39mnow�[38;2;187;187;187m �[39mzp�[38;2;102;102;102m-�[39mscalar�[38;2;187;187;187m �[39mint4�[38;2;187;187;187m �[39m(QmvFast�[38;2;102;102;102m/�[39m QmmTransposed,�[38;2;187;187;187m �[39muse_zero_points�[38;2;102;102;102m=�[39m�[38;2;170;34;255;01mtrue�[39;00m,�[38;2;187;187;187m �[39muse_mlx_quant�[38;2;102;102;102m=�[39m�[38;2;170;34;255;01mfalse�[39;00m,�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mno�[39;00m�[38;2;187;187;187m �[39mbias); challengers�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mare�[39;00m�[38;2;187;187;187m �[39mnf4�[38;2;187;187;187m �[39m�[38;2;102;102;102m/�[39m�[38;2;187;187;187m �[39mnf4�[38;2;102;102;102m-�[39me4m3�[38;2;187;187;187m �[39m�[38;2;102;102;102m/�[39m�[38;2;187;187;187m �[39mnf4�[38;2;102;102;102m-�[39m�[38;2;102;102;102m4�[39m�[38;2;0;0;255mbit�[39m�[38;2;102;102;102m-�[39mlut�[38;2;187;187;187m �[39mreported�[38;2;187;187;187m �[39m�[38;2;170;34;255;01mas�[39;00m�[38;2;187;187;187m �[39mΔ�[38;2;102;102;102m%�[39m�[38;2;187;187;187m �[39mvs�[38;2;187;187;187m �[39mbaseline. Passes�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mmagnitude�[38;2;187;187;187m �[39m�[38;2;102;102;102m+�[39m�[38;2;187;187;187m �[39mdirection�[38;2;187;187;187m �[39msanity�[38;2;187;187;187m �[39mgates�[38;2;187;187;187m �[39magainst�[38;2;187;187;187m �[39mthe�[38;2;187;187;187m �[39mtrusted�[38;2;187;187;187m �[39mM4 reference�[38;2;187;187;187m �[39m�[38;2;0;0;255mtable�[39m. Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39m�[38;2;170;34;255;01mBy�[39;00m:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply�[38;2;184;134;11m@anthropic�[39m.com�[38;2;102;102;102m>�[39m

…2;187;187;187m �[39mSPECIALIZE�[38;2;187;187;187m �[39m�[38;2;102;102;102m+�[39m�[38;2;187;187;187m �[39m�[38;2;102;102;102m3�[39m�[38;2;102;102;102m-�[39mway�[38;2;187;187;187m �[39mscalar�[38;2;102;102;102m/�[39mawq�[38;2;102;102;102m-�[39mlut256�[38;2;102;102;102m/�[39mnf4�[38;2;187;187;187m �[39mbench Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39m�[38;2;170;34;255;01mBy�[39;00m:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply�[38;2;184;134;11m@anthropic�[39m.com�[38;2;102;102;102m>�[39m

…6�[38;2;187;187;187m �[39m(byte�[38;2;102;102;102m-�[39mbatched�[38;2;187;187;187m �[39mhalf2�[38;2;187;187;187m �[39m�[38;2;102;102;102m256�[39m�[38;2;102;102;102m-�[39mLUT)�[38;2;187;187;187m �[39m�[38;2;102;102;102m+�[39m�[38;2;187;187;187m �[39mcorrectness�[38;2;187;187;187m �[39m�[38;2;102;102;102m+�[39m�[38;2;187;187;187m �[39mbench Co�[38;2;102;102;102m-�[39mAuthored�[38;2;102;102;102m-�[39m�[38;2;170;34;255;01mBy�[39;00m:�[38;2;187;187;187m �[39mClaude�[38;2;187;187;187m �[39mOpus�[38;2;187;187;187m �[39m�[38;2;102;102;102m4.7�[39m�[38;2;187;187;187m �[39m(�[38;2;102;102;102m1�[39mM�[38;2;187;187;187m �[39mcontext)�[38;2;187;187;187m �[39m�[38;2;102;102;102m<�[39mnoreply�[38;2;184;134;11m@anthropic�[39m.com�[38;2;102;102;102m>�[39m

CC-Yeh · 2026-05-15T18:57:40Z

NF4 vs scalar int4 — short, clear, with QMM (Apple M4, reproduced)

+ = NF4 slower than scalar int4. Batched timing (N=128 dispatches/command buffer); 5 warmup / 20 measured / drop-5-farthest-from-median / mean. Independently re-run-confirmed.

Decode (QMV)

Shape (K×N)	M=1	M=2	M=4
LFM-2048 (2048×2048)	+36%*	+30–45%	+50%
Qwen-MLPup (896×4864)	+19%*	+42%	+49%
Qwen-MLPdown (4864×896)	+22%*	+41–45%	+49%
Llama-4096 (4096×4096)	≈tie (−6%)	+40–47%	+55%
Llama-MLPup (4096×14336)	≈tie (−2%)	+23–31%	+57%
Llama-MLPdown (14336×4096)	≈tie (−3%)	+32–36%	+59%

Prefill (QMM)

Shape (K×N)	M=5	8	16	32	48	64	128
ShapeA (2048×2048)	−9%	≈0	+6%	+13%	≈0	≈0	≈0
ShapeB (2560×6912)	+5%	+9%	+11%	+12%	≈0	≈0	≈0

* Why these M=1 cells are less reliable (not "noise" hand-waving — here's the cause): at M=1 these small matrices run in only ~15–20 µs, so absolute time is dominated by GPU clock-ramp / kernel-launch jitter and the % vs scalar swings run-to-run (e.g. LFM-2048 M=1 measured +29% / +36% / +71% across three independent runs; the table shows the canonical-run value). Direction is consistent (NF4 always neutral-to-slower at M=1, never faster) but treat the exact M=1 small-shape % as ±high. All M≥2 and Llama cells are stable (σ small) and trustworthy.

The story in three lines

Decode (the regime that matters): NF4 is +30–60% slower than scalar int4 at M≥2 on every shape; only ties at M=1 on big shapes.
Prefill: roughly even (±2%) at large tiles (M≥48) and tiny M; +6–13% slower in the mid range (M=16–32).
Why (RESOLVED): the cost is intrinsic to NF4's codebook dequant, not the kernel. Decisive test: NF4 codebook dequant was grafted into the exact tuned QmvFast skeleton (identical tiling/occupancy/loop as the scalar path; bit-exact vs Nf4QmvConstant, worst_rel = 0.000e0, independently re-verified). The gap did not close — nf4-grafted ≈ nf4-const at every M≥2 (e.g. LFM-2048 M=4: +49.7% vs +49.0%; Llama-MLPup M=4: +57.4% vs +56.8%), both still +45–60% over scalar. So it is not an unoptimized-kernel artifact, and the codebook access pattern is only a minor lever (see refinement below — it helps but doesn't close it). The 16-entry codebook lookup is fundamentally more expensive than the int4 mantissa trick on Apple M4, largely independent of kernel structure. Refinement: the codebook address space is a real but minor lever — a 16-entry codebook in threadgroup memory (nf4-tg, bit-exact vs const) avoids the constant-cache divergent-load serialization and beats nf4-const by ~5–13pp at M≥2 (e.g. LFM-2048 M=2: +36% vs +44%; Llama-MLPup M=4: +50% vs +62%) — making nf4-tg the best NF4 dequant — but it still does not reach scalar (best M≥2 ≈ +22% slower). Other zero-memory routes were also tested and are far worse: a register simd_shuffle codebook is +250–300% vs scalar at M≥2 (per-weight simd_shuffle is expensive on M4; size-invariant 8≈16≈32), and a full-replication in-register switch select chain (nf4-select) is catastrophic at +670–1376% (the per-nibble select serializes into a long dependent chain). Design space exhausted (6 mechanisms) — final ranking @ M≥2: awq-lut256 ≤ scalar < nf4-tg (+22–59%) < nf4-const ≈ nf4-grafted (+26–62%) < nf4-byte256 (+48–93%) ≪ nf4-shuffle (+200–320%) ≪ nf4-select (+670–1376%). No kernel-level change makes NF4 competitive with scalar int4 at decode on Apple M4 — nf4-tg narrows the gap but does not close it; NF4 is an accuracy-only format here. If NF4 must be used, ship the threadgroup-codebook (Nf4QmvTg) variant.

Adds a third dequant mode (use_nf4 SPECIALIZE) to QuantizedMatmulQmvFast: 16-entry NF4 codebook lookup, scale-only (no zero-points/bias), inside the exact production QmvFast tiling/occupancy/loop. Precedence: use_nf4 -> codebook path, else existing use_lut branch. Wired through the CPU kernel signature (additive trailing bool), all existing call sites (default false), and the qmv_lut_bench dispatcher (new nf4-grafted column). Adds nf4_graft_test correctness gate: grafted-NF4 is bit-exact vs Nf4QmvConstant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s + bench Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CC-Yeh added 2 commits May 8, 2026 10:47

lut experiments

41f0690

use threadgroup

66b6629

CC-Yeh added 2 commits May 12, 2026 20:45

support qmm lut

fe6e428

NF4 experiments

32ce008

CC-Yeh added 6 commits May 15, 2026 23:08

NF4 E4M3-scale variant (Nf4QmvE4m3/Nf4QmmE4m3) + CPU ref + test

2ed7ac0

CC-Yeh and others added 3 commits May 16, 2026 03:12

Add Nf4QmvShuffle (register shuffle codebook, S=8/16/32) + correctnes…

1b33506

…s + bench Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Nf4QmvSelect + re-bench nf4-tg: closing NF4 dequant sweep

3f3f56c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experiment] QMV LUT Dequant#394

[Experiment] QMV LUT Dequant#394
CC-Yeh wants to merge 13 commits into
mainfrom
qmv_lut

CC-Yeh commented May 8, 2026

Uh oh!

CC-Yeh commented May 8, 2026

Uh oh!

CC-Yeh commented May 12, 2026 •

edited

Loading

Uh oh!

CC-Yeh commented May 12, 2026 •

edited

Loading

Uh oh!

CC-Yeh commented May 12, 2026

Uh oh!

CC-Yeh commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CC-Yeh commented May 8, 2026

Headline (Apple M4, ZP_BF16_gs64, 4-bit, kernel µs medians)

What was tried

Verdict

Uh oh!

CC-Yeh commented May 8, 2026

Kernel benchmarks (M4, ZP_BF16_gs64, criterion median µs)

(4096, 4096) — 8 MB, fits SLC

(14336, 4096) M=1 — 28 MB, 5-rep verification

E2E LFM2.5 4-bit (n=15, M4)

Uh oh!

CC-Yeh commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CC-Yeh commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QMV (M ∈ {1, 2, 4}, 6 shapes) — Apple M4

Uh oh!

CC-Yeh commented May 12, 2026

QMM vs scalar-AWQ (M ∈ {5..128}, 2 shapes) — Apple M4

Uh oh!

CC-Yeh commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NF4 vs scalar int4 — short, clear, with QMM (Apple M4, reproduced)

Decode (QMV)

Prefill (QMM)

The story in three lines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CC-Yeh commented May 12, 2026 •

edited

Loading

CC-Yeh commented May 12, 2026 •

edited

Loading

CC-Yeh commented May 15, 2026 •

edited

Loading