fix(inference): wide f16 GEMV kernel restores 160 tok/s decode throughput#151
Merged
Conversation
…hput The lm_head dispatch for Q8 models was using gemv_decode_m1 (NR=1, one threadgroup per vocab row) which created 151,936 threadgroups of 256 threads each — 4× more shader invocations than the prior Q8 path. This was introduced in commit 4dab27e which correctly switched to f16 weights for PPL quality but used an inefficient kernel for the large-N matmul. Add gemv_decode_wide_f16: an NR=4 f16 GEMV kernel (same structure as the existing gemv_q8_decode_wide) that processes 4 output rows per threadgroup, reducing lm_head dispatch from 151,936 to 37,984 threadgroups. Same f16 weights, same f32 accumulation — zero quality regression. bench_decode_ab (Qwen3.5-0.8B Q8, slope method): Before: T1=291ms, ~133 tok/s After: T1=127ms, ~160 tok/s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced May 31, 2026
ppl_metal: GPU-accelerated perplexity evaluation via Metal. Uses the same forward path as decode (including the wide f16 lm_head kernel). Configurable via PPL_TOKENS and CORPUS env vars. pyproject.toml: tracks common Python dev dependencies (pyarrow, datasets, mlx, numpy, matplotlib) so scripts/ and one-shot comparisons work without ad-hoc installs. Verified: Lattice PPL=20.60 vs MLX PPL=20.67 on wikitext-2 (2048 tokens, window=512, stride=256). Parity confirmed — no quality regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Perf regression report (ADR-058)
|
| Bench | Δ point | 95% CI | new ns | base ns | verdict |
|---|---|---|---|---|---|
simd_query_batch_dot_product/pair_loop/768d_256c |
+9.43% | [+9.16%, +9.69%] | 21794.3 | 21794.3 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_16c |
+8.89% | [+8.25%, +9.52%] | 1067.7 | 1067.7 | ❌ FAIL |
simd_query_batch_dot_product/simd_batch/768d_256c |
+8.68% | [+8.18%, +9.19%] | 18169.6 | 18169.6 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_256c |
+7.44% | [+6.91%, +7.97%] | 21801.5 | 21801.5 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_256c |
+5.85% | [+5.69%, +6.01%] | 29137.6 | 29137.6 | ⚠ WARN |
simd_normalize/simd/384 |
+7.13% | [+5.61%, +8.59%] | 72.4 | 72.4 | ⚠ WARN |
simd_query_batch_dot_product/simd_batch/384d_256c |
+5.54% | [+5.49%, +5.59%] | 9048.0 | 9048.0 | ⚠ WARN |
simd_query_batch_dot_product/pair_loop/384d_256c |
+5.48% | [+5.40%, +5.56%] | 10143.6 | 10143.6 | ⚠ WARN |
simd_normalize/simd/768 |
+6.43% | [+5.39%, +7.44%] | 123.9 | 123.9 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/pair_loop/768d_256c |
+5.54% | [+5.33%, +5.74%] | 29062.8 | 29062.8 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/simd_batch/768d_256c |
+5.35% | [+5.15%, +5.56%] | 29141.1 | 29141.1 | ⚠ WARN |
simd_batch_cosine_normalized_query/simd_batch/768d_256c |
+5.28% | [+5.06%, +5.51%] | 28787.3 | 28787.3 | ⚠ WARN |
simd_query_batch_dot_product/simd_batch/768d_16c |
+5.06% | [+5.02%, +5.09%] | 706.5 | 706.5 | ⚠ WARN |
simd_normalized_cosine_fast_path/dot_product/768 |
+5.56% | [+4.89%, +6.22%] | 59.9 | 59.9 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_64c |
+4.29% | [+4.04%, +4.55%] | 4829.0 | 4829.0 | ⚠ WARN |
simd_normalize/simd/1536 |
+5.22% | [+4.03%, +6.41%] | 230.5 | 230.5 | ⚠ WARN |
simd_query_batch_dot_product/pair_loop/128d_256c |
+4.05% | [+4.02%, +4.08%] | 3918.1 | 3918.1 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_4c |
+3.88% | [+3.76%, +4.00%] | 241.2 | 241.2 | ⚠ WARN |
memory_size/search_1000_int8 |
+3.63% | [+3.60%, +3.65%] | 15468.9 | 15468.9 | ⚠ WARN |
simd_prepared_query_normalized_cosine/dot_product_loop/384 |
+3.44% | [+3.40%, +3.48%] | 39445.0 | 39445.0 | ⚠ WARN |
simd_query_batch_dot_product/pair_loop/768d_64c |
+3.50% | [+3.24%, +3.76%] | 4772.6 | 4772.6 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_16c |
-3.02% | [-3.15%, -2.88%] | 1424.7 | 1424.7 | 🚀 WIN |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_256c |
-3.03% | [-3.30%, -2.76%] | 35630.4 | 35630.4 | 🚀 WIN |
simd_dot_product/simd/768 |
-3.55% | [-3.75%, -3.34%] | 55.2 | 55.2 | 🚀 WIN |
simd_normalized_cosine_fast_path/dot_product/384 |
-3.70% | [-3.79%, -3.61%] | 30.3 | 30.3 | 🚀 WIN |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_16c |
-3.96% | [-4.06%, -3.87%] | 493.8 | 493.8 | 🚀 WIN |
simd_batch_dot_product/simd_batch/10 |
-5.47% | [-5.51%, -5.43%] | 316.0 | 316.0 | 🚀 WIN |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_256c |
-6.22% | [-6.58%, -5.85%] | 26405.5 | 26405.5 | 🚀 WIN |
simd_query_batch_dot_product/pair_loop/768d_16c |
-7.04% | [-7.08%, -7.00%] | 939.4 | 939.4 | 🚀 WIN |
simd_throughput_384/normalize |
-8.36% | [-8.37%, -8.34%] | 113.9 | 113.9 | 🚀 WIN |
simd_batch_dot_product/simd_batch/1000 |
-19.92% | [-20.27%, -19.56%] | 73334.6 | 73334.6 | 🚀 WIN |
All 259 measurements
| Bench | Δ point | CI-lower | CI-upper |
|---|---|---|---|
add_bias_gelu/4096 |
+0.09% | +0.06% | +0.11% |
add_bias_gelu/896 |
+0.01% | -0.02% | +0.03% |
binary_cosine_distance/binary/1024 |
+2.71% | +2.59% | +2.83% |
binary_cosine_distance/binary/1536 |
+0.39% | +0.33% | +0.45% |
binary_cosine_distance/binary/384 |
+0.06% | +0.02% | +0.11% |
binary_cosine_distance/binary/768 |
+0.39% | +0.37% | +0.41% |
binary_cosine_distance/float32_simd/1024 |
+0.15% | +0.08% | +0.22% |
binary_cosine_distance/float32_simd/1536 |
-0.01% | -0.04% | +0.01% |
binary_cosine_distance/float32_simd/384 |
-0.03% | -0.06% | -0.01% |
binary_cosine_distance/float32_simd/768 |
+0.48% | +0.46% | +0.50% |
elementwise_mul/4096 |
+1.18% | +1.11% | +1.24% |
gelu/4096 |
-0.01% | -0.04% | +0.01% |
gelu/896 |
+0.03% | +0.01% | +0.06% |
int4_cosine_distance/float32_simd/1024 |
+0.34% | +0.14% | +0.53% |
int4_cosine_distance/float32_simd/1536 |
+0.00% | -0.02% | +0.03% |
int4_cosine_distance/float32_simd/384 |
-0.00% | -0.05% | +0.04% |
int4_cosine_distance/float32_simd/768 |
+0.08% | -0.00% | +0.16% |
int4_cosine_distance/int4/1024 |
-0.02% | -0.12% | +0.07% |
int4_cosine_distance/int4/1536 |
+0.38% | +0.24% | +0.52% |
int4_cosine_distance/int4/384 |
-0.02% | -0.22% | +0.18% |
int4_cosine_distance/int4/768 |
+0.22% | +0.08% | +0.36% |
int8_batch_cosine/float32_simd/10 |
-0.05% | -0.06% | -0.03% |
int8_batch_cosine/float32_simd/100 |
+0.28% | +0.26% | +0.30% |
int8_batch_cosine/float32_simd/1000 |
-0.70% | -0.78% | -0.62% |
int8_batch_cosine/int8_loop/10 |
+0.07% | -0.01% | +0.16% |
int8_batch_cosine/int8_loop/100 |
-0.61% | -0.65% | -0.57% |
int8_batch_cosine/int8_loop/1000 |
-2.66% | -2.99% | -2.34% |
int8_prepared_dot_product/per_call/1024 |
+0.09% | +0.02% | +0.16% |
int8_prepared_dot_product/per_call/127 |
+0.13% | +0.11% | +0.14% |
int8_prepared_dot_product/per_call/128 |
+0.47% | +0.32% | +0.62% |
int8_prepared_dot_product/per_call/129 |
+0.17% | +0.16% | +0.19% |
int8_prepared_dot_product/per_call/384 |
+0.04% | +0.03% | +0.04% |
int8_prepared_dot_product/per_call/768 |
+0.12% | +0.02% | +0.21% |
int8_prepared_dot_product/prepared/1024 |
+0.89% | +0.70% | +1.08% |
int8_prepared_dot_product/prepared/127 |
+0.26% | +0.23% | +0.29% |
int8_prepared_dot_product/prepared/128 |
+0.15% | -0.04% | +0.34% |
int8_prepared_dot_product/prepared/129 |
-0.20% | -0.28% | -0.12% |
int8_prepared_dot_product/prepared/384 |
-1.20% | -1.25% | -1.15% |
int8_prepared_dot_product/prepared/768 |
+0.29% | +0.21% | +0.36% |
int8_quantization/quantize/1024 |
+0.02% | +0.01% | +0.03% |
int8_quantization/quantize/1536 |
-0.43% | -0.44% | -0.42% |
int8_quantization/quantize/384 |
+0.00% | -0.01% | +0.01% |
int8_quantization/quantize/768 |
+0.02% | +0.01% | +0.03% |
int8_raw_dot_product/dot_product_i8/1024 |
+0.28% | +0.25% | +0.32% |
int8_raw_dot_product/dot_product_i8/127 |
-0.03% | -0.07% | +0.01% |
int8_raw_dot_product/dot_product_i8/128 |
+0.71% | +0.35% | +1.08% |
int8_raw_dot_product/dot_product_i8/129 |
-0.37% | -0.41% | -0.34% |
int8_raw_dot_product/dot_product_i8/384 |
-1.00% | -1.10% | -0.90% |
int8_raw_dot_product/dot_product_i8/768 |
-1.33% | -1.44% | -1.21% |
int8_raw_dot_product/dot_product_i8_raw/1024 |
+0.03% | -0.00% | +0.06% |
int8_raw_dot_product/dot_product_i8_raw/127 |
-0.27% | -0.35% | -0.19% |
int8_raw_dot_product/dot_product_i8_raw/128 |
-0.23% | -0.29% | -0.18% |
int8_raw_dot_product/dot_product_i8_raw/129 |
+0.10% | +0.00% | +0.20% |
int8_raw_dot_product/dot_product_i8_raw/384 |
-0.44% | -0.47% | -0.41% |
int8_raw_dot_product/dot_product_i8_raw/768 |
-0.32% | -0.38% | -0.25% |
int8_vs_float32_cosine/float32_simd/1024 |
-0.02% | -0.08% | +0.03% |
int8_vs_float32_cosine/float32_simd/1536 |
+0.11% | +0.09% | +0.12% |
int8_vs_float32_cosine/float32_simd/384 |
+0.69% | +0.53% | +0.85% |
int8_vs_float32_cosine/float32_simd/768 |
-0.06% | -0.11% | -0.02% |
int8_vs_float32_cosine/int8/1024 |
-0.06% | -0.15% | +0.02% |
int8_vs_float32_cosine/int8/1536 |
+0.63% | +0.56% | +0.70% |
int8_vs_float32_cosine/int8/384 |
+0.19% | +0.10% | +0.28% |
int8_vs_float32_cosine/int8/768 |
+0.26% | +0.19% | +0.32% |
layer_norm/4096 |
-0.77% | -0.81% | -0.74% |
layer_norm/896 |
-0.16% | -0.21% | -0.11% |
memory_size/search_1000_float32 |
+0.55% | +0.47% | +0.63% |
memory_size/search_1000_int8 |
+3.63% | +3.60% | +3.65% |
rms_norm/4096 |
-1.10% | -1.20% | -1.00% |
rms_norm/896 |
+0.09% | -0.03% | +0.20% |
silu_inplace/4096 |
-0.02% | -0.04% | -0.00% |
silu_inplace/896 |
-0.01% | -0.03% | +0.01% |
simd_batch_cosine/scalar_loop/10 |
+0.08% | +0.02% | +0.13% |
simd_batch_cosine/scalar_loop/100 |
-0.09% | -0.18% | -0.00% |
simd_batch_cosine/scalar_loop/1000 |
-0.72% | -0.85% | -0.59% |
simd_batch_cosine/simd_batch/10 |
+0.22% | +0.16% | +0.27% |
simd_batch_cosine/simd_batch/100 |
-1.88% | -1.94% | -1.83% |
simd_batch_cosine/simd_batch/1000 |
-0.71% | -1.17% | -0.26% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_1000c |
+0.92% | +0.75% | +1.07% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_16c |
-0.70% | -0.76% | -0.65% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_256c |
-3.03% | -3.30% | -2.76% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_4c |
+0.08% | +0.05% | +0.12% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_64c |
+0.02% | -0.06% | +0.10% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_1000c |
+0.46% | +0.35% | +0.56% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_16c |
+0.44% | +0.30% | +0.57% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_256c |
-0.11% | -0.15% | -0.08% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_4c |
+0.10% | +0.07% | +0.13% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_64c |
-0.66% | -0.69% | -0.64% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_1000c |
+1.44% | +1.20% | +1.67% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_16c |
+1.45% | +1.34% | +1.55% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_256c |
+5.54% | +5.33% | +5.74% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_4c |
+0.03% | +0.02% | +0.04% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_64c |
-0.43% | -0.46% | -0.40% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_1000c |
+0.73% | +0.68% | +0.77% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_16c |
-0.83% | -0.95% | -0.71% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_256c |
+2.99% | +2.21% | +3.79% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_4c |
+0.18% | +0.02% | +0.33% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_64c |
+0.02% | -0.01% | +0.05% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_1000c |
+0.12% | +0.07% | +0.17% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_16c |
+0.29% | +0.27% | +0.31% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_256c |
+0.02% | -0.02% | +0.06% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_4c |
+0.32% | +0.30% | +0.34% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_64c |
-0.50% | -0.60% | -0.41% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_1000c |
+1.50% | +1.29% | +1.71% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_16c |
+1.42% | +1.39% | +1.45% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_256c |
+5.35% | +5.15% | +5.56% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_4c |
+0.01% | -0.02% | +0.03% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_64c |
-0.41% | -0.45% | -0.38% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_1000c |
+1.46% | +1.31% | +1.61% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_16c |
-0.42% | -0.46% | -0.38% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_256c |
-2.67% | -3.00% | -2.34% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_4c |
+0.12% | +0.11% | +0.14% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_64c |
+0.04% | +0.02% | +0.05% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_1000c |
+1.40% | +1.12% | +1.69% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_16c |
-0.54% | -0.59% | -0.49% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_256c |
-1.60% | -1.66% | -1.54% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_4c |
+0.29% | +0.15% | +0.44% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_64c |
+0.23% | +0.20% | +0.25% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_1000c |
+0.15% | -0.02% | +0.32% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_16c |
+0.71% | +0.67% | +0.74% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_256c |
+5.85% | +5.69% | +6.01% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_4c |
+0.16% | +0.10% | +0.22% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_64c |
+0.27% | +0.22% | +0.32% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_1000c |
+0.77% | +0.58% | +0.96% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_16c |
-3.02% | -3.15% | -2.88% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_256c |
-6.22% | -6.58% | -5.85% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_4c |
-0.47% | -0.56% | -0.39% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_64c |
+0.65% | +0.61% | +0.69% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_1000c |
+0.46% | +0.13% | +0.79% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_16c |
-3.96% | -4.06% | -3.87% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_256c |
-1.67% | -1.73% | -1.61% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_4c |
+3.02% | +2.02% | +4.03% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_64c |
+1.04% | +1.03% | +1.06% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_1000c |
-0.28% | -0.42% | -0.15% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_16c |
+8.89% | +8.25% | +9.52% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_256c |
+7.44% | +6.91% | +7.97% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_4c |
+3.88% | +3.76% | +4.00% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_64c |
+4.29% | +4.04% | +4.55% |
simd_batch_cosine_normalized_query/simd_batch/1024d_1000c |
+1.09% | +0.92% | +1.26% |
simd_batch_cosine_normalized_query/simd_batch/1024d_16c |
-0.49% | -0.50% | -0.48% |
simd_batch_cosine_normalized_query/simd_batch/1024d_256c |
-0.04% | -0.68% | +0.61% |
simd_batch_cosine_normalized_query/simd_batch/1024d_4c |
+0.24% | +0.22% | +0.26% |
simd_batch_cosine_normalized_query/simd_batch/1024d_64c |
+0.31% | +0.25% | +0.37% |
simd_batch_cosine_normalized_query/simd_batch/384d_1000c |
+0.82% | +0.64% | +1.00% |
simd_batch_cosine_normalized_query/simd_batch/384d_16c |
-1.00% | -1.01% | -0.99% |
simd_batch_cosine_normalized_query/simd_batch/384d_256c |
-1.70% | -1.75% | -1.64% |
simd_batch_cosine_normalized_query/simd_batch/384d_4c |
+0.22% | +0.17% | +0.28% |
simd_batch_cosine_normalized_query/simd_batch/384d_64c |
+0.48% | +0.47% | +0.48% |
simd_batch_cosine_normalized_query/simd_batch/768d_1000c |
+0.30% | +0.24% | +0.36% |
simd_batch_cosine_normalized_query/simd_batch/768d_16c |
+1.03% | +0.99% | +1.07% |
simd_batch_cosine_normalized_query/simd_batch/768d_256c |
+5.28% | +5.06% | +5.51% |
simd_batch_cosine_normalized_query/simd_batch/768d_4c |
+0.00% | -0.08% | +0.05% |
simd_batch_cosine_normalized_query/simd_batch/768d_64c |
+0.47% | +0.46% | +0.49% |
simd_batch_dot_product/scalar_loop/10 |
+0.03% | +0.02% | +0.05% |
simd_batch_dot_product/scalar_loop/100 |
+0.20% | +0.13% | +0.27% |
simd_batch_dot_product/scalar_loop/1000 |
-1.53% | -1.57% | -1.49% |
simd_batch_dot_product/simd_batch/10 |
-5.47% | -5.51% | -5.43% |
simd_batch_dot_product/simd_batch/100 |
+1.59% | +1.56% | +1.61% |
simd_batch_dot_product/simd_batch/1000 |
-19.92% | -20.27% | -19.56% |
simd_cosine_similarity/scalar/1024 |
-0.07% | -0.09% | -0.04% |
simd_cosine_similarity/scalar/1536 |
-0.01% | -0.05% | +0.02% |
simd_cosine_similarity/scalar/384 |
+0.03% | -0.11% | +0.17% |
simd_cosine_similarity/scalar/768 |
+0.34% | +0.31% | +0.38% |
simd_cosine_similarity/simd/1024 |
+0.42% | +0.34% | +0.50% |
simd_cosine_similarity/simd/1536 |
-0.21% | -0.24% | -0.18% |
simd_cosine_similarity/simd/384 |
-0.22% | -0.48% | +0.03% |
simd_cosine_similarity/simd/768 |
+0.90% | +0.88% | +0.92% |
simd_dot_product/scalar/1024 |
+0.00% | -0.01% | +0.02% |
simd_dot_product/scalar/1536 |
+0.07% | +0.06% | +0.08% |
simd_dot_product/scalar/384 |
+0.00% | -0.04% | +0.05% |
simd_dot_product/scalar/768 |
+0.16% | +0.10% | +0.21% |
simd_dot_product/simd/1024 |
-1.40% | -1.52% | -1.28% |
simd_dot_product/simd/1536 |
+0.40% | +0.36% | +0.45% |
simd_dot_product/simd/384 |
-0.16% | -0.21% | -0.10% |
simd_dot_product/simd/768 |
-3.55% | -3.75% | -3.34% |
simd_euclidean_distance/scalar/1024 |
-0.33% | -0.55% | -0.10% |
simd_euclidean_distance/scalar/1536 |
+0.02% | -0.05% | +0.08% |
simd_euclidean_distance/scalar/384 |
+0.03% | -0.30% | +0.36% |
simd_euclidean_distance/scalar/768 |
-0.11% | -0.22% | -0.02% |
simd_euclidean_distance/simd/1024 |
+0.09% | +0.05% | +0.12% |
simd_euclidean_distance/simd/1536 |
+0.02% | -0.00% | +0.05% |
simd_euclidean_distance/simd/384 |
+0.23% | +0.20% | +0.26% |
simd_euclidean_distance/simd/768 |
+0.28% | +0.24% | +0.32% |
simd_normalize/scalar/1024 |
+1.08% | +0.82% | +1.34% |
simd_normalize/scalar/1536 |
+1.07% | +0.91% | +1.22% |
simd_normalize/scalar/384 |
+1.84% | +1.46% | +2.21% |
simd_normalize/scalar/768 |
+1.15% | +1.01% | +1.27% |
simd_normalize/simd/1024 |
+4.14% | +2.80% | +5.44% |
simd_normalize/simd/1536 |
+5.22% | +4.03% | +6.41% |
simd_normalize/simd/384 |
+7.13% | +5.61% | +8.59% |
simd_normalize/simd/768 |
+6.43% | +5.39% | +7.44% |
simd_normalized_cosine_fast_path/cosine_full/1024 |
+0.47% | +0.43% | +0.51% |
simd_normalized_cosine_fast_path/cosine_full/384 |
-0.29% | -0.43% | -0.14% |
simd_normalized_cosine_fast_path/cosine_full/768 |
+0.96% | +0.91% | +1.00% |
simd_normalized_cosine_fast_path/dot_product/1024 |
-0.46% | -0.54% | -0.38% |
simd_normalized_cosine_fast_path/dot_product/384 |
-3.70% | -3.79% | -3.61% |
simd_normalized_cosine_fast_path/dot_product/768 |
+5.56% | +4.89% | +6.22% |
simd_prepared_query_normalized_cosine/dot_product_loop/1024 |
+1.78% | +1.56% | +2.00% |
simd_prepared_query_normalized_cosine/dot_product_loop/384 |
+3.44% | +3.40% | +3.48% |
simd_prepared_query_normalized_cosine/dot_product_loop/768 |
-0.57% | -0.75% | -0.40% |
simd_prepared_query_normalized_cosine/prepared_full_cosine/1024 |
-0.53% | -0.70% | -0.37% |
simd_prepared_query_normalized_cosine/prepared_full_cosine/384 |
+0.69% | +0.64% | +0.74% |
simd_prepared_query_normalized_cosine/prepared_full_cosine/768 |
+0.72% | +0.61% | +0.83% |
simd_prepared_query_normalized_cosine/prepared_meta_unit/1024 |
+0.54% | +0.32% | +0.77% |
simd_prepared_query_normalized_cosine/prepared_meta_unit/384 |
+1.85% | +1.64% | +2.05% |
simd_prepared_query_normalized_cosine/prepared_meta_unit/768 |
-0.94% | -1.38% | -0.50% |
simd_query_batch_dot_product/pair_loop/128d_16c |
+0.53% | +0.45% | +0.61% |
simd_query_batch_dot_product/pair_loop/128d_256c |
+4.05% | +4.02% | +4.08% |
simd_query_batch_dot_product/pair_loop/128d_4c |
+1.48% | +1.35% | +1.61% |
simd_query_batch_dot_product/pair_loop/128d_64c |
-0.28% | -0.32% | -0.23% |
simd_query_batch_dot_product/pair_loop/384d_16c |
-1.87% | -2.12% | -1.63% |
simd_query_batch_dot_product/pair_loop/384d_256c |
+5.48% | +5.40% | +5.56% |
simd_query_batch_dot_product/pair_loop/384d_4c |
+1.12% | +0.97% | +1.27% |
simd_query_batch_dot_product/pair_loop/384d_64c |
+1.30% | +1.28% | +1.33% |
simd_query_batch_dot_product/pair_loop/768d_16c |
-7.04% | -7.08% | -7.00% |
simd_query_batch_dot_product/pair_loop/768d_256c |
+9.43% | +9.16% | +9.69% |
simd_query_batch_dot_product/pair_loop/768d_4c |
+1.80% | +1.57% | +2.03% |
simd_query_batch_dot_product/pair_loop/768d_64c |
+3.50% | +3.24% | +3.76% |
simd_query_batch_dot_product/simd_batch/128d_16c |
+1.09% | +1.00% | +1.17% |
simd_query_batch_dot_product/simd_batch/128d_256c |
+1.71% | +1.68% | +1.73% |
simd_query_batch_dot_product/simd_batch/128d_4c |
-0.37% | -0.45% | -0.30% |
simd_query_batch_dot_product/simd_batch/128d_64c |
+0.76% | +0.74% | +0.78% |
simd_query_batch_dot_product/simd_batch/384d_16c |
+0.07% | +0.03% | +0.11% |
simd_query_batch_dot_product/simd_batch/384d_256c |
+5.54% | +5.49% | +5.59% |
simd_query_batch_dot_product/simd_batch/384d_4c |
+0.80% | +0.59% | +1.00% |
simd_query_batch_dot_product/simd_batch/384d_64c |
-1.31% | -1.75% | -0.86% |
simd_query_batch_dot_product/simd_batch/768d_16c |
+5.06% | +5.02% | +5.09% |
simd_query_batch_dot_product/simd_batch/768d_256c |
+8.68% | +8.18% | +9.19% |
simd_query_batch_dot_product/simd_batch/768d_4c |
-0.27% | -0.30% | -0.25% |
simd_query_batch_dot_product/simd_batch/768d_64c |
-0.69% | -0.89% | -0.49% |
simd_squared_euclidean_fast_path/euclidean_full/1024 |
+0.14% | +0.05% | +0.24% |
simd_squared_euclidean_fast_path/euclidean_full/384 |
+0.32% | +0.27% | +0.38% |
simd_squared_euclidean_fast_path/euclidean_full/768 |
-0.27% | -0.32% | -0.22% |
simd_squared_euclidean_fast_path/squared_euclidean/1024 |
+0.09% | +0.07% | +0.10% |
simd_squared_euclidean_fast_path/squared_euclidean/384 |
+0.27% | +0.23% | +0.31% |
simd_squared_euclidean_fast_path/squared_euclidean/768 |
-0.62% | -0.65% | -0.58% |
simd_throughput_384/cosine_similarity |
-0.21% | -0.30% | -0.13% |
simd_throughput_384/dot_product |
-2.50% | -2.54% | -2.46% |
simd_throughput_384/euclidean_distance |
-0.27% | -0.30% | -0.25% |
simd_throughput_384/normalize |
-8.36% | -8.37% | -8.34% |
softmax_attention/128 |
-0.08% | -0.09% | -0.06% |
softmax_attention/512 |
-1.19% | -1.25% | -1.12% |
tier_prepared_batch_sizes/int4_batch_prepared/10 |
+0.28% | +0.21% | +0.35% |
tier_prepared_batch_sizes/int4_batch_prepared/100 |
-0.14% | -0.23% | -0.07% |
tier_prepared_batch_sizes/int4_batch_prepared/1000 |
-0.05% | -0.31% | +0.20% |
tier_prepared_batch_sizes/int4_query_per_call/10 |
+0.74% | +0.71% | +0.77% |
tier_prepared_batch_sizes/int4_query_per_call/100 |
+0.61% | +0.57% | +0.64% |
tier_prepared_batch_sizes/int4_query_per_call/1000 |
+0.57% | +0.55% | +0.59% |
tier_prepared_batch_sizes/int8_batch_prepared/10 |
-0.56% | -0.61% | -0.51% |
tier_prepared_batch_sizes/int8_batch_prepared/100 |
+0.57% | +0.53% | +0.62% |
tier_prepared_batch_sizes/int8_batch_prepared/1000 |
+2.37% | +2.13% | +2.61% |
tier_prepared_batch_sizes/int8_query_per_call/10 |
-0.02% | -0.03% | -0.00% |
tier_prepared_batch_sizes/int8_query_per_call/100 |
+0.01% | -0.01% | +0.02% |
tier_prepared_batch_sizes/int8_query_per_call/1000 |
+0.01% | -0.02% | +0.03% |
tier_prepared_query/binary_query_once_1000 |
-0.04% | -0.22% | +0.14% |
tier_prepared_query/binary_query_per_call_1000 |
-0.02% | -0.02% | -0.01% |
tier_prepared_query/int4_query_once_1000 |
+0.15% | +0.11% | +0.19% |
tier_prepared_query/int4_query_per_call_1000 |
+0.23% | +0.23% | +0.24% |
tier_prepared_query/int8_query_once_1000 |
+0.58% | +0.53% | +0.62% |
tier_prepared_query/int8_query_per_call_1000 |
-0.02% | -0.03% | -0.00% |
Rule: CI-lower of change ≤3.0% passes silently; (3.0%, 7.0%] warns; >7.0% fails. Override via PR label bench-allow-regression.
x86_64-linux — perf regression report
❌ 25 FAIL (regression >7.0% confirmed by 95% CI)
⚠ 30 WARN (regression 3.0-7.0% confirmed)
🚀 18 confirmed improvement
| Bench | Δ point | 95% CI | new ns | base ns | verdict |
|---|---|---|---|---|---|
simd_normalized_cosine_fast_path/dot_product/1024 |
+49.12% | [+48.74%, +49.50%] | 78.4 | 78.4 | ❌ FAIL |
simd_squared_euclidean_fast_path/squared_euclidean/768 |
+44.98% | [+44.21%, +45.71%] | 61.3 | 61.3 | ❌ FAIL |
simd_squared_euclidean_fast_path/euclidean_full/768 |
+38.32% | [+37.83%, +38.78%] | 65.6 | 65.6 | ❌ FAIL |
simd_query_batch_dot_product/simd_batch/768d_4c |
+35.22% | [+34.82%, +35.61%] | 199.1 | 199.1 | ❌ FAIL |
simd_query_batch_dot_product/pair_loop/768d_4c |
+33.50% | [+33.16%, +33.80%] | 300.1 | 300.1 | ❌ FAIL |
simd_dot_product/simd/768 |
+33.29% | [+32.99%, +33.54%] | 54.2 | 54.2 | ❌ FAIL |
simd_query_batch_dot_product/simd_batch/384d_16c |
+25.91% | [+25.64%, +26.18%] | 329.7 | 329.7 | ❌ FAIL |
simd_query_batch_dot_product/pair_loop/768d_16c |
+24.47% | [+23.61%, +25.13%] | 1054.6 | 1054.6 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_4c |
+22.68% | [+22.32%, +22.99%] | 257.9 | 257.9 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_256c |
+21.45% | [+19.46%, +23.43%] | 15950.4 | 15950.4 | ❌ FAIL |
simd_normalized_cosine_fast_path/dot_product/768 |
+17.94% | [+17.65%, +18.21%] | 60.5 | 60.5 | ❌ FAIL |
simd_euclidean_distance/simd/1536 |
+17.43% | [+17.21%, +17.57%] | 120.2 | 120.2 | ❌ FAIL |
layer_norm/896 |
+16.48% | [+16.28%, +16.68%] | 205.9 | 205.9 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_64c |
+16.11% | [+16.00%, +16.21%] | 3862.5 | 3862.5 | ❌ FAIL |
simd_query_batch_dot_product/pair_loop/768d_256c |
+16.36% | [+15.89%, +16.69%] | 15542.2 | 15542.2 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_16c |
+16.11% | [+15.84%, +16.33%] | 1003.6 | 1003.6 | ❌ FAIL |
simd_query_batch_dot_product/pair_loop/768d_64c |
+18.98% | [+15.73%, +22.22%] | 3938.8 | 3938.8 | ❌ FAIL |
simd_query_batch_dot_product/simd_batch/768d_16c |
+12.77% | [+12.37%, +13.09%] | 682.5 | 682.5 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_64c |
+12.40% | [+12.34%, +12.45%] | 2155.2 | 2155.2 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_4c |
+12.50% | [+11.87%, +13.10%] | 141.4 | 141.4 | ❌ FAIL |
simd_query_batch_dot_product/pair_loop/384d_16c |
+10.25% | [+9.92%, +10.52%] | 517.1 | 517.1 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_1000c |
+9.89% | [+9.57%, +10.12%] | 34865.6 | 34865.6 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_256c |
+9.93% | [+9.53%, +10.30%] | 8476.9 | 8476.9 | ❌ FAIL |
simd_batch_cosine_non_normalized_query/pair_loop/384d_16c |
+9.01% | [+8.63%, +9.31%] | 695.4 | 695.4 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_1000c |
+9.52% | [+8.53%, +10.22%] | 61761.2 | 61761.2 | ❌ FAIL |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_4c |
+7.32% | [+6.99%, +7.64%] | 285.6 | 285.6 | ⚠ WARN |
simd_normalized_cosine_fast_path/cosine_full/1024 |
+7.29% | [+6.88%, +7.69%] | 91.1 | 91.1 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/pair_loop/768d_256c |
+7.25% | [+6.79%, +7.70%] | 18093.3 | 18093.3 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/pair_loop/384d_4c |
+7.30% | [+6.78%, +7.82%] | 179.5 | 179.5 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/pair_loop/768d_16c |
+6.90% | [+6.76%, +7.03%] | 1146.7 | 1146.7 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/simd_batch/768d_256c |
+6.35% | [+6.10%, +6.52%] | 17748.5 | 17748.5 | ⚠ WARN |
elementwise_mul/4096 |
+5.83% | [+5.46%, +6.15%] | 317.6 | 317.6 | ⚠ WARN |
simd_normalize/simd/384 |
+8.64% | [+5.37%, +12.03%] | 76.8 | 76.8 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_256c |
+5.58% | [+5.29%, +5.77%] | 23228.0 | 23228.0 | ⚠ WARN |
int8_batch_cosine/int8_loop/10 |
+5.48% | [+5.21%, +5.67%] | 176.9 | 176.9 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_256c |
+5.77% | [+5.17%, +6.14%] | 23273.3 | 23273.3 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_256c |
+5.32% | [+5.14%, +5.48%] | 19091.5 | 19091.5 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/pair_loop/768d_4c |
+5.37% | [+5.13%, +5.53%] | 281.2 | 281.2 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_256c |
+5.34% | [+4.97%, +5.63%] | 17796.2 | 17796.2 | ⚠ WARN |
simd_normalized_cosine_fast_path/cosine_full/768 |
+5.42% | [+4.93%, +5.90%] | 72.9 | 72.9 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_256c |
+5.16% | [+4.87%, +5.44%] | 22817.1 | 22817.1 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/pair_loop/768d_64c |
+5.03% | [+4.81%, +5.19%] | 4427.7 | 4427.7 | ⚠ WARN |
simd_batch_cosine_normalized_query/simd_batch/768d_256c |
+5.17% | [+4.80%, +5.48%] | 17586.5 | 17586.5 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/simd_batch/768d_16c |
+4.88% | [+4.71%, +5.04%] | 1112.2 | 1112.2 | ⚠ WARN |
simd_batch_cosine_normalized_query/simd_batch/1024d_256c |
+4.94% | [+4.71%, +5.13%] | 22946.6 | 22946.6 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_4c |
+4.69% | [+4.44%, +4.94%] | 175.4 | 175.4 | ⚠ WARN |
simd_query_batch_dot_product/simd_batch/384d_64c |
+5.21% | [+4.29%, +6.00%] | 1308.5 | 1308.5 | ⚠ WARN |
simd_query_batch_dot_product/simd_batch/384d_4c |
+4.66% | [+4.27%, +5.03%] | 81.9 | 81.9 | ⚠ WARN |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_16c |
+4.39% | [+4.10%, +4.64%] | 1125.9 | 1125.9 | ⚠ WARN |
simd_query_batch_dot_product/pair_loop/384d_64c |
+4.39% | [+3.97%, +4.74%] | 2003.0 | 2003.0 | ⚠ WARN |
simd_batch_cosine_normalized_query/simd_batch/768d_4c |
+4.60% | [+3.88%, +5.29%] | 276.5 | 276.5 | ⚠ WARN |
simd_query_batch_dot_product/simd_batch/768d_256c |
+4.17% | [+3.37%, +4.92%] | 10090.2 | 10090.2 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/simd_batch/768d_4c |
+3.66% | [+3.34%, +3.97%] | 273.3 | 273.3 | ⚠ WARN |
simd_query_batch_dot_product/pair_loop/384d_4c |
+3.63% | [+3.30%, +3.89%] | 132.6 | 132.6 | ⚠ WARN |
simd_batch_cosine_non_normalized_query/simd_batch/768d_64c |
+3.46% | [+3.03%, +3.84%] | 4312.9 | 4312.9 | ⚠ WARN |
simd_throughput_384/normalize |
-3.23% | [-3.37%, -3.12%] | 107.5 | 107.5 | 🚀 WIN |
simd_query_batch_dot_product/simd_batch/128d_16c |
-3.08% | [-3.45%, -2.71%] | 128.7 | 128.7 | 🚀 WIN |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_16c |
-3.86% | [-4.07%, -3.66%] | 652.8 | 652.8 | 🚀 WIN |
rms_norm/4096 |
-3.98% | [-4.17%, -3.81%] | 898.9 | 898.9 | 🚀 WIN |
int8_batch_cosine/int8_loop/1000 |
-4.57% | [-4.72%, -4.42%] | 18243.6 | 18243.6 | 🚀 WIN |
simd_batch_cosine_normalized_query/simd_batch/384d_16c |
-4.97% | [-5.35%, -4.58%] | 631.7 | 631.7 | 🚀 WIN |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_1000c |
-4.94% | [-5.52%, -4.42%] | 69682.6 | 69682.6 | 🚀 WIN |
int8_vs_float32_cosine/float32_simd/768 |
-5.38% | [-5.55%, -5.22%] | 68.7 | 68.7 | 🚀 WIN |
simd_cosine_similarity/simd/1024 |
-5.65% | [-5.80%, -5.51%] | 85.7 | 85.7 | 🚀 WIN |
int8_vs_float32_cosine/float32_simd/1536 |
-6.06% | [-6.32%, -5.80%] | 119.6 | 119.6 | 🚀 WIN |
binary_cosine_distance/float32_simd/768 |
-6.14% | [-6.38%, -5.91%] | 68.8 | 68.8 | 🚀 WIN |
simd_prepared_query_normalized_cosine/prepared_meta_unit/1024 |
-5.83% | [-6.51%, -5.16%] | 66324.8 | 66324.8 | 🚀 WIN |
int4_cosine_distance/float32_simd/768 |
-6.16% | [-6.60%, -5.82%] | 69.0 | 69.0 | 🚀 WIN |
simd_squared_euclidean_fast_path/squared_euclidean/1024 |
-8.28% | [-8.85%, -7.77%] | 79.4 | 79.4 | 🚀 WIN |
simd_prepared_query_normalized_cosine/dot_product_loop/384 |
-11.05% | [-11.25%, -10.86%] | 31776.6 | 31776.6 | 🚀 WIN |
simd_squared_euclidean_fast_path/euclidean_full/384 |
-15.94% | [-16.45%, -15.50%] | 28.9 | 28.9 | 🚀 WIN |
simd_squared_euclidean_fast_path/squared_euclidean/384 |
-18.65% | [-18.95%, -18.35%] | 23.9 | 23.9 | 🚀 WIN |
simd_normalized_cosine_fast_path/dot_product/384 |
-19.25% | [-19.79%, -18.72%] | 22.3 | 22.3 | 🚀 WIN |
All 259 measurements
| Bench | Δ point | CI-lower | CI-upper |
|---|---|---|---|
add_bias_gelu/4096 |
+0.25% | +0.10% | +0.40% |
add_bias_gelu/896 |
+0.15% | -0.09% | +0.40% |
binary_cosine_distance/binary/1024 |
+0.50% | +0.09% | +0.91% |
binary_cosine_distance/binary/1536 |
+0.05% | -0.37% | +0.44% |
binary_cosine_distance/binary/384 |
+0.96% | +0.53% | +1.38% |
binary_cosine_distance/binary/768 |
-0.58% | -1.86% | +0.56% |
binary_cosine_distance/float32_simd/1024 |
+0.00% | -0.20% | +0.19% |
binary_cosine_distance/float32_simd/1536 |
-0.06% | -0.24% | +0.08% |
binary_cosine_distance/float32_simd/384 |
+0.02% | -0.40% | +0.45% |
binary_cosine_distance/float32_simd/768 |
-6.14% | -6.38% | -5.91% |
elementwise_mul/4096 |
+5.83% | +5.46% | +6.15% |
gelu/4096 |
-0.41% | -1.42% | +0.36% |
gelu/896 |
+0.09% | -0.04% | +0.18% |
int4_cosine_distance/float32_simd/1024 |
+0.04% | -0.22% | +0.30% |
int4_cosine_distance/float32_simd/1536 |
-0.09% | -0.34% | +0.09% |
int4_cosine_distance/float32_simd/384 |
-0.85% | -1.24% | -0.46% |
int4_cosine_distance/float32_simd/768 |
-6.16% | -6.60% | -5.82% |
int4_cosine_distance/int4/1024 |
-0.05% | -0.16% | +0.03% |
int4_cosine_distance/int4/1536 |
-0.44% | -0.83% | -0.07% |
int4_cosine_distance/int4/384 |
+0.37% | +0.11% | +0.55% |
int4_cosine_distance/int4/768 |
-2.77% | -4.74% | -1.10% |
int8_batch_cosine/float32_simd/10 |
-0.66% | -1.54% | +0.09% |
int8_batch_cosine/float32_simd/100 |
-2.49% | -3.09% | -1.89% |
int8_batch_cosine/float32_simd/1000 |
+0.72% | +0.28% | +1.14% |
int8_batch_cosine/int8_loop/10 |
+5.48% | +5.21% | +5.67% |
int8_batch_cosine/int8_loop/100 |
+1.29% | +0.87% | +1.70% |
int8_batch_cosine/int8_loop/1000 |
-4.57% | -4.72% | -4.42% |
int8_prepared_dot_product/per_call/1024 |
-0.19% | -0.45% | -0.04% |
int8_prepared_dot_product/per_call/127 |
-0.07% | -0.28% | +0.09% |
int8_prepared_dot_product/per_call/128 |
-0.05% | -0.22% | +0.08% |
int8_prepared_dot_product/per_call/129 |
+0.19% | +0.05% | +0.28% |
int8_prepared_dot_product/per_call/384 |
-0.50% | -1.03% | -0.13% |
int8_prepared_dot_product/per_call/768 |
+0.68% | -0.35% | +1.69% |
int8_prepared_dot_product/prepared/1024 |
-1.49% | -1.72% | -1.29% |
int8_prepared_dot_product/prepared/127 |
+2.14% | +1.46% | +2.71% |
int8_prepared_dot_product/prepared/128 |
+1.19% | -0.35% | +2.51% |
int8_prepared_dot_product/prepared/129 |
+3.56% | +2.91% | +4.21% |
int8_prepared_dot_product/prepared/384 |
-0.90% | -1.59% | -0.23% |
int8_prepared_dot_product/prepared/768 |
+0.98% | +0.08% | +1.77% |
int8_quantization/quantize/1024 |
-0.15% | -0.76% | +0.26% |
int8_quantization/quantize/1536 |
-0.04% | -0.62% | +0.42% |
int8_quantization/quantize/384 |
+0.53% | -0.37% | +1.40% |
int8_quantization/quantize/768 |
-0.28% | -0.77% | +0.06% |
int8_raw_dot_product/dot_product_i8/1024 |
-0.98% | -1.52% | -0.44% |
int8_raw_dot_product/dot_product_i8/127 |
+0.41% | +0.05% | +0.78% |
int8_raw_dot_product/dot_product_i8/128 |
+1.19% | +0.61% | +1.70% |
int8_raw_dot_product/dot_product_i8/129 |
+2.02% | +1.52% | +2.39% |
int8_raw_dot_product/dot_product_i8/384 |
-1.69% | -2.21% | -1.29% |
int8_raw_dot_product/dot_product_i8/768 |
+0.27% | -0.16% | +0.61% |
int8_raw_dot_product/dot_product_i8_raw/1024 |
-2.34% | -3.15% | -1.54% |
int8_raw_dot_product/dot_product_i8_raw/127 |
-0.42% | -0.98% | +0.01% |
int8_raw_dot_product/dot_product_i8_raw/128 |
-0.41% | -0.71% | -0.16% |
int8_raw_dot_product/dot_product_i8_raw/129 |
+3.35% | +2.11% | +4.54% |
int8_raw_dot_product/dot_product_i8_raw/384 |
-0.39% | -0.61% | -0.18% |
int8_raw_dot_product/dot_product_i8_raw/768 |
+0.62% | +0.26% | +0.96% |
int8_vs_float32_cosine/float32_simd/1024 |
-0.97% | -1.04% | -0.92% |
int8_vs_float32_cosine/float32_simd/1536 |
-6.06% | -6.32% | -5.80% |
int8_vs_float32_cosine/float32_simd/384 |
-0.51% | -0.73% | -0.29% |
int8_vs_float32_cosine/float32_simd/768 |
-5.38% | -5.55% | -5.22% |
int8_vs_float32_cosine/int8/1024 |
-0.83% | -1.07% | -0.60% |
int8_vs_float32_cosine/int8/1536 |
-0.53% | -1.00% | -0.08% |
int8_vs_float32_cosine/int8/384 |
-0.88% | -1.72% | -0.19% |
int8_vs_float32_cosine/int8/768 |
-0.45% | -0.78% | -0.16% |
layer_norm/4096 |
+2.00% | +1.86% | +2.12% |
layer_norm/896 |
+16.48% | +16.28% | +16.68% |
memory_size/search_1000_float32 |
-1.29% | -1.65% | -0.94% |
memory_size/search_1000_int8 |
+0.15% | -0.91% | +0.96% |
rms_norm/4096 |
-3.98% | -4.17% | -3.81% |
rms_norm/896 |
-0.79% | -1.00% | -0.58% |
silu_inplace/4096 |
-0.35% | -1.02% | +0.21% |
silu_inplace/896 |
-0.12% | -0.55% | +0.28% |
simd_batch_cosine/scalar_loop/10 |
-0.09% | -0.33% | +0.15% |
simd_batch_cosine/scalar_loop/100 |
-0.13% | -0.45% | +0.10% |
simd_batch_cosine/scalar_loop/1000 |
+0.00% | -0.10% | +0.08% |
simd_batch_cosine/simd_batch/10 |
+0.96% | +0.83% | +1.07% |
simd_batch_cosine/simd_batch/100 |
-2.02% | -2.45% | -1.60% |
simd_batch_cosine/simd_batch/1000 |
+0.30% | -0.37% | +0.97% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_1000c |
-1.27% | -1.51% | -1.11% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_16c |
+0.61% | +0.49% | +0.74% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_256c |
+5.77% | +5.17% | +6.14% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_4c |
+0.52% | +0.13% | +0.88% |
simd_batch_cosine_non_normalized_query/pair_loop/1024d_64c |
-0.95% | -1.76% | -0.31% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_1000c |
+2.98% | +2.62% | +3.25% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_16c |
+9.01% | +8.63% | +9.31% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_256c |
+1.78% | +1.56% | +1.91% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_4c |
+7.30% | +6.78% | +7.82% |
simd_batch_cosine_non_normalized_query/pair_loop/384d_64c |
+3.09% | +2.84% | +3.33% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_1000c |
+2.87% | +2.53% | +3.16% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_16c |
+6.90% | +6.76% | +7.03% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_256c |
+7.25% | +6.79% | +7.70% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_4c |
+5.37% | +5.13% | +5.53% |
simd_batch_cosine_non_normalized_query/pair_loop/768d_64c |
+5.03% | +4.81% | +5.19% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_1000c |
-1.23% | -1.49% | -0.98% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_16c |
+0.66% | +0.50% | +0.78% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_256c |
+5.16% | +4.87% | +5.44% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_4c |
+0.11% | -0.14% | +0.36% |
simd_batch_cosine_non_normalized_query/simd_batch/1024d_64c |
+0.08% | -0.10% | +0.19% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_1000c |
+2.61% | +1.66% | +3.57% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_16c |
+3.95% | +3.00% | +4.89% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_256c |
+0.53% | -0.75% | +1.81% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_4c |
+0.75% | +0.15% | +1.34% |
simd_batch_cosine_non_normalized_query/simd_batch/384d_64c |
+2.28% | +2.10% | +2.44% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_1000c |
+3.27% | +2.89% | +3.60% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_16c |
+4.88% | +4.71% | +5.04% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_256c |
+6.35% | +6.10% | +6.52% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_4c |
+3.66% | +3.34% | +3.97% |
simd_batch_cosine_non_normalized_query/simd_batch/768d_64c |
+3.46% | +3.03% | +3.84% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_1000c |
-1.83% | -2.31% | -1.50% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_16c |
+0.44% | +0.11% | +0.73% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_256c |
+5.58% | +5.29% | +5.77% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_4c |
-0.37% | -0.50% | -0.25% |
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_64c |
-0.19% | -0.41% | -0.03% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_1000c |
+2.28% | +2.11% | +2.44% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_16c |
-3.86% | -4.07% | -3.66% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_256c |
+1.09% | +0.91% | +1.25% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_4c |
+4.69% | +4.44% | +4.94% |
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_64c |
+1.72% | +1.50% | +1.89% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_1000c |
+2.45% | +2.24% | +2.66% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_16c |
+3.25% | +2.92% | +3.56% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_256c |
+5.34% | +4.97% | +5.63% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_4c |
+7.32% | +6.99% | +7.64% |
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_64c |
+2.99% | +2.85% | +3.09% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_1000c |
-4.94% | -5.52% | -4.42% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_16c |
+4.39% | +4.10% | +4.64% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_256c |
+5.32% | +5.14% | +5.48% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_4c |
+2.95% | +2.75% | +3.06% |
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_64c |
+1.41% | +1.19% | +1.54% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_1000c |
+9.89% | +9.57% | +10.12% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_16c |
+0.87% | +0.36% | +1.38% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_256c |
+9.93% | +9.53% | +10.30% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_4c |
+12.50% | +11.87% | +13.10% |
simd_batch_cosine_normalized_query/pair_loop_dot/384d_64c |
+12.40% | +12.34% | +12.45% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_1000c |
+9.52% | +8.53% | +10.22% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_16c |
+16.11% | +15.84% | +16.33% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_256c |
+21.45% | +19.46% | +23.43% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_4c |
+22.68% | +22.32% | +22.99% |
simd_batch_cosine_normalized_query/pair_loop_dot/768d_64c |
+16.11% | +16.00% | +16.21% |
simd_batch_cosine_normalized_query/simd_batch/1024d_1000c |
-1.44% | -1.61% | -1.29% |
simd_batch_cosine_normalized_query/simd_batch/1024d_16c |
+0.79% | +0.71% | +0.85% |
simd_batch_cosine_normalized_query/simd_batch/1024d_256c |
+4.94% | +4.71% | +5.13% |
simd_batch_cosine_normalized_query/simd_batch/1024d_4c |
-0.68% | -1.03% | -0.40% |
simd_batch_cosine_normalized_query/simd_batch/1024d_64c |
-0.27% | -0.66% | +0.01% |
simd_batch_cosine_normalized_query/simd_batch/384d_1000c |
+1.93% | +1.13% | +2.69% |
simd_batch_cosine_normalized_query/simd_batch/384d_16c |
-4.97% | -5.35% | -4.58% |
simd_batch_cosine_normalized_query/simd_batch/384d_256c |
+2.10% | +0.20% | +4.00% |
simd_batch_cosine_normalized_query/simd_batch/384d_4c |
+2.76% | +2.25% | +3.26% |
simd_batch_cosine_normalized_query/simd_batch/384d_64c |
+0.97% | +0.72% | +1.21% |
simd_batch_cosine_normalized_query/simd_batch/768d_1000c |
+2.62% | +2.26% | +2.95% |
simd_batch_cosine_normalized_query/simd_batch/768d_16c |
+3.29% | +2.87% | +3.69% |
simd_batch_cosine_normalized_query/simd_batch/768d_256c |
+5.17% | +4.80% | +5.48% |
simd_batch_cosine_normalized_query/simd_batch/768d_4c |
+4.60% | +3.88% | +5.29% |
simd_batch_cosine_normalized_query/simd_batch/768d_64c |
+2.56% | +2.36% | +2.76% |
simd_batch_dot_product/scalar_loop/10 |
+0.13% | +0.06% | +0.19% |
simd_batch_dot_product/scalar_loop/100 |
+0.01% | -0.11% | +0.10% |
simd_batch_dot_product/scalar_loop/1000 |
+0.74% | +0.65% | +0.81% |
simd_batch_dot_product/simd_batch/10 |
-0.07% | -0.34% | +0.13% |
simd_batch_dot_product/simd_batch/100 |
+0.08% | -0.02% | +0.16% |
simd_batch_dot_product/simd_batch/1000 |
-1.52% | -2.21% | -0.90% |
simd_cosine_similarity/scalar/1024 |
-0.17% | -0.39% | +0.01% |
simd_cosine_similarity/scalar/1536 |
+0.01% | -0.06% | +0.07% |
simd_cosine_similarity/scalar/384 |
-0.12% | -0.36% | +0.05% |
simd_cosine_similarity/scalar/768 |
-0.00% | -0.11% | +0.10% |
simd_cosine_similarity/simd/1024 |
-5.65% | -5.80% | -5.51% |
simd_cosine_similarity/simd/1536 |
+1.67% | +1.25% | +2.08% |
simd_cosine_similarity/simd/384 |
+0.02% | -0.31% | +0.30% |
simd_cosine_similarity/simd/768 |
+1.96% | +1.26% | +2.65% |
simd_dot_product/scalar/1024 |
+0.06% | -0.06% | +0.17% |
simd_dot_product/scalar/1536 |
+0.11% | -0.30% | +0.51% |
simd_dot_product/scalar/384 |
-0.02% | -0.15% | +0.08% |
simd_dot_product/scalar/768 |
+0.24% | +0.18% | +0.30% |
simd_dot_product/simd/1024 |
+0.78% | +0.48% | +1.05% |
simd_dot_product/simd/1536 |
+0.17% | +0.01% | +0.30% |
simd_dot_product/simd/384 |
-1.33% | -2.06% | -0.58% |
simd_dot_product/simd/768 |
+33.29% | +32.99% | +33.54% |
simd_euclidean_distance/scalar/1024 |
-0.08% | -0.25% | +0.08% |
simd_euclidean_distance/scalar/1536 |
-0.16% | -0.42% | +0.07% |
simd_euclidean_distance/scalar/384 |
+0.77% | +0.22% | +1.32% |
simd_euclidean_distance/scalar/768 |
-0.07% | -0.20% | +0.06% |
simd_euclidean_distance/simd/1024 |
-1.35% | -1.88% | -0.84% |
simd_euclidean_distance/simd/1536 |
+17.43% | +17.21% | +17.57% |
simd_euclidean_distance/simd/384 |
-0.11% | -0.38% | +0.12% |
simd_euclidean_distance/simd/768 |
+0.10% | -0.06% | +0.26% |
simd_normalize/scalar/1024 |
+1.24% | +0.95% | +1.52% |
simd_normalize/scalar/1536 |
+0.48% | +0.30% | +0.65% |
simd_normalize/scalar/384 |
+1.82% | +1.51% | +2.11% |
simd_normalize/scalar/768 |
+0.93% | +0.56% | +1.24% |
simd_normalize/simd/1024 |
+1.73% | +0.01% | +3.48% |
simd_normalize/simd/1536 |
-2.35% | -4.65% | -0.02% |
simd_normalize/simd/384 |
+8.64% | +5.37% | +12.03% |
simd_normalize/simd/768 |
+4.10% | +2.13% | +6.15% |
simd_normalized_cosine_fast_path/cosine_full/1024 |
+7.29% | +6.88% | +7.69% |
simd_normalized_cosine_fast_path/cosine_full/384 |
-1.37% | -1.76% | -0.99% |
simd_normalized_cosine_fast_path/cosine_full/768 |
+5.42% | +4.93% | +5.90% |
simd_normalized_cosine_fast_path/dot_product/1024 |
+49.12% | +48.74% | +49.50% |
simd_normalized_cosine_fast_path/dot_product/384 |
-19.25% | -19.79% | -18.72% |
simd_normalized_cosine_fast_path/dot_product/768 |
+17.94% | +17.65% | +18.21% |
simd_prepared_query_normalized_cosine/dot_product_loop/1024 |
-1.84% | -1.99% | -1.71% |
simd_prepared_query_normalized_cosine/dot_product_loop/384 |
-11.05% | -11.25% | -10.86% |
simd_prepared_query_normalized_cosine/dot_product_loop/768 |
-2.66% | -2.95% | -2.40% |
simd_prepared_query_normalized_cosine/prepared_full_cosine/1024 |
-1.87% | -2.38% | -1.50% |
simd_prepared_query_normalized_cosine/prepared_full_cosine/384 |
+0.92% | +0.60% | +1.16% |
simd_prepared_query_normalized_cosine/prepared_full_cosine/768 |
+0.49% | -0.00% | +0.86% |
simd_prepared_query_normalized_cosine/prepared_meta_unit/1024 |
-5.83% | -6.51% | -5.16% |
simd_prepared_query_normalized_cosine/prepared_meta_unit/384 |
+0.99% | +0.77% | +1.20% |
simd_prepared_query_normalized_cosine/prepared_meta_unit/768 |
-1.28% | -1.63% | -0.93% |
simd_query_batch_dot_product/pair_loop/128d_16c |
-2.87% | -3.13% | -2.62% |
simd_query_batch_dot_product/pair_loop/128d_256c |
-1.86% | -2.30% | -1.43% |
simd_query_batch_dot_product/pair_loop/128d_4c |
+3.68% | +2.89% | +4.47% |
simd_query_batch_dot_product/pair_loop/128d_64c |
-2.53% | -2.67% | -2.42% |
simd_query_batch_dot_product/pair_loop/384d_16c |
+10.25% | +9.92% | +10.52% |
simd_query_batch_dot_product/pair_loop/384d_256c |
-0.30% | -0.49% | -0.15% |
simd_query_batch_dot_product/pair_loop/384d_4c |
+3.63% | +3.30% | +3.89% |
simd_query_batch_dot_product/pair_loop/384d_64c |
+4.39% | +3.97% | +4.74% |
simd_query_batch_dot_product/pair_loop/768d_16c |
+24.47% | +23.61% | +25.13% |
simd_query_batch_dot_product/pair_loop/768d_256c |
+16.36% | +15.89% | +16.69% |
simd_query_batch_dot_product/pair_loop/768d_4c |
+33.50% | +33.16% | +33.80% |
simd_query_batch_dot_product/pair_loop/768d_64c |
+18.98% | +15.73% | +22.22% |
simd_query_batch_dot_product/simd_batch/128d_16c |
-3.08% | -3.45% | -2.71% |
simd_query_batch_dot_product/simd_batch/128d_256c |
+0.38% | +0.10% | +0.55% |
simd_query_batch_dot_product/simd_batch/128d_4c |
-2.87% | -3.30% | -2.53% |
simd_query_batch_dot_product/simd_batch/128d_64c |
-1.68% | -2.30% | -1.27% |
simd_query_batch_dot_product/simd_batch/384d_16c |
+25.91% | +25.64% | +26.18% |
simd_query_batch_dot_product/simd_batch/384d_256c |
+2.57% | +2.36% | +2.79% |
simd_query_batch_dot_product/simd_batch/384d_4c |
+4.66% | +4.27% | +5.03% |
simd_query_batch_dot_product/simd_batch/384d_64c |
+5.21% | +4.29% | +6.00% |
simd_query_batch_dot_product/simd_batch/768d_16c |
+12.77% | +12.37% | +13.09% |
simd_query_batch_dot_product/simd_batch/768d_256c |
+4.17% | +3.37% | +4.92% |
simd_query_batch_dot_product/simd_batch/768d_4c |
+35.22% | +34.82% | +35.61% |
simd_query_batch_dot_product/simd_batch/768d_64c |
+1.58% | +1.26% | +1.84% |
simd_squared_euclidean_fast_path/euclidean_full/1024 |
-1.02% | -2.68% | +0.63% |
simd_squared_euclidean_fast_path/euclidean_full/384 |
-15.94% | -16.45% | -15.50% |
simd_squared_euclidean_fast_path/euclidean_full/768 |
+38.32% | +37.83% | +38.78% |
simd_squared_euclidean_fast_path/squared_euclidean/1024 |
-8.28% | -8.85% | -7.77% |
simd_squared_euclidean_fast_path/squared_euclidean/384 |
-18.65% | -18.95% | -18.35% |
simd_squared_euclidean_fast_path/squared_euclidean/768 |
+44.98% | +44.21% | +45.71% |
simd_throughput_384/cosine_similarity |
-1.20% | -1.52% | -0.89% |
simd_throughput_384/dot_product |
-1.06% | -1.79% | -0.35% |
simd_throughput_384/euclidean_distance |
-0.81% | -1.03% | -0.60% |
simd_throughput_384/normalize |
-3.23% | -3.37% | -3.12% |
softmax_attention/128 |
-0.49% | -0.64% | -0.33% |
softmax_attention/512 |
+0.11% | +0.00% | +0.18% |
tier_prepared_batch_sizes/int4_batch_prepared/10 |
+0.26% | -0.33% | +0.84% |
tier_prepared_batch_sizes/int4_batch_prepared/100 |
-0.66% | -1.30% | -0.21% |
tier_prepared_batch_sizes/int4_batch_prepared/1000 |
-1.30% | -2.96% | -0.12% |
tier_prepared_batch_sizes/int4_query_per_call/10 |
+0.63% | +0.44% | +0.78% |
tier_prepared_batch_sizes/int4_query_per_call/100 |
+0.49% | +0.05% | +0.84% |
tier_prepared_batch_sizes/int4_query_per_call/1000 |
+0.58% | +0.54% | +0.62% |
tier_prepared_batch_sizes/int8_batch_prepared/10 |
-0.73% | -0.84% | -0.64% |
tier_prepared_batch_sizes/int8_batch_prepared/100 |
+0.17% | -0.43% | +0.73% |
tier_prepared_batch_sizes/int8_batch_prepared/1000 |
+0.23% | -0.33% | +0.76% |
tier_prepared_batch_sizes/int8_query_per_call/10 |
+0.20% | -0.11% | +0.43% |
tier_prepared_batch_sizes/int8_query_per_call/100 |
-0.40% | -0.79% | -0.11% |
tier_prepared_batch_sizes/int8_query_per_call/1000 |
-0.03% | -0.14% | +0.09% |
tier_prepared_query/binary_query_once_1000 |
+0.55% | -0.05% | +1.16% |
tier_prepared_query/binary_query_per_call_1000 |
+0.41% | +0.25% | +0.52% |
tier_prepared_query/int4_query_once_1000 |
+0.11% | -0.06% | +0.28% |
tier_prepared_query/int4_query_per_call_1000 |
-1.57% | -1.62% | -1.54% |
tier_prepared_query/int8_query_once_1000 |
+0.57% | +0.41% | +0.70% |
tier_prepared_query/int8_query_per_call_1000 |
+0.11% | -0.01% | +0.18% |
Rule: CI-lower of change ≤3.0% passes silently; (3.0%, 7.0%] warns; >7.0% fails. Override via PR label bench-allow-regression.
Gate is in advisory mode (Rollout step 3, ADR-058 §Rollout). Failures do not block merge for the first 7 days.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gemv_decode_wide_f16MSL kernel: NR=4, same structure as existinggemv_q8_decode_wide, 4× fewer threadgroupsencode_final_head) and prefill (forward_prefill_impl) lm_head paths for Q8 formatMeasurements
Root Cause
Commit 4dab27e correctly switched lm_head from Q8→f16 weights to close a 0.79 PPL gap. But it routed through
gemv_decode_m1(one threadgroup per vocab row, 256 threads/TG), creating 151,936 TGs for Qwen3.5-0.8B. The old Q8 path usedgemv_q8_decodewith NR=2 (75,968 TGs of 128 threads). The new wide kernel uses NR=4 → 37,984 TGs.Test plan
cargo check --workspace --all-targetspassesbench_decode_abN=8: 127ms (down from 291ms)bench_decode_abN=56: 429ms → slope = 160 tok/s🤖 Generated with Claude Code