Skip to content

fix(inference): wide f16 GEMV kernel restores 160 tok/s decode throughput#151

Merged
ohdearquant merged 2 commits into
mainfrom
fix/lm-head-wide-kernel
May 31, 2026
Merged

fix(inference): wide f16 GEMV kernel restores 160 tok/s decode throughput#151
ohdearquant merged 2 commits into
mainfrom
fix/lm-head-wide-kernel

Conversation

@ohdearquant

Copy link
Copy Markdown
Owner

Summary

  • Root-caused 17% Metal decode regression (160→133 tok/s) to lm_head dispatch using NR=1 f16 kernel with 151,936 threadgroups per decode step
  • Added gemv_decode_wide_f16 MSL kernel: NR=4, same structure as existing gemv_q8_decode_wide, 4× fewer threadgroups
  • Wired into both decode (encode_final_head) and prefill (forward_prefill_impl) lm_head paths for Q8 format
  • Zero quality regression — same f16 embed_tokens weights, same f32 accumulation, bit-for-bit identical logits

Measurements

bench_decode_ab (Qwen3.5-0.8B Q8, slope method N1=8 N2=56):
  Before: T1=291ms, T2≈1654ms → 133 tok/s
  After:  T1=127ms, T2=429ms  → 160 tok/s

Root Cause

Commit 4dab27e correctly switched lm_head from Q8→f16 weights to close a 0.79 PPL gap. But it routed through gemv_decode_m1 (one threadgroup per vocab row, 256 threads/TG), creating 151,936 TGs for Qwen3.5-0.8B. The old Q8 path used gemv_q8_decode with NR=2 (75,968 TGs of 128 threads). The new wide kernel uses NR=4 → 37,984 TGs.

Test plan

  • cargo check --workspace --all-targets passes
  • bench_decode_ab N=8: 127ms (down from 291ms)
  • bench_decode_ab N=56: 429ms → slope = 160 tok/s
  • Deterministic output (completion=8/8 on all runs)
  • PPL validation against MLX golden (same f16 weights → no regression expected)

🤖 Generated with Claude Code

…hput

The lm_head dispatch for Q8 models was using gemv_decode_m1 (NR=1, one
threadgroup per vocab row) which created 151,936 threadgroups of 256 threads
each — 4× more shader invocations than the prior Q8 path. This was introduced
in commit 4dab27e which correctly switched to f16 weights for PPL quality but
used an inefficient kernel for the large-N matmul.

Add gemv_decode_wide_f16: an NR=4 f16 GEMV kernel (same structure as the
existing gemv_q8_decode_wide) that processes 4 output rows per threadgroup,
reducing lm_head dispatch from 151,936 to 37,984 threadgroups. Same f16
weights, same f32 accumulation — zero quality regression.

bench_decode_ab (Qwen3.5-0.8B Q8, slope method):
  Before: T1=291ms, ~133 tok/s
  After:  T1=127ms, ~160 tok/s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ppl_metal: GPU-accelerated perplexity evaluation via Metal. Uses the same
forward path as decode (including the wide f16 lm_head kernel). Configurable
via PPL_TOKENS and CORPUS env vars.

pyproject.toml: tracks common Python dev dependencies (pyarrow, datasets,
mlx, numpy, matplotlib) so scripts/ and one-shot comparisons work without
ad-hoc installs.

Verified: Lattice PPL=20.60 vs MLX PPL=20.67 on wikitext-2 (2048 tokens,
window=512, stride=256). Parity confirmed — no quality regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Perf regression report (ADR-058)

aarch64-linux — perf regression report

❌ 3 FAIL (regression >7.0% confirmed by 95% CI)
⚠ 18 WARN (regression 3.0-7.0% confirmed)
🚀 10 confirmed improvement

Bench Δ point 95% CI new ns base ns verdict
simd_query_batch_dot_product/pair_loop/768d_256c +9.43% [+9.16%, +9.69%] 21794.3 21794.3 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/768d_16c +8.89% [+8.25%, +9.52%] 1067.7 1067.7 ❌ FAIL
simd_query_batch_dot_product/simd_batch/768d_256c +8.68% [+8.18%, +9.19%] 18169.6 18169.6 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/768d_256c +7.44% [+6.91%, +7.97%] 21801.5 21801.5 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_256c +5.85% [+5.69%, +6.01%] 29137.6 29137.6 ⚠ WARN
simd_normalize/simd/384 +7.13% [+5.61%, +8.59%] 72.4 72.4 ⚠ WARN
simd_query_batch_dot_product/simd_batch/384d_256c +5.54% [+5.49%, +5.59%] 9048.0 9048.0 ⚠ WARN
simd_query_batch_dot_product/pair_loop/384d_256c +5.48% [+5.40%, +5.56%] 10143.6 10143.6 ⚠ WARN
simd_normalize/simd/768 +6.43% [+5.39%, +7.44%] 123.9 123.9 ⚠ WARN
simd_batch_cosine_non_normalized_query/pair_loop/768d_256c +5.54% [+5.33%, +5.74%] 29062.8 29062.8 ⚠ WARN
simd_batch_cosine_non_normalized_query/simd_batch/768d_256c +5.35% [+5.15%, +5.56%] 29141.1 29141.1 ⚠ WARN
simd_batch_cosine_normalized_query/simd_batch/768d_256c +5.28% [+5.06%, +5.51%] 28787.3 28787.3 ⚠ WARN
simd_query_batch_dot_product/simd_batch/768d_16c +5.06% [+5.02%, +5.09%] 706.5 706.5 ⚠ WARN
simd_normalized_cosine_fast_path/dot_product/768 +5.56% [+4.89%, +6.22%] 59.9 59.9 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_dot/768d_64c +4.29% [+4.04%, +4.55%] 4829.0 4829.0 ⚠ WARN
simd_normalize/simd/1536 +5.22% [+4.03%, +6.41%] 230.5 230.5 ⚠ WARN
simd_query_batch_dot_product/pair_loop/128d_256c +4.05% [+4.02%, +4.08%] 3918.1 3918.1 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_dot/768d_4c +3.88% [+3.76%, +4.00%] 241.2 241.2 ⚠ WARN
memory_size/search_1000_int8 +3.63% [+3.60%, +3.65%] 15468.9 15468.9 ⚠ WARN
simd_prepared_query_normalized_cosine/dot_product_loop/384 +3.44% [+3.40%, +3.48%] 39445.0 39445.0 ⚠ WARN
simd_query_batch_dot_product/pair_loop/768d_64c +3.50% [+3.24%, +3.76%] 4772.6 4772.6 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_16c -3.02% [-3.15%, -2.88%] 1424.7 1424.7 🚀 WIN
simd_batch_cosine_non_normalized_query/pair_loop/1024d_256c -3.03% [-3.30%, -2.76%] 35630.4 35630.4 🚀 WIN
simd_dot_product/simd/768 -3.55% [-3.75%, -3.34%] 55.2 55.2 🚀 WIN
simd_normalized_cosine_fast_path/dot_product/384 -3.70% [-3.79%, -3.61%] 30.3 30.3 🚀 WIN
simd_batch_cosine_normalized_query/pair_loop_dot/384d_16c -3.96% [-4.06%, -3.87%] 493.8 493.8 🚀 WIN
simd_batch_dot_product/simd_batch/10 -5.47% [-5.51%, -5.43%] 316.0 316.0 🚀 WIN
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_256c -6.22% [-6.58%, -5.85%] 26405.5 26405.5 🚀 WIN
simd_query_batch_dot_product/pair_loop/768d_16c -7.04% [-7.08%, -7.00%] 939.4 939.4 🚀 WIN
simd_throughput_384/normalize -8.36% [-8.37%, -8.34%] 113.9 113.9 🚀 WIN
simd_batch_dot_product/simd_batch/1000 -19.92% [-20.27%, -19.56%] 73334.6 73334.6 🚀 WIN
All 259 measurements
Bench Δ point CI-lower CI-upper
add_bias_gelu/4096 +0.09% +0.06% +0.11%
add_bias_gelu/896 +0.01% -0.02% +0.03%
binary_cosine_distance/binary/1024 +2.71% +2.59% +2.83%
binary_cosine_distance/binary/1536 +0.39% +0.33% +0.45%
binary_cosine_distance/binary/384 +0.06% +0.02% +0.11%
binary_cosine_distance/binary/768 +0.39% +0.37% +0.41%
binary_cosine_distance/float32_simd/1024 +0.15% +0.08% +0.22%
binary_cosine_distance/float32_simd/1536 -0.01% -0.04% +0.01%
binary_cosine_distance/float32_simd/384 -0.03% -0.06% -0.01%
binary_cosine_distance/float32_simd/768 +0.48% +0.46% +0.50%
elementwise_mul/4096 +1.18% +1.11% +1.24%
gelu/4096 -0.01% -0.04% +0.01%
gelu/896 +0.03% +0.01% +0.06%
int4_cosine_distance/float32_simd/1024 +0.34% +0.14% +0.53%
int4_cosine_distance/float32_simd/1536 +0.00% -0.02% +0.03%
int4_cosine_distance/float32_simd/384 -0.00% -0.05% +0.04%
int4_cosine_distance/float32_simd/768 +0.08% -0.00% +0.16%
int4_cosine_distance/int4/1024 -0.02% -0.12% +0.07%
int4_cosine_distance/int4/1536 +0.38% +0.24% +0.52%
int4_cosine_distance/int4/384 -0.02% -0.22% +0.18%
int4_cosine_distance/int4/768 +0.22% +0.08% +0.36%
int8_batch_cosine/float32_simd/10 -0.05% -0.06% -0.03%
int8_batch_cosine/float32_simd/100 +0.28% +0.26% +0.30%
int8_batch_cosine/float32_simd/1000 -0.70% -0.78% -0.62%
int8_batch_cosine/int8_loop/10 +0.07% -0.01% +0.16%
int8_batch_cosine/int8_loop/100 -0.61% -0.65% -0.57%
int8_batch_cosine/int8_loop/1000 -2.66% -2.99% -2.34%
int8_prepared_dot_product/per_call/1024 +0.09% +0.02% +0.16%
int8_prepared_dot_product/per_call/127 +0.13% +0.11% +0.14%
int8_prepared_dot_product/per_call/128 +0.47% +0.32% +0.62%
int8_prepared_dot_product/per_call/129 +0.17% +0.16% +0.19%
int8_prepared_dot_product/per_call/384 +0.04% +0.03% +0.04%
int8_prepared_dot_product/per_call/768 +0.12% +0.02% +0.21%
int8_prepared_dot_product/prepared/1024 +0.89% +0.70% +1.08%
int8_prepared_dot_product/prepared/127 +0.26% +0.23% +0.29%
int8_prepared_dot_product/prepared/128 +0.15% -0.04% +0.34%
int8_prepared_dot_product/prepared/129 -0.20% -0.28% -0.12%
int8_prepared_dot_product/prepared/384 -1.20% -1.25% -1.15%
int8_prepared_dot_product/prepared/768 +0.29% +0.21% +0.36%
int8_quantization/quantize/1024 +0.02% +0.01% +0.03%
int8_quantization/quantize/1536 -0.43% -0.44% -0.42%
int8_quantization/quantize/384 +0.00% -0.01% +0.01%
int8_quantization/quantize/768 +0.02% +0.01% +0.03%
int8_raw_dot_product/dot_product_i8/1024 +0.28% +0.25% +0.32%
int8_raw_dot_product/dot_product_i8/127 -0.03% -0.07% +0.01%
int8_raw_dot_product/dot_product_i8/128 +0.71% +0.35% +1.08%
int8_raw_dot_product/dot_product_i8/129 -0.37% -0.41% -0.34%
int8_raw_dot_product/dot_product_i8/384 -1.00% -1.10% -0.90%
int8_raw_dot_product/dot_product_i8/768 -1.33% -1.44% -1.21%
int8_raw_dot_product/dot_product_i8_raw/1024 +0.03% -0.00% +0.06%
int8_raw_dot_product/dot_product_i8_raw/127 -0.27% -0.35% -0.19%
int8_raw_dot_product/dot_product_i8_raw/128 -0.23% -0.29% -0.18%
int8_raw_dot_product/dot_product_i8_raw/129 +0.10% +0.00% +0.20%
int8_raw_dot_product/dot_product_i8_raw/384 -0.44% -0.47% -0.41%
int8_raw_dot_product/dot_product_i8_raw/768 -0.32% -0.38% -0.25%
int8_vs_float32_cosine/float32_simd/1024 -0.02% -0.08% +0.03%
int8_vs_float32_cosine/float32_simd/1536 +0.11% +0.09% +0.12%
int8_vs_float32_cosine/float32_simd/384 +0.69% +0.53% +0.85%
int8_vs_float32_cosine/float32_simd/768 -0.06% -0.11% -0.02%
int8_vs_float32_cosine/int8/1024 -0.06% -0.15% +0.02%
int8_vs_float32_cosine/int8/1536 +0.63% +0.56% +0.70%
int8_vs_float32_cosine/int8/384 +0.19% +0.10% +0.28%
int8_vs_float32_cosine/int8/768 +0.26% +0.19% +0.32%
layer_norm/4096 -0.77% -0.81% -0.74%
layer_norm/896 -0.16% -0.21% -0.11%
memory_size/search_1000_float32 +0.55% +0.47% +0.63%
memory_size/search_1000_int8 +3.63% +3.60% +3.65%
rms_norm/4096 -1.10% -1.20% -1.00%
rms_norm/896 +0.09% -0.03% +0.20%
silu_inplace/4096 -0.02% -0.04% -0.00%
silu_inplace/896 -0.01% -0.03% +0.01%
simd_batch_cosine/scalar_loop/10 +0.08% +0.02% +0.13%
simd_batch_cosine/scalar_loop/100 -0.09% -0.18% -0.00%
simd_batch_cosine/scalar_loop/1000 -0.72% -0.85% -0.59%
simd_batch_cosine/simd_batch/10 +0.22% +0.16% +0.27%
simd_batch_cosine/simd_batch/100 -1.88% -1.94% -1.83%
simd_batch_cosine/simd_batch/1000 -0.71% -1.17% -0.26%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_1000c +0.92% +0.75% +1.07%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_16c -0.70% -0.76% -0.65%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_256c -3.03% -3.30% -2.76%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_4c +0.08% +0.05% +0.12%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_64c +0.02% -0.06% +0.10%
simd_batch_cosine_non_normalized_query/pair_loop/384d_1000c +0.46% +0.35% +0.56%
simd_batch_cosine_non_normalized_query/pair_loop/384d_16c +0.44% +0.30% +0.57%
simd_batch_cosine_non_normalized_query/pair_loop/384d_256c -0.11% -0.15% -0.08%
simd_batch_cosine_non_normalized_query/pair_loop/384d_4c +0.10% +0.07% +0.13%
simd_batch_cosine_non_normalized_query/pair_loop/384d_64c -0.66% -0.69% -0.64%
simd_batch_cosine_non_normalized_query/pair_loop/768d_1000c +1.44% +1.20% +1.67%
simd_batch_cosine_non_normalized_query/pair_loop/768d_16c +1.45% +1.34% +1.55%
simd_batch_cosine_non_normalized_query/pair_loop/768d_256c +5.54% +5.33% +5.74%
simd_batch_cosine_non_normalized_query/pair_loop/768d_4c +0.03% +0.02% +0.04%
simd_batch_cosine_non_normalized_query/pair_loop/768d_64c -0.43% -0.46% -0.40%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_1000c +0.73% +0.68% +0.77%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_16c -0.83% -0.95% -0.71%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_256c +2.99% +2.21% +3.79%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_4c +0.18% +0.02% +0.33%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_64c +0.02% -0.01% +0.05%
simd_batch_cosine_non_normalized_query/simd_batch/384d_1000c +0.12% +0.07% +0.17%
simd_batch_cosine_non_normalized_query/simd_batch/384d_16c +0.29% +0.27% +0.31%
simd_batch_cosine_non_normalized_query/simd_batch/384d_256c +0.02% -0.02% +0.06%
simd_batch_cosine_non_normalized_query/simd_batch/384d_4c +0.32% +0.30% +0.34%
simd_batch_cosine_non_normalized_query/simd_batch/384d_64c -0.50% -0.60% -0.41%
simd_batch_cosine_non_normalized_query/simd_batch/768d_1000c +1.50% +1.29% +1.71%
simd_batch_cosine_non_normalized_query/simd_batch/768d_16c +1.42% +1.39% +1.45%
simd_batch_cosine_non_normalized_query/simd_batch/768d_256c +5.35% +5.15% +5.56%
simd_batch_cosine_non_normalized_query/simd_batch/768d_4c +0.01% -0.02% +0.03%
simd_batch_cosine_non_normalized_query/simd_batch/768d_64c -0.41% -0.45% -0.38%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_1000c +1.46% +1.31% +1.61%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_16c -0.42% -0.46% -0.38%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_256c -2.67% -3.00% -2.34%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_4c +0.12% +0.11% +0.14%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_64c +0.04% +0.02% +0.05%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_1000c +1.40% +1.12% +1.69%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_16c -0.54% -0.59% -0.49%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_256c -1.60% -1.66% -1.54%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_4c +0.29% +0.15% +0.44%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_64c +0.23% +0.20% +0.25%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_1000c +0.15% -0.02% +0.32%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_16c +0.71% +0.67% +0.74%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_256c +5.85% +5.69% +6.01%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_4c +0.16% +0.10% +0.22%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_64c +0.27% +0.22% +0.32%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_1000c +0.77% +0.58% +0.96%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_16c -3.02% -3.15% -2.88%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_256c -6.22% -6.58% -5.85%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_4c -0.47% -0.56% -0.39%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_64c +0.65% +0.61% +0.69%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_1000c +0.46% +0.13% +0.79%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_16c -3.96% -4.06% -3.87%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_256c -1.67% -1.73% -1.61%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_4c +3.02% +2.02% +4.03%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_64c +1.04% +1.03% +1.06%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_1000c -0.28% -0.42% -0.15%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_16c +8.89% +8.25% +9.52%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_256c +7.44% +6.91% +7.97%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_4c +3.88% +3.76% +4.00%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_64c +4.29% +4.04% +4.55%
simd_batch_cosine_normalized_query/simd_batch/1024d_1000c +1.09% +0.92% +1.26%
simd_batch_cosine_normalized_query/simd_batch/1024d_16c -0.49% -0.50% -0.48%
simd_batch_cosine_normalized_query/simd_batch/1024d_256c -0.04% -0.68% +0.61%
simd_batch_cosine_normalized_query/simd_batch/1024d_4c +0.24% +0.22% +0.26%
simd_batch_cosine_normalized_query/simd_batch/1024d_64c +0.31% +0.25% +0.37%
simd_batch_cosine_normalized_query/simd_batch/384d_1000c +0.82% +0.64% +1.00%
simd_batch_cosine_normalized_query/simd_batch/384d_16c -1.00% -1.01% -0.99%
simd_batch_cosine_normalized_query/simd_batch/384d_256c -1.70% -1.75% -1.64%
simd_batch_cosine_normalized_query/simd_batch/384d_4c +0.22% +0.17% +0.28%
simd_batch_cosine_normalized_query/simd_batch/384d_64c +0.48% +0.47% +0.48%
simd_batch_cosine_normalized_query/simd_batch/768d_1000c +0.30% +0.24% +0.36%
simd_batch_cosine_normalized_query/simd_batch/768d_16c +1.03% +0.99% +1.07%
simd_batch_cosine_normalized_query/simd_batch/768d_256c +5.28% +5.06% +5.51%
simd_batch_cosine_normalized_query/simd_batch/768d_4c +0.00% -0.08% +0.05%
simd_batch_cosine_normalized_query/simd_batch/768d_64c +0.47% +0.46% +0.49%
simd_batch_dot_product/scalar_loop/10 +0.03% +0.02% +0.05%
simd_batch_dot_product/scalar_loop/100 +0.20% +0.13% +0.27%
simd_batch_dot_product/scalar_loop/1000 -1.53% -1.57% -1.49%
simd_batch_dot_product/simd_batch/10 -5.47% -5.51% -5.43%
simd_batch_dot_product/simd_batch/100 +1.59% +1.56% +1.61%
simd_batch_dot_product/simd_batch/1000 -19.92% -20.27% -19.56%
simd_cosine_similarity/scalar/1024 -0.07% -0.09% -0.04%
simd_cosine_similarity/scalar/1536 -0.01% -0.05% +0.02%
simd_cosine_similarity/scalar/384 +0.03% -0.11% +0.17%
simd_cosine_similarity/scalar/768 +0.34% +0.31% +0.38%
simd_cosine_similarity/simd/1024 +0.42% +0.34% +0.50%
simd_cosine_similarity/simd/1536 -0.21% -0.24% -0.18%
simd_cosine_similarity/simd/384 -0.22% -0.48% +0.03%
simd_cosine_similarity/simd/768 +0.90% +0.88% +0.92%
simd_dot_product/scalar/1024 +0.00% -0.01% +0.02%
simd_dot_product/scalar/1536 +0.07% +0.06% +0.08%
simd_dot_product/scalar/384 +0.00% -0.04% +0.05%
simd_dot_product/scalar/768 +0.16% +0.10% +0.21%
simd_dot_product/simd/1024 -1.40% -1.52% -1.28%
simd_dot_product/simd/1536 +0.40% +0.36% +0.45%
simd_dot_product/simd/384 -0.16% -0.21% -0.10%
simd_dot_product/simd/768 -3.55% -3.75% -3.34%
simd_euclidean_distance/scalar/1024 -0.33% -0.55% -0.10%
simd_euclidean_distance/scalar/1536 +0.02% -0.05% +0.08%
simd_euclidean_distance/scalar/384 +0.03% -0.30% +0.36%
simd_euclidean_distance/scalar/768 -0.11% -0.22% -0.02%
simd_euclidean_distance/simd/1024 +0.09% +0.05% +0.12%
simd_euclidean_distance/simd/1536 +0.02% -0.00% +0.05%
simd_euclidean_distance/simd/384 +0.23% +0.20% +0.26%
simd_euclidean_distance/simd/768 +0.28% +0.24% +0.32%
simd_normalize/scalar/1024 +1.08% +0.82% +1.34%
simd_normalize/scalar/1536 +1.07% +0.91% +1.22%
simd_normalize/scalar/384 +1.84% +1.46% +2.21%
simd_normalize/scalar/768 +1.15% +1.01% +1.27%
simd_normalize/simd/1024 +4.14% +2.80% +5.44%
simd_normalize/simd/1536 +5.22% +4.03% +6.41%
simd_normalize/simd/384 +7.13% +5.61% +8.59%
simd_normalize/simd/768 +6.43% +5.39% +7.44%
simd_normalized_cosine_fast_path/cosine_full/1024 +0.47% +0.43% +0.51%
simd_normalized_cosine_fast_path/cosine_full/384 -0.29% -0.43% -0.14%
simd_normalized_cosine_fast_path/cosine_full/768 +0.96% +0.91% +1.00%
simd_normalized_cosine_fast_path/dot_product/1024 -0.46% -0.54% -0.38%
simd_normalized_cosine_fast_path/dot_product/384 -3.70% -3.79% -3.61%
simd_normalized_cosine_fast_path/dot_product/768 +5.56% +4.89% +6.22%
simd_prepared_query_normalized_cosine/dot_product_loop/1024 +1.78% +1.56% +2.00%
simd_prepared_query_normalized_cosine/dot_product_loop/384 +3.44% +3.40% +3.48%
simd_prepared_query_normalized_cosine/dot_product_loop/768 -0.57% -0.75% -0.40%
simd_prepared_query_normalized_cosine/prepared_full_cosine/1024 -0.53% -0.70% -0.37%
simd_prepared_query_normalized_cosine/prepared_full_cosine/384 +0.69% +0.64% +0.74%
simd_prepared_query_normalized_cosine/prepared_full_cosine/768 +0.72% +0.61% +0.83%
simd_prepared_query_normalized_cosine/prepared_meta_unit/1024 +0.54% +0.32% +0.77%
simd_prepared_query_normalized_cosine/prepared_meta_unit/384 +1.85% +1.64% +2.05%
simd_prepared_query_normalized_cosine/prepared_meta_unit/768 -0.94% -1.38% -0.50%
simd_query_batch_dot_product/pair_loop/128d_16c +0.53% +0.45% +0.61%
simd_query_batch_dot_product/pair_loop/128d_256c +4.05% +4.02% +4.08%
simd_query_batch_dot_product/pair_loop/128d_4c +1.48% +1.35% +1.61%
simd_query_batch_dot_product/pair_loop/128d_64c -0.28% -0.32% -0.23%
simd_query_batch_dot_product/pair_loop/384d_16c -1.87% -2.12% -1.63%
simd_query_batch_dot_product/pair_loop/384d_256c +5.48% +5.40% +5.56%
simd_query_batch_dot_product/pair_loop/384d_4c +1.12% +0.97% +1.27%
simd_query_batch_dot_product/pair_loop/384d_64c +1.30% +1.28% +1.33%
simd_query_batch_dot_product/pair_loop/768d_16c -7.04% -7.08% -7.00%
simd_query_batch_dot_product/pair_loop/768d_256c +9.43% +9.16% +9.69%
simd_query_batch_dot_product/pair_loop/768d_4c +1.80% +1.57% +2.03%
simd_query_batch_dot_product/pair_loop/768d_64c +3.50% +3.24% +3.76%
simd_query_batch_dot_product/simd_batch/128d_16c +1.09% +1.00% +1.17%
simd_query_batch_dot_product/simd_batch/128d_256c +1.71% +1.68% +1.73%
simd_query_batch_dot_product/simd_batch/128d_4c -0.37% -0.45% -0.30%
simd_query_batch_dot_product/simd_batch/128d_64c +0.76% +0.74% +0.78%
simd_query_batch_dot_product/simd_batch/384d_16c +0.07% +0.03% +0.11%
simd_query_batch_dot_product/simd_batch/384d_256c +5.54% +5.49% +5.59%
simd_query_batch_dot_product/simd_batch/384d_4c +0.80% +0.59% +1.00%
simd_query_batch_dot_product/simd_batch/384d_64c -1.31% -1.75% -0.86%
simd_query_batch_dot_product/simd_batch/768d_16c +5.06% +5.02% +5.09%
simd_query_batch_dot_product/simd_batch/768d_256c +8.68% +8.18% +9.19%
simd_query_batch_dot_product/simd_batch/768d_4c -0.27% -0.30% -0.25%
simd_query_batch_dot_product/simd_batch/768d_64c -0.69% -0.89% -0.49%
simd_squared_euclidean_fast_path/euclidean_full/1024 +0.14% +0.05% +0.24%
simd_squared_euclidean_fast_path/euclidean_full/384 +0.32% +0.27% +0.38%
simd_squared_euclidean_fast_path/euclidean_full/768 -0.27% -0.32% -0.22%
simd_squared_euclidean_fast_path/squared_euclidean/1024 +0.09% +0.07% +0.10%
simd_squared_euclidean_fast_path/squared_euclidean/384 +0.27% +0.23% +0.31%
simd_squared_euclidean_fast_path/squared_euclidean/768 -0.62% -0.65% -0.58%
simd_throughput_384/cosine_similarity -0.21% -0.30% -0.13%
simd_throughput_384/dot_product -2.50% -2.54% -2.46%
simd_throughput_384/euclidean_distance -0.27% -0.30% -0.25%
simd_throughput_384/normalize -8.36% -8.37% -8.34%
softmax_attention/128 -0.08% -0.09% -0.06%
softmax_attention/512 -1.19% -1.25% -1.12%
tier_prepared_batch_sizes/int4_batch_prepared/10 +0.28% +0.21% +0.35%
tier_prepared_batch_sizes/int4_batch_prepared/100 -0.14% -0.23% -0.07%
tier_prepared_batch_sizes/int4_batch_prepared/1000 -0.05% -0.31% +0.20%
tier_prepared_batch_sizes/int4_query_per_call/10 +0.74% +0.71% +0.77%
tier_prepared_batch_sizes/int4_query_per_call/100 +0.61% +0.57% +0.64%
tier_prepared_batch_sizes/int4_query_per_call/1000 +0.57% +0.55% +0.59%
tier_prepared_batch_sizes/int8_batch_prepared/10 -0.56% -0.61% -0.51%
tier_prepared_batch_sizes/int8_batch_prepared/100 +0.57% +0.53% +0.62%
tier_prepared_batch_sizes/int8_batch_prepared/1000 +2.37% +2.13% +2.61%
tier_prepared_batch_sizes/int8_query_per_call/10 -0.02% -0.03% -0.00%
tier_prepared_batch_sizes/int8_query_per_call/100 +0.01% -0.01% +0.02%
tier_prepared_batch_sizes/int8_query_per_call/1000 +0.01% -0.02% +0.03%
tier_prepared_query/binary_query_once_1000 -0.04% -0.22% +0.14%
tier_prepared_query/binary_query_per_call_1000 -0.02% -0.02% -0.01%
tier_prepared_query/int4_query_once_1000 +0.15% +0.11% +0.19%
tier_prepared_query/int4_query_per_call_1000 +0.23% +0.23% +0.24%
tier_prepared_query/int8_query_once_1000 +0.58% +0.53% +0.62%
tier_prepared_query/int8_query_per_call_1000 -0.02% -0.03% -0.00%

Rule: CI-lower of change ≤3.0% passes silently; (3.0%, 7.0%] warns; >7.0% fails. Override via PR label bench-allow-regression.

x86_64-linux — perf regression report

❌ 25 FAIL (regression >7.0% confirmed by 95% CI)
⚠ 30 WARN (regression 3.0-7.0% confirmed)
🚀 18 confirmed improvement

Bench Δ point 95% CI new ns base ns verdict
simd_normalized_cosine_fast_path/dot_product/1024 +49.12% [+48.74%, +49.50%] 78.4 78.4 ❌ FAIL
simd_squared_euclidean_fast_path/squared_euclidean/768 +44.98% [+44.21%, +45.71%] 61.3 61.3 ❌ FAIL
simd_squared_euclidean_fast_path/euclidean_full/768 +38.32% [+37.83%, +38.78%] 65.6 65.6 ❌ FAIL
simd_query_batch_dot_product/simd_batch/768d_4c +35.22% [+34.82%, +35.61%] 199.1 199.1 ❌ FAIL
simd_query_batch_dot_product/pair_loop/768d_4c +33.50% [+33.16%, +33.80%] 300.1 300.1 ❌ FAIL
simd_dot_product/simd/768 +33.29% [+32.99%, +33.54%] 54.2 54.2 ❌ FAIL
simd_query_batch_dot_product/simd_batch/384d_16c +25.91% [+25.64%, +26.18%] 329.7 329.7 ❌ FAIL
simd_query_batch_dot_product/pair_loop/768d_16c +24.47% [+23.61%, +25.13%] 1054.6 1054.6 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/768d_4c +22.68% [+22.32%, +22.99%] 257.9 257.9 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/768d_256c +21.45% [+19.46%, +23.43%] 15950.4 15950.4 ❌ FAIL
simd_normalized_cosine_fast_path/dot_product/768 +17.94% [+17.65%, +18.21%] 60.5 60.5 ❌ FAIL
simd_euclidean_distance/simd/1536 +17.43% [+17.21%, +17.57%] 120.2 120.2 ❌ FAIL
layer_norm/896 +16.48% [+16.28%, +16.68%] 205.9 205.9 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/768d_64c +16.11% [+16.00%, +16.21%] 3862.5 3862.5 ❌ FAIL
simd_query_batch_dot_product/pair_loop/768d_256c +16.36% [+15.89%, +16.69%] 15542.2 15542.2 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/768d_16c +16.11% [+15.84%, +16.33%] 1003.6 1003.6 ❌ FAIL
simd_query_batch_dot_product/pair_loop/768d_64c +18.98% [+15.73%, +22.22%] 3938.8 3938.8 ❌ FAIL
simd_query_batch_dot_product/simd_batch/768d_16c +12.77% [+12.37%, +13.09%] 682.5 682.5 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/384d_64c +12.40% [+12.34%, +12.45%] 2155.2 2155.2 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/384d_4c +12.50% [+11.87%, +13.10%] 141.4 141.4 ❌ FAIL
simd_query_batch_dot_product/pair_loop/384d_16c +10.25% [+9.92%, +10.52%] 517.1 517.1 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/384d_1000c +9.89% [+9.57%, +10.12%] 34865.6 34865.6 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/384d_256c +9.93% [+9.53%, +10.30%] 8476.9 8476.9 ❌ FAIL
simd_batch_cosine_non_normalized_query/pair_loop/384d_16c +9.01% [+8.63%, +9.31%] 695.4 695.4 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_dot/768d_1000c +9.52% [+8.53%, +10.22%] 61761.2 61761.2 ❌ FAIL
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_4c +7.32% [+6.99%, +7.64%] 285.6 285.6 ⚠ WARN
simd_normalized_cosine_fast_path/cosine_full/1024 +7.29% [+6.88%, +7.69%] 91.1 91.1 ⚠ WARN
simd_batch_cosine_non_normalized_query/pair_loop/768d_256c +7.25% [+6.79%, +7.70%] 18093.3 18093.3 ⚠ WARN
simd_batch_cosine_non_normalized_query/pair_loop/384d_4c +7.30% [+6.78%, +7.82%] 179.5 179.5 ⚠ WARN
simd_batch_cosine_non_normalized_query/pair_loop/768d_16c +6.90% [+6.76%, +7.03%] 1146.7 1146.7 ⚠ WARN
simd_batch_cosine_non_normalized_query/simd_batch/768d_256c +6.35% [+6.10%, +6.52%] 17748.5 17748.5 ⚠ WARN
elementwise_mul/4096 +5.83% [+5.46%, +6.15%] 317.6 317.6 ⚠ WARN
simd_normalize/simd/384 +8.64% [+5.37%, +12.03%] 76.8 76.8 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_256c +5.58% [+5.29%, +5.77%] 23228.0 23228.0 ⚠ WARN
int8_batch_cosine/int8_loop/10 +5.48% [+5.21%, +5.67%] 176.9 176.9 ⚠ WARN
simd_batch_cosine_non_normalized_query/pair_loop/1024d_256c +5.77% [+5.17%, +6.14%] 23273.3 23273.3 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_256c +5.32% [+5.14%, +5.48%] 19091.5 19091.5 ⚠ WARN
simd_batch_cosine_non_normalized_query/pair_loop/768d_4c +5.37% [+5.13%, +5.53%] 281.2 281.2 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_256c +5.34% [+4.97%, +5.63%] 17796.2 17796.2 ⚠ WARN
simd_normalized_cosine_fast_path/cosine_full/768 +5.42% [+4.93%, +5.90%] 72.9 72.9 ⚠ WARN
simd_batch_cosine_non_normalized_query/simd_batch/1024d_256c +5.16% [+4.87%, +5.44%] 22817.1 22817.1 ⚠ WARN
simd_batch_cosine_non_normalized_query/pair_loop/768d_64c +5.03% [+4.81%, +5.19%] 4427.7 4427.7 ⚠ WARN
simd_batch_cosine_normalized_query/simd_batch/768d_256c +5.17% [+4.80%, +5.48%] 17586.5 17586.5 ⚠ WARN
simd_batch_cosine_non_normalized_query/simd_batch/768d_16c +4.88% [+4.71%, +5.04%] 1112.2 1112.2 ⚠ WARN
simd_batch_cosine_normalized_query/simd_batch/1024d_256c +4.94% [+4.71%, +5.13%] 22946.6 22946.6 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_4c +4.69% [+4.44%, +4.94%] 175.4 175.4 ⚠ WARN
simd_query_batch_dot_product/simd_batch/384d_64c +5.21% [+4.29%, +6.00%] 1308.5 1308.5 ⚠ WARN
simd_query_batch_dot_product/simd_batch/384d_4c +4.66% [+4.27%, +5.03%] 81.9 81.9 ⚠ WARN
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_16c +4.39% [+4.10%, +4.64%] 1125.9 1125.9 ⚠ WARN
simd_query_batch_dot_product/pair_loop/384d_64c +4.39% [+3.97%, +4.74%] 2003.0 2003.0 ⚠ WARN
simd_batch_cosine_normalized_query/simd_batch/768d_4c +4.60% [+3.88%, +5.29%] 276.5 276.5 ⚠ WARN
simd_query_batch_dot_product/simd_batch/768d_256c +4.17% [+3.37%, +4.92%] 10090.2 10090.2 ⚠ WARN
simd_batch_cosine_non_normalized_query/simd_batch/768d_4c +3.66% [+3.34%, +3.97%] 273.3 273.3 ⚠ WARN
simd_query_batch_dot_product/pair_loop/384d_4c +3.63% [+3.30%, +3.89%] 132.6 132.6 ⚠ WARN
simd_batch_cosine_non_normalized_query/simd_batch/768d_64c +3.46% [+3.03%, +3.84%] 4312.9 4312.9 ⚠ WARN
simd_throughput_384/normalize -3.23% [-3.37%, -3.12%] 107.5 107.5 🚀 WIN
simd_query_batch_dot_product/simd_batch/128d_16c -3.08% [-3.45%, -2.71%] 128.7 128.7 🚀 WIN
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_16c -3.86% [-4.07%, -3.66%] 652.8 652.8 🚀 WIN
rms_norm/4096 -3.98% [-4.17%, -3.81%] 898.9 898.9 🚀 WIN
int8_batch_cosine/int8_loop/1000 -4.57% [-4.72%, -4.42%] 18243.6 18243.6 🚀 WIN
simd_batch_cosine_normalized_query/simd_batch/384d_16c -4.97% [-5.35%, -4.58%] 631.7 631.7 🚀 WIN
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_1000c -4.94% [-5.52%, -4.42%] 69682.6 69682.6 🚀 WIN
int8_vs_float32_cosine/float32_simd/768 -5.38% [-5.55%, -5.22%] 68.7 68.7 🚀 WIN
simd_cosine_similarity/simd/1024 -5.65% [-5.80%, -5.51%] 85.7 85.7 🚀 WIN
int8_vs_float32_cosine/float32_simd/1536 -6.06% [-6.32%, -5.80%] 119.6 119.6 🚀 WIN
binary_cosine_distance/float32_simd/768 -6.14% [-6.38%, -5.91%] 68.8 68.8 🚀 WIN
simd_prepared_query_normalized_cosine/prepared_meta_unit/1024 -5.83% [-6.51%, -5.16%] 66324.8 66324.8 🚀 WIN
int4_cosine_distance/float32_simd/768 -6.16% [-6.60%, -5.82%] 69.0 69.0 🚀 WIN
simd_squared_euclidean_fast_path/squared_euclidean/1024 -8.28% [-8.85%, -7.77%] 79.4 79.4 🚀 WIN
simd_prepared_query_normalized_cosine/dot_product_loop/384 -11.05% [-11.25%, -10.86%] 31776.6 31776.6 🚀 WIN
simd_squared_euclidean_fast_path/euclidean_full/384 -15.94% [-16.45%, -15.50%] 28.9 28.9 🚀 WIN
simd_squared_euclidean_fast_path/squared_euclidean/384 -18.65% [-18.95%, -18.35%] 23.9 23.9 🚀 WIN
simd_normalized_cosine_fast_path/dot_product/384 -19.25% [-19.79%, -18.72%] 22.3 22.3 🚀 WIN
All 259 measurements
Bench Δ point CI-lower CI-upper
add_bias_gelu/4096 +0.25% +0.10% +0.40%
add_bias_gelu/896 +0.15% -0.09% +0.40%
binary_cosine_distance/binary/1024 +0.50% +0.09% +0.91%
binary_cosine_distance/binary/1536 +0.05% -0.37% +0.44%
binary_cosine_distance/binary/384 +0.96% +0.53% +1.38%
binary_cosine_distance/binary/768 -0.58% -1.86% +0.56%
binary_cosine_distance/float32_simd/1024 +0.00% -0.20% +0.19%
binary_cosine_distance/float32_simd/1536 -0.06% -0.24% +0.08%
binary_cosine_distance/float32_simd/384 +0.02% -0.40% +0.45%
binary_cosine_distance/float32_simd/768 -6.14% -6.38% -5.91%
elementwise_mul/4096 +5.83% +5.46% +6.15%
gelu/4096 -0.41% -1.42% +0.36%
gelu/896 +0.09% -0.04% +0.18%
int4_cosine_distance/float32_simd/1024 +0.04% -0.22% +0.30%
int4_cosine_distance/float32_simd/1536 -0.09% -0.34% +0.09%
int4_cosine_distance/float32_simd/384 -0.85% -1.24% -0.46%
int4_cosine_distance/float32_simd/768 -6.16% -6.60% -5.82%
int4_cosine_distance/int4/1024 -0.05% -0.16% +0.03%
int4_cosine_distance/int4/1536 -0.44% -0.83% -0.07%
int4_cosine_distance/int4/384 +0.37% +0.11% +0.55%
int4_cosine_distance/int4/768 -2.77% -4.74% -1.10%
int8_batch_cosine/float32_simd/10 -0.66% -1.54% +0.09%
int8_batch_cosine/float32_simd/100 -2.49% -3.09% -1.89%
int8_batch_cosine/float32_simd/1000 +0.72% +0.28% +1.14%
int8_batch_cosine/int8_loop/10 +5.48% +5.21% +5.67%
int8_batch_cosine/int8_loop/100 +1.29% +0.87% +1.70%
int8_batch_cosine/int8_loop/1000 -4.57% -4.72% -4.42%
int8_prepared_dot_product/per_call/1024 -0.19% -0.45% -0.04%
int8_prepared_dot_product/per_call/127 -0.07% -0.28% +0.09%
int8_prepared_dot_product/per_call/128 -0.05% -0.22% +0.08%
int8_prepared_dot_product/per_call/129 +0.19% +0.05% +0.28%
int8_prepared_dot_product/per_call/384 -0.50% -1.03% -0.13%
int8_prepared_dot_product/per_call/768 +0.68% -0.35% +1.69%
int8_prepared_dot_product/prepared/1024 -1.49% -1.72% -1.29%
int8_prepared_dot_product/prepared/127 +2.14% +1.46% +2.71%
int8_prepared_dot_product/prepared/128 +1.19% -0.35% +2.51%
int8_prepared_dot_product/prepared/129 +3.56% +2.91% +4.21%
int8_prepared_dot_product/prepared/384 -0.90% -1.59% -0.23%
int8_prepared_dot_product/prepared/768 +0.98% +0.08% +1.77%
int8_quantization/quantize/1024 -0.15% -0.76% +0.26%
int8_quantization/quantize/1536 -0.04% -0.62% +0.42%
int8_quantization/quantize/384 +0.53% -0.37% +1.40%
int8_quantization/quantize/768 -0.28% -0.77% +0.06%
int8_raw_dot_product/dot_product_i8/1024 -0.98% -1.52% -0.44%
int8_raw_dot_product/dot_product_i8/127 +0.41% +0.05% +0.78%
int8_raw_dot_product/dot_product_i8/128 +1.19% +0.61% +1.70%
int8_raw_dot_product/dot_product_i8/129 +2.02% +1.52% +2.39%
int8_raw_dot_product/dot_product_i8/384 -1.69% -2.21% -1.29%
int8_raw_dot_product/dot_product_i8/768 +0.27% -0.16% +0.61%
int8_raw_dot_product/dot_product_i8_raw/1024 -2.34% -3.15% -1.54%
int8_raw_dot_product/dot_product_i8_raw/127 -0.42% -0.98% +0.01%
int8_raw_dot_product/dot_product_i8_raw/128 -0.41% -0.71% -0.16%
int8_raw_dot_product/dot_product_i8_raw/129 +3.35% +2.11% +4.54%
int8_raw_dot_product/dot_product_i8_raw/384 -0.39% -0.61% -0.18%
int8_raw_dot_product/dot_product_i8_raw/768 +0.62% +0.26% +0.96%
int8_vs_float32_cosine/float32_simd/1024 -0.97% -1.04% -0.92%
int8_vs_float32_cosine/float32_simd/1536 -6.06% -6.32% -5.80%
int8_vs_float32_cosine/float32_simd/384 -0.51% -0.73% -0.29%
int8_vs_float32_cosine/float32_simd/768 -5.38% -5.55% -5.22%
int8_vs_float32_cosine/int8/1024 -0.83% -1.07% -0.60%
int8_vs_float32_cosine/int8/1536 -0.53% -1.00% -0.08%
int8_vs_float32_cosine/int8/384 -0.88% -1.72% -0.19%
int8_vs_float32_cosine/int8/768 -0.45% -0.78% -0.16%
layer_norm/4096 +2.00% +1.86% +2.12%
layer_norm/896 +16.48% +16.28% +16.68%
memory_size/search_1000_float32 -1.29% -1.65% -0.94%
memory_size/search_1000_int8 +0.15% -0.91% +0.96%
rms_norm/4096 -3.98% -4.17% -3.81%
rms_norm/896 -0.79% -1.00% -0.58%
silu_inplace/4096 -0.35% -1.02% +0.21%
silu_inplace/896 -0.12% -0.55% +0.28%
simd_batch_cosine/scalar_loop/10 -0.09% -0.33% +0.15%
simd_batch_cosine/scalar_loop/100 -0.13% -0.45% +0.10%
simd_batch_cosine/scalar_loop/1000 +0.00% -0.10% +0.08%
simd_batch_cosine/simd_batch/10 +0.96% +0.83% +1.07%
simd_batch_cosine/simd_batch/100 -2.02% -2.45% -1.60%
simd_batch_cosine/simd_batch/1000 +0.30% -0.37% +0.97%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_1000c -1.27% -1.51% -1.11%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_16c +0.61% +0.49% +0.74%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_256c +5.77% +5.17% +6.14%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_4c +0.52% +0.13% +0.88%
simd_batch_cosine_non_normalized_query/pair_loop/1024d_64c -0.95% -1.76% -0.31%
simd_batch_cosine_non_normalized_query/pair_loop/384d_1000c +2.98% +2.62% +3.25%
simd_batch_cosine_non_normalized_query/pair_loop/384d_16c +9.01% +8.63% +9.31%
simd_batch_cosine_non_normalized_query/pair_loop/384d_256c +1.78% +1.56% +1.91%
simd_batch_cosine_non_normalized_query/pair_loop/384d_4c +7.30% +6.78% +7.82%
simd_batch_cosine_non_normalized_query/pair_loop/384d_64c +3.09% +2.84% +3.33%
simd_batch_cosine_non_normalized_query/pair_loop/768d_1000c +2.87% +2.53% +3.16%
simd_batch_cosine_non_normalized_query/pair_loop/768d_16c +6.90% +6.76% +7.03%
simd_batch_cosine_non_normalized_query/pair_loop/768d_256c +7.25% +6.79% +7.70%
simd_batch_cosine_non_normalized_query/pair_loop/768d_4c +5.37% +5.13% +5.53%
simd_batch_cosine_non_normalized_query/pair_loop/768d_64c +5.03% +4.81% +5.19%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_1000c -1.23% -1.49% -0.98%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_16c +0.66% +0.50% +0.78%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_256c +5.16% +4.87% +5.44%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_4c +0.11% -0.14% +0.36%
simd_batch_cosine_non_normalized_query/simd_batch/1024d_64c +0.08% -0.10% +0.19%
simd_batch_cosine_non_normalized_query/simd_batch/384d_1000c +2.61% +1.66% +3.57%
simd_batch_cosine_non_normalized_query/simd_batch/384d_16c +3.95% +3.00% +4.89%
simd_batch_cosine_non_normalized_query/simd_batch/384d_256c +0.53% -0.75% +1.81%
simd_batch_cosine_non_normalized_query/simd_batch/384d_4c +0.75% +0.15% +1.34%
simd_batch_cosine_non_normalized_query/simd_batch/384d_64c +2.28% +2.10% +2.44%
simd_batch_cosine_non_normalized_query/simd_batch/768d_1000c +3.27% +2.89% +3.60%
simd_batch_cosine_non_normalized_query/simd_batch/768d_16c +4.88% +4.71% +5.04%
simd_batch_cosine_non_normalized_query/simd_batch/768d_256c +6.35% +6.10% +6.52%
simd_batch_cosine_non_normalized_query/simd_batch/768d_4c +3.66% +3.34% +3.97%
simd_batch_cosine_non_normalized_query/simd_batch/768d_64c +3.46% +3.03% +3.84%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_1000c -1.83% -2.31% -1.50%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_16c +0.44% +0.11% +0.73%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_256c +5.58% +5.29% +5.77%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_4c -0.37% -0.50% -0.25%
simd_batch_cosine_normalized_query/pair_loop_cosine/1024d_64c -0.19% -0.41% -0.03%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_1000c +2.28% +2.11% +2.44%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_16c -3.86% -4.07% -3.66%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_256c +1.09% +0.91% +1.25%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_4c +4.69% +4.44% +4.94%
simd_batch_cosine_normalized_query/pair_loop_cosine/384d_64c +1.72% +1.50% +1.89%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_1000c +2.45% +2.24% +2.66%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_16c +3.25% +2.92% +3.56%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_256c +5.34% +4.97% +5.63%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_4c +7.32% +6.99% +7.64%
simd_batch_cosine_normalized_query/pair_loop_cosine/768d_64c +2.99% +2.85% +3.09%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_1000c -4.94% -5.52% -4.42%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_16c +4.39% +4.10% +4.64%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_256c +5.32% +5.14% +5.48%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_4c +2.95% +2.75% +3.06%
simd_batch_cosine_normalized_query/pair_loop_dot/1024d_64c +1.41% +1.19% +1.54%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_1000c +9.89% +9.57% +10.12%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_16c +0.87% +0.36% +1.38%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_256c +9.93% +9.53% +10.30%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_4c +12.50% +11.87% +13.10%
simd_batch_cosine_normalized_query/pair_loop_dot/384d_64c +12.40% +12.34% +12.45%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_1000c +9.52% +8.53% +10.22%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_16c +16.11% +15.84% +16.33%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_256c +21.45% +19.46% +23.43%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_4c +22.68% +22.32% +22.99%
simd_batch_cosine_normalized_query/pair_loop_dot/768d_64c +16.11% +16.00% +16.21%
simd_batch_cosine_normalized_query/simd_batch/1024d_1000c -1.44% -1.61% -1.29%
simd_batch_cosine_normalized_query/simd_batch/1024d_16c +0.79% +0.71% +0.85%
simd_batch_cosine_normalized_query/simd_batch/1024d_256c +4.94% +4.71% +5.13%
simd_batch_cosine_normalized_query/simd_batch/1024d_4c -0.68% -1.03% -0.40%
simd_batch_cosine_normalized_query/simd_batch/1024d_64c -0.27% -0.66% +0.01%
simd_batch_cosine_normalized_query/simd_batch/384d_1000c +1.93% +1.13% +2.69%
simd_batch_cosine_normalized_query/simd_batch/384d_16c -4.97% -5.35% -4.58%
simd_batch_cosine_normalized_query/simd_batch/384d_256c +2.10% +0.20% +4.00%
simd_batch_cosine_normalized_query/simd_batch/384d_4c +2.76% +2.25% +3.26%
simd_batch_cosine_normalized_query/simd_batch/384d_64c +0.97% +0.72% +1.21%
simd_batch_cosine_normalized_query/simd_batch/768d_1000c +2.62% +2.26% +2.95%
simd_batch_cosine_normalized_query/simd_batch/768d_16c +3.29% +2.87% +3.69%
simd_batch_cosine_normalized_query/simd_batch/768d_256c +5.17% +4.80% +5.48%
simd_batch_cosine_normalized_query/simd_batch/768d_4c +4.60% +3.88% +5.29%
simd_batch_cosine_normalized_query/simd_batch/768d_64c +2.56% +2.36% +2.76%
simd_batch_dot_product/scalar_loop/10 +0.13% +0.06% +0.19%
simd_batch_dot_product/scalar_loop/100 +0.01% -0.11% +0.10%
simd_batch_dot_product/scalar_loop/1000 +0.74% +0.65% +0.81%
simd_batch_dot_product/simd_batch/10 -0.07% -0.34% +0.13%
simd_batch_dot_product/simd_batch/100 +0.08% -0.02% +0.16%
simd_batch_dot_product/simd_batch/1000 -1.52% -2.21% -0.90%
simd_cosine_similarity/scalar/1024 -0.17% -0.39% +0.01%
simd_cosine_similarity/scalar/1536 +0.01% -0.06% +0.07%
simd_cosine_similarity/scalar/384 -0.12% -0.36% +0.05%
simd_cosine_similarity/scalar/768 -0.00% -0.11% +0.10%
simd_cosine_similarity/simd/1024 -5.65% -5.80% -5.51%
simd_cosine_similarity/simd/1536 +1.67% +1.25% +2.08%
simd_cosine_similarity/simd/384 +0.02% -0.31% +0.30%
simd_cosine_similarity/simd/768 +1.96% +1.26% +2.65%
simd_dot_product/scalar/1024 +0.06% -0.06% +0.17%
simd_dot_product/scalar/1536 +0.11% -0.30% +0.51%
simd_dot_product/scalar/384 -0.02% -0.15% +0.08%
simd_dot_product/scalar/768 +0.24% +0.18% +0.30%
simd_dot_product/simd/1024 +0.78% +0.48% +1.05%
simd_dot_product/simd/1536 +0.17% +0.01% +0.30%
simd_dot_product/simd/384 -1.33% -2.06% -0.58%
simd_dot_product/simd/768 +33.29% +32.99% +33.54%
simd_euclidean_distance/scalar/1024 -0.08% -0.25% +0.08%
simd_euclidean_distance/scalar/1536 -0.16% -0.42% +0.07%
simd_euclidean_distance/scalar/384 +0.77% +0.22% +1.32%
simd_euclidean_distance/scalar/768 -0.07% -0.20% +0.06%
simd_euclidean_distance/simd/1024 -1.35% -1.88% -0.84%
simd_euclidean_distance/simd/1536 +17.43% +17.21% +17.57%
simd_euclidean_distance/simd/384 -0.11% -0.38% +0.12%
simd_euclidean_distance/simd/768 +0.10% -0.06% +0.26%
simd_normalize/scalar/1024 +1.24% +0.95% +1.52%
simd_normalize/scalar/1536 +0.48% +0.30% +0.65%
simd_normalize/scalar/384 +1.82% +1.51% +2.11%
simd_normalize/scalar/768 +0.93% +0.56% +1.24%
simd_normalize/simd/1024 +1.73% +0.01% +3.48%
simd_normalize/simd/1536 -2.35% -4.65% -0.02%
simd_normalize/simd/384 +8.64% +5.37% +12.03%
simd_normalize/simd/768 +4.10% +2.13% +6.15%
simd_normalized_cosine_fast_path/cosine_full/1024 +7.29% +6.88% +7.69%
simd_normalized_cosine_fast_path/cosine_full/384 -1.37% -1.76% -0.99%
simd_normalized_cosine_fast_path/cosine_full/768 +5.42% +4.93% +5.90%
simd_normalized_cosine_fast_path/dot_product/1024 +49.12% +48.74% +49.50%
simd_normalized_cosine_fast_path/dot_product/384 -19.25% -19.79% -18.72%
simd_normalized_cosine_fast_path/dot_product/768 +17.94% +17.65% +18.21%
simd_prepared_query_normalized_cosine/dot_product_loop/1024 -1.84% -1.99% -1.71%
simd_prepared_query_normalized_cosine/dot_product_loop/384 -11.05% -11.25% -10.86%
simd_prepared_query_normalized_cosine/dot_product_loop/768 -2.66% -2.95% -2.40%
simd_prepared_query_normalized_cosine/prepared_full_cosine/1024 -1.87% -2.38% -1.50%
simd_prepared_query_normalized_cosine/prepared_full_cosine/384 +0.92% +0.60% +1.16%
simd_prepared_query_normalized_cosine/prepared_full_cosine/768 +0.49% -0.00% +0.86%
simd_prepared_query_normalized_cosine/prepared_meta_unit/1024 -5.83% -6.51% -5.16%
simd_prepared_query_normalized_cosine/prepared_meta_unit/384 +0.99% +0.77% +1.20%
simd_prepared_query_normalized_cosine/prepared_meta_unit/768 -1.28% -1.63% -0.93%
simd_query_batch_dot_product/pair_loop/128d_16c -2.87% -3.13% -2.62%
simd_query_batch_dot_product/pair_loop/128d_256c -1.86% -2.30% -1.43%
simd_query_batch_dot_product/pair_loop/128d_4c +3.68% +2.89% +4.47%
simd_query_batch_dot_product/pair_loop/128d_64c -2.53% -2.67% -2.42%
simd_query_batch_dot_product/pair_loop/384d_16c +10.25% +9.92% +10.52%
simd_query_batch_dot_product/pair_loop/384d_256c -0.30% -0.49% -0.15%
simd_query_batch_dot_product/pair_loop/384d_4c +3.63% +3.30% +3.89%
simd_query_batch_dot_product/pair_loop/384d_64c +4.39% +3.97% +4.74%
simd_query_batch_dot_product/pair_loop/768d_16c +24.47% +23.61% +25.13%
simd_query_batch_dot_product/pair_loop/768d_256c +16.36% +15.89% +16.69%
simd_query_batch_dot_product/pair_loop/768d_4c +33.50% +33.16% +33.80%
simd_query_batch_dot_product/pair_loop/768d_64c +18.98% +15.73% +22.22%
simd_query_batch_dot_product/simd_batch/128d_16c -3.08% -3.45% -2.71%
simd_query_batch_dot_product/simd_batch/128d_256c +0.38% +0.10% +0.55%
simd_query_batch_dot_product/simd_batch/128d_4c -2.87% -3.30% -2.53%
simd_query_batch_dot_product/simd_batch/128d_64c -1.68% -2.30% -1.27%
simd_query_batch_dot_product/simd_batch/384d_16c +25.91% +25.64% +26.18%
simd_query_batch_dot_product/simd_batch/384d_256c +2.57% +2.36% +2.79%
simd_query_batch_dot_product/simd_batch/384d_4c +4.66% +4.27% +5.03%
simd_query_batch_dot_product/simd_batch/384d_64c +5.21% +4.29% +6.00%
simd_query_batch_dot_product/simd_batch/768d_16c +12.77% +12.37% +13.09%
simd_query_batch_dot_product/simd_batch/768d_256c +4.17% +3.37% +4.92%
simd_query_batch_dot_product/simd_batch/768d_4c +35.22% +34.82% +35.61%
simd_query_batch_dot_product/simd_batch/768d_64c +1.58% +1.26% +1.84%
simd_squared_euclidean_fast_path/euclidean_full/1024 -1.02% -2.68% +0.63%
simd_squared_euclidean_fast_path/euclidean_full/384 -15.94% -16.45% -15.50%
simd_squared_euclidean_fast_path/euclidean_full/768 +38.32% +37.83% +38.78%
simd_squared_euclidean_fast_path/squared_euclidean/1024 -8.28% -8.85% -7.77%
simd_squared_euclidean_fast_path/squared_euclidean/384 -18.65% -18.95% -18.35%
simd_squared_euclidean_fast_path/squared_euclidean/768 +44.98% +44.21% +45.71%
simd_throughput_384/cosine_similarity -1.20% -1.52% -0.89%
simd_throughput_384/dot_product -1.06% -1.79% -0.35%
simd_throughput_384/euclidean_distance -0.81% -1.03% -0.60%
simd_throughput_384/normalize -3.23% -3.37% -3.12%
softmax_attention/128 -0.49% -0.64% -0.33%
softmax_attention/512 +0.11% +0.00% +0.18%
tier_prepared_batch_sizes/int4_batch_prepared/10 +0.26% -0.33% +0.84%
tier_prepared_batch_sizes/int4_batch_prepared/100 -0.66% -1.30% -0.21%
tier_prepared_batch_sizes/int4_batch_prepared/1000 -1.30% -2.96% -0.12%
tier_prepared_batch_sizes/int4_query_per_call/10 +0.63% +0.44% +0.78%
tier_prepared_batch_sizes/int4_query_per_call/100 +0.49% +0.05% +0.84%
tier_prepared_batch_sizes/int4_query_per_call/1000 +0.58% +0.54% +0.62%
tier_prepared_batch_sizes/int8_batch_prepared/10 -0.73% -0.84% -0.64%
tier_prepared_batch_sizes/int8_batch_prepared/100 +0.17% -0.43% +0.73%
tier_prepared_batch_sizes/int8_batch_prepared/1000 +0.23% -0.33% +0.76%
tier_prepared_batch_sizes/int8_query_per_call/10 +0.20% -0.11% +0.43%
tier_prepared_batch_sizes/int8_query_per_call/100 -0.40% -0.79% -0.11%
tier_prepared_batch_sizes/int8_query_per_call/1000 -0.03% -0.14% +0.09%
tier_prepared_query/binary_query_once_1000 +0.55% -0.05% +1.16%
tier_prepared_query/binary_query_per_call_1000 +0.41% +0.25% +0.52%
tier_prepared_query/int4_query_once_1000 +0.11% -0.06% +0.28%
tier_prepared_query/int4_query_per_call_1000 -1.57% -1.62% -1.54%
tier_prepared_query/int8_query_once_1000 +0.57% +0.41% +0.70%
tier_prepared_query/int8_query_per_call_1000 +0.11% -0.01% +0.18%

Rule: CI-lower of change ≤3.0% passes silently; (3.0%, 7.0%] warns; >7.0% fails. Override via PR label bench-allow-regression.

Gate is in advisory mode (Rollout step 3, ADR-058 §Rollout). Failures do not block merge for the first 7 days.

@ohdearquant ohdearquant merged commit 7d145f4 into main May 31, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant