Motivation
The lm_head regression (#151) hid for months because we had no per-kernel timing — only E2E bench_decode_ab. Need a systematic micro-benchmark suite that catches per-kernel regressions before they compound.
Scope
Micro-benchmark binary (bench_kernel_isolate)
- Individual kernel timing: gemv_q8, gemv_decode_wide_f16, rms_norm, partial_rope, decode_attention (flash partial + reduce), silu_mul_fused, conv1d_silu, gdn_recurrence, fused_residual_add_norm, copy/add ops
- Each kernel: independent command buffer, 1000 iterations, median + p95 reported
- Full-layer composite (conv1d_silu + gdn_recurrence pipelined in single CB)
- Sum-of-parts vs E2E comparison to detect pipeline stalls
Regression gate
- CI job that runs micro-benchmarks on
aarch64-macos (Apple Silicon)
- Per-kernel baseline stored on
perf-baselines branch
-
10% regression on any kernel blocks merge (configurable threshold)
- Integrates with existing
bench-regression.yml infrastructure from ADR-058
Tracking
- Baseline JSON committed per-platform
- Historical trend data for kernel-level performance over time
Acceptance
Motivation
The lm_head regression (#151) hid for months because we had no per-kernel timing — only E2E
bench_decode_ab. Need a systematic micro-benchmark suite that catches per-kernel regressions before they compound.Scope
Micro-benchmark binary (
bench_kernel_isolate)Regression gate
aarch64-macos(Apple Silicon)perf-baselinesbranchbench-regression.ymlinfrastructure from ADR-058Tracking
Acceptance
bench_kernel_isolatebinary compiles and runs on macOS