Skip to content

feat(inference): comprehensive kernel micro-benchmarks + regression gate #153

@ohdearquant

Description

@ohdearquant

Motivation

The lm_head regression (#151) hid for months because we had no per-kernel timing — only E2E bench_decode_ab. Need a systematic micro-benchmark suite that catches per-kernel regressions before they compound.

Scope

Micro-benchmark binary (bench_kernel_isolate)

  • Individual kernel timing: gemv_q8, gemv_decode_wide_f16, rms_norm, partial_rope, decode_attention (flash partial + reduce), silu_mul_fused, conv1d_silu, gdn_recurrence, fused_residual_add_norm, copy/add ops
  • Each kernel: independent command buffer, 1000 iterations, median + p95 reported
  • Full-layer composite (conv1d_silu + gdn_recurrence pipelined in single CB)
  • Sum-of-parts vs E2E comparison to detect pipeline stalls

Regression gate

  • CI job that runs micro-benchmarks on aarch64-macos (Apple Silicon)
  • Per-kernel baseline stored on perf-baselines branch
  • 10% regression on any kernel blocks merge (configurable threshold)

  • Integrates with existing bench-regression.yml infrastructure from ADR-058

Tracking

  • Baseline JSON committed per-platform
  • Historical trend data for kernel-level performance over time

Acceptance

  • bench_kernel_isolate binary compiles and runs on macOS
  • Reports per-kernel timing with median/p95/stdev
  • CI gate integrated with existing perf-baselines workflow
  • Documents which kernels map to which model operations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions