feat(inference): comprehensive kernel micro-benchmarks + regression gate

## Motivation

The lm_head regression (#151) hid for months because we had no per-kernel timing — only E2E `bench_decode_ab`. Need a systematic micro-benchmark suite that catches per-kernel regressions before they compound.

## Scope

### Micro-benchmark binary (`bench_kernel_isolate`)
- Individual kernel timing: gemv_q8, gemv_decode_wide_f16, rms_norm, partial_rope, decode_attention (flash partial + reduce), silu_mul_fused, conv1d_silu, gdn_recurrence, fused_residual_add_norm, copy/add ops
- Each kernel: independent command buffer, 1000 iterations, median + p95 reported
- Full-layer composite (conv1d_silu + gdn_recurrence pipelined in single CB)
- Sum-of-parts vs E2E comparison to detect pipeline stalls

### Regression gate
- CI job that runs micro-benchmarks on `aarch64-macos` (Apple Silicon)
- Per-kernel baseline stored on `perf-baselines` branch
- >10% regression on any kernel blocks merge (configurable threshold)
- Integrates with existing `bench-regression.yml` infrastructure from ADR-058

### Tracking
- Baseline JSON committed per-platform
- Historical trend data for kernel-level performance over time

## Acceptance

- [ ] `bench_kernel_isolate` binary compiles and runs on macOS
- [ ] Reports per-kernel timing with median/p95/stdev
- [ ] CI gate integrated with existing perf-baselines workflow
- [ ] Documents which kernels map to which model operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): comprehensive kernel micro-benchmarks + regression gate #153

Motivation

Scope

Micro-benchmark binary (`bench_kernel_isolate`)

Regression gate

Tracking

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(inference): comprehensive kernel micro-benchmarks + regression gate #153

Description

Motivation

Scope

Micro-benchmark binary (bench_kernel_isolate)

Regression gate

Tracking

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Micro-benchmark binary (`bench_kernel_isolate`)