perf(inference): chunked GDN prefill scan (B=64 to 128, simdgroup_matrix)

GDN decode is sequential, but **prefill** uses a chunkwise DeltaNet/GDN scan: parallel within a chunk, sequential carry across chunks (L/B carries instead of L). The within-chunk pieces (`KS0^T`, `KK^T`, `QS0^T`, `U^TK`) are GEMM-like (M=B) — the right place for `simdgroup_matrix`.

### Tasks
- [ ] chunkwise WY/DPLR form (Yang et al. Gated DeltaNet; Kimi Linear KDA chunkwise) — within-chunk causal correction + decayed state carry
- [ ] B=64 bring-up, B=128 tiled production (autotune {32,64,128,256})
- [ ] `simdgroup_matrix` for the GEMM pieces (M=B ≥64 crossover)

### Acceptance (ADR-064 gates)
- GDN-prefill recurrence speedup vs sequential scan: ≥5× @4K, ≥10× @16K (expected 8-18×)
- chunked vs sequential f32-state ref: **PPL Δ≤0.005**; no chunk-size-dependent greedy divergence
- TTFT reported separately at 1K/4K/16K

Ref: d3§2. Complements prefill FA work (#126). Note: d3§2 (WY/DPLR kernel math) is the least implementation-complete section — may warrant a focused research pass before coding.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(inference): chunked GDN prefill scan (B=64 to 128, simdgroup_matrix) #175

Tasks

Acceptance (ADR-064 gates)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(inference): chunked GDN prefill scan (B=64 to 128, simdgroup_matrix) #175

Description

Tasks

Acceptance (ADR-064 gates)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions