GDN decode is sequential, but prefill uses a chunkwise DeltaNet/GDN scan: parallel within a chunk, sequential carry across chunks (L/B carries instead of L). The within-chunk pieces (KS0^T, KK^T, QS0^T, U^TK) are GEMM-like (M=B) — the right place for simdgroup_matrix.
Tasks
Acceptance (ADR-064 gates)
- GDN-prefill recurrence speedup vs sequential scan: ≥5× @4k, ≥10× @16k (expected 8-18×)
- chunked vs sequential f32-state ref: PPL Δ≤0.005; no chunk-size-dependent greedy divergence
- TTFT reported separately at 1K/4K/16K
Ref: d3§2. Complements prefill FA work (#126). Note: d3§2 (WY/DPLR kernel math) is the least implementation-complete section — may warrant a focused research pass before coding.
GDN decode is sequential, but prefill uses a chunkwise DeltaNet/GDN scan: parallel within a chunk, sequential carry across chunks (L/B carries instead of L). The within-chunk pieces (
KS0^T,KK^T,QS0^T,U^TK) are GEMM-like (M=B) — the right place forsimdgroup_matrix.Tasks
simdgroup_matrixfor the GEMM pieces (M=B ≥64 crossover)Acceptance (ADR-064 gates)
Ref: d3§2. Complements prefill FA work (#126). Note: d3§2 (WY/DPLR kernel math) is the least implementation-complete section — may warrant a focused research pass before coding.