Skip to content

perf(inference): chunked-parallel GatedDeltaNet scan (race-fixed, gated)#190

Open
ohdearquant wants to merge 1 commit into
pr/prefill-2-attnfrom
pr/prefill-3-gdn
Open

perf(inference): chunked-parallel GatedDeltaNet scan (race-fixed, gated)#190
ohdearquant wants to merge 1 commit into
pr/prefill-2-attnfrom
pr/prefill-3-gdn

Conversation

@ohdearquant

@ohdearquant ohdearquant commented Jun 4, 2026

Copy link
Copy Markdown
Owner

chunked-parallel GatedDeltaNet scan (race-fixed, gated)

Stacked on #189. Adds a chunked-parallel (Yang 2024, C=32) GatedDeltaNet scan for
the 18 GDN layers, replacing the serial recurrence in prefill. Env-gated
(LATTICE_GDN_CHUNKED); the serial path remains the default.

Result — interleaved same-process A/B (Qwen3.5-0.8B, Apple Silicon, 18 GDN layers)

Context (tok) serial tok/s chunked tok/s Δ median chunked CV
289 459.7 608.5 +32.4% 3.2%
529 413.9 585.6 +41.5% 1.8%
1009 390.7 549.2 +40.6% 1.5%

Correctness — the interesting part

The chunked path first failed correctness (nondeterministic state up to 0.45). The fix:

  1. A standalone fp32 CPU oracle of the chunked algebra (W/U forward-sub, R = U − γ·S₀ᵀW,
    S_end, QKL output) matched the serial recurrence to 1e-7 — so the algebra was
    never wrong, disproving an earlier "state-algebra error" diagnosis.
  2. Real bug: 4 missing threadgroup_barrier — cross-simdgroup WAR races on the
    shared reduction scratch (materialize ×2, solve ×2). A lagging simdgroup read the
    next reduction's partial → corrupted normalization.
  3. Localized headless via a gdn_chunked_race_localization_probe test — no GPU capture
    needed.

Gates (post-fix, all green)

gate threshold result
B-vs-B self-consistency < 1e-3 0.000000
chunked-vs-chunked state < 1e-3 0.000000 (15+ runs)
parity (logits) argmax stable 6.3e-5, argmax stable
chunked-vs-serial state magnitude-aware 2.2e-4, 10/10

Notes

  • 0.8B shape only (kd=vd=128), prefill only, batch=1. Default-flip is a follow-up once
    it has mileage.
  • No new crates; no library unwrap().

Adds a chunked-parallel (C=32) GatedDeltaNet scan for the 18 GDN layers,
replacing the serial recurrence in prefill. Env-gated (LATTICE_GDN_CHUNKED);
serial path remains default. Includes B-vs-B and chunked-vs-serial state
correctness gates plus the threadgroup-barrier fix that eliminates a
cross-simdgroup WAR race on the reduction scratch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant