perf(inference): chunked-parallel GatedDeltaNet scan (race-fixed, gated)#190
Open
ohdearquant wants to merge 1 commit into
Open
perf(inference): chunked-parallel GatedDeltaNet scan (race-fixed, gated)#190ohdearquant wants to merge 1 commit into
ohdearquant wants to merge 1 commit into
Conversation
Adds a chunked-parallel (C=32) GatedDeltaNet scan for the 18 GDN layers, replacing the serial recurrence in prefill. Env-gated (LATTICE_GDN_CHUNKED); serial path remains default. Includes B-vs-B and chunked-vs-serial state correctness gates plus the threadgroup-barrier fix that eliminates a cross-simdgroup WAR race on the reduction scratch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
chunked-parallel GatedDeltaNet scan (race-fixed, gated)
Stacked on #189. Adds a chunked-parallel (Yang 2024, C=32) GatedDeltaNet scan for
the 18 GDN layers, replacing the serial recurrence in prefill. Env-gated
(
LATTICE_GDN_CHUNKED); the serial path remains the default.Result — interleaved same-process A/B (Qwen3.5-0.8B, Apple Silicon, 18 GDN layers)
Correctness — the interesting part
The chunked path first failed correctness (nondeterministic state up to 0.45). The fix:
R = U − γ·S₀ᵀW,S_end, QKL output) matched the serial recurrence to 1e-7 — so the algebra was
never wrong, disproving an earlier "state-algebra error" diagnosis.
threadgroup_barrier— cross-simdgroup WAR races on theshared reduction scratch (materialize ×2, solve ×2). A lagging simdgroup read the
next reduction's partial → corrupted normalization.
gdn_chunked_race_localization_probetest — no GPU captureneeded.
Gates (post-fix, all green)
Notes
it has mileage.
unwrap().