perf(laguna): CUDA-graph replay decode — 113→143 tok/s all-GPU, 101→129 at 60% residency#358
Merged
Merged
Conversation
…29 at 60% residency The fused decode rebuilt its graph every token with a kv_start-offset view as the K/V copy destination. ggml-cuda's CUDA-graph cache memcmps every node's properties and resets capture warmup on any change, so the moving copy destination forfeited replay permanently: every token paid ~1.4k individual kernel launches, and the ~200 LUT-remap ops of the hybrid path sat in the dependency chain at full launch latency. Three changes make the graph bit-stable across decode steps: 1. K/V append via ggml_set_rows with the row index as a graph INPUT (kv_idx, I32, broadcasts over heads): index data changes per step, node properties don't. Bit-identical output vs the cpy append at equal padding (verified by full-sequence hash). 2. FA K/V views + masks stride-padded to 256 slots (kv_pad); the mask carries -inf for the padded tail and the KV buffer is zero-init'd, so topology only changes when decode crosses a 256-token boundary. 3. Persistent ggml arena (ggml_init with mem_buffer) so rebuilt graphs land at identical addresses -> stable graph key -> warmup completes and the captured CUDA graph replays. Applied to both laguna_step (all-GPU) and laguna_step_hybrid (Spark hot/cold). RTX 3090, laguna-xs2 Q4_K_M, 128p/256g greedy: all-GPU 113.3 -> 143.3 tok/s (+26%, ids bit-identical) Spark 60% + 16-slot ring 101.4 -> 129.5 tok/s (114% of the old ceiling) 99.4% resident 106.6 -> 134.8 tok/s Escape hatches: DFLASH_LAGUNA_NO_KVPAD=1 restores the legacy exact-length path; DFLASH_LAGUNA_PAD_CPY=1 pads spans but keeps the cpy append (numerics decomposition). New bench: bench_laguna_spark drives the real LagunaBackend hybrid path with all DFLASH_* knobs, DFLASH_BENCH_MIX/SEED varied prompts, and a full-sequence FNV hash for exactness gates. Co-Authored-By: WOZCODE <contact@withwoz.com>
Contributor
There was a problem hiding this comment.
2 issues found across 4 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…nch args - laguna_step/laguna_step_hybrid arenas become thread_local: decode is single-threaded per process today (the pre-existing static gallocr makes the same assumption), but a second decode thread must not share the arena. Same-thread address stability (what CUDA-graph replay needs) is preserved. - bench_laguna_spark: clamp prompt_N/n_gen to >=1 so a 0 or non-numeric argv can't write past the end of an empty prompt vector. Co-Authored-By: WOZCODE <contact@withwoz.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 10, 2026
…mma4 Same recipe as the laguna fix (set_rows KV append with index INPUTS, stride-padded FA spans, stable arenas), adapted per backend: qwen35 (Qwen3.5/3.6-27B AR target-only decode): - build_target_step gets a persistent thread_local arena (stable node addresses across steps -> stable CUDA-graph key). - The existing step-invariant kv_write_rows path now also forces the 256-stride FA span for ALL KV types (was TQ3_0-only), clamped to the cache capacity, so view shapes stay constant within a 256-token window. Mask-less decode + zero-initialised cache: padded rows contribute exp(-row_max) ~ 0 (same approximation the TQ3_0 path already shipped). - reset_recurrent_state now clears the whole base buffer device-side (cudaMemset, ~0.2ms) and restore_and_generate clears before restoring: with a padded mask-less span, stale K/V rows from the previous request inside the padded tail would otherwise be attended with real scores. - DFLASH_QWEN35_NO_KVPAD=1 restores the legacy path. qwen35moe (Qwen3.6-35B-A3B hybrid/Spark pipelined decode): - build_qwen35_layer_prefn gains a kv_write_rows pass-through. - New build_cached_attn_prefn: full-attention layers now get CACHED per-layer step graphs (like the DeltaNet ones) with positions + kv_write_rows inputs and the FA span baked per 256-token window; rebuilt only on window crossings. pipelined_decode_one_token reuses them with input-data updates instead of rebuilding a fresh 512MB graph ctx per attention layer per token. - DFLASH_QWEN35MOE_NO_KVPAD=1 restores per-token rebuild. gemma4 (26B-A4B / 31B): - gemma4_step gets the persistent thread_local arena and two set_rows index inputs: absolute rows for full-attention layers, ring rows (pos % swa_size) for SWA layers — which also fixes the offset-view append writing past the ring end for chunks crossing the wrap boundary. FA spans/masks were already 256-padded. - KV cache buffer now zero-initialised at creation (F16 garbage in the padded tail can be NaN/Inf; NaN + -inf mask = NaN). - DFLASH_GEMMA4_NO_KVPAD=1 restores the legacy append. Measured (RTX 3090, dflash_server /v1/chat/completions, greedy 400 tok): gemma4-26B-A4B 102.6 -> 117.6 tok/s (+15%), output 100% identical Qwen3.6-27B AR 36.7 -> 39.2 tok/s (+7%, bandwidth-bound), coherent Qwen3.6-35B all-GPU unchanged (125 tok/s, pipelined path not used there) Co-Authored-By: WOZCODE <contact@withwoz.com>
Contributor
There was a problem hiding this comment.
1 issue found across 8 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
cubic P1: positions/kv_write_rows/kv_win were not moved or nulled in the source, so a vector reallocation of cached_prefn would leave the moved-to graph with null attn inputs (crash on tensor_set) and dangling pointers in the source. Co-Authored-By: WOZCODE <contact@withwoz.com>
The pull_request paths filter included server/src/** so every code PR rebuilt the cuda12+rocm images (~20 runner-minutes per push) for build-only coverage the self-hosted GPU jobs (RTX 3090 sm_86 + Radeon gfx1151) already provide on real hardware. Image-build regressions can only originate from the Dockerfile / bake file / dockerignore / this workflow — keep the PR guard scoped to those; main pushes and releases still build and push as before. Co-Authored-By: WOZCODE <contact@withwoz.com>
The RTX 3090 / Radeon gfx1151 jobs finish in ~2 minutes on real hardware but were gated on the ~18-minute GitHub-hosted cmake+megakernel build, so the strongest CI signal arrived last and was invisible in the checks list until the cloud build completed. Gate them on the 1-minute uv-workspace sanity check only; both jobs compile the tree themselves and fail just as fast on broken code. Co-Authored-By: WOZCODE <contact@withwoz.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes decode graphs bit-stable across tokens so ggml-cuda's CUDA-graph cache engages and replays the token as one captured graph instead of ~1.4k individual launches. Now covers all four backends: laguna (fused hybrid + all-GPU), qwen35 27B AR, qwen35moe pipelined (Spark), gemma4.
The blocker was structural everywhere: K/V appends wrote through
kv_start-offset views, whose node properties change every token;ggml_cuda_graph_update_requiredmemcmps every node property and resets capture warmup on any change, so replay never engaged (except qwen's DeltaNet layers — fixed-size state — which is why #332's win never reached the attention paths).How (per backend)
Recipe: (1)
ggml_set_rowsK/V append with the row index as a graph input, (2) FA spans + masks stride-padded to 256 so shapes change only at window boundaries, (3) persistent arenas so rebuilt graphs keep stable node addresses.laguna_step+laguna_step_hybrid(one fused graph/token).build_target_step; the existingkv_write_rowspath now forces the 256-stride FA span for all KV types (was TQ3_0-only). Mask-less decode + zero-init cache ⇒ padded rows contributeexp(-row_max) ≈ 0(the approximation the TQ3_0 path already shipped).reset_recurrent_statenow clears the base buffer device-side and restore clears before restoring — required so stale K/V from a previous request can't sit inside the padded tail.positions/kv_write_rowsinputs, FA span baked per 256-token window; the per-token 512MB graph rebuild per attention layer is gone.pos % swa_sizering rows for SWA — also fixing the contiguous-block append that could write past the ring end). KV cache now zero-initialised (F16 garbage in the padded tail can be NaN).Escape hatches per backend:
DFLASH_{LAGUNA,QWEN35,QWEN35MOE,GEMMA4}_NO_KVPAD=1.Numbers (RTX 3090, greedy, real prompts via dflash_server; laguna via bench_laguna_spark)
Correctness / numerics, honestly
set_rowsappend is bit-identical to the cpy append at equal padding (verified by full-sequence hash on laguna).exp(-row_max)-suppressed zero rows (same approximation as the pre-existing TQ3_0 padding); under greedy this flips near-ties and trajectories diverge while staying coherent. The stale-tail hazard this creates across requests is closed by the device-side buffer clears above.Bench
bench_laguna_spark(new) drives the realLagunaBackendhybrid path with allDFLASH_*knobs,DFLASH_BENCH_MIX/SEEDvaried prompts, and a full-sequence FNV hash for exactness gates.🧙 Built with WOZCODE