perf(laguna): CUDA-graph replay decode — 113→143 tok/s all-GPU, 101→129 at 60% residency by davide221 · Pull Request #358 · Luce-Org/lucebox-hub

davide221 · 2026-06-09T22:51:03Z

What

Makes decode graphs bit-stable across tokens so ggml-cuda's CUDA-graph cache engages and replays the token as one captured graph instead of ~1.4k individual launches. Now covers all four backends: laguna (fused hybrid + all-GPU), qwen35 27B AR, qwen35moe pipelined (Spark), gemma4.

The blocker was structural everywhere: K/V appends wrote through kv_start-offset views, whose node properties change every token; ggml_cuda_graph_update_required memcmps every node property and resets capture warmup on any change, so replay never engaged (except qwen's DeltaNet layers — fixed-size state — which is why #332's win never reached the attention paths).

How (per backend)

Recipe: (1) ggml_set_rows K/V append with the row index as a graph input, (2) FA spans + masks stride-padded to 256 so shapes change only at window boundaries, (3) persistent arenas so rebuilt graphs keep stable node addresses.

laguna: applied to laguna_step + laguna_step_hybrid (one fused graph/token).
qwen35 (27B AR): persistent arena in build_target_step; the existing kv_write_rows path now forces the 256-stride FA span for all KV types (was TQ3_0-only). Mask-less decode + zero-init cache ⇒ padded rows contribute exp(-row_max) ≈ 0 (the approximation the TQ3_0 path already shipped). reset_recurrent_state now clears the base buffer device-side and restore clears before restoring — required so stale K/V from a previous request can't sit inside the padded tail.
qwen35moe (Spark): full-attention layers get cached per-layer step graphs (like the DeltaNet ones) with positions/kv_write_rows inputs, FA span baked per 256-token window; the per-token 512MB graph rebuild per attention layer is gone.
gemma4: arena + two set_rows index streams (absolute rows for full layers, pos % swa_size ring rows for SWA — also fixing the contiguous-block append that could write past the ring end). KV cache now zero-initialised (F16 garbage in the padded tail can be NaN).

Escape hatches per backend: DFLASH_{LAGUNA,QWEN35,QWEN35MOE,GEMMA4}_NO_KVPAD=1.

Numbers (RTX 3090, greedy, real prompts via dflash_server; laguna via bench_laguna_spark)

Model / mode	legacy	replay	output vs legacy
laguna-xs2 all-GPU	113.3	143.3 (+26%)	token-identical
laguna-xs2 Spark 60%+ring	101.4	129.5 (+28%)	equal quality (drops dominate, pre-existing)
gemma4-26B-A4B	102.6	117.6 (+15%)	100% identical text
Qwen3.6-27B AR	36.7	39.2 (+7%)	coherent; greedy near-tie divergence
Qwen3.6-35B-A3B Spark	102.6	108.1 (+5%)	coherent; same divergence class
Qwen3.6-35B-A3B all-GPU	125	125 (=)	identical (path not touched)

Correctness / numerics, honestly

set_rows append is bit-identical to the cpy append at equal padding (verified by full-sequence hash on laguna).
gemma4 + masked paths: bit-identical output (spans were already padded ⇒ same numerics).
laguna: only delta vs unpadded legacy is FA tile-order rounding — the numerics upstream llama.cpp ships by default.
qwen35/qwen35moe decode is mask-less, so the padded span adds exp(-row_max)-suppressed zero rows (same approximation as the pre-existing TQ3_0 padding); under greedy this flips near-ties and trajectories diverge while staying coherent. The stale-tail hazard this creates across requests is closed by the device-side buffer clears above.
Spec-decode tree-verify paths are unchanged (variable tree shapes; follow-up — needs a fixed tree budget).

Bench

bench_laguna_spark (new) drives the real LagunaBackend hybrid path with all DFLASH_* knobs, DFLASH_BENCH_MIX/SEED varied prompts, and a full-sequence FNV hash for exactness gates.

🧙 Built with WOZCODE

…29 at 60% residency The fused decode rebuilt its graph every token with a kv_start-offset view as the K/V copy destination. ggml-cuda's CUDA-graph cache memcmps every node's properties and resets capture warmup on any change, so the moving copy destination forfeited replay permanently: every token paid ~1.4k individual kernel launches, and the ~200 LUT-remap ops of the hybrid path sat in the dependency chain at full launch latency. Three changes make the graph bit-stable across decode steps: 1. K/V append via ggml_set_rows with the row index as a graph INPUT (kv_idx, I32, broadcasts over heads): index data changes per step, node properties don't. Bit-identical output vs the cpy append at equal padding (verified by full-sequence hash). 2. FA K/V views + masks stride-padded to 256 slots (kv_pad); the mask carries -inf for the padded tail and the KV buffer is zero-init'd, so topology only changes when decode crosses a 256-token boundary. 3. Persistent ggml arena (ggml_init with mem_buffer) so rebuilt graphs land at identical addresses -> stable graph key -> warmup completes and the captured CUDA graph replays. Applied to both laguna_step (all-GPU) and laguna_step_hybrid (Spark hot/cold). RTX 3090, laguna-xs2 Q4_K_M, 128p/256g greedy: all-GPU 113.3 -> 143.3 tok/s (+26%, ids bit-identical) Spark 60% + 16-slot ring 101.4 -> 129.5 tok/s (114% of the old ceiling) 99.4% resident 106.6 -> 134.8 tok/s Escape hatches: DFLASH_LAGUNA_NO_KVPAD=1 restores the legacy exact-length path; DFLASH_LAGUNA_PAD_CPY=1 pads spans but keeps the cpy append (numerics decomposition). New bench: bench_laguna_spark drives the real LagunaBackend hybrid path with all DFLASH_* knobs, DFLASH_BENCH_MIX/SEED varied prompts, and a full-sequence FNV hash for exactness gates. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

2 issues found across 4 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…nch args - laguna_step/laguna_step_hybrid arenas become thread_local: decode is single-threaded per process today (the pre-existing static gallocr makes the same assumption), but a second decode thread must not share the arena. Same-thread address stability (what CUDA-graph replay needs) is preserved. - bench_laguna_spark: clamp prompt_N/n_gen to >=1 so a 0 or non-numeric argv can't write past the end of an empty prompt vector. Co-Authored-By: WOZCODE <contact@withwoz.com>

…g#356 merges

…mma4 Same recipe as the laguna fix (set_rows KV append with index INPUTS, stride-padded FA spans, stable arenas), adapted per backend: qwen35 (Qwen3.5/3.6-27B AR target-only decode): - build_target_step gets a persistent thread_local arena (stable node addresses across steps -> stable CUDA-graph key). - The existing step-invariant kv_write_rows path now also forces the 256-stride FA span for ALL KV types (was TQ3_0-only), clamped to the cache capacity, so view shapes stay constant within a 256-token window. Mask-less decode + zero-initialised cache: padded rows contribute exp(-row_max) ~ 0 (same approximation the TQ3_0 path already shipped). - reset_recurrent_state now clears the whole base buffer device-side (cudaMemset, ~0.2ms) and restore_and_generate clears before restoring: with a padded mask-less span, stale K/V rows from the previous request inside the padded tail would otherwise be attended with real scores. - DFLASH_QWEN35_NO_KVPAD=1 restores the legacy path. qwen35moe (Qwen3.6-35B-A3B hybrid/Spark pipelined decode): - build_qwen35_layer_prefn gains a kv_write_rows pass-through. - New build_cached_attn_prefn: full-attention layers now get CACHED per-layer step graphs (like the DeltaNet ones) with positions + kv_write_rows inputs and the FA span baked per 256-token window; rebuilt only on window crossings. pipelined_decode_one_token reuses them with input-data updates instead of rebuilding a fresh 512MB graph ctx per attention layer per token. - DFLASH_QWEN35MOE_NO_KVPAD=1 restores per-token rebuild. gemma4 (26B-A4B / 31B): - gemma4_step gets the persistent thread_local arena and two set_rows index inputs: absolute rows for full-attention layers, ring rows (pos % swa_size) for SWA layers — which also fixes the offset-view append writing past the ring end for chunks crossing the wrap boundary. FA spans/masks were already 256-padded. - KV cache buffer now zero-initialised at creation (F16 garbage in the padded tail can be NaN/Inf; NaN + -inf mask = NaN). - DFLASH_GEMMA4_NO_KVPAD=1 restores the legacy append. Measured (RTX 3090, dflash_server /v1/chat/completions, greedy 400 tok): gemma4-26B-A4B 102.6 -> 117.6 tok/s (+15%), output 100% identical Qwen3.6-27B AR 36.7 -> 39.2 tok/s (+7%, bandwidth-bound), coherent Qwen3.6-35B all-GPU unchanged (125 tok/s, pipelined path not used there) Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

1 issue found across 8 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic P1: positions/kv_write_rows/kv_win were not moved or nulled in the source, so a vector reallocation of cached_prefn would leave the moved-to graph with null attn inputs (crash on tensor_set) and dangling pointers in the source. Co-Authored-By: WOZCODE <contact@withwoz.com>

The pull_request paths filter included server/src/** so every code PR rebuilt the cuda12+rocm images (~20 runner-minutes per push) for build-only coverage the self-hosted GPU jobs (RTX 3090 sm_86 + Radeon gfx1151) already provide on real hardware. Image-build regressions can only originate from the Dockerfile / bake file / dockerignore / this workflow — keep the PR guard scoped to those; main pushes and releases still build and push as before. Co-Authored-By: WOZCODE <contact@withwoz.com>

The RTX 3090 / Radeon gfx1151 jobs finish in ~2 minutes on real hardware but were gated on the ~18-minute GitHub-hosted cmake+megakernel build, so the strongest CI signal arrived last and was invisible in the checks list until the cloud build completed. Gate them on the 1-minute uv-workspace sanity check only; both jobs compile the tree themselves and fail just as fast on broken code. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread server/src/laguna/laguna_target_graph.cpp Outdated

Comment thread server/test/bench_laguna_spark.cpp

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 10, 2026

docs: refresh auto-integration manifest after PR Luce-Org#358/Luce-Or…

33b8ec1

…g#356 merges

cubic-dev-ai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread server/src/qwen35moe/qwen35moe_pipelined_decode.h

davide221 and others added 3 commits June 10, 2026 11:10

davide221 merged commit 507a678 into main Jun 10, 2026
5 checks passed

davide221 deleted the feat/laguna-cudagraph-replay branch June 10, 2026 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(laguna): CUDA-graph replay decode — 113→143 tok/s all-GPU, 101→129 at 60% residency#358

perf(laguna): CUDA-graph replay decode — 113→143 tok/s all-GPU, 101→129 at 60% residency#358
davide221 merged 6 commits into
mainfrom
feat/laguna-cudagraph-replay

davide221 commented Jun 9, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davide221 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How (per backend)

Numbers (RTX 3090, greedy, real prompts via dflash_server; laguna via bench_laguna_spark)

Correctness / numerics, honestly

Bench

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davide221 commented Jun 9, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading