Skip to content

perf(laguna): CUDA-graph replay decode — 113→143 tok/s all-GPU, 101→129 at 60% residency#358

Merged
davide221 merged 6 commits into
mainfrom
feat/laguna-cudagraph-replay
Jun 10, 2026
Merged

perf(laguna): CUDA-graph replay decode — 113→143 tok/s all-GPU, 101→129 at 60% residency#358
davide221 merged 6 commits into
mainfrom
feat/laguna-cudagraph-replay

Conversation

@davide221

@davide221 davide221 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What

Makes decode graphs bit-stable across tokens so ggml-cuda's CUDA-graph cache engages and replays the token as one captured graph instead of ~1.4k individual launches. Now covers all four backends: laguna (fused hybrid + all-GPU), qwen35 27B AR, qwen35moe pipelined (Spark), gemma4.

The blocker was structural everywhere: K/V appends wrote through kv_start-offset views, whose node properties change every token; ggml_cuda_graph_update_required memcmps every node property and resets capture warmup on any change, so replay never engaged (except qwen's DeltaNet layers — fixed-size state — which is why #332's win never reached the attention paths).

How (per backend)

Recipe: (1) ggml_set_rows K/V append with the row index as a graph input, (2) FA spans + masks stride-padded to 256 so shapes change only at window boundaries, (3) persistent arenas so rebuilt graphs keep stable node addresses.

  • laguna: applied to laguna_step + laguna_step_hybrid (one fused graph/token).
  • qwen35 (27B AR): persistent arena in build_target_step; the existing kv_write_rows path now forces the 256-stride FA span for all KV types (was TQ3_0-only). Mask-less decode + zero-init cache ⇒ padded rows contribute exp(-row_max) ≈ 0 (the approximation the TQ3_0 path already shipped). reset_recurrent_state now clears the base buffer device-side and restore clears before restoring — required so stale K/V from a previous request can't sit inside the padded tail.
  • qwen35moe (Spark): full-attention layers get cached per-layer step graphs (like the DeltaNet ones) with positions/kv_write_rows inputs, FA span baked per 256-token window; the per-token 512MB graph rebuild per attention layer is gone.
  • gemma4: arena + two set_rows index streams (absolute rows for full layers, pos % swa_size ring rows for SWA — also fixing the contiguous-block append that could write past the ring end). KV cache now zero-initialised (F16 garbage in the padded tail can be NaN).

Escape hatches per backend: DFLASH_{LAGUNA,QWEN35,QWEN35MOE,GEMMA4}_NO_KVPAD=1.

Numbers (RTX 3090, greedy, real prompts via dflash_server; laguna via bench_laguna_spark)

Model / mode legacy replay output vs legacy
laguna-xs2 all-GPU 113.3 143.3 (+26%) token-identical
laguna-xs2 Spark 60%+ring 101.4 129.5 (+28%) equal quality (drops dominate, pre-existing)
gemma4-26B-A4B 102.6 117.6 (+15%) 100% identical text
Qwen3.6-27B AR 36.7 39.2 (+7%) coherent; greedy near-tie divergence
Qwen3.6-35B-A3B Spark 102.6 108.1 (+5%) coherent; same divergence class
Qwen3.6-35B-A3B all-GPU 125 125 (=) identical (path not touched)

Correctness / numerics, honestly

  • set_rows append is bit-identical to the cpy append at equal padding (verified by full-sequence hash on laguna).
  • gemma4 + masked paths: bit-identical output (spans were already padded ⇒ same numerics).
  • laguna: only delta vs unpadded legacy is FA tile-order rounding — the numerics upstream llama.cpp ships by default.
  • qwen35/qwen35moe decode is mask-less, so the padded span adds exp(-row_max)-suppressed zero rows (same approximation as the pre-existing TQ3_0 padding); under greedy this flips near-ties and trajectories diverge while staying coherent. The stale-tail hazard this creates across requests is closed by the device-side buffer clears above.
  • Spec-decode tree-verify paths are unchanged (variable tree shapes; follow-up — needs a fixed tree budget).

Bench

bench_laguna_spark (new) drives the real LagunaBackend hybrid path with all DFLASH_* knobs, DFLASH_BENCH_MIX/SEED varied prompts, and a full-sequence FNV hash for exactness gates.

🧙 Built with WOZCODE

…29 at 60% residency

The fused decode rebuilt its graph every token with a kv_start-offset view
as the K/V copy destination. ggml-cuda's CUDA-graph cache memcmps every
node's properties and resets capture warmup on any change, so the moving
copy destination forfeited replay permanently: every token paid ~1.4k
individual kernel launches, and the ~200 LUT-remap ops of the hybrid path
sat in the dependency chain at full launch latency.

Three changes make the graph bit-stable across decode steps:

1. K/V append via ggml_set_rows with the row index as a graph INPUT
   (kv_idx, I32, broadcasts over heads): index data changes per step,
   node properties don't. Bit-identical output vs the cpy append at
   equal padding (verified by full-sequence hash).
2. FA K/V views + masks stride-padded to 256 slots (kv_pad); the mask
   carries -inf for the padded tail and the KV buffer is zero-init'd,
   so topology only changes when decode crosses a 256-token boundary.
3. Persistent ggml arena (ggml_init with mem_buffer) so rebuilt graphs
   land at identical addresses -> stable graph key -> warmup completes
   and the captured CUDA graph replays.

Applied to both laguna_step (all-GPU) and laguna_step_hybrid (Spark
hot/cold). RTX 3090, laguna-xs2 Q4_K_M, 128p/256g greedy:

  all-GPU                  113.3 -> 143.3 tok/s (+26%, ids bit-identical)
  Spark 60% + 16-slot ring 101.4 -> 129.5 tok/s (114% of the old ceiling)
  99.4% resident           106.6 -> 134.8 tok/s

Escape hatches: DFLASH_LAGUNA_NO_KVPAD=1 restores the legacy exact-length
path; DFLASH_LAGUNA_PAD_CPY=1 pads spans but keeps the cpy append (numerics
decomposition). New bench: bench_laguna_spark drives the real LagunaBackend
hybrid path with all DFLASH_* knobs, DFLASH_BENCH_MIX/SEED varied prompts,
and a full-sequence FNV hash for exactness gates.

Co-Authored-By: WOZCODE <contact@withwoz.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/laguna/laguna_target_graph.cpp Outdated
Comment thread server/test/bench_laguna_spark.cpp
…nch args

- laguna_step/laguna_step_hybrid arenas become thread_local: decode is
  single-threaded per process today (the pre-existing static gallocr makes
  the same assumption), but a second decode thread must not share the
  arena. Same-thread address stability (what CUDA-graph replay needs) is
  preserved.
- bench_laguna_spark: clamp prompt_N/n_gen to >=1 so a 0 or non-numeric
  argv can't write past the end of an empty prompt vector.

Co-Authored-By: WOZCODE <contact@withwoz.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 10, 2026
…mma4

Same recipe as the laguna fix (set_rows KV append with index INPUTS,
stride-padded FA spans, stable arenas), adapted per backend:

qwen35 (Qwen3.5/3.6-27B AR target-only decode):
- build_target_step gets a persistent thread_local arena (stable node
  addresses across steps -> stable CUDA-graph key).
- The existing step-invariant kv_write_rows path now also forces the
  256-stride FA span for ALL KV types (was TQ3_0-only), clamped to the
  cache capacity, so view shapes stay constant within a 256-token window.
  Mask-less decode + zero-initialised cache: padded rows contribute
  exp(-row_max) ~ 0 (same approximation the TQ3_0 path already shipped).
- reset_recurrent_state now clears the whole base buffer device-side
  (cudaMemset, ~0.2ms) and restore_and_generate clears before restoring:
  with a padded mask-less span, stale K/V rows from the previous request
  inside the padded tail would otherwise be attended with real scores.
- DFLASH_QWEN35_NO_KVPAD=1 restores the legacy path.

qwen35moe (Qwen3.6-35B-A3B hybrid/Spark pipelined decode):
- build_qwen35_layer_prefn gains a kv_write_rows pass-through.
- New build_cached_attn_prefn: full-attention layers now get CACHED
  per-layer step graphs (like the DeltaNet ones) with positions +
  kv_write_rows inputs and the FA span baked per 256-token window;
  rebuilt only on window crossings. pipelined_decode_one_token reuses
  them with input-data updates instead of rebuilding a fresh 512MB
  graph ctx per attention layer per token.
- DFLASH_QWEN35MOE_NO_KVPAD=1 restores per-token rebuild.

gemma4 (26B-A4B / 31B):
- gemma4_step gets the persistent thread_local arena and two set_rows
  index inputs: absolute rows for full-attention layers, ring rows
  (pos % swa_size) for SWA layers — which also fixes the offset-view
  append writing past the ring end for chunks crossing the wrap
  boundary. FA spans/masks were already 256-padded.
- KV cache buffer now zero-initialised at creation (F16 garbage in the
  padded tail can be NaN/Inf; NaN + -inf mask = NaN).
- DFLASH_GEMMA4_NO_KVPAD=1 restores the legacy append.

Measured (RTX 3090, dflash_server /v1/chat/completions, greedy 400 tok):
  gemma4-26B-A4B   102.6 -> 117.6 tok/s (+15%), output 100% identical
  Qwen3.6-27B AR    36.7 ->  39.2 tok/s (+7%, bandwidth-bound), coherent
  Qwen3.6-35B all-GPU unchanged (125 tok/s, pipelined path not used there)

Co-Authored-By: WOZCODE <contact@withwoz.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/src/qwen35moe/qwen35moe_pipelined_decode.h
davide221 and others added 3 commits June 10, 2026 11:10
cubic P1: positions/kv_write_rows/kv_win were not moved or nulled in the
source, so a vector reallocation of cached_prefn would leave the moved-to
graph with null attn inputs (crash on tensor_set) and dangling pointers
in the source.

Co-Authored-By: WOZCODE <contact@withwoz.com>
The pull_request paths filter included server/src/** so every code PR
rebuilt the cuda12+rocm images (~20 runner-minutes per push) for
build-only coverage the self-hosted GPU jobs (RTX 3090 sm_86 + Radeon
gfx1151) already provide on real hardware. Image-build regressions can
only originate from the Dockerfile / bake file / dockerignore / this
workflow — keep the PR guard scoped to those; main pushes and releases
still build and push as before.

Co-Authored-By: WOZCODE <contact@withwoz.com>
The RTX 3090 / Radeon gfx1151 jobs finish in ~2 minutes on real hardware
but were gated on the ~18-minute GitHub-hosted cmake+megakernel build,
so the strongest CI signal arrived last and was invisible in the checks
list until the cloud build completed. Gate them on the 1-minute
uv-workspace sanity check only; both jobs compile the tree themselves
and fail just as fast on broken code.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221 davide221 merged commit 507a678 into main Jun 10, 2026
5 checks passed
@davide221 davide221 deleted the feat/laguna-cudagraph-replay branch June 10, 2026 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant