DFlash on Qwen3.5-4B: hybrid rewind + GPU verify, server support, portable chunked GDN by AlexWortega · Pull Request #5 · ruixiang63/llama.cpp

AlexWortega · 2026-06-12T17:29:05Z

Builds on this DFlash PR to make speculative decoding actually faster than
autoregressive on the Qwen3.5-4B Gated-DeltaNet hybrid, add CPU support, land
it in llama-server, and add a portable (WebGPU/Metal/Vulkan) GDN verify path.

Branches from ruixiang63:dflash (67cb0d50), so this shows exactly the 4 commits below.

Commits

1. convert: map the Qwen3.5-4B multimodal tokenizer hash to the qwen35 pre-tokenizer
Qwen/Qwen3.5-4B is Qwen3_5ForConditionalGeneration; without the mapping neither
target nor drafter converts to GGUF.

2. dflash: make speculative decoding work and fast on the hybrid
The recurrent state can't be partial-rolled-back, so a naive verify is slower than
plain generation. Fixes:

recurrent-state rewind via a per-token GDN state trace + on-device promote of the
accepted state, instead of a ~50 MiB host checkpoint + re-decode per round
graph reuse: device-resident target-context cache, encoder folded into the decoder
graph, padding mask over a bucketed context
on-device greedy verify (drafter block argmax + llama_set_out_argmax for the
target), one host sync/round; optional GPU sampling verify (temperature)
CUDA graphs opt-in on Volta (GGML_CUDA_GRAPHS_VOLTA)

Lossless; ~1.7x on V100/Q8 single-stream, scaling with the draft block on
high-acceptance (reasoning) workloads.

3. gdn: portable chunk-parallel Gated-DeltaNet verify path (opt-in)
Decomposes the GDN recurrence into a pure ggml-op graph (cumsum/exp/mul_mat/tri/
solve_tri/diag) so the verify can run on backends without a fused GDN kernel
(WebGPU/Metal/Vulkan). Multi-chunk tiling, vector + scalar gates, GQA. Bitwise-
validated vs ggml_gated_delta_net on CPU and CUDA (tests/test-gdn-chunked).
Opt-in via LLAMA_GDN_CHUNKED; default path unchanged.

4. server: fix DFlash spec path + GPU greedy verify
The server processes a prompt in several ubatches (vs one decode in speculative-
simple), which exposed two crashes:

index target features by absolute position + accumulate across ubatches; reset
per-request state in begin()
set view buffers in dflash_promote_state so trace/promote also runs on CPU

Plus GPU greedy verify for pure-greedy requests (skip the per-block logits download).
Lossless (byte-identical to the host-verify path); ~2.0x -> 2.4x on reasoning.

Verification

tests/test-gdn-chunked: ALL PASS on CPU and CUDA (~1e-8 vs the fused op)
full ctest suite passes; the only 2 failures (test-llama-archs grovemoe/EAGLE3,
test-quant-type-selection) are pre-existing on the base branch (reproduced there)
end-to-end lossless on V100 (Q8) and a CPU-only build

…re-tokenizer Qwen/Qwen3.5-4B is Qwen3_5ForConditionalGeneration (multimodal). Without this mapping neither the target nor the DFlash drafter converts to GGUF.

The DFlash drafter targets a Gated-DeltaNet hybrid (recurrent + attention). The recurrent state can't be partial-rolled-back, so a naive verify is slower than plain generation. This brings it to lossless speedup: - recurrent-state rewind via a per-token GDN state trace + on-device promote of the accepted state (llama_dflash_promote_state) instead of a ~50 MiB host checkpoint and re-decode per round - graph reuse: fixed-capacity device-resident target-context cache, encoder folded into the decoder graph, padding mask over a bucketed context - on-device greedy verify: drafter block argmax + target argmax for the greedy verify (llama_set_out_argmax), one host sync per round - optional GPU sampling verify (temperature; top-k/top-p behind LLAMA_SPEC_GPU_SAMPLE) - CUDA graphs opt-in on Volta (GGML_CUDA_GRAPHS_VOLTA) and a stable sched uid Lossless. ~1.7x on V100/Q8 single-stream, scaling with the draft block on high-acceptance (reasoning) workloads.

Decompose the GDN recurrence into a pure ggml-op graph (cumsum, exp, mul_mat, tri, solve_tri, diag, concat) so the verify can run on backends that lack a fused GDN kernel (WebGPU, Metal, Vulkan). Multi-chunk tiling keeps exp(cumsum(g)) in fp32 range; handles both the vector (KDA) and per-head scalar gate, and GQA. Validated bitwise against ggml_gated_delta_net on CPU and CUDA (tests/test-gdn-chunked). Opt-in via LLAMA_GDN_CHUNKED; the default path is unchanged. This is for portability: on CUDA the fused kernel is faster and the GDN scan is not the verify bottleneck.

The server DFlash path was wired but crashed on every request, because the server processes a prompt in several ubatches while speculative-simple does it in one decode: - index the target features by absolute position and accumulate across ubatches (a chunked prompt previously left the first draft reading stale features), and read the [n_total-n_new, n_total) slice in the drafter - reset dflash_n_past per request in begin() (it carried over between requests) - set the view buffers in dflash_promote_state so the trace/promote copy also runs on the CPU backend (was a CUDA-only path, asserted on a null buffer) Also add GPU greedy verify: for a pure-greedy request the target emits an on-device argmax of the verify block and the host skips the per-block logits download + CPU sampler. Enabled only after the first token is sampled from logits, reset per request; non-greedy requests fall back to the host sampler. Lossless (byte-identical to the host-verify path). ~2.0x -> 2.4x on reasoning.

AlexWortega added 4 commits June 12, 2026 18:11

convert: map the Qwen3.5-4B multimodal tokenizer hash to the qwen35 p…

10508e7

…re-tokenizer Qwen/Qwen3.5-4B is Qwen3_5ForConditionalGeneration (multimodal). Without this mapping neither the target nor the DFlash drafter converts to GGUF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFlash on Qwen3.5-4B: hybrid rewind + GPU verify, server support, portable chunked GDN#5

DFlash on Qwen3.5-4B: hybrid rewind + GPU verify, server support, portable chunked GDN#5
AlexWortega wants to merge 4 commits into
ruixiang63:dflashfrom
AlexWortega:work-qwen35-dflash

AlexWortega commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlexWortega commented Jun 12, 2026

Commits

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant