DFlash on Qwen3.5-4B: hybrid rewind + GPU verify, server support, portable chunked GDN#5
Open
AlexWortega wants to merge 4 commits into
Open
DFlash on Qwen3.5-4B: hybrid rewind + GPU verify, server support, portable chunked GDN#5AlexWortega wants to merge 4 commits into
AlexWortega wants to merge 4 commits into
Conversation
…re-tokenizer Qwen/Qwen3.5-4B is Qwen3_5ForConditionalGeneration (multimodal). Without this mapping neither the target nor the DFlash drafter converts to GGUF.
The DFlash drafter targets a Gated-DeltaNet hybrid (recurrent + attention). The recurrent state can't be partial-rolled-back, so a naive verify is slower than plain generation. This brings it to lossless speedup: - recurrent-state rewind via a per-token GDN state trace + on-device promote of the accepted state (llama_dflash_promote_state) instead of a ~50 MiB host checkpoint and re-decode per round - graph reuse: fixed-capacity device-resident target-context cache, encoder folded into the decoder graph, padding mask over a bucketed context - on-device greedy verify: drafter block argmax + target argmax for the greedy verify (llama_set_out_argmax), one host sync per round - optional GPU sampling verify (temperature; top-k/top-p behind LLAMA_SPEC_GPU_SAMPLE) - CUDA graphs opt-in on Volta (GGML_CUDA_GRAPHS_VOLTA) and a stable sched uid Lossless. ~1.7x on V100/Q8 single-stream, scaling with the draft block on high-acceptance (reasoning) workloads.
Decompose the GDN recurrence into a pure ggml-op graph (cumsum, exp, mul_mat, tri, solve_tri, diag, concat) so the verify can run on backends that lack a fused GDN kernel (WebGPU, Metal, Vulkan). Multi-chunk tiling keeps exp(cumsum(g)) in fp32 range; handles both the vector (KDA) and per-head scalar gate, and GQA. Validated bitwise against ggml_gated_delta_net on CPU and CUDA (tests/test-gdn-chunked). Opt-in via LLAMA_GDN_CHUNKED; the default path is unchanged. This is for portability: on CUDA the fused kernel is faster and the GDN scan is not the verify bottleneck.
The server DFlash path was wired but crashed on every request, because the server processes a prompt in several ubatches while speculative-simple does it in one decode: - index the target features by absolute position and accumulate across ubatches (a chunked prompt previously left the first draft reading stale features), and read the [n_total-n_new, n_total) slice in the drafter - reset dflash_n_past per request in begin() (it carried over between requests) - set the view buffers in dflash_promote_state so the trace/promote copy also runs on the CPU backend (was a CUDA-only path, asserted on a null buffer) Also add GPU greedy verify: for a pure-greedy request the target emits an on-device argmax of the verify block and the host skips the per-block logits download + CPU sampler. Enabled only after the first token is sampled from logits, reset per request; non-greedy requests fall back to the host sampler. Lossless (byte-identical to the host-verify path). ~2.0x -> 2.4x on reasoning.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on this DFlash PR to make speculative decoding actually faster than
autoregressive on the Qwen3.5-4B Gated-DeltaNet hybrid, add CPU support, land
it in llama-server, and add a portable (WebGPU/Metal/Vulkan) GDN verify path.
Branches from
ruixiang63:dflash(67cb0d50), so this shows exactly the 4 commits below.Commits
1. convert: map the Qwen3.5-4B multimodal tokenizer hash to the qwen35 pre-tokenizer
Qwen/Qwen3.5-4BisQwen3_5ForConditionalGeneration; without the mapping neithertarget nor drafter converts to GGUF.
2. dflash: make speculative decoding work and fast on the hybrid
The recurrent state can't be partial-rolled-back, so a naive verify is slower than
plain generation. Fixes:
accepted state, instead of a ~50 MiB host checkpoint + re-decode per round
graph, padding mask over a bucketed context
llama_set_out_argmaxfor thetarget), one host sync/round; optional GPU sampling verify (temperature)
GGML_CUDA_GRAPHS_VOLTA)Lossless; ~1.7x on V100/Q8 single-stream, scaling with the draft block on
high-acceptance (reasoning) workloads.
3. gdn: portable chunk-parallel Gated-DeltaNet verify path (opt-in)
Decomposes the GDN recurrence into a pure ggml-op graph (cumsum/exp/mul_mat/tri/
solve_tri/diag) so the verify can run on backends without a fused GDN kernel
(WebGPU/Metal/Vulkan). Multi-chunk tiling, vector + scalar gates, GQA. Bitwise-
validated vs
ggml_gated_delta_neton CPU and CUDA (tests/test-gdn-chunked).Opt-in via
LLAMA_GDN_CHUNKED; default path unchanged.4. server: fix DFlash spec path + GPU greedy verify
The server processes a prompt in several ubatches (vs one decode in speculative-
simple), which exposed two crashes:
per-request state in
begin()dflash_promote_stateso trace/promote also runs on CPUPlus GPU greedy verify for pure-greedy requests (skip the per-block logits download).
Lossless (byte-identical to the host-verify path); ~2.0x -> 2.4x on reasoning.
Verification
tests/test-gdn-chunked: ALL PASS on CPU and CUDA (~1e-8 vs the fused op)ctestsuite passes; the only 2 failures (test-llama-archsgrovemoe/EAGLE3,test-quant-type-selection) are pre-existing on the base branch (reproduced there)