Skip to content

DFlash on Qwen3.5-4B: hybrid rewind + GPU verify, server support, portable chunked GDN#5

Open
AlexWortega wants to merge 4 commits into
ruixiang63:dflashfrom
AlexWortega:work-qwen35-dflash
Open

DFlash on Qwen3.5-4B: hybrid rewind + GPU verify, server support, portable chunked GDN#5
AlexWortega wants to merge 4 commits into
ruixiang63:dflashfrom
AlexWortega:work-qwen35-dflash

Conversation

@AlexWortega

Copy link
Copy Markdown

Builds on this DFlash PR to make speculative decoding actually faster than
autoregressive on the Qwen3.5-4B Gated-DeltaNet hybrid, add CPU support, land
it in llama-server, and add a portable (WebGPU/Metal/Vulkan) GDN verify path.

Branches from ruixiang63:dflash (67cb0d50), so this shows exactly the 4 commits below.

Commits

1. convert: map the Qwen3.5-4B multimodal tokenizer hash to the qwen35 pre-tokenizer
Qwen/Qwen3.5-4B is Qwen3_5ForConditionalGeneration; without the mapping neither
target nor drafter converts to GGUF.

2. dflash: make speculative decoding work and fast on the hybrid
The recurrent state can't be partial-rolled-back, so a naive verify is slower than
plain generation. Fixes:

  • recurrent-state rewind via a per-token GDN state trace + on-device promote of the
    accepted state, instead of a ~50 MiB host checkpoint + re-decode per round
  • graph reuse: device-resident target-context cache, encoder folded into the decoder
    graph, padding mask over a bucketed context
  • on-device greedy verify (drafter block argmax + llama_set_out_argmax for the
    target), one host sync/round; optional GPU sampling verify (temperature)
  • CUDA graphs opt-in on Volta (GGML_CUDA_GRAPHS_VOLTA)

Lossless; ~1.7x on V100/Q8 single-stream, scaling with the draft block on
high-acceptance (reasoning) workloads.

3. gdn: portable chunk-parallel Gated-DeltaNet verify path (opt-in)
Decomposes the GDN recurrence into a pure ggml-op graph (cumsum/exp/mul_mat/tri/
solve_tri/diag) so the verify can run on backends without a fused GDN kernel
(WebGPU/Metal/Vulkan). Multi-chunk tiling, vector + scalar gates, GQA. Bitwise-
validated vs ggml_gated_delta_net on CPU and CUDA (tests/test-gdn-chunked).
Opt-in via LLAMA_GDN_CHUNKED; default path unchanged.

4. server: fix DFlash spec path + GPU greedy verify
The server processes a prompt in several ubatches (vs one decode in speculative-
simple), which exposed two crashes:

  • index target features by absolute position + accumulate across ubatches; reset
    per-request state in begin()
  • set view buffers in dflash_promote_state so trace/promote also runs on CPU

Plus GPU greedy verify for pure-greedy requests (skip the per-block logits download).
Lossless (byte-identical to the host-verify path); ~2.0x -> 2.4x on reasoning.

Verification

  • tests/test-gdn-chunked: ALL PASS on CPU and CUDA (~1e-8 vs the fused op)
  • full ctest suite passes; the only 2 failures (test-llama-archs grovemoe/EAGLE3,
    test-quant-type-selection) are pre-existing on the base branch (reproduced there)
  • end-to-end lossless on V100 (Q8) and a CPU-only build

…re-tokenizer

Qwen/Qwen3.5-4B is Qwen3_5ForConditionalGeneration (multimodal). Without this
mapping neither the target nor the DFlash drafter converts to GGUF.
The DFlash drafter targets a Gated-DeltaNet hybrid (recurrent + attention). The
recurrent state can't be partial-rolled-back, so a naive verify is slower than
plain generation. This brings it to lossless speedup:

- recurrent-state rewind via a per-token GDN state trace + on-device promote of
  the accepted state (llama_dflash_promote_state) instead of a ~50 MiB host
  checkpoint and re-decode per round
- graph reuse: fixed-capacity device-resident target-context cache, encoder
  folded into the decoder graph, padding mask over a bucketed context
- on-device greedy verify: drafter block argmax + target argmax for the greedy
  verify (llama_set_out_argmax), one host sync per round
- optional GPU sampling verify (temperature; top-k/top-p behind LLAMA_SPEC_GPU_SAMPLE)
- CUDA graphs opt-in on Volta (GGML_CUDA_GRAPHS_VOLTA) and a stable sched uid

Lossless. ~1.7x on V100/Q8 single-stream, scaling with the draft block on
high-acceptance (reasoning) workloads.
Decompose the GDN recurrence into a pure ggml-op graph (cumsum, exp, mul_mat,
tri, solve_tri, diag, concat) so the verify can run on backends that lack a
fused GDN kernel (WebGPU, Metal, Vulkan). Multi-chunk tiling keeps exp(cumsum(g))
in fp32 range; handles both the vector (KDA) and per-head scalar gate, and GQA.

Validated bitwise against ggml_gated_delta_net on CPU and CUDA
(tests/test-gdn-chunked). Opt-in via LLAMA_GDN_CHUNKED; the default path is
unchanged. This is for portability: on CUDA the fused kernel is faster and the
GDN scan is not the verify bottleneck.
The server DFlash path was wired but crashed on every request, because the
server processes a prompt in several ubatches while speculative-simple does it
in one decode:

- index the target features by absolute position and accumulate across ubatches
  (a chunked prompt previously left the first draft reading stale features), and
  read the [n_total-n_new, n_total) slice in the drafter
- reset dflash_n_past per request in begin() (it carried over between requests)
- set the view buffers in dflash_promote_state so the trace/promote copy also
  runs on the CPU backend (was a CUDA-only path, asserted on a null buffer)

Also add GPU greedy verify: for a pure-greedy request the target emits an
on-device argmax of the verify block and the host skips the per-block logits
download + CPU sampler. Enabled only after the first token is sampled from
logits, reset per request; non-greedy requests fall back to the host sampler.

Lossless (byte-identical to the host-verify path). ~2.0x -> 2.4x on reasoning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant