Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+#172
Conversation
This fork's pre-norm MTP optimization is superseded by upstream's maintained post-norm/nextn MTP, brought in by the commits that follow. Removing it here so that lineage applies onto a clean base instead of colliding with it.
* llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm (cherry picked from commit 3e12fbd)
* llama : disable equal splits for recurrent memory with partial rollback * spec : re-enable p-min with MTP drafts * spec : re-enable ngram spec in combination with RS rollback * spec : fix ngram-map-* params * spec : fix acceptance logic in combined ngram + draft configs * graph : fix reuse for combined `token` + `embd` batches * spec : log parameters for each speculative implementation - add LOG_INF in each constructor with implementation type and parameters - extract device string logic into common_speculative_get_devices_str() - move 'adding speculative implementation' log from init into constructors Assisted-by: llama.cpp:local pi * spec : extend --spec-default with ngram-map-k4v Assisted-by: llama.cpp:local pi * minor : fix n_embd log * args : update draft.n_max == 3 + regen docs * spec : relax ngram-mod rejection thold to 0.25 @ 5 low * logs : improve * docs : update speculative decoding CLI argument documentation - Add missing draft model CPU scheduling and tensor override parameters - Update --spec-type to include all available types (excluding draft-eagle3 WIP) - Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0) - Remove deprecated options (spec-draft-ctx-size, spec-draft-replace) - Add environment variables for new parameters Assisted-by: llama.cpp:local pi * arg : step-back on adding k4v to the default spec config * cont : fix name (cherry picked from commit d14ce3d)
This commit attempts to clarify a code comment in graph_mtp regarding where the MTP layer is stored. The motivation for this is that it was not obvious to me what the original comment meant and hopefully this makes it clearer. (cherry picked from commit baf3cc6)
* Move to backend sampling for MTP draft path Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K. * Allow sampler chains to be partially offloaded to backend * Add --spec-draft-backend-sampling argument. Enabled by default. (cherry picked from commit ad27757)
…r SWA-only models (ggml-org#23131) When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4), the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs, self_kq_mask) are created as graph input nodes but never consumed by any compute node, so the backend scheduler never allocates a buffer for them. Calling mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits GGML_ASSERT(buffer) at ggml-backend.cpp:194. The same scenario applies symmetrically: if a model had zero SWA layers, the SWA tensors would be unallocated. Fix: guard both the base and SWA set_input calls with null/buffer checks, matching the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674) which has the comment: 'base tensors may not be allocated if there are no non-SWA attention layers'. Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for unallocated tensors, preventing a null-dereference on the reuse path. (cherry picked from commit eeeaf61)
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required. (cherry picked from commit 12e5d99)
* Add NVFP4 MTP scale tensors * Link Qwen3.5 MTP tensors * Aligned nullptr (cherry picked from commit b0df4c0)
* llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> (cherry picked from commit eef59a7)
…gml-org#23988) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs (cherry picked from commit 5dcb711)
* tests : add support for qwen3 SSM archs * arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS * cont : naming + TODOs (cherry picked from commit 06938ac)
* qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35 (cherry picked from commit 166fe29)
* hparams : refactor hparams.n_layer * cont : remove `n_layer_kv()`, use n_layer_all instead * cont : type consistency * pi : update SYSTEM.md * models : fix Step3.5 MTP * cont : remove duplicate switch cases * cont : explicitly set `false` to extra layers for `is_swa` and `is_recr` * cont : fix nextn layer count handling Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> (cherry picked from commit 7acb4e8)
(cherry picked from commit 8a091c4)
(cherry picked from commit 04eb4c4)
Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on this fork without disturbing TurboQuant+ or the custom kernels: - llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb` through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested under the new `if (other) { share } else { ... }` KV-sharing branch. - hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the refactor while keeping the fork's n_layer_kv_from_start; restore the swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl / nextn_predict_layers->n_layer_nextn renames across fork models. - add n_outputs_max to cparams / common_params / llama_context_params and wire it through; restore deepstack_mapping_arr. - server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on this fork; MTP does not need it). - drop upstream-only models pulled in by the refactor (deepseek32, mellum, talkie); keep non-MTP fork models on their own source + mechanical refactor. Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip). Kernels (ggml-cuda / ggml-metal) untouched.
dequantize4() returns vec4; the USE_DECODE_K / USE_DECODE_V sites assigned it to f16vec4 locals/shared buffers, which glslc rejects under explicit-arithmetic-types. The f16-cache and zero-fill branches were already correct. Wrap each in f16vec4(). Unblocks coopmat1 FA with quantized KV caches (--cache-type-k q8_0 / --cache-type-v turbo4) on RDNA4.
Two fixes so a standalone MTP draft GGUF whose layers are ALL nextn (gemma4-assistant: n_layer_all == n_layer_nextn, so n_layer() == 0) initializes and engages speculative decoding: 1. llama-kv-cache.cpp: the ctor iterated hparams.n_layer() (excludes nextn layers) for the per-layer KV loop; the ggml-org#24060 reconciliation wired it to the nextn-excluding method, but upstream loops the full hparams.n_layer member. With n_layer() == 0 the draft registered ZERO KV layers -> map_layer_ids empty -> get_k(0) threw std::out_of_range during draft-context reserve. Loop over hparams.n_layer_all instead; has_kv() still gates per-layer. 2. llama-graph.cpp: port upstream ggml-org#24294 - guard the iSWA kq_mask on its own buffer in set_input/can_reuse (base and swa). A SWA-only draft head leaves the base sub-cache empty, so its mask buffer is null. Verified on RDNA4/Vulkan: gemma4-12B MTP assistant loads, drafts at ~0.56-0.75 acceptance with q8_0 K / turbo4 V.
|
Pushed two hardening commits (validated on the RDNA4/Vulkan box):
The bundled-MTP case (Qwen 3.5/3.6: base layers + nextn in one GGUF) is unaffected — Verified: gemma4-12B MTP assistant loads and drafts on RDNA4/Vulkan, q8_0 K / turbo4 V, acceptance ~0.56–0.75 depending on workload. |
|
Hardware validation of
MTP self-spec is now a net +15% on RDNA4/Vulkan with this lineage (it was a net loss on the pre-PR fork build) — likely the backend-sampling draft path and The vulkan f16vec4 cast fix also gates the whole thing — without it glslc rejects the cm1 FA shaders for quantized KV caches. |
What
Adds Gemma 4 MTP (multi-token prediction / self-speculative decoding) by bringing the upstream MTP lineage onto the fork, on top of TurboQuant+. Supersedes the fork's own pre-norm Qwen MTP (#149) with upstream's maintained post-norm/nextn design, which is what Gemma 4 MTP is built on.
Net: same Qwen35/Qwen35-MoE MTP coverage (now on upstream's design) plus Gemma 4 MTP, with the custom CUDA/Metal kernels and TurboQuant+ KV cache untouched.
How
Revert #149then cherry-pick the full upstream MTP lineage in order:llm_graph_input_mtphparams.n_layerrefactor, vocab checkPlus one
fork: reconcile MTP lineage with TurboQuant+ KV cachecommit for the fork-specific glue (constructorhparams/sharethreading, turbo auto-asymmetric + attn-rotation policy preserved under the new KV-sharing branch, then_layerfield->method + array renames across fork models,n_outputs_max/deepstack_mapping_arrfields).Constraints honored
ggml-cuda/ggml-metal/ggml-vulkanfile is in any commit.LLAMA_ATTN_ROT_*policy (default OFF),n_layer_kv()sizing (+3 rotation tensors) all preserved; turbo quant unit test passes (turbo2/3/4 round-trip).Scope notes
--fitVRAM pre-estimation in the server (speculative : fix n_outputs_max and remove draft-simple auto-enable ggml-org/llama.cpp#23988) is not carried (depends on helpers not on this fork; MTP does not need it). The llama : add Gemma4 MTP ggml-org/llama.cpp#23398ctx_otherMTP draft KV-sharing wiring is kept.n_layerrefactor dragged in (deepseek32, mellum, talkie) are dropped — not part of this fork.Test status
Not for merge until the above pass on real hardware.