Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+ by TheTom · Pull Request #172 · TheTom/llama-cpp-turboquant

TheTom · 2026-06-08T17:56:58Z

What

Adds Gemma 4 MTP (multi-token prediction / self-speculative decoding) by bringing the upstream MTP lineage onto the fork, on top of TurboQuant+. Supersedes the fork's own pre-norm Qwen MTP (#149) with upstream's maintained post-norm/nextn design, which is what Gemma 4 MTP is built on.

Net: same Qwen35/Qwen35-MoE MTP coverage (now on upstream's design) plus Gemma 4 MTP, with the custom CUDA/Metal kernels and TurboQuant+ KV cache untouched.

How

Revert #149 then cherry-pick the full upstream MTP lineage in order:

	PR
infra	ggml-org#23198 ggml-org#23269 ggml-org#23386 ggml-org#23287 ggml-org#23131 ggml-org#23433	logit/prompt-decode handling, backend sampling, crash fixes, inp_out_ids
infra	ggml-org#23563 ggml-org#23643	NVFP4 MTP scale tensors, `llm_graph_input_mtp`
infra	ggml-org#23988 ggml-org#24031 ggml-org#24060 ggml-org#24256	n_outputs_max, qwen3 SSM archs, `hparams.n_layer` refactor, vocab check
feature	ggml-org#24025 ggml-org#23398	qwen35 post-norm MTP, Gemma 4 MTP

Plus one fork: reconcile MTP lineage with TurboQuant+ KV cache commit for the fork-specific glue (constructor hparams/share threading, turbo auto-asymmetric + attn-rotation policy preserved under the new KV-sharing branch, the n_layer field->method + array renames across fork models, n_outputs_max/deepstack_mapping_arr fields).

Constraints honored

Kernels untouched — no ggml-cuda / ggml-metal / ggml-vulkan file is in any commit.
TurboQuant+ intact — turbo auto-asymmetric K upgrade, per-side LLAMA_ATTN_ROT_* policy (default OFF), n_layer_kv() sizing (+3 rotation tensors) all preserved; turbo quant unit test passes (turbo2/3/4 round-trip).

Scope notes

Upstream's --fit VRAM pre-estimation in the server (speculative : fix n_outputs_max and remove draft-simple auto-enable ggml-org/llama.cpp#23988) is not carried (depends on helpers not on this fork; MTP does not need it). The llama : add Gemma4 MTP ggml-org/llama.cpp#23398 ctx_other MTP draft KV-sharing wiring is kept.
Upstream-only models the n_layer refactor dragged in (deepseek32, mellum, talkie) are dropped — not part of this fork.

Test status

✅ Builds clean on Metal (full build: cli, server, tools).
✅ Turbo quant unit test passes.
⏳ Needs bench validation before merge (do NOT merge on the build alone): Qwen 3.6 MTP regression (acceptance rate vs pre-spec: avoid all-token outputs during MTP prefill #149 behavior) and a Gemma 4 MTP functional run (assistant draft, accept rate, output sanity). CUDA build on Spark also worth a pass.

Not for merge until the above pass on real hardware.

This fork's pre-norm MTP optimization is superseded by upstream's maintained post-norm/nextn MTP, brought in by the commits that follow. Removing it here so that lineage applies onto a clean base instead of colliding with it.

* llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm (cherry picked from commit 3e12fbd)

* llama : disable equal splits for recurrent memory with partial rollback * spec : re-enable p-min with MTP drafts * spec : re-enable ngram spec in combination with RS rollback * spec : fix ngram-map-* params * spec : fix acceptance logic in combined ngram + draft configs * graph : fix reuse for combined `token` + `embd` batches * spec : log parameters for each speculative implementation - add LOG_INF in each constructor with implementation type and parameters - extract device string logic into common_speculative_get_devices_str() - move 'adding speculative implementation' log from init into constructors Assisted-by: llama.cpp:local pi * spec : extend --spec-default with ngram-map-k4v Assisted-by: llama.cpp:local pi * minor : fix n_embd log * args : update draft.n_max == 3 + regen docs * spec : relax ngram-mod rejection thold to 0.25 @ 5 low * logs : improve * docs : update speculative decoding CLI argument documentation - Add missing draft model CPU scheduling and tensor override parameters - Update --spec-type to include all available types (excluding draft-eagle3 WIP) - Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0) - Remove deprecated options (spec-draft-ctx-size, spec-draft-replace) - Add environment variables for new parameters Assisted-by: llama.cpp:local pi * arg : step-back on adding k4v to the default spec config * cont : fix name (cherry picked from commit d14ce3d)

This commit attempts to clarify a code comment in graph_mtp regarding where the MTP layer is stored. The motivation for this is that it was not obvious to me what the original comment meant and hopefully this makes it clearer. (cherry picked from commit baf3cc6)

…3386) ggml_backend_dev_by_name always appends a nullptr sentinel to the devices vector. Skipping nullptr entries prevents assertion failure in ggml_backend_dev_name. Assisted-by: llama.cpp:local pi (cherry picked from commit 510b5c2)

* Move to backend sampling for MTP draft path Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K. * Allow sampler chains to be partially offloaded to backend * Add --spec-draft-backend-sampling argument. Enabled by default. (cherry picked from commit ad27757)

…r SWA-only models (ggml-org#23131) When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4), the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs, self_kq_mask) are created as graph input nodes but never consumed by any compute node, so the backend scheduler never allocates a buffer for them. Calling mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits GGML_ASSERT(buffer) at ggml-backend.cpp:194. The same scenario applies symmetrically: if a model had zero SWA layers, the SWA tensors would be unallocated. Fix: guard both the base and SWA set_input calls with null/buffer checks, matching the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674) which has the comment: 'base tensors may not be allocated if there are no non-SWA attention layers'. Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for unallocated tensors, preventing a null-dereference on the reuse path. (cherry picked from commit eeeaf61)

when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required. (cherry picked from commit 12e5d99)

* Add NVFP4 MTP scale tensors * Link Qwen3.5 MTP tensors * Aligned nullptr (cherry picked from commit b0df4c0)

* llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> (cherry picked from commit eef59a7)

…gml-org#23988) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs (cherry picked from commit 5dcb711)

* tests : add support for qwen3 SSM archs * arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS * cont : naming + TODOs (cherry picked from commit 06938ac)

* qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35 (cherry picked from commit 166fe29)

* hparams : refactor hparams.n_layer * cont : remove `n_layer_kv()`, use n_layer_all instead * cont : type consistency * pi : update SYSTEM.md * models : fix Step3.5 MTP * cont : remove duplicate switch cases * cont : explicitly set `false` to extra layers for `is_swa` and `is_recr` * cont : fix nextn layer count handling Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> (cherry picked from commit 7acb4e8)

(cherry picked from commit 8a091c4)

(cherry picked from commit 04eb4c4)

Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on this fork without disturbing TurboQuant+ or the custom kernels: - llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb` through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested under the new `if (other) { share } else { ... }` KV-sharing branch. - hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the refactor while keeping the fork's n_layer_kv_from_start; restore the swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl / nextn_predict_layers->n_layer_nextn renames across fork models. - add n_outputs_max to cparams / common_params / llama_context_params and wire it through; restore deepstack_mapping_arr. - server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on this fork; MTP does not need it). - drop upstream-only models pulled in by the refactor (deepseek32, mellum, talkie); keep non-MTP fork models on their own source + mechanical refactor. Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip). Kernels (ggml-cuda / ggml-metal) untouched.

dequantize4() returns vec4; the USE_DECODE_K / USE_DECODE_V sites assigned it to f16vec4 locals/shared buffers, which glslc rejects under explicit-arithmetic-types. The f16-cache and zero-fill branches were already correct. Wrap each in f16vec4(). Unblocks coopmat1 FA with quantized KV caches (--cache-type-k q8_0 / --cache-type-v turbo4) on RDNA4.

Two fixes so a standalone MTP draft GGUF whose layers are ALL nextn (gemma4-assistant: n_layer_all == n_layer_nextn, so n_layer() == 0) initializes and engages speculative decoding: 1. llama-kv-cache.cpp: the ctor iterated hparams.n_layer() (excludes nextn layers) for the per-layer KV loop; the ggml-org#24060 reconciliation wired it to the nextn-excluding method, but upstream loops the full hparams.n_layer member. With n_layer() == 0 the draft registered ZERO KV layers -> map_layer_ids empty -> get_k(0) threw std::out_of_range during draft-context reserve. Loop over hparams.n_layer_all instead; has_kv() still gates per-layer. 2. llama-graph.cpp: port upstream ggml-org#24294 - guard the iSWA kq_mask on its own buffer in set_input/can_reuse (base and swa). A SWA-only draft head leaves the base sub-cache empty, so its mask buffer is null. Verified on RDNA4/Vulkan: gemma4-12B MTP assistant loads, drafts at ~0.56-0.75 acceptance with q8_0 K / turbo4 V.

TheTom · 2026-06-09T21:29:27Z

Pushed two hardening commits (validated on the RDNA4/Vulkan box):

de389e021 vulkan: fix f16vec4 casts in cm1 FA quantized K/V decode paths — glslc rejects the implicit vec4→f16vec4 assignment from dequantize4() under explicit-arithmetic-types; blocks coopmat1 FA with quantized KV caches. Note: this updates the PR's earlier "kernels untouched" claim — one Vulkan shader is now in scope.
469c9c432 mtp: fix standalone all-nextn draft KV cache (gemma4-assistant) — a separate draft GGUF whose layers are all nextn has n_layer_all == n_layer_nextn, so hparams.n_layer() evaluates to 0. The KV cache ctor looped n_layer() and registered zero layers → map_layer_ids empty → get_k(0) threw std::out_of_range during draft-context reserve. Fixed by looping hparams.n_layer_all (matching upstream, which loops the full n_layer member; has_kv() still gates per-layer) + ported the upstream graph: guard iswa kq_mask on its own buffer ggml-org/llama.cpp#24294 iSWA kq_mask buffer guards for SWA-only draft heads.

The bundled-MTP case (Qwen 3.5/3.6: base layers + nextn in one GGUF) is unaffected — n_layer_all is what the ctor iterated pre-ggml-org#24060 reconciliation. An alternative fix that special-cases hparams::n_layer() itself (if (n_layer_all == n_layer_nextn) return n_layer_all;) was floated externally, but that changes the accessor's semantics at every call site (tensor loading, graph build, layer-adaptive heuristics); keeping the fix at the KV ctor matches upstream layering exactly.

Verified: gemma4-12B MTP assistant loads and drafts on RDNA4/Vulkan, q8_0 K / turbo4 V, acceptance ~0.56–0.75 depending on workload.

TheTom · 2026-06-09T21:57:05Z

Hardware validation of 469c9c432 + de389e021 on RDNA 4 (RX 9070 XT, Vulkan, Windows):

Standalone all-nextn gemma4-assistant draft (gemma-4-12B-it-MTP-Q8_0.gguf, 4 layers, all nextn) loads with no std::out_of_range crash; all 4 draft layers register KV and share with base layers 46/47.
Speculative decoding engages end-to-end (-c 16384 -ngl 999 --flash-attn on --cache-type-k q8_0 --cache-type-v turbo4 --spec-type draft-mtp --spec-draft-n-max 2).
A/B on the same prompt (temp 0, 330-token generation, warmed):
- plain decode: 36.9 t/s
- MTP spec decode: 42.4 t/s (draft acceptance 210/242 = 87%)

MTP self-spec is now a net +15% on RDNA4/Vulkan with this lineage (it was a net loss on the pre-PR fork build) — likely the backend-sampling draft path and inp_out_ids logit-skip commits. Acceptance is workload-dependent (87% on deterministic math; ~0.56–0.75 on open-ended chat), so plain decode may still win on chatty workloads.

The vulkan f16vec4 cast fix also gates the whole thing — without it glslc rejects the cm1 FA shaders for quantized KV caches.

github-actions Bot added documentation Improvements or additions to documentation examples server testing python model labels Jun 8, 2026

TheTom and others added 17 commits June 8, 2026 13:22

llama: avoid copying logits during prompt decode in MTP (ggml-org#23198)

19c7616

* llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm (cherry picked from commit 3e12fbd)

mtp: use inp_out_ids for skipping logit computation (ggml-org#23433)

676b3ed

when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required. (cherry picked from commit 12e5d99)

model : add NVFP4 MTP scale tensors (ggml-org#23563)

46cc11a

* Add NVFP4 MTP scale tensors * Link Qwen3.5 MTP tensors * Aligned nullptr (cherry picked from commit b0df4c0)

llama: add llm_graph_input_mtp (ggml-org#23643)

ab11a71

* llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> (cherry picked from commit eef59a7)

tests : add support for qwen3 SSM archs (ggml-org#24031)

7560617

* tests : add support for qwen3 SSM archs * arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS * cont : naming + TODOs (cherry picked from commit 06938ac)

qwen35: use post-norm hidden state for MTP (ggml-org#24025)

0c809f0

* qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35 (cherry picked from commit 166fe29)

spec : fix vocab compatibility check (ggml-org#24256)

6d9a4a8

(cherry picked from commit 8a091c4)

llama : add Gemma4 MTP (ggml-org#23398)

d1e70aa

(cherry picked from commit 04eb4c4)

TheTom force-pushed the feat/gemma4-mtp branch from 01b40dc to 2f756e6 Compare June 8, 2026 18:22

TheTom added 2 commits June 9, 2026 16:29

github-actions Bot added ggml Vulkan labels Jun 9, 2026

TheTom merged commit 2e68087 into feature/turboquant-kv-cache Jun 10, 2026
31 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+#172

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+#172
TheTom merged 19 commits into
feature/turboquant-kv-cachefrom
feat/gemma4-mtp

TheTom commented Jun 8, 2026

Uh oh!

TheTom commented Jun 9, 2026

Uh oh!

TheTom commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

TheTom commented Jun 8, 2026

What

How

Constraints honored

Scope notes

Test status

Uh oh!

TheTom commented Jun 9, 2026

Uh oh!

TheTom commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants