Skip to content

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+#172

Merged
TheTom merged 19 commits into
feature/turboquant-kv-cachefrom
feat/gemma4-mtp
Jun 10, 2026
Merged

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+#172
TheTom merged 19 commits into
feature/turboquant-kv-cachefrom
feat/gemma4-mtp

Conversation

@TheTom

@TheTom TheTom commented Jun 8, 2026

Copy link
Copy Markdown
Owner

What

Adds Gemma 4 MTP (multi-token prediction / self-speculative decoding) by bringing the upstream MTP lineage onto the fork, on top of TurboQuant+. Supersedes the fork's own pre-norm Qwen MTP (#149) with upstream's maintained post-norm/nextn design, which is what Gemma 4 MTP is built on.

Net: same Qwen35/Qwen35-MoE MTP coverage (now on upstream's design) plus Gemma 4 MTP, with the custom CUDA/Metal kernels and TurboQuant+ KV cache untouched.

How

Revert #149 then cherry-pick the full upstream MTP lineage in order:

PR
infra ggml-org#23198 ggml-org#23269 ggml-org#23386 ggml-org#23287 ggml-org#23131 ggml-org#23433 logit/prompt-decode handling, backend sampling, crash fixes, inp_out_ids
infra ggml-org#23563 ggml-org#23643 NVFP4 MTP scale tensors, llm_graph_input_mtp
infra ggml-org#23988 ggml-org#24031 ggml-org#24060 ggml-org#24256 n_outputs_max, qwen3 SSM archs, hparams.n_layer refactor, vocab check
feature ggml-org#24025 ggml-org#23398 qwen35 post-norm MTP, Gemma 4 MTP

Plus one fork: reconcile MTP lineage with TurboQuant+ KV cache commit for the fork-specific glue (constructor hparams/share threading, turbo auto-asymmetric + attn-rotation policy preserved under the new KV-sharing branch, the n_layer field->method + array renames across fork models, n_outputs_max/deepstack_mapping_arr fields).

Constraints honored

  • Kernels untouched — no ggml-cuda / ggml-metal / ggml-vulkan file is in any commit.
  • TurboQuant+ intact — turbo auto-asymmetric K upgrade, per-side LLAMA_ATTN_ROT_* policy (default OFF), n_layer_kv() sizing (+3 rotation tensors) all preserved; turbo quant unit test passes (turbo2/3/4 round-trip).

Scope notes

Test status

  • ✅ Builds clean on Metal (full build: cli, server, tools).
  • ✅ Turbo quant unit test passes.
  • Needs bench validation before merge (do NOT merge on the build alone): Qwen 3.6 MTP regression (acceptance rate vs pre-spec: avoid all-token outputs during MTP prefill #149 behavior) and a Gemma 4 MTP functional run (assistant draft, accept rate, output sanity). CUDA build on Spark also worth a pass.

Not for merge until the above pass on real hardware.

@github-actions github-actions Bot added documentation Improvements or additions to documentation examples server testing python model labels Jun 8, 2026
TheTom and others added 17 commits June 8, 2026 13:22
This fork's pre-norm MTP optimization is superseded by upstream's maintained
post-norm/nextn MTP, brought in by the commits that follow. Removing it here
so that lineage applies onto a clean base instead of colliding with it.
* llama: avoid copying logits during prompt decode in MTP

* review: update comment

* llama-graph: call set_output for t_h_pre_norm

(cherry picked from commit 3e12fbd)
* llama : disable equal splits for recurrent memory with partial rollback

* spec : re-enable p-min with MTP drafts

* spec : re-enable ngram spec in combination with RS rollback

* spec : fix ngram-map-* params

* spec : fix acceptance logic in combined ngram + draft configs

* graph : fix reuse for combined `token` + `embd` batches

* spec : log parameters for each speculative implementation

- add LOG_INF in each constructor with implementation type and parameters
- extract device string logic into common_speculative_get_devices_str()
- move 'adding speculative implementation' log from init into constructors

Assisted-by: llama.cpp:local pi

* spec : extend --spec-default with ngram-map-k4v

Assisted-by: llama.cpp:local pi

* minor : fix n_embd log

* args : update draft.n_max == 3 + regen docs

* spec : relax ngram-mod rejection thold to 0.25 @ 5 low

* logs : improve

* docs : update speculative decoding CLI argument documentation

- Add missing draft model CPU scheduling and tensor override parameters
- Update --spec-type to include all available types (excluding draft-eagle3 WIP)
- Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0)
- Remove deprecated options (spec-draft-ctx-size, spec-draft-replace)
- Add environment variables for new parameters

Assisted-by: llama.cpp:local pi

* arg : step-back on adding k4v to the default spec config

* cont : fix name

(cherry picked from commit d14ce3d)
This commit attempts to clarify a code comment in graph_mtp regarding
where the MTP layer is stored.

The motivation for this is that it was not obvious to me what the
original comment meant and hopefully this makes it clearer.

(cherry picked from commit baf3cc6)
…3386)

ggml_backend_dev_by_name always appends a nullptr sentinel to the devices
vector. Skipping nullptr entries prevents assertion failure in
ggml_backend_dev_name.

Assisted-by: llama.cpp:local pi
(cherry picked from commit 510b5c2)
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.

(cherry picked from commit ad27757)
…r SWA-only models (ggml-org#23131)

When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4),
the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs,
self_kq_mask) are created as graph input nodes but never consumed by any compute node,
so the backend scheduler never allocates a buffer for them. Calling
mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits
GGML_ASSERT(buffer) at ggml-backend.cpp:194.

The same scenario applies symmetrically: if a model had zero SWA layers, the SWA
tensors would be unallocated.

Fix: guard both the base and SWA set_input calls with null/buffer checks, matching
the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674)
which has the comment: 'base tensors may not be allocated if there are no non-SWA
attention layers'.

Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for
unallocated tensors, preventing a null-dereference on the reuse path.

(cherry picked from commit eeeaf61)
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.

(cherry picked from commit 12e5d99)
* Add NVFP4 MTP scale tensors

* Link Qwen3.5 MTP tensors

* Aligned nullptr

(cherry picked from commit b0df4c0)
* llama: add llm_graph_input_mtp

* rename input_mtp -> input_token_embd

* add TODO about mtmd embedding

* cont : clean-up

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
(cherry picked from commit eef59a7)
…gml-org#23988)

* speculative : add common_speculative_n_max helper function

Extract the speculative max-draft-size logic from server_n_outputs_max
into a reusable common_speculative_n_max() function in common/speculative.

Assisted-by: llama.cpp:local pi

* cont : draft context always has n_parallel outputs

* llama : log n_outputs_max

* speculative : remove draft-simple auto-enable

* ci : enable server tests on PRs

(cherry picked from commit 5dcb711)
* tests : add support for qwen3 SSM archs

* arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS

* cont : naming + TODOs

(cherry picked from commit 06938ac)
* qwen35: use post-norm hidden state for MTP

* rename pre_norm to nextn

* fix step35

(cherry picked from commit 166fe29)
* hparams : refactor hparams.n_layer

* cont : remove `n_layer_kv()`, use n_layer_all instead

* cont : type consistency

* pi : update SYSTEM.md

* models : fix Step3.5 MTP

* cont : remove duplicate switch cases

* cont : explicitly set `false` to extra layers for `is_swa` and `is_recr`

* cont : fix nextn layer count handling

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
(cherry picked from commit 7acb4e8)
Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on
this fork without disturbing TurboQuant+ or the custom kernels:

- llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb`
  through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the
  fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation
  tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested
  under the new `if (other) { share } else { ... }` KV-sharing branch.
- hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the
  refactor while keeping the fork's n_layer_kv_from_start; restore the
  swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl /
  nextn_predict_layers->n_layer_nextn renames across fork models.
- add n_outputs_max to cparams / common_params / llama_context_params and
  wire it through; restore deepstack_mapping_arr.
- server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the
  ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on
  this fork; MTP does not need it).
- drop upstream-only models pulled in by the refactor (deepseek32, mellum,
  talkie); keep non-MTP fork models on their own source + mechanical refactor.

Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip).
Kernels (ggml-cuda / ggml-metal) untouched.
@TheTom TheTom force-pushed the feat/gemma4-mtp branch from 01b40dc to 2f756e6 Compare June 8, 2026 18:22
TheTom added 2 commits June 9, 2026 16:29
dequantize4() returns vec4; the USE_DECODE_K / USE_DECODE_V sites assigned
it to f16vec4 locals/shared buffers, which glslc rejects under
explicit-arithmetic-types. The f16-cache and zero-fill branches were
already correct. Wrap each in f16vec4(). Unblocks coopmat1 FA with
quantized KV caches (--cache-type-k q8_0 / --cache-type-v turbo4) on
RDNA4.
Two fixes so a standalone MTP draft GGUF whose layers are ALL nextn
(gemma4-assistant: n_layer_all == n_layer_nextn, so n_layer() == 0)
initializes and engages speculative decoding:

1. llama-kv-cache.cpp: the ctor iterated hparams.n_layer() (excludes
   nextn layers) for the per-layer KV loop; the ggml-org#24060 reconciliation
   wired it to the nextn-excluding method, but upstream loops the full
   hparams.n_layer member. With n_layer() == 0 the draft registered ZERO
   KV layers -> map_layer_ids empty -> get_k(0) threw std::out_of_range
   during draft-context reserve. Loop over hparams.n_layer_all instead;
   has_kv() still gates per-layer.

2. llama-graph.cpp: port upstream ggml-org#24294 - guard the iSWA kq_mask on its
   own buffer in set_input/can_reuse (base and swa). A SWA-only draft
   head leaves the base sub-cache empty, so its mask buffer is null.

Verified on RDNA4/Vulkan: gemma4-12B MTP assistant loads, drafts at
~0.56-0.75 acceptance with q8_0 K / turbo4 V.
@TheTom

TheTom commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

Pushed two hardening commits (validated on the RDNA4/Vulkan box):

  • de389e021 vulkan: fix f16vec4 casts in cm1 FA quantized K/V decode paths — glslc rejects the implicit vec4→f16vec4 assignment from dequantize4() under explicit-arithmetic-types; blocks coopmat1 FA with quantized KV caches. Note: this updates the PR's earlier "kernels untouched" claim — one Vulkan shader is now in scope.
  • 469c9c432 mtp: fix standalone all-nextn draft KV cache (gemma4-assistant) — a separate draft GGUF whose layers are all nextn has n_layer_all == n_layer_nextn, so hparams.n_layer() evaluates to 0. The KV cache ctor looped n_layer() and registered zero layers → map_layer_ids empty → get_k(0) threw std::out_of_range during draft-context reserve. Fixed by looping hparams.n_layer_all (matching upstream, which loops the full n_layer member; has_kv() still gates per-layer) + ported the upstream graph: guard iswa kq_mask on its own buffer ggml-org/llama.cpp#24294 iSWA kq_mask buffer guards for SWA-only draft heads.

The bundled-MTP case (Qwen 3.5/3.6: base layers + nextn in one GGUF) is unaffected — n_layer_all is what the ctor iterated pre-ggml-org#24060 reconciliation. An alternative fix that special-cases hparams::n_layer() itself (if (n_layer_all == n_layer_nextn) return n_layer_all;) was floated externally, but that changes the accessor's semantics at every call site (tensor loading, graph build, layer-adaptive heuristics); keeping the fix at the KV ctor matches upstream layering exactly.

Verified: gemma4-12B MTP assistant loads and drafts on RDNA4/Vulkan, q8_0 K / turbo4 V, acceptance ~0.56–0.75 depending on workload.

@TheTom

TheTom commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

Hardware validation of 469c9c432 + de389e021 on RDNA 4 (RX 9070 XT, Vulkan, Windows):

  • Standalone all-nextn gemma4-assistant draft (gemma-4-12B-it-MTP-Q8_0.gguf, 4 layers, all nextn) loads with no std::out_of_range crash; all 4 draft layers register KV and share with base layers 46/47.
  • Speculative decoding engages end-to-end (-c 16384 -ngl 999 --flash-attn on --cache-type-k q8_0 --cache-type-v turbo4 --spec-type draft-mtp --spec-draft-n-max 2).
  • A/B on the same prompt (temp 0, 330-token generation, warmed):
    • plain decode: 36.9 t/s
    • MTP spec decode: 42.4 t/s (draft acceptance 210/242 = 87%)

MTP self-spec is now a net +15% on RDNA4/Vulkan with this lineage (it was a net loss on the pre-PR fork build) — likely the backend-sampling draft path and inp_out_ids logit-skip commits. Acceptance is workload-dependent (87% on deterministic math; ~0.56–0.75 on open-ended chat), so plain decode may still win on chatty workloads.

The vulkan f16vec4 cast fix also gates the whole thing — without it glslc rejects the cm1 FA shaders for quantized KV caches.

@TheTom TheTom merged commit 2e68087 into feature/turboquant-kv-cache Jun 10, 2026
31 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants