fix(gemma4-mtp): resolve PARALLEL=2 multi-slot crash in Gemma 4 MTP speculative decoding by boxwrench · Pull Request #26 · AtomicBot-ai/atomic-llama-cpp-turboquant

boxwrench · 2026-06-06T20:41:39Z

What this fixes

Running llama-server with --n-parallel 2 and a Gemma 4 MTP assistant head crashed immediately when the second slot began its first speculative draft step:

GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed
ggml_reshape_3d() — ggml.c:3665
llm_build_gemma4_mtp::llm_build_gemma4_mtp()
llama_context::ensure_sched_mtp()
llama_context::decode_mtp_async()

Reproducible in 18 tokens on any two-slot request.

Root cause (3 related issues)

1. gemma4-assistant.cpp — wrong token dimension in Qcur reshape

The MTP draft step always processes a single token column. The reshape was using n_tokens (the main batch size, which is 2 under --n-parallel 2) as the third dimension:

// before
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);

// after
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, 1);

n_tokens is correct for the main model forward pass. For the MTP head it is always 1 — the draft head speculates one token position at a time regardless of how many slots the server is running.

2. llama-graph.cpp/h — stream partition size for MTP models

n_stream was not being set to 1 for MTP graphs, causing batch dimension splitting in self-attention/MHA that produced mismatched shapes under multi-slot scheduling.

3. llama-context.cpp — reservation graph shape mismatch

Scheduling reservations were using init_full instead of init_mtp, causing the reservation graph shape to differ from the execution graph shape under --n-parallel 2.

Testing

Tested on AMD Ryzen AI Max+ 395 (Strix Halo APU), Vulkan/RADV, with:

Gemma 4 12B QAT Q4_0 + QAT-matched assistant head Q8_0
--n-parallel 2, --mtp-draft-n 3, --draft-p-min 0.75

Server survived two-slot speculative decoding without crashing. Acceptance rates and decode throughput unchanged from single-slot baseline.

Note: PR #25 addresses a different crash path (null embd dereference in llm_graph_input_embd::set_input). These two fixes are independent.

…er Vulkan

hogeheer499-commits · 2026-06-06T22:35:56Z

Great fix!

boxwrench · 2026-06-10T18:13:31Z

my first merge, whoo hoo

hogeheer499-commits · 2026-06-11T12:45:23Z

Congrats on your first merge, genuinely well earned. This is exactly the kind of practical runtime fix that makes the benchmark work useful beyond just posting numbers: it turns a Gemma 4 MTP caveat into something that can be retested cleanly with multi-slot serving. Thanks for digging into the actual graph/context issue and upstreaming it.

fix(gemma4-mtp): resolve speculator parallel=2 multi-slot crashes und…

6564568

…er Vulkan

github-actions Bot added the model label Jun 6, 2026

This was referenced Jun 6, 2026

Collect benchmark reports from more Strix Halo systems hogeheer499-commits/strix-halo-guide#4

Open

llama : add Gemma4 MTP ggml-org/llama.cpp#23398

Merged

Ooooze merged commit 0dbf74d into AtomicBot-ai:feature/turboquant-kv-cache Jun 9, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gemma4-mtp): resolve PARALLEL=2 multi-slot crash in Gemma 4 MTP speculative decoding#26

fix(gemma4-mtp): resolve PARALLEL=2 multi-slot crash in Gemma 4 MTP speculative decoding#26
Ooooze merged 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
boxwrench:feature/turboquant-kv-cache

boxwrench commented Jun 6, 2026

Uh oh!

hogeheer499-commits commented Jun 6, 2026

Uh oh!

Uh oh!

boxwrench commented Jun 10, 2026

Uh oh!

hogeheer499-commits commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

boxwrench commented Jun 6, 2026

What this fixes

Root cause (3 related issues)

Testing

Uh oh!

hogeheer499-commits commented Jun 6, 2026

Uh oh!

Uh oh!

boxwrench commented Jun 10, 2026

Uh oh!

hogeheer499-commits commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants