Skip to content

fix(gemma4-mtp): resolve PARALLEL=2 multi-slot crash in Gemma 4 MTP speculative decoding#26

Merged
Ooooze merged 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
boxwrench:feature/turboquant-kv-cache
Jun 9, 2026
Merged

fix(gemma4-mtp): resolve PARALLEL=2 multi-slot crash in Gemma 4 MTP speculative decoding#26
Ooooze merged 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
boxwrench:feature/turboquant-kv-cache

Conversation

@boxwrench

Copy link
Copy Markdown

What this fixes

Running llama-server with --n-parallel 2 and a Gemma 4 MTP assistant head crashed immediately when the second slot began its first speculative draft step:

GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed
ggml_reshape_3d() — ggml.c:3665
llm_build_gemma4_mtp::llm_build_gemma4_mtp()
llama_context::ensure_sched_mtp()
llama_context::decode_mtp_async()

Reproducible in 18 tokens on any two-slot request.

Root cause (3 related issues)

1. gemma4-assistant.cpp — wrong token dimension in Qcur reshape

The MTP draft step always processes a single token column. The reshape was using n_tokens (the main batch size, which is 2 under --n-parallel 2) as the third dimension:

// before
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);

// after
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, 1);

n_tokens is correct for the main model forward pass. For the MTP head it is always 1 — the draft head speculates one token position at a time regardless of how many slots the server is running.

2. llama-graph.cpp/h — stream partition size for MTP models

n_stream was not being set to 1 for MTP graphs, causing batch dimension splitting in self-attention/MHA that produced mismatched shapes under multi-slot scheduling.

3. llama-context.cpp — reservation graph shape mismatch

Scheduling reservations were using init_full instead of init_mtp, causing the reservation graph shape to differ from the execution graph shape under --n-parallel 2.

Testing

Tested on AMD Ryzen AI Max+ 395 (Strix Halo APU), Vulkan/RADV, with:

  • Gemma 4 12B QAT Q4_0 + QAT-matched assistant head Q8_0
  • --n-parallel 2, --mtp-draft-n 3, --draft-p-min 0.75

Server survived two-slot speculative decoding without crashing. Acceptance rates and decode throughput unchanged from single-slot baseline.

Note: PR #25 addresses a different crash path (null embd dereference in llm_graph_input_embd::set_input). These two fixes are independent.

@hogeheer499-commits

Copy link
Copy Markdown

Great fix!

@Ooooze Ooooze merged commit 0dbf74d into AtomicBot-ai:feature/turboquant-kv-cache Jun 9, 2026
1 check passed
@boxwrench

Copy link
Copy Markdown
Author

my first merge, whoo hoo

@hogeheer499-commits

Copy link
Copy Markdown

Congrats on your first merge, genuinely well earned. This is exactly the kind of practical runtime fix that makes the benchmark work useful beyond just posting numbers: it turns a Gemma 4 MTP caveat into something that can be retested cleanly with multi-slot serving. Thanks for digging into the actual graph/context issue and upstreaming it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants