[codex] Add a reserve-only Gemma4 MTP reserve context by nycdubliner · Pull Request #23 · AtomicBot-ai/atomic-llama-cpp-turboquant

nycdubliner · 2026-06-01T17:32:10Z

What changed

add llama_kv_cache_iswa::init_mtp_reserve(llama_ubatch) for shape-only Gemma4 MTP graph reservation
switch llama_context::ensure_sched_mtp() to use the reserve-only helper
keep real MTP decode on init_mtp(seq_id, ubatch) unchanged
update the Gemma4 multislot MTP crash worklog with the follow-up hardening

Why

The original crash fix stopped using memory->init_full() during MTP graph reservation, but it still routed reserve through init_mtp(0, ub). That worked in practice, yet it implicitly treated reserve as if it were decoding real seq_id 0.

This change removes that semantic risk by introducing a dedicated reserve-only memory context with the exact single-stream, single-index topology the MTP graph needs, without depending on any user sequence id or KV state.

Root cause

With -np > 1, reserving the MTP graph through a full KV context built dummy slot info spanning multiple streams. Gemma4 MTP expects a single-stream topology in its reserve and draft paths, so the mismatched reserve shape tripped the ggml_reshape_3d() assert.

Impact

preserves the Gemma4 A4B MTP multislot crash fix
makes the reserve path explicit and shape-only
does not change Qwen MTP / NextN behavior
does not change real MTP decode semantics

Validation

cmake --build build-hip-rocwmma --target llama-server -j "$(nproc)"
CTX=4096 PARALLEL=1 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=1 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.sh
CTX=4096 PARALLEL=2 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=0 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.sh
CTX=4096 PARALLEL=2 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=1 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.sh
tiny /v1/messages request: {"model":"gemma4-26b-a4b-mtp","max_tokens":8,"messages":[{"role":"user","content":"hi"}]}
Claude-style /v1/messages request with system + messages
/metrics still exposes:
- llamacpp:speculative_drafts_generated_total{spec_type="mtp"}
- llamacpp:speculative_draft_tokens_generated_total{spec_type="mtp"}

nycdubliner added 2 commits May 31, 2026 23:25

Add speculative draft Prometheus metrics

ca97dde

Add a reserve-only Gemma4 MTP context

b7b5a3b

github-actions Bot added documentation Improvements or additions to documentation examples server labels Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add a reserve-only Gemma4 MTP reserve context#23

[codex] Add a reserve-only Gemma4 MTP reserve context#23
nycdubliner wants to merge 2 commits into
AtomicBot-ai:feature/turboquant-kv-cachefrom
nycdubliner:fix-gemma4-mtp-reserve-context

nycdubliner commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nycdubliner commented Jun 1, 2026

What changed

Why

Root cause

Impact

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant