[codex] Add a reserve-only Gemma4 MTP reserve context#23
Draft
nycdubliner wants to merge 2 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
llama_kv_cache_iswa::init_mtp_reserve(llama_ubatch)for shape-only Gemma4 MTP graph reservationllama_context::ensure_sched_mtp()to use the reserve-only helperinit_mtp(seq_id, ubatch)unchangedWhy
The original crash fix stopped using
memory->init_full()during MTP graph reservation, but it still routed reserve throughinit_mtp(0, ub). That worked in practice, yet it implicitly treated reserve as if it were decoding realseq_id 0.This change removes that semantic risk by introducing a dedicated reserve-only memory context with the exact single-stream, single-index topology the MTP graph needs, without depending on any user sequence id or KV state.
Root cause
With
-np > 1, reserving the MTP graph through a full KV context built dummy slot info spanning multiple streams. Gemma4 MTP expects a single-stream topology in its reserve and draft paths, so the mismatched reserve shape tripped theggml_reshape_3d()assert.Impact
Validation
cmake --build build-hip-rocwmma --target llama-server -j "$(nproc)"CTX=4096 PARALLEL=1 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=1 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.shCTX=4096 PARALLEL=2 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=0 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.shCTX=4096 PARALLEL=2 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=1 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.sh/v1/messagesrequest:{"model":"gemma4-26b-a4b-mtp","max_tokens":8,"messages":[{"role":"user","content":"hi"}]}/v1/messagesrequest withsystem+messages/metricsstill exposes:llamacpp:speculative_drafts_generated_total{spec_type="mtp"}llamacpp:speculative_draft_tokens_generated_total{spec_type="mtp"}