Skip to content

[codex] Add a reserve-only Gemma4 MTP reserve context#23

Draft
nycdubliner wants to merge 2 commits into
AtomicBot-ai:feature/turboquant-kv-cachefrom
nycdubliner:fix-gemma4-mtp-reserve-context
Draft

[codex] Add a reserve-only Gemma4 MTP reserve context#23
nycdubliner wants to merge 2 commits into
AtomicBot-ai:feature/turboquant-kv-cachefrom
nycdubliner:fix-gemma4-mtp-reserve-context

Conversation

@nycdubliner

Copy link
Copy Markdown

What changed

  • add llama_kv_cache_iswa::init_mtp_reserve(llama_ubatch) for shape-only Gemma4 MTP graph reservation
  • switch llama_context::ensure_sched_mtp() to use the reserve-only helper
  • keep real MTP decode on init_mtp(seq_id, ubatch) unchanged
  • update the Gemma4 multislot MTP crash worklog with the follow-up hardening

Why

The original crash fix stopped using memory->init_full() during MTP graph reservation, but it still routed reserve through init_mtp(0, ub). That worked in practice, yet it implicitly treated reserve as if it were decoding real seq_id 0.

This change removes that semantic risk by introducing a dedicated reserve-only memory context with the exact single-stream, single-index topology the MTP graph needs, without depending on any user sequence id or KV state.

Root cause

With -np > 1, reserving the MTP graph through a full KV context built dummy slot info spanning multiple streams. Gemma4 MTP expects a single-stream topology in its reserve and draft paths, so the mismatched reserve shape tripped the ggml_reshape_3d() assert.

Impact

  • preserves the Gemma4 A4B MTP multislot crash fix
  • makes the reserve path explicit and shape-only
  • does not change Qwen MTP / NextN behavior
  • does not change real MTP decode semantics

Validation

  • cmake --build build-hip-rocwmma --target llama-server -j "$(nproc)"
  • CTX=4096 PARALLEL=1 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=1 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.sh
  • CTX=4096 PARALLEL=2 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=0 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.sh
  • CTX=4096 PARALLEL=2 BATCH=128 UBATCH=64 SPLIT_MODE=layer KV_K=turbo4 KV_V=turbo4 REASONING_BUDGET=1024 ENABLE_MTP=1 PORT=8084 NO_WARMUP=1 ~/scripts/local-opencode-llama/scripts/run-gemma4-26b-a4b-mtp.sh
  • tiny /v1/messages request: {"model":"gemma4-26b-a4b-mtp","max_tokens":8,"messages":[{"role":"user","content":"hi"}]}
  • Claude-style /v1/messages request with system + messages
  • /metrics still exposes:
    • llamacpp:speculative_drafts_generated_total{spec_type="mtp"}
    • llamacpp:speculative_draft_tokens_generated_total{spec_type="mtp"}

@github-actions github-actions Bot added documentation Improvements or additions to documentation examples server labels Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant