Skip to content

Eval bug: Gemma 4 31B MTP (draft-mtp) crashes on Vulkan backend, pre-allocated tensor cannot run operation NONE #24492

@kostich

Description

@kostich

Name and Version

version: 9601 (4c65955)
built with GNU 16.1.1 for Linux x86_64

Operating systems

Linux

GGML backends

Vulkan

Hardware

CPU: AMD Ryzen 9 5900X
GPU: AMD Radeon RX 7900 XTX (RDNA3, gfx1100), 24GB VRAM
ROCm 7.1.1, Mesa RADV NAVI31 (Vulkan)

Models

Main: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
Draft: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/MTP/gemma-4-31B-it-Q4_0-MTP.gguf

Problem description & steps to reproduce

Gemma 4 31B with --spec-type draft-mtp crashes immediately on context
initialization when using the Vulkan backend. The crash does not occur with
ROCm (--device ROCm0), which works correctly at 45-60 tok/s with MTP.

The crash happens during draft model KV cache setup, specifically when
llama.cpp tries to schedule a pre-allocated shared KV tensor (cache_k_l58)
on Vulkan0. The Vulkan backend scheduler aborts because it cannot run the
NONE operation on a pre-allocated tensor in its buffer.

The KV sharing warnings immediately precede the crash:
llama_kv_cache: layer 3: sharing with layer 59
llama_kv_cache: layer 0: sharing with layer 58
llama_kv_cache: layer 1: sharing with layer 58
llama_kv_cache: layer 2: sharing with layer 58

This is Gemma 4's MTP draft KV sharing pattern (layers 0-3 of the draft
sharing with layers 58-59 of the main model). The Vulkan backend scheduler
does not support this cross-context pre-allocated tensor assignment.

Tested with both f16 and q8_0 KV cache — same crash either way.
--fit off is already set; this is not the fit-path crash from issue #24117.

Build command:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=ON
-DAMDGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release
-DGGML_HIP_ROCWMMA_FATTN=ON -DLLAMA_CURL=ON

Reproduce with:
llama-server
--model gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
--spec-draft-model gemma-4-31B-it-Q4_0-MTP.gguf
--device Vulkan0
--spec-type draft-mtp
--spec-draft-n-max 4
--ctx-size 16384
--n-gpu-layers 99
--parallel 1
--fit off
--flash-attn on
--cache-type-k f16
--cache-type-v f16
--reasoning on
--jinja

First Bad Commit

Not bisected. PR #23398 (Gemma 4 MTP support) merged recently, this may be
a pre-existing Vulkan limitation rather than a regression. ROCm was not
tested before #23398 so cannot confirm if this is a new regression.

Relevant log output

Logs
0.00.069.844 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.069.846 I device_info:
0.00.069.874 I   - ROCm0   : AMD Radeon RX 7900 XTX (24560 MiB, 24502 MiB free)
0.00.069.986 I   - Vulkan0 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24576 MiB, 22496 MiB free)
0.00.069.990 I   - CPU     : AMD Ryzen 9 5900X 12-Core Processor (128703 MiB, 128703 MiB free)
0.00.070.038 I system_info: n_threads = 22 (n_threads_batch = 22) / 24 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.070.072 I srv          init: running without SSL
0.00.070.108 I srv          init: using 23 threads for HTTP server
0.00.070.208 W srv  llama_server: -----------------
0.00.070.209 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
0.00.070.210 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
0.00.070.210 W srv  llama_server: -----------------
0.00.070.219 W srv  llama_server: -----------------
0.00.070.219 W srv  llama_server: Built-in tools are enabled, do not expose server to untrusted environments
0.00.070.220 W srv  llama_server: This feature is EXPERIMENTAL and may be changed in the future
0.00.070.220 W srv  llama_server: -----------------
0.00.070.225 I srv         start: binding port with default address family
0.00.071.425 I srv  llama_server: loading model
0.00.071.435 I srv    load_model: loading model '/home/marko/AI/models/gemma-4-31b-it/ud_qr_k_xl_qat.gguf'
0.00.455.063 W load: override 'tokenizer.ggml.add_bos_token' to 'true' for Gemma4
0.00.508.593 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.508.930 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.528.480 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.04.957.722 W llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.05.044.384 W common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
0.05.044.392 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.05.121.077 I srv    load_model: loading draft model '/home/marko/AI/models/gemma-4-31b-it/q4_0_mtp.gguf'
0.05.543.412 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.05.543.764 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.05.564.629 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.05.784.913 W llama_context: n_ctx_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
0.05.788.435 W llama_kv_cache: layer   3: sharing with layer 59. k = 0x8001000, v = 0xc001000
0.05.788.449 W llama_kv_cache: layer   0: sharing with layer 58. k = 0x9c01000, v = 0xa801000
0.05.788.450 W llama_kv_cache: layer   1: sharing with layer 58. k = 0x9c01000, v = 0xa801000
0.05.788.451 W llama_kv_cache: layer   2: sharing with layer 58. k = 0x9c01000, v = 0xa801000
/home/marko/AI/llama.cpp/ggml/src/ggml-backend.cpp:898: pre-allocated tensor (cache_k_l58) in a buffer (Vulkan0) that cannot run the operation (NONE)
[New LWP 111011]
[New LWP 110984]
[New LWP 110983]
[New LWP 110982]
[New LWP 110981]
[New LWP 110980]
[New LWP 110979]
[New LWP 110978]
[New LWP 110977]
[New LWP 110976]
[New LWP 110975]
[New LWP 110974]
[New LWP 110973]
[New LWP 110972]
[New LWP 110971]
[New LWP 110970]
[New LWP 110969]
[New LWP 110968]
[New LWP 110967]
[New LWP 110966]
[New LWP 110965]
[New LWP 110964]
[New LWP 110963]
[New LWP 110962]
[New LWP 110961]
[New LWP 110960]
[New LWP 110959]
[New LWP 110957]
[New LWP 110956]

This GDB supports auto-downloading debuginfo from the following URLs:
  <ima:enforcing>
  <https://debuginfod.fedoraproject.org/>
  <ima:ignore>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007feeb4082412 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007feeb4082412 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007feeb407662c in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007feeb4076674 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007feeb40e624f in wait4 () from /lib64/libc.so.6
#4  0x00007feebed5801b in ggml_print_backtrace () from /home/marko/AI/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007feebed5818d in ggml_abort () from /home/marko/AI/llama.cpp/build/bin/libggml-base.so.0
#6  0x00007feebed702e1 in ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) () from /home/marko/AI/llama.cpp/build/bin/libggml-base.so.0
#7  0x00007feebed721d4 in ggml_backend_sched_split_graph () from /home/marko/AI/llama.cpp/build/bin/libggml-base.so.0
#8  0x00007feebd848d8d in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool, unsigned long*) () from /home/marko/AI/llama.cpp/build/bin/libllama.so.0
#9  0x00007feebd849e42 in llama_context::sched_reserve() () from /home/marko/AI/llama.cpp/build/bin/libllama.so.0
#10 0x00007feebd8501cd in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/marko/AI/llama.cpp/build/bin/libllama.so.0
#11 0x00007feebd8516a4 in llama_init_from_model () from /home/marko/AI/llama.cpp/build/bin/libllama.so.0
#12 0x00007feebe305af5 in server_context_impl::load_model(common_params&) () from /home/marko/AI/llama.cpp/build/bin/libllama-server-impl.so
#13 0x00007feebe24e0de in llama_server(int, char**) () from /home/marko/AI/llama.cpp/build/bin/libllama-server-impl.so
#14 0x00007feeb400a681 in __libc_start_call_main () from /lib64/libc.so.6
#15 0x00007feeb400a798 in __libc_start_main_impl () from /lib64/libc.so.6
#16 0x00000000004003b5 in _start ()
[Inferior 1 (process 110954) detached]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions