Eval bug: Gemma4 MTP is silently disabled in case of insufficient VRAM

### Name and Version

version: 9692 (f3e182816)
built with GNU 15.2.0 for Linux x86_64


### Operating systems

Linux

### GGML backends

CUDA

### Hardware

RTX 5090

### Models

unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL

- https://huggingface.co/unsloth/gemma-4-31B-it-GGUF

### Problem description & steps to reproduce

I run:
```
llama-server -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL --spec-type draft-mtp --spec-draft-n-max 2
```
I observe first:
```
E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set
```
Followed by:
```
E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2645.29 MiB on device 0: cudaMalloc failed: out of memory
E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 2773782656
E graph_reserve: failed to allocate compute buffers
E llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
```

The server continues as usual, but **none of MTP works**. 

---

No such errors are there in case of Qwen3.6 27B MTP:

```
llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q6_K --spec-type draft-mtp --spec-draft-n-max 2
```

The workaround is to manually fit the model with say `--ctx-size 131072` or other fitting context size.

- both `LLM_ARCH_GEMMA4_ASSISTANT`  and `LLM_ARCH_EAGLE3` are affected looking at the code, but I haven't tried another
-  possible duplicates #24343 and #24350 
- PR #24590 is related


### First Bad Commit

I believe this is an incomplete feature in https://github.com/ggml-org/llama.cpp/pull/23398

### Relevant log output

<details>
<summary>Logs</summary>


```console
0.00.119.038 I common_params_handle_remote_preset: looking for remote preset at https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/resolve/main/preset.ini
0.00.120.654 D common_download_file_single_online: no previous model file found /home/debian/.cache/llama.cpp/unsloth_gemma-4-31B-it-GGUF_preset.ini
0.00.363.810 I common_download_file_single_online: HEAD failed, status: 404
0.00.364.565 I common_params_handle_remote_preset: no remote preset found, skipping
0.00.989.911 D common_download_file_single_online: using cached file: /home/debian/.cache/huggingface/hub/models--unsloth--gemma-4-31B-it-GGUF/snapshots/8906b3db2e669a0b1d6293c315d3f9fbf934a86d/gemma-4-31B-it-UD-Q4_K_XL.gguf
0.00.990.151 D common_download_file_single_online: using cached file: /home/debian/.cache/huggingface/hub/models--unsloth--gemma-4-31B-it-GGUF/snapshots/8906b3db2e669a0b1d6293c315d3f9fbf934a86d/mtp-gemma-4-31B-it.gguf
0.00.990.194 D common_download_file_single_online: using cached file: /home/debian/.cache/huggingface/hub/models--unsloth--gemma-4-31B-it-GGUF/snapshots/8906b3db2e669a0b1d6293c315d3f9fbf934a86d/mmproj-BF16.gguf
0.00.990.636 I common_params_print_info: build 9692 (f3e182816) with GNU 15.2.0 for Linux x86_64
0.00.990.643 I log_info: verbosity = 2147483647 (adjust with the `-lv N` CLI arg)
0.00.990.643 I device_info:
0.01.070.150 I   - CUDA0   : NVIDIA GeForce RTX 5090 (32109 MiB, 31585 MiB free)
0.01.070.162 I   - CPU     : Intel(R) Core(TM) i7-14700K (63543 MiB, 63543 MiB free)
0.01.070.242 I system_info: n_threads = 8 (n_threads_batch = 8) / 28 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.01.070.259 I srv          init: running without SSL
0.01.070.279 I srv          init: using 27 threads for HTTP server
0.01.070.295 D srv          init: serve nocache for _app/version.json
0.01.070.391 D srv          init: serve nocache for build.json
0.01.070.397 D srv          init: serve nocache for manifest.webmanifest
0.01.070.404 D srv          init: serve nocache for sw.js
0.01.070.459 I srv         start: binding port with default address family
0.01.071.588 I srv  llama_server: loading model
0.01.071.611 I srv    load_model: loading model '/home/debian/.cache/huggingface/hub/models--unsloth--gemma-4-31B-it-GGUF/snapshots/8906b3db2e669a0b1d6293c315d3f9fbf934a86d/gemma-4-31B-it-UD-Q4_K_XL.gguf'
0.01.274.895 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 1292.35 MiB
0.01.274.904 D srv    load_model: [mtmd] adding 1285.58 MiB to fit_params_target for device CUDA0
0.01.274.905 D srv    load_model: [mtmd] adding 6.77 MiB to fit_params_target for device CPU
0.01.399.683 D llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31583 MiB free
0.01.517.509 D init_tokenizer: initializing tokenizer for type 2
0.01.552.484 D load: 0 unused tokens
0.01.552.607 D load: control token: 255999 '<|image>' is not marked as EOG
0.01.554.390 D load: control token: 258882 '<image|>' is not marked as EOG
0.01.555.122 D load: control token: 258883 '<audio|>' is not marked as EOG
0.01.557.912 D load: control token:     98 '<|think|>' is not marked as EOG
0.01.559.160 D load: control token:    105 '<|turn>' is not marked as EOG
0.01.559.629 D load: control token: 258880 '<|image|>' is not marked as EOG
0.01.560.865 D load: control token:      2 '<bos>' is not marked as EOG
0.01.561.588 D load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.561.688 D load: control token:      0 '<pad>' is not marked as EOG
0.01.561.912 D load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.561.921 D load: control token:     46 '<|tool>' is not marked as EOG
0.01.562.265 D load: control token:     47 '<tool|>' is not marked as EOG
0.01.562.421 D load: control token: 256000 '<|audio>' is not marked as EOG
0.01.563.942 D load: control token:      3 '<unk>' is not marked as EOG
0.01.565.134 D load: control token: 258881 '<|audio|>' is not marked as EOG
0.01.566.784 D load: control token:      4 '<mask>' is not marked as EOG
0.01.579.868 D load: printing all EOG tokens:
0.01.579.869 D load:   - 1 ('<eos>')
0.01.579.870 D load:   - 50 ('<|tool_response>')
0.01.579.870 D load:   - 106 ('<turn|>')
0.01.579.870 D load:   - 212 ('</s>')
0.01.579.874 D load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.01.580.277 D load: special tokens cache size = 23
0.01.594.886 D load: token to piece cache size = 1.9445 MB
'
0.01.598.087 D llama_context: constructing llama_context
0.01.598.151 E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this warning is normal during memory fitting)
0.01.770.161 I print_info: file format = GGUF V3 (latest)
0.01.770.161 I print_info: file type   = Q4_K - Medium
0.01.770.163 I print_info: file size   = 17.52 GiB (4.90 BPW) 
0.01.770.231 I llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31583 MiB free
0.01.885.614 D init_tokenizer: initializing tokenizer for type 2
0.01.920.821 I load: 0 unused tokens
0.01.920.911 D load: control token: 258884 '<|video|>' is not marked as EOG
0.01.920.968 D load: control token: 255999 '<|image>' is not marked as EOG
0.01.922.727 D load: control token: 258882 '<image|>' is not marked as EOG
0.01.923.487 D load: control token: 258883 '<audio|>' is not marked as EOG
0.01.926.276 D load: control token:     98 '<|think|>' is not marked as EOG
0.01.927.600 D load: control token:    105 '<|turn>' is not marked as EOG
0.01.928.142 D load: control token: 258880 '<|image|>' is not marked as EOG
0.01.929.874 D load: control token:      2 '<bos>' is not marked as EOG
0.01.930.702 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.930.815 D load: control token:      0 '<pad>' is not marked as EOG
0.01.931.076 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.931.091 D load: control token:     46 '<|tool>' is not marked as EOG
0.01.931.519 D load: control token:     47 '<tool|>' is not marked as EOG
0.01.931.717 D load: control token: 256000 '<|audio>' is not marked as EOG
0.01.933.548 D load: control token:      3 '<unk>' is not marked as EOG
0.01.934.918 D load: control token: 258881 '<|audio|>' is not marked as EOG
0.01.935.815 W load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.936.683 D load: control token:      4 '<mask>' is not marked as EOG
0.01.947.784 I load: printing all EOG tokens:
0.01.947.786 I load:   - 1 ('<eos>')
0.01.947.786 I load:   - 50 ('<|tool_response>')
0.01.947.786 I load:   - 106 ('<turn|>')
0.01.947.787 I load:   - 212 ('</s>')
0.01.947.788 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.01.948.145 I load: special tokens cache size = 24
0.01.962.060 I load: token to piece cache size = 1.9445 MB
0.01.962.069 I print_info: arch                  = gemma4
0.01.962.069 I print_info: vocab_only            = 0
0.01.962.070 I print_info: no_alloc              = 1
0.01.962.070 I print_info: n_ctx_train           = 262144
0.01.962.070 I print_info: n_embd_inp            = 5376
0.01.962.070 I print_info: n_embd                = 5376
0.01.962.070 I print_info: n_embd_out            = 5376
0.01.962.071 I print_info: n_layer               = 60
0.01.962.071 I print_info: n_layer_all           = 60
0.01.962.076 I print_info: n_head                = 32
0.01.962.080 I print_info: n_head_kv             = [16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4]
0.01.962.105 I print_info: n_rot                 = 512
0.01.962.105 I print_info: n_swa                 = 1024
0.01.962.105 I print_info: is_swa_any            = 1
0.01.962.105 I print_info: n_embd_head_k         = 512
0.01.962.105 I print_info: n_embd_head_v         = 512
0.01.962.107 I print_info: n_gqa                 = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
0.01.962.110 I print_info: n_embd_k_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
0.01.962.114 I print_info: n_embd_v_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
0.01.962.114 I print_info: f_norm_eps            = 0.0e+00
0.01.962.115 I print_info: f_norm_rms_eps        = 1.0e-06
0.01.962.115 I print_info: f_clamp_kqv           = 0.0e+00
0.01.962.115 I print_info: f_max_alibi_bias      = 0.0e+00
0.01.962.116 I print_info: f_logit_scale         = 0.0e+00
0.01.962.116 I print_info: f_attn_scale          = 1.0e+00
0.01.962.116 I print_info: f_attn_value_scale    = 0.0000
0.01.962.116 I print_info: n_ff                  = 21504
0.01.962.117 I print_info: n_expert              = 0
0.01.962.117 I print_info: n_expert_used         = 0
0.01.962.117 I print_info: n_expert_groups       = 0
0.01.962.117 I print_info: n_group_used          = 0
0.01.962.117 I print_info: causal attn           = 1
0.01.962.117 I print_info: pooling type          = -1
0.01.962.117 I print_info: rope type             = 2
0.01.962.117 I print_info: rope scaling          = linear
0.01.962.118 I print_info: freq_base_train       = 1000000.0
0.01.962.118 I print_info: freq_scale_train      = 1
0.01.962.119 I print_info: freq_base_swa         = 10000.0
0.01.962.119 I print_info: freq_scale_swa        = 1
0.01.962.119 I print_info: n_embd_head_k_swa     = 256
0.01.962.119 I print_info: n_embd_head_v_swa     = 256
0.01.962.119 I print_info: n_rot_swa             = 256
0.01.962.119 I print_info: n_ctx_orig_yarn       = 262144
0.01.962.119 I print_info: rope_yarn_log_mul     = 0.0000
0.01.962.119 I print_info: rope_finetuned        = unknown
0.01.962.120 I print_info: model type            = 31B
0.01.962.120 I print_info: model params          = 30.70 B
0.01.962.120 I print_info: general.name          = Gemma-4-31B-It
0.01.962.121 I print_info: vocab type            = BPE
0.01.962.121 I print_info: n_vocab               = 262144
0.01.962.121 I print_info: n_merges              = 514906
0.01.962.122 I print_info: BOS token             = 2 '<bos>'
0.01.962.122 I print_info: EOS token             = 106 '<turn|>'
0.01.962.122 I print_info: UNK token             = 3 '<unk>'
0.01.962.122 I print_info: PAD token             = 0 '<pad>'
0.01.962.122 I print_info: MASK token            = 4 '<mask>'
0.01.962.122 I print_info: LF token              = 107 '
'
0.01.962.122 I print_info: EOG token             = 1 '<eos>'
0.01.962.123 I print_info: EOG token             = 50 '<|tool_response>'
0.01.962.123 I print_info: EOG token             = 106 '<turn|>'
0.01.962.123 I print_info: max token length      = 93
0.01.965.554 D done_getting_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
0.01.967.699 D llama_context: n_rs_seq=2 requested but model arch does not support recurrent partial rollback; clamping to 0
0.01.967.705 W llama_context: n_ctx_seq (256) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.01.967.716 D set_abort_callback: call
0.01.968.725 D llama_context: enumerating backends
0.01.968.728 D llama_context: backend_ptrs.size() = 2
0.01.968.729 I sched_reserve: reserving ...
0.01.968.732 D sched_reserve: max_nodes = 6672
0.01.969.766 D sched_reserve: reserving full memory module
0.01.969.782 D sched_reserve: worst-case: n_tokens = 32, n_seqs = 1, n_outputs = 1
0.01.969.783 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.01.970.350 I sched_reserve: Flash Attention was auto, set to enabled
0.01.970.351 I sched_reserve: resolving fused Gated Delta Net support:
0.01.970.351 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.01.970.738 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
0.01.970.739 D graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =   16
0.01.971.111 I sched_reserve: fused Gated Delta Net (chunked) enabled
0.01.971.112 D graph_reserve: reserving a graph for ubatch with n_tokens =   32, n_seqs =  1, n_outputs =    3
0.01.971.826 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.01.972.200 D graph_reserve: reserving a graph for ubatch with n_tokens =   32, n_seqs =  1, n_outputs =    3
0.01.972.592 I sched_reserve:      CUDA0 compute buffer size =    12.50 MiB
0.01.972.593 I sched_reserve:  CUDA_Host compute buffer size =     2.63 MiB
0.01.972.593 I sched_reserve: graph nodes  = 3179
0.01.972.593 I sched_reserve: graph splits = 2
0.01.972.595 I sched_reserve: reserve took 3.86 ms, sched copies = 1
0.02.096.379 D llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31583 MiB free
0.02.211.418 D init_tokenizer: initializing tokenizer for type 2
0.02.247.564 D load: 0 unused tokens
0.02.247.688 D load: control token: 255999 '<|image>' is not marked as EOG
0.02.249.432 D load: control token: 258882 '<image|>' is not marked as EOG
0.02.250.124 D load: control token: 258883 '<audio|>' is not marked as EOG
0.02.253.019 D load: control token:     98 '<|think|>' is not marked as EOG
0.02.254.364 D load: control token:    105 '<|turn>' is not marked as EOG
0.02.254.905 D load: control token: 258880 '<|image|>' is not marked as EOG
0.02.256.636 D load: control token:      2 '<bos>' is not marked as EOG
0.02.257.437 D load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.257.524 D load: control token:      0 '<pad>' is not marked as EOG
0.02.257.782 D load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.257.793 D load: control token:     46 '<|tool>' is not marked as EOG
0.02.258.187 D load: control token:     47 '<tool|>' is not marked as EOG
0.02.258.358 D load: control token: 256000 '<|audio>' is not marked as EOG
0.02.260.118 D load: control token:      3 '<unk>' is not marked as EOG
0.02.261.491 D load: control token: 258881 '<|audio|>' is not marked as EOG
0.02.263.241 D load: control token:      4 '<mask>' is not marked as EOG
0.02.273.840 D load: printing all EOG tokens:
0.02.273.842 D load:   - 1 ('<eos>')
0.02.273.843 D load:   - 50 ('<|tool_response>')
0.02.273.843 D load:   - 106 ('<turn|>')
0.02.273.843 D load:   - 212 ('</s>')
0.02.273.844 D load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.02.274.195 D load: special tokens cache size = 23
0.02.288.790 D load: token to piece cache size = 1.9445 MB
'
0.02.290.869 D llama_context: constructing llama_context
0.02.290.871 D llama_context: n_seq_max     = 1
0.02.290.872 D llama_context: n_ctx         = 262144
0.02.290.872 D llama_context: n_ctx_seq     = 262144
0.02.290.872 D llama_context: n_batch       = 2048
0.02.290.872 D llama_context: n_ubatch      = 1024
0.02.290.872 D llama_context: causal_attn   = 1
0.02.290.872 D llama_context: flash_attn    = auto
0.02.290.872 D llama_context: kv_unified    = false
0.02.290.873 D llama_context: freq_base     = 1000000.0
0.02.290.873 D llama_context: freq_scale    = 1
0.02.290.874 D llama_context: n_rs_seq      = 0
0.02.290.874 D llama_context: n_outputs_max = 1
0.02.290.880 D set_abort_callback: call
0.02.290.941 D llama_context:  CUDA_Host  output buffer size =     1.00 MiB
0.02.291.048 D llama_context: enumerating backends
0.02.291.049 D llama_context: backend_ptrs.size() = 2
0.02.291.050 D sched_reserve: reserving ...
0.02.291.050 D sched_reserve: max_nodes = 1024
0.02.291.207 D sched_reserve: reserving full memory module
0.02.291.209 D sched_reserve: worst-case: n_tokens = 1024, n_seqs = 1, n_outputs = 1
0.02.291.210 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.02.291.272 D sched_reserve: Flash Attention was auto, set to enabled
0.02.291.272 D sched_reserve: resolving fused Gated Delta Net support:
0.02.291.272 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.02.291.306 D sched_reserve: fused Gated Delta Net (autoregressive) enabled
0.02.291.307 D graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =   16
0.02.291.339 D sched_reserve: fused Gated Delta Net (chunked) enabled
0.02.291.339 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    1
0.02.291.417 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.02.291.451 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    1
0.02.291.483 D sched_reserve:      CUDA0 compute buffer size =   174.29 MiB
0.02.291.484 D sched_reserve:  CUDA_Host compute buffer size =    44.29 MiB
0.02.291.484 D sched_reserve: graph nodes  = 150
0.02.291.484 D sched_reserve: graph splits = 2
0.02.291.484 D sched_reserve: reserve took 0.43 ms, sched copies = 1
0.02.291.592 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
0.02.291.593 I common_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 31583 + ( 650 =   475 +       0 +     174) +        -124 |
0.02.291.593 I common_memory_breakdown_print: |   - Host               |                   316 =   272 +       0 +      44                |
0.02.343.472 D srv    load_model: [spec] adding 650.09 MiB to fit_params_target for device CUDA0
0.02.343.475 I srv    load_model: [spec] estimated memory usage of draft model is 650.09 MiB
0.02.394.629 I common_init_result: fitting params to device memory ...
0.02.394.631 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.02.394.634 I common_params_fit_impl: getting device memory data for initial parameters:
0.02.516.016 I print_info: file format = GGUF V3 (latest)
0.02.516.016 I print_info: file type   = Q4_K - Medium
0.02.516.018 I print_info: file size   = 17.52 GiB (4.90 BPW) 
0.02.516.090 I llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31583 MiB free
0.02.616.467 D init_tokenizer: initializing tokenizer for type 2
0.02.651.109 I load: 0 unused tokens
0.02.651.199 D load: control token: 258884 '<|video|>' is not marked as EOG
0.02.651.255 D load: control token: 255999 '<|image>' is not marked as EOG
0.02.653.056 D load: control token: 258882 '<image|>' is not marked as EOG
0.02.653.808 D load: control token: 258883 '<audio|>' is not marked as EOG
0.02.656.627 D load: control token:     98 '<|think|>' is not marked as EOG
0.02.657.929 D load: control token:    105 '<|turn>' is not marked as EOG
0.02.658.400 D load: control token: 258880 '<|image|>' is not marked as EOG
0.02.659.609 D load: control token:      2 '<bos>' is not marked as EOG
0.02.660.338 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.660.440 D load: control token:      0 '<pad>' is not marked as EOG
0.02.660.679 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.660.688 D load: control token:     46 '<|tool>' is not marked as EOG
0.02.661.059 D load: control token:     47 '<tool|>' is not marked as EOG
0.02.661.213 D load: control token: 256000 '<|audio>' is not marked as EOG
0.02.662.869 D load: control token:      3 '<unk>' is not marked as EOG
0.02.664.074 D load: control token: 258881 '<|audio|>' is not marked as EOG
0.02.664.867 W load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.665.642 D load: control token:      4 '<mask>' is not marked as EOG
0.02.678.727 I load: printing all EOG tokens:
0.02.678.729 I load:   - 1 ('<eos>')
0.02.678.729 I load:   - 50 ('<|tool_response>')
0.02.678.729 I load:   - 106 ('<turn|>')
0.02.678.729 I load:   - 212 ('</s>')
0.02.678.731 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.02.679.168 I load: special tokens cache size = 24
0.02.692.337 I load: token to piece cache size = 1.9445 MB
0.02.692.346 I print_info: arch                  = gemma4
0.02.692.346 I print_info: vocab_only            = 0
0.02.692.346 I print_info: no_alloc              = 1
0.02.692.346 I print_info: n_ctx_train           = 262144
0.02.692.347 I print_info: n_embd_inp            = 5376
0.02.692.347 I print_info: n_embd                = 5376
0.02.692.347 I print_info: n_embd_out            = 5376
0.02.692.347 I print_info: n_layer               = 60
0.02.692.347 I print_info: n_layer_all           = 60
0.02.692.351 I print_info: n_head                = 32
0.02.692.355 I print_info: n_head_kv             = [16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4]
0.02.692.355 I print_info: n_rot                 = 512
0.02.692.355 I print_info: n_swa                 = 1024
0.02.692.355 I print_info: is_swa_any            = 1
0.02.692.355 I print_info: n_embd_head_k         = 512
0.02.692.355 I print_info: n_embd_head_v         = 512
0.02.692.358 I print_info: n_gqa                 = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
0.02.692.361 I print_info: n_embd_k_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
0.02.692.363 I print_info: n_embd_v_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
0.02.692.363 I print_info: f_norm_eps            = 0.0e+00
0.02.692.365 I print_info: f_norm_rms_eps        = 1.0e-06
0.02.692.365 I print_info: f_clamp_kqv           = 0.0e+00
0.02.692.365 I print_info: f_max_alibi_bias      = 0.0e+00
0.02.692.365 I print_info: f_logit_scale         = 0.0e+00
0.02.692.365 I print_info: f_attn_scale          = 1.0e+00
0.02.692.365 I print_info: f_attn_value_scale    = 0.0000
0.02.692.366 I print_info: n_ff                  = 21504
0.02.692.366 I print_info: n_expert              = 0
0.02.692.366 I print_info: n_expert_used         = 0
0.02.692.367 I print_info: n_expert_groups       = 0
0.02.692.367 I print_info: n_group_used          = 0
0.02.692.367 I print_info: causal attn           = 1
0.02.692.367 I print_info: pooling type          = -1
0.02.692.367 I print_info: rope type             = 2
0.02.692.367 I print_info: rope scaling          = linear
0.02.692.368 I print_info: freq_base_train       = 1000000.0
0.02.692.368 I print_info: freq_scale_train      = 1
0.02.692.368 I print_info: freq_base_swa         = 10000.0
0.02.692.368 I print_info: freq_scale_swa        = 1
0.02.692.369 I print_info: n_embd_head_k_swa     = 256
0.02.692.369 I print_info: n_embd_head_v_swa     = 256
0.02.692.369 I print_info: n_rot_swa             = 256
0.02.692.369 I print_info: n_ctx_orig_yarn       = 262144
0.02.692.369 I print_info: rope_yarn_log_mul     = 0.0000
0.02.692.369 I print_info: rope_finetuned        = unknown
0.02.692.370 I print_info: model type            = 31B
0.02.692.370 I print_info: model params          = 30.70 B
0.02.692.370 I print_info: general.name          = Gemma-4-31B-It
0.02.692.371 I print_info: vocab type            = BPE
0.02.692.371 I print_info: n_vocab               = 262144
0.02.692.371 I print_info: n_merges              = 514906
0.02.692.371 I print_info: BOS token             = 2 '<bos>'
0.02.692.372 I print_info: EOS token             = 106 '<turn|>'
0.02.692.372 I print_info: UNK token             = 3 '<unk>'
0.02.692.372 I print_info: PAD token             = 0 '<pad>'
0.02.692.372 I print_info: MASK token            = 4 '<mask>'
0.02.692.372 I print_info: LF token              = 107 '
'
0.02.692.372 I print_info: EOG token             = 1 '<eos>'
0.02.692.373 I print_info: EOG token             = 50 '<|tool_response>'
0.02.692.373 I print_info: EOG token             = 106 '<turn|>'
0.02.692.373 I print_info: max token length      = 93
0.02.695.418 D done_getting_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
0.02.697.320 D llama_context: n_rs_seq=2 requested but model arch does not support recurrent partial rollback; clamping to 0
0.02.697.330 D set_abort_callback: call
0.02.699.188 D llama_context: enumerating backends
0.02.699.190 D llama_context: backend_ptrs.size() = 2
0.02.699.190 I sched_reserve: reserving ...
0.02.699.191 D sched_reserve: max_nodes = 6672
0.02.699.861 D sched_reserve: reserving full memory module
0.02.699.865 D sched_reserve: worst-case: n_tokens = 1024, n_seqs = 1, n_outputs = 1
0.02.699.866 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.02.700.355 I sched_reserve: Flash Attention was auto, set to enabled
0.02.700.355 I sched_reserve: resolving fused Gated Delta Net support:
0.02.700.356 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.02.700.752 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
0.02.700.753 D graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =   16
0.02.701.137 I sched_reserve: fused Gated Delta Net (chunked) enabled
0.02.701.138 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    3
0.02.701.978 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.02.702.374 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    3
0.02.702.758 I sched_reserve:      CUDA0 compute buffer size =  2735.32 MiB
0.02.702.760 I sched_reserve:  CUDA_Host compute buffer size =   559.32 MiB
0.02.702.760 I sched_reserve: graph nodes  = 3179
0.02.702.760 I sched_reserve: graph splits = 2
0.02.702.760 I sched_reserve: reserve took 3.57 ms, sched copies = 1
0.02.702.900 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
0.02.702.901 I common_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 31583 + (29641 = 17935 +    8970 +    2735) +      -29115 |
0.02.702.901 I common_memory_breakdown_print: |   - Host               |                   1483 =   924 +       0 +     559                |
0.02.758.255 I common_params_fit_impl: projected to use 29641 MiB of device memory vs. 31583 MiB of free device memory
0.02.758.258 I common_params_fit_impl: cannot meet free memory target of 2959 MiB, need to reduce device memory by 1017 MiB
0.02.878.717 I print_info: file format = GGUF V3 (latest)
0.02.878.718 I print_info: file type   = Q4_K - Medium
0.02.878.720 I print_info: file size   = 17.52 GiB (4.90 BPW) 
0.02.878.784 I llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31583 MiB free
0.02.983.940 D init_tokenizer: initializing tokenizer for type 2
0.03.019.348 I load: 0 unused tokens
0.03.019.436 D load: control token: 258884 '<|video|>' is not marked as EOG
0.03.019.493 D load: control token: 255999 '<|image>' is not marked as EOG
0.03.021.299 D load: control token: 258882 '<image|>' is not marked as EOG
0.03.022.042 D load: control token: 258883 '<audio|>' is not marked as EOG
0.03.024.866 D load: control token:     98 '<|think|>' is not marked as EOG
0.03.026.063 D load: control token:    105 '<|turn>' is not marked as EOG
0.03.026.509 D load: control token: 258880 '<|image|>' is not marked as EOG
0.03.027.611 D load: control token:      2 '<bos>' is not marked as EOG
0.03.028.270 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.03.028.345 D load: control token:      0 '<pad>' is not marked as EOG
0.03.028.582 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.03.028.591 D load: control token:     46 '<|tool>' is not marked as EOG
0.03.028.920 D load: control token:     47 '<tool|>' is not marked as EOG
0.03.029.060 D load: control token: 256000 '<|audio>' is not marked as EOG
0.03.030.534 D load: control token:      3 '<unk>' is not marked as EOG
0.03.031.693 D load: control token: 258881 '<|audio|>' is not marked as EOG
0.03.032.456 W load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
0.03.033.339 D load: control token:      4 '<mask>' is not marked as EOG
0.03.048.878 I load: printing all EOG tokens:
0.03.048.880 I load:   - 1 ('<eos>')
0.03.048.881 I load:   - 50 ('<|tool_response>')
0.03.048.881 I load:   - 106 ('<turn|>')
0.03.048.881 I load:   - 212 ('</s>')
0.03.048.883 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.03.049.376 I load: special tokens cache size = 24
0.03.062.604 I load: token to piece cache size = 1.9445 MB
0.03.062.614 I print_info: arch                  = gemma4
0.03.062.615 I print_info: vocab_only            = 0
0.03.062.615 I print_info: no_alloc              = 1
0.03.062.615 I print_info: n_ctx_train           = 262144
0.03.062.616 I print_info: n_embd_inp            = 5376
0.03.062.616 I print_info: n_embd                = 5376
0.03.062.616 I print_info: n_embd_out            = 5376
0.03.062.616 I print_info: n_layer               = 60
0.03.062.616 I print_info: n_layer_all           = 60
0.03.062.621 I print_info: n_head                = 32
0.03.062.626 I print_info: n_head_kv             = [16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4]
0.03.062.626 I print_info: n_rot                 = 512
0.03.062.626 I print_info: n_swa                 = 1024
0.03.062.627 I print_info: is_swa_any            = 1
0.03.062.627 I print_info: n_embd_head_k         = 512
0.03.062.627 I print_info: n_embd_head_v         = 512
0.03.062.629 I print_info: n_gqa                 = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
0.03.062.656 I print_info: n_embd_k_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
0.03.062.658 I print_info: n_embd_v_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
0.03.062.659 I print_info: f_norm_eps            = 0.0e+00
0.03.062.660 I print_info: f_norm_rms_eps        = 1.0e-06
0.03.062.660 I print_info: f_clamp_kqv           = 0.0e+00
0.03.062.660 I print_info: f_max_alibi_bias      = 0.0e+00
0.03.062.660 I print_info: f_logit_scale         = 0.0e+00
0.03.062.660 I print_info: f_attn_scale          = 1.0e+00
0.03.062.660 I print_info: f_attn_value_scale    = 0.0000
0.03.062.661 I print_info: n_ff                  = 21504
0.03.062.661 I print_info: n_expert              = 0
0.03.062.661 I print_info: n_expert_used         = 0
0.03.062.661 I print_info: n_expert_groups       = 0
0.03.062.661 I print_info: n_group_used          = 0
0.03.062.661 I print_info: causal attn           = 1
0.03.062.662 I print_info: pooling type          = -1
0.03.062.662 I print_info: rope type             = 2
0.03.062.662 I print_info: rope scaling          = linear
0.03.062.662 I print_info: freq_base_train       = 1000000.0
0.03.062.663 I print_info: freq_scale_train      = 1
0.03.062.663 I print_info: freq_base_swa         = 10000.0
0.03.062.663 I print_info: freq_scale_swa        = 1
0.03.062.663 I print_info: n_embd_head_k_swa     = 256
0.03.062.663 I print_info: n_embd_head_v_swa     = 256
0.03.062.663 I print_info: n_rot_swa             = 256
0.03.062.664 I print_info: n_ctx_orig_yarn       = 262144
0.03.062.664 I print_info: rope_yarn_log_mul     = 0.0000
0.03.062.664 I print_info: rope_finetuned        = unknown
0.03.062.664 I print_info: model type            = 31B
0.03.062.665 I print_info: model params          = 30.70 B
0.03.062.665 I print_info: general.name          = Gemma-4-31B-It
0.03.062.665 I print_info: vocab type            = BPE
0.03.062.666 I print_info: n_vocab               = 262144
0.03.062.666 I print_info: n_merges              = 514906
0.03.062.666 I print_info: BOS token             = 2 '<bos>'
0.03.062.666 I print_info: EOS token             = 106 '<turn|>'
0.03.062.666 I print_info: UNK token             = 3 '<unk>'
0.03.062.667 I print_info: PAD token             = 0 '<pad>'
0.03.062.667 I print_info: MASK token            = 4 '<mask>'
0.03.062.667 I print_info: LF token              = 107 '
'
0.03.062.667 I print_info: EOG token             = 1 '<eos>'
0.03.062.667 I print_info: EOG token             = 50 '<|tool_response>'
0.03.062.667 I print_info: EOG token             = 106 '<turn|>'
0.03.062.667 I print_info: max token length      = 93
0.03.065.687 D done_getting_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
0.03.067.661 D llama_context: n_rs_seq=2 requested but model arch does not support recurrent partial rollback; clamping to 0
0.03.067.665 W llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.03.067.670 D set_abort_callback: call
0.03.068.607 D llama_context: enumerating backends
0.03.068.609 D llama_context: backend_ptrs.size() = 2
0.03.068.609 I sched_reserve: reserving ...
0.03.068.610 D sched_reserve: max_nodes = 6672
0.03.069.473 D sched_reserve: reserving full memory module
0.03.069.477 D sched_reserve: worst-case: n_tokens = 1024, n_seqs = 1, n_outputs = 1
0.03.069.478 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.03.069.948 I sched_reserve: Flash Attention was auto, set to enabled
0.03.069.949 I sched_reserve: resolving fused Gated Delta Net support:
0.03.069.950 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.03.070.358 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
0.03.070.359 D graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =   16
0.03.070.757 I sched_reserve: fused Gated Delta Net (chunked) enabled
0.03.070.758 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    3
0.03.071.542 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.03.071.941 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    3
0.03.072.339 I sched_reserve:      CUDA0 compute buffer size =   371.32 MiB
0.03.072.340 I sched_reserve:  CUDA_Host compute buffer size =    55.32 MiB
0.03.072.341 I sched_reserve: graph nodes  = 3179
0.03.072.341 I sched_reserve: graph splits = 2
0.03.072.341 I sched_reserve: reserve took 3.73 ms, sched copies = 1
0.03.072.483 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
0.03.072.484 I common_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 31583 + (19087 = 17935 +     780 +     371) +      -18561 |
0.03.072.484 I common_memory_breakdown_print: |   - Host               |                    979 =   924 +       0 +      55                |
0.03.125.155 I common_params_fit_impl: context size reduced from 262144 to 237056 -> need 1026 MiB less memory in total
0.03.125.161 I common_params_fit_impl: entire model can be fit by reducing context
0.03.125.161 I common_fit_params: successfully fit params to free device memory
0.03.125.164 I common_fit_params: fitting params to free memory took 0.73 seconds
0.03.246.430 I print_info: file format = GGUF V3 (latest)
0.03.246.430 I print_info: file type   = Q4_K - Medium
0.03.246.432 I print_info: file size   = 17.52 GiB (4.90 BPW) 
0.03.246.473 I llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31583 MiB free
0.03.353.296 D init_tokenizer: initializing tokenizer for type 2
0.03.387.510 I load: 0 unused tokens
0.03.387.576 D load: control token: 258884 '<|video|>' is not marked as EOG
0.03.387.632 D load: control token: 255999 '<|image>' is not marked as EOG
0.03.389.349 D load: control token: 258882 '<image|>' is not marked as EOG
0.03.390.057 D load: control token: 258883 '<audio|>' is not marked as EOG
0.03.392.768 D load: control token:     98 '<|think|>' is not marked as EOG
0.03.394.039 D load: control token:    105 '<|turn>' is not marked as EOG
0.03.394.555 D load: control token: 258880 '<|image|>' is not marked as EOG
0.03.396.329 D load: control token:      2 '<bos>' is not marked as EOG
0.03.397.159 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.03.397.244 D load: control token:      0 '<pad>' is not marked as EOG
0.03.397.515 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.03.397.526 D load: control token:     46 '<|tool>' is not marked as EOG
0.03.397.936 D load: control token:     47 '<tool|>' is not marked as EOG
0.03.398.107 D load: control token: 256000 '<|audio>' is not marked as EOG
0.03.399.864 D load: control token:      3 '<unk>' is not marked as EOG
0.03.401.266 D load: control token: 258881 '<|audio|>' is not marked as EOG
0.03.402.205 W load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
0.03.403.099 D load: control token:      4 '<mask>' is not marked as EOG
0.03.413.906 I load: printing all EOG tokens:
0.03.413.908 I load:   - 1 ('<eos>')
0.03.413.908 I load:   - 50 ('<|tool_response>')
0.03.413.908 I load:   - 106 ('<turn|>')
0.03.413.909 I load:   - 212 ('</s>')
0.03.413.909 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.03.414.248 I load: special tokens cache size = 24
0.03.427.454 I load: token to piece cache size = 1.9445 MB
0.03.427.464 I print_info: arch                  = gemma4
0.03.427.465 I print_info: vocab_only            = 0
0.03.427.465 I print_info: no_alloc              = 0
0.03.427.465 I print_info: n_ctx_train           = 262144
0.03.427.465 I print_info: n_embd_inp            = 5376
0.03.427.465 I print_info: n_embd                = 5376
0.03.427.466 I print_info: n_embd_out            = 5376
0.03.427.466 I print_info: n_layer               = 60
0.03.427.466 I print_info: n_layer_all           = 60
0.03.427.471 I print_info: n_head                = 32
0.03.427.475 I print_info: n_head_kv             = [16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4]
0.03.427.475 I print_info: n_rot                 = 512
0.03.427.475 I print_info: n_swa                 = 1024
0.03.427.476 I print_info: is_swa_any            = 1
0.03.427.476 I print_info: n_embd_head_k         = 512
0.03.427.476 I print_info: n_embd_head_v         = 512
0.03.427.478 I print_info: n_gqa                 = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
0.03.427.505 I print_info: n_embd_k_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
0.03.427.508 I print_info: n_embd_v_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
0.03.427.508 I print_info: f_norm_eps            = 0.0e+00
0.03.427.509 I print_info: f_norm_rms_eps        = 1.0e-06
0.03.427.509 I print_info: f_clamp_kqv           = 0.0e+00
0.03.427.509 I print_info: f_max_alibi_bias      = 0.0e+00
0.03.427.510 I print_info: f_logit_scale         = 0.0e+00
0.03.427.510 I print_info: f_attn_scale          = 1.0e+00
0.03.427.510 I print_info: f_attn_value_scale    = 0.0000
0.03.427.511 I print_info: n_ff                  = 21504
0.03.427.511 I print_info: n_expert              = 0
0.03.427.511 I print_info: n_expert_used         = 0
0.03.427.511 I print_info: n_expert_groups       = 0
0.03.427.511 I print_info: n_group_used          = 0
0.03.427.511 I print_info: causal attn           = 1
0.03.427.511 I print_info: pooling type          = -1
0.03.427.511 I print_info: rope type             = 2
0.03.427.511 I print_info: rope scaling          = linear
0.03.427.512 I print_info: freq_base_train       = 1000000.0
0.03.427.512 I print_info: freq_scale_train      = 1
0.03.427.513 I print_info: freq_base_swa         = 10000.0
0.03.427.513 I print_info: freq_scale_swa        = 1
0.03.427.513 I print_info: n_embd_head_k_swa     = 256
0.03.427.513 I print_info: n_embd_head_v_swa     = 256
0.03.427.513 I print_info: n_rot_swa             = 256
0.03.427.513 I print_info: n_ctx_orig_yarn       = 262144
0.03.427.513 I print_info: rope_yarn_log_mul     = 0.0000
0.03.427.514 I print_info: rope_finetuned        = unknown
0.03.427.514 I print_info: model type            = 31B
0.03.427.514 I print_info: model params          = 30.70 B
0.03.427.515 I print_info: general.name          = Gemma-4-31B-It
0.03.427.515 I print_info: vocab type            = BPE
0.03.427.515 I print_info: n_vocab               = 262144
0.03.427.516 I print_info: n_merges              = 514906
0.03.427.516 I print_info: BOS token             = 2 '<bos>'
0.03.427.516 I print_info: EOS token             = 106 '<turn|>'
0.03.427.516 I print_info: UNK token             = 3 '<unk>'
0.03.427.516 I print_info: PAD token             = 0 '<pad>'
0.03.427.516 I print_info: MASK token            = 4 '<mask>'
0.03.427.516 I print_info: LF token              = 107 '
'
0.03.427.517 I print_info: EOG token             = 1 '<eos>'
0.03.427.517 I print_info: EOG token             = 50 '<|tool_response>'
0.03.427.517 I print_info: EOG token             = 106 '<turn|>'
0.03.427.517 I print_info: max token length      = 93
0.03.430.538 D done_getting_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
.............................................................................................
0.05.164.959 I common_init_result: added <eos> logit bias = -inf
0.05.164.961 I common_init_result: added <|tool_response> logit bias = -inf
0.05.164.962 I common_init_result: added <turn|> logit bias = -inf
0.05.165.439 D llama_context: n_rs_seq=2 requested but model arch does not support recurrent partial rollback; clamping to 0
0.05.165.443 W llama_context: n_ctx_seq (237056) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.05.165.454 D set_abort_callback: call
0.05.173.501 D llama_context: enumerating backends
0.05.173.505 D llama_context: backend_ptrs.size() = 2
0.05.173.506 I sched_reserve: reserving ...
0.05.173.507 D sched_reserve: max_nodes = 6672
0.05.174.194 D sched_reserve: reserving full memory module
0.05.174.198 D sched_reserve: worst-case: n_tokens = 1024, n_seqs = 1, n_outputs = 1
0.05.174.198 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.05.174.669 I sched_reserve: Flash Attention was auto, set to enabled
0.05.174.670 I sched_reserve: resolving fused Gated Delta Net support:
0.05.174.671 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.05.175.087 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
0.05.175.088 D graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =   16
0.05.175.492 I sched_reserve: fused Gated Delta Net (chunked) enabled
0.05.175.492 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    3
0.05.269.778 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.05.270.560 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    3
0.05.271.239 I sched_reserve:      CUDA0 compute buffer size =  2490.32 MiB
0.05.271.240 I sched_reserve:  CUDA_Host compute buffer size =   510.32 MiB
0.05.271.240 I sched_reserve: graph nodes  = 3179
0.05.271.241 I sched_reserve: graph splits = 2
0.05.271.241 I sched_reserve: reserve took 97.73 ms, sched copies = 1
0.05.271.325 D set_adapters_lora: adapters = (nil)
0.05.271.326 D adapters_lora_are_same: adapters = (nil)
0.05.271.327 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.05.304.769 I srv    load_model: loading draft model '/home/debian/.cache/huggingface/hub/models--unsloth--gemma-4-31B-it-GGUF/snapshots/8906b3db2e669a0b1d6293c315d3f9fbf934a86d/mtp-gemma-4-31B-it.gguf'
0.05.430.001 I print_info: file format = GGUF V3 (latest)
0.05.430.002 I print_info: file type   = Q8_0
0.05.430.004 I print_info: file size   = 475.81 MiB (8.50 BPW) 
0.05.430.050 I llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 2925 MiB free
0.05.544.982 D init_tokenizer: initializing tokenizer for type 2
0.05.579.704 I load: 0 unused tokens
0.05.579.825 D load: control token: 255999 '<|image>' is not marked as EOG
0.05.581.581 D load: control token: 258882 '<image|>' is not marked as EOG
0.05.582.289 D load: control token: 258883 '<audio|>' is not marked as EOG
0.05.585.017 D load: control token:     98 '<|think|>' is not marked as EOG
0.05.586.290 D load: control token:    105 '<|turn>' is not marked as EOG
0.05.586.817 D load: control token: 258880 '<|image|>' is not marked as EOG
0.05.588.516 D load: control token:      2 '<bos>' is not marked as EOG
0.05.589.385 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.05.589.485 D load: control token:      0 '<pad>' is not marked as EOG
0.05.589.759 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.05.589.770 D load: control token:     46 '<|tool>' is not marked as EOG
0.05.590.184 D load: control token:     47 '<tool|>' is not marked as EOG
0.05.590.384 D load: control token: 256000 '<|audio>' is not marked as EOG
0.05.592.229 D load: control token:      3 '<unk>' is not marked as EOG
0.05.593.648 D load: control token: 258881 '<|audio|>' is not marked as EOG
0.05.595.448 D load: control token:      4 '<mask>' is not marked as EOG
0.05.609.148 I load: printing all EOG tokens:
0.05.609.150 I load:   - 1 ('<eos>')
0.05.609.150 I load:   - 50 ('<|tool_response>')
0.05.609.150 I load:   - 106 ('<turn|>')
0.05.609.151 I load:   - 212 ('</s>')
0.05.609.152 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.05.609.596 I load: special tokens cache size = 23
0.05.624.199 I load: token to piece cache size = 1.9445 MB
0.05.624.207 I print_info: arch                  = gemma4-assistant
0.05.624.207 I print_info: vocab_only            = 0
0.05.624.208 I print_info: no_alloc              = 0
0.05.624.208 I print_info: n_ctx_train           = 262144
0.05.624.208 I print_info: n_embd_inp            = 5376
0.05.624.208 I print_info: n_embd                = 1024
0.05.624.208 I print_info: n_embd_out            = 5376
0.05.624.208 I print_info: n_layer               = 0
0.05.624.209 I print_info: n_layer_all           = 4
0.05.624.213 I print_info: n_head                = 32
0.05.624.214 I print_info: n_head_kv             = [16, 16, 16, 4]
0.05.624.214 I print_info: n_rot                 = 512
0.05.624.214 I print_info: n_swa                 = 1024
0.05.624.214 I print_info: is_swa_any            = 1
0.05.624.214 I print_info: n_embd_head_k         = 512
0.05.624.215 I print_info: n_embd_head_v         = 512
0.05.624.215 I print_info: n_gqa                 = [2, 2, 2, 8]
0.05.624.216 I print_info: n_embd_k_gqa          = [4096, 4096, 4096, 2048]
0.05.624.217 I print_info: n_embd_v_gqa          = [4096, 4096, 4096, 2048]
0.05.624.217 I print_info: f_norm_eps            = 0.0e+00
0.05.624.219 I print_info: f_norm_rms_eps        = 1.0e-06
0.05.624.219 I print_info: f_clamp_kqv           = 0.0e+00
0.05.624.219 I print_info: f_max_alibi_bias      = 0.0e+00
0.05.624.219 I print_info: f_logit_scale         = 0.0e+00
0.05.624.219 I print_info: f_attn_scale          = 1.0e+00
0.05.624.219 I print_info: f_attn_value_scale    = 0.0000
0.05.624.220 I print_info: n_ff                  = 8192
0.05.624.220 I print_info: n_expert              = 0
0.05.624.220 I print_info: n_expert_used         = 0
0.05.624.220 I print_info: n_expert_groups       = 0
0.05.624.220 I print_info: n_group_used          = 0
0.05.624.220 I print_info: causal attn           = 1
0.05.624.220 I print_info: pooling type          = -1
0.05.624.220 I print_info: rope type             = 2
0.05.624.221 I print_info: rope scaling          = linear
0.05.624.221 I print_info: freq_base_train       = 1000000.0
0.05.624.222 I print_info: freq_scale_train      = 1
0.05.624.222 I print_info: freq_base_swa         = 10000.0
0.05.624.222 I print_info: freq_scale_swa        = 1
0.05.624.222 I print_info: n_embd_head_k_swa     = 256
0.05.624.222 I print_info: n_embd_head_v_swa     = 256
0.05.624.222 I print_info: n_rot_swa             = 256
0.05.624.223 I print_info: n_ctx_orig_yarn       = 262144
0.05.624.223 I print_info: rope_yarn_log_mul     = 0.0000
0.05.624.223 I print_info: rope_finetuned        = unknown
0.05.624.223 I print_info: model type            = ?B
0.05.624.224 I print_info: model params          = 469.52 M
0.05.624.224 I print_info: general.name          = 31B
0.05.624.225 I print_info: vocab type            = BPE
0.05.624.225 I print_info: n_vocab               = 262144
0.05.624.225 I print_info: n_merges              = 514906
0.05.624.226 I print_info: BOS token             = 2 '<bos>'
0.05.624.226 I print_info: EOS token             = 1 '<eos>'
0.05.624.226 I print_info: UNK token             = 3 '<unk>'
0.05.624.226 I print_info: PAD token             = 0 '<pad>'
0.05.624.226 I print_info: MASK token            = 4 '<mask>'
0.05.624.226 I print_info: LF token              = 107 '
'
0.05.624.226 I print_info: EOG token             = 1 '<eos>'
0.05.624.227 I print_info: EOG token             = 50 '<|tool_response>'
0.05.624.227 I print_info: EOG token             = 106 '<turn|>'
0.05.624.227 I print_info: max token length      = 93
0.05.624.414 D done_getting_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
.........................
0.05.670.712 D set_abort_callback: call
0.05.671.893 D llama_context: enumerating backends
0.05.671.895 D llama_context: backend_ptrs.size() = 2
0.05.671.895 I sched_reserve: reserving ...
0.05.671.896 D sched_reserve: max_nodes = 1024
0.05.672.027 D sched_reserve: reserving full memory module
0.05.672.029 D sched_reserve: worst-case: n_tokens = 1024, n_seqs = 1, n_outputs = 1
0.05.672.030 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.05.672.103 I sched_reserve: Flash Attention was auto, set to enabled
0.05.672.103 I sched_reserve: resolving fused Gated Delta Net support:
0.05.672.103 D graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
0.05.672.142 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
0.05.672.142 D graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =   16
0.05.672.177 I sched_reserve: fused Gated Delta Net (chunked) enabled
0.05.672.178 D graph_reserve: reserving a graph for ubatch with n_tokens = 1024, n_seqs =  1, n_outputs =    3
0.05.672.379 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2490.29 MiB on device 0: cudaMalloc failed: out of memory
0.05.672.381 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 2611253376
0.05.672.381 E graph_reserve: failed to allocate compute buffers
0.05.672.424 E llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
0.05.672.651 I 
0.05.672.996 I clip_ctx: CLIP using CUDA0 backend
0.05.673.428 I load_hparams: projector:          gemma4v
0.05.673.429 I load_hparams: n_embd:             1152
0.05.673.429 I load_hparams: n_head:             16
0.05.673.429 I load_hparams: n_ff:               4304
0.05.673.429 I load_hparams: n_layer:            27
0.05.673.429 I load_hparams: ffn_op:             gelu_quick
0.05.673.429 I load_hparams: projection_dim:     5376
0.05.673.429 I 
--- vision hparams ---
0.05.673.430 I load_hparams: image_size:         224
0.05.673.430 I load_hparams: patch_size:         16
0.05.673.430 I load_hparams: has_llava_proj:     0
0.05.673.430 I load_hparams: minicpmv_version:   0
0.05.673.430 I load_hparams: n_merge:            3
0.05.673.430 I load_hparams: n_wa_pattern: 0
0.05.673.431 I load_hparams: image_min_pixels:   92160
0.05.673.431 I load_hparams: image_max_pixels:   645120
0.05.673.431 I 
0.05.673.431 I load_hparams: model size:         1145.08 MiB
0.05.673.432 I load_hparams: metadata size:      0.12 MiB
0.05.875.189 I get_dummy_batch: warmup with image size = 768 x 768
0.05.875.461 I get_dummy_batch: warmup with image size = 768 x 768
0.05.875.987 I reserve_compute_meta:      CUDA0 compute buffer size =   140.50 MiB
0.05.875.989 I reserve_compute_meta:        CPU compute buffer size =     6.77 MiB
0.05.875.989 I reserve_compute_meta: graph splits = 1, nodes = 1571
0.05.876.024 I warmup: flash attention is enabled
0.05.876.029 I srv    load_model: loaded multimodal model, '/home/debian/.cache/huggingface/hub/models--unsloth--gemma-4-31B-it-GGUF/snapshots/8906b3db2e669a0b1d6293c315d3f9fbf934a86d/mmproj-BF16.gguf'
0.05.876.036 I srv    load_model: initializing slots, n_slots = 1
0.05.882.283 D CUDA Graph id 95 reused
0.05.882.286 D ggml_backend_cuda_graph_compute: CUDA graph warmup complete
0.05.907.130 W common_speculative_init: no implementations specified for speculative decoding
0.05.907.134 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 237056
0.05.907.134 D slot        reset: id  0 | task -1 | 
0.05.907.183 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.05.907.184 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.05.907.184 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.05.907.184 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.05.907.195 I srv          init: idle slots will be saved to prompt cache upon starting a new task
0.05.907.195 D srv          init: __TEST_TAG_CACHE_IDLE_SLOTS_ENABLED__
0.05.911.280 I init: chat template, example_format: '<|turn>system
<|think|>
You are a helpful assistant<turn|>
<|turn>user
Hello<turn|>
<|turn>model
Hi there<turn|>
<|turn>user
How are you?<turn|>
<|turn>model
'
0.05.911.695 I srv          init: init: chat template, thinking = 1
0.05.911.714 I srv  llama_server: model loaded
0.05.911.716 I srv  llama_server: server is listening on http://127.0.0.1:8080
0.05.911.718 D que    start_loop: processing new tasks
0.05.911.719 D que    start_loop: update slots
0.05.911.719 I srv  update_slots: all slots are idle
0.05.911.720 D que    start_loop: waiting for new tasks
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Gemma4 MTP is silently disabled in case of insufficient VRAM #24758

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Gemma4 MTP is silently disabled in case of insufficient VRAM #24758

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions