Skip to content

unknown model architecture: 'gemma4_assistant' #10340

@scetu

Description

@scetu

LocalAI version:

LocalAI v4.4.3 (4d3d54d)

Environment, CPU architecture, OS, and Version:

Ubuntu 26.04 with LocalAI in Docker, 2x ROCm GPUs GFX1100

Describe the bug

Cannot run Gemma 4 QAT MTP models introduced in #10215

To Reproduce

Import gemma-4-26b-a4b-it-qat-mtp from gallery, run it, you will get error

gemma-4-26b-a4b-it-qat-mtp
Error: failed to load model with internal loader: could not load model: rpc error: code = Internal desc = Failed to load model: /models/llama-cpp/models/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B_q4_0-it.gguf (with mmproj: /models/llama-cpp/mmproj/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B-it-mmproj.gguf) (with draft model: /models/llama-cpp/models/gemma-4-qat-mtp-assistant-heads/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf). Error: llama_model_load: error loading model: unknown model architecture: 'gemma4_assistant'; llama_model_load_from_file_impl: failed to load model; llama_model_load: error loading model: unknown model architecture: 'gemma4_assistant'; llama_model_load_from_file_impl: failed to load model

this is default config which is used for the model definition

backend: llama-cpp
draft_model: llama-cpp/models/gemma-4-qat-mtp-assistant-heads/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf
function:
    automatic_tool_parsing_fallback: true
    grammar:
        disable: true
known_usecases:
    - chat
min_p: 0
mmproj: llama-cpp/mmproj/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B-it-mmproj.gguf
name: gemma-4-26b-a4b-it-qat-mtp
options:
    - use_jinja:true
    - spec_type:draft-mtp
    - spec_n_max:6
    - spec_p_min:0.75
parameters:
    min_p: 0
    model: llama-cpp/models/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B_q4_0-it.gguf
    repeat_penalty: 1
    temperature: 1
    top_k: 64
    top_p: 0.95
repeat_penalty: 1
temperature: 1
template:
    use_tokenizer_template: true
top_k: 64
top_p: 0.95

llama-cpp backend in version 3327693bf0da

llama-cpp
User
LLM inference in C/C++
	
3327693bf0da
	34m ago	
Details

Description

    LLM inference in C/C++
Repository
    localai
License
    mit
Tags
    text-to-textLLMCPUGPUMetalCUDAHIP
Links
    https://github.com/ggerganov/llama.cpp
Source
    quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-llama-cpp
Digest
    sha256:3327693bf0da60c286771d5f86e37763c4ec7347a8b0450e358ed80658de61d7
Installed
    34m ago(2026-06-15 08:40:00 UTC)
Alias
    llama-cpp

Expected behavior

Model runs.

Logs

+++ realpath run.sh
++ dirname /backends/rocm-llama-cpp/run.sh
+ CURDIR=/backends/rocm-llama-cpp
+ cd /
+ echo 'CPU info:'
CPU info:
+ grep -e 'model\sname' /proc/cpuinfo
+ head -1
model name	: AMD Ryzen 9 9900X 12-Core Processor
+ grep -e flags /proc/cpuinfo
+ head -1
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze
+ BINARY=llama-cpp-fallback
+ grep -q -e '\savx\s' /proc/cpuinfo
+ echo 'CPU:    AVX    found OK'
+ '[' -e /backends/rocm-llama-cpp/llama-cpp-avx ']'
+ grep -q -e '\savx2\s' /proc/cpuinfo
CPU:    AVX    found OK
CPU:    AVX2   found OK
+ echo 'CPU:    AVX2   found OK'
+ '[' -e /backends/rocm-llama-cpp/llama-cpp-avx2 ']'
+ grep -q -e '\savx512f\s' /proc/cpuinfo
+ echo 'CPU:    AVX512F found OK'
+ '[' -e /backends/rocm-llama-cpp/llama-cpp-avx512 ']'
+ '[' -n '' ']'
CPU:    AVX512F found OK
++ uname
+ '[' Linux == Darwin ']'
+ export LD_LIBRARY_PATH=/backends/rocm-llama-cpp/lib:
+ LD_LIBRARY_PATH=/backends/rocm-llama-cpp/lib:
+ '[' -d /backends/rocm-llama-cpp/lib/rocblas/library ']'
+ export ROCBLAS_TENSILE_LIBPATH=/backends/rocm-llama-cpp/lib/rocblas/library
+ ROCBLAS_TENSILE_LIBPATH=/backends/rocm-llama-cpp/lib/rocblas/library
+ '[' -f /backends/rocm-llama-cpp/lib/ld.so ']'
Using lib/ld.so
+ echo 'Using lib/ld.so'
Using binary: llama-cpp-fallback
+ echo 'Using binary: llama-cpp-fallback'
+ exec /backends/rocm-llama-cpp/lib/ld.so /backends/rocm-llama-cpp/llama-cpp-fallback --addr 127.0.0.1:39233
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1781514620.978437     696 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache, work_serializer_dispatch
I0000 00:00:1781514620.978591     696 ev_epoll1_linux.cc:125] grpc epoll fd: 3
I0000 00:00:1781514620.978697     696 server_builder.cc:392] Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
I0000 00:00:1781514620.979613     696 ev_epoll1_linux.cc:359] grpc epoll fd: 4
I0000 00:00:1781514620.979973     696 tcp_socket_utils.cc:634] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
Server listening on 127.0.0.1:39233
start_llama_server: starting llama server
start_llama_server: waiting for model to be loaded
0.03.546.297 I system info: n_threads = 12, n_threads_batch = -1, total_threads = 24
0.03.546.300 I 
0.03.546.312 I system_info: n_threads = 12 / 24 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.03.546.313 I 
0.03.546.314 I srv    load_model: loading model '/models/llama-cpp/models/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B_q4_0-it.gguf'
0.04.015.661 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 1286.72 MiB
0.04.139.454 E llama_model_load: error loading model: unknown model architecture: 'gemma4_assistant'
0.04.139.464 E llama_model_load_from_file_impl: failed to load model
0.04.139.476 W srv    load_model: [spec] failed to measure draft model memory: failed to load model
0.04.139.763 I common_init_result: fitting params to device memory ...
0.04.139.764 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.04.650.126 W load: override 'tokenizer.ggml.add_bos_token' to 'true' for Gemma4
0.04.664.813 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.04.664.993 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.04.674.013 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.08.266.211 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.08.407.645 I srv    load_model: loading draft model '/models/llama-cpp/models/gemma-4-qat-mtp-assistant-heads/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf'
0.08.517.883 E llama_model_load: error loading model: unknown model architecture: 'gemma4_assistant'
0.08.517.887 E llama_model_load_from_file_impl: failed to load model
0.08.517.889 E srv    load_model: failed to load draft model, '/models/llama-cpp/models/gemma-4-qat-mtp-assistant-heads/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf'

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions