Skip to content

[BUG?] (Kobold v1.113) - Was MTP integrated from upstream? #2211

@SabinStargem

Description

@SabinStargem

I have been trying to run some MTP models with the latest Kobold, and seeing how LlamaCPP just added MTP, I assumed Kobold should be able to run models. Unfortunately, three separate MTP models don't boot up. Here is the log:

00000000000

llama_model_load: error loading model: missing tensor 'blk.64.ssm_conv1d.weight'
llama_model_load_from_file_impl: failed to load model
Traceback (most recent call last):
File "koboldcpp.py", line 11664, in
main(launch_args=parser.parse_args(),default_args=parser.parse_args([]))
File "koboldcpp.py", line 10230, in main
kcpp_main_process(args,global_memory,using_gui_launcher)
File "koboldcpp.py", line 10938, in kcpp_main_process
loadok = load_model(modelname)
File "koboldcpp.py", line 2014, in load_model
ret = handle.load_model(inputs)
OSError: exception: access violation reading 0x000000000000000C
[40812] Failed to execute script 'koboldcpp' due to unhandled exception!

[process exited with code 1 (0x00000001)]
You can now close this terminal with Ctrl+D, or press Enter to restart.

000000000

Here is the full log.

000000000


Welcome to KoboldCpp - Version 1.113
For command line arguments, please refer to --help


Auto Selected CUDA Backend (flag=0)

Loading Chat Completions Adapter: C:\Users\Janus\AppData\Local\Temp_MEI164762\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
Forced autofit is selected, moecpu and overridetensors will be set automatically.
Auto Recommended GPU Layers: 0
System: Windows 10.0.26200 AMD64 AMD64 Family 25 Model 33 Stepping 2, AuthenticAMD
Detected Available GPU Memory: 12288 MB
Detected Available RAM: 105147 MB
Initializing dynamic library: koboldcpp_cublas.dll

Namespace(admin=False, admindir='', adminpassword='', adminunloadtimeout=0, analyze='', autofit=True, autofitpadding=1024, autoswapmode=False, baseconfig='', batchsize=512, benchmark=None, blasthreads=None, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=262144, continuous_batching=0, debugmode=0, defaultgenamt=8192, device='', downloaddir='', draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=True, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=False, foreground=False, gendefaults='', gendefaultsoverwrite=False, genlimit=0, gpulayers=0, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=True, jinja_kwargs='"preserve_thinking": True', jinja_tools=True, jinjatemplate='', launch=True, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mcpfile=None, mmproj='C:/KoboldCPP/Models/Qwen3.6-27B-mmproj-BF16.gguf', mmprojcpu=False, model=[], model_param='C:/KoboldCPP/Models/Qwen3.6-27B-UD-Q6_K_XL.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=10, musicdiffusion='', musicembeddings='', musicllm='', musiclowvram=False, musicvae='', noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, noflashattention=False, nommap=False, nommq=False, nomodel=False, nopipelineparallel=False, noshift=True, onready='', overridekv=None, overridenativecontext=0, overridetensors='', password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory=None, prompt='', proxy_port=None, quantkv='bf16', quiet=False, ratelimit=0, remotetunnel=False, reqtimeout=0, ropeconfig=[0.0, 10000.0], routermode=False, savedatafile=None, sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults=False, sdlora=[], sdloramult=[1.0], sdmaingpu=-1, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=15, sdtiledvae=768, sdupscaler='', sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=5, smartcontext=False, splitmode='layer', ssl=None, swapadding=1024, tensor_split=[1.2, 3.0], testmemory=False, threads=15, ttsdir='', ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=15, ttswavtokenizer='', unpack='', usecpu=False, usecuda=['normal'], usemlock=False, usemmap=False, useswa=True, usevulkan=None, version=False, visionmaxres=2048, visionmaxtokens=-1, visionmintokens=-1, websearch=False, whispermodel='')

Loading Text Model: C:\KoboldCPP\Models\Qwen3.6-27B-UD-Q6_K_XL.gguf

The reported GGUF Arch is: qwen35
Arch Category: 32


Identified as GGUF model.
Attempting to Load...

SWA Mode is ENABLED!
Note that using SWA Mode cannot be used with Context Shifting, and can lead to degraded recall when combined with Fast Forwarding!
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 36851 MiB):
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 12287 MiB
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24563 MiB
CUDA MMQ: True

Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...

Applying Tensor Split...

Attempting to use llama.cpp's automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 2356 MB
Autofit Success: 0, Autofit Result: -c 262272 -ngl -1
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from C:\KoboldCPP\Models\Qwen3.6-27B-UD-Q6_K_XL.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 24.22 GiB (7.61 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:03:00.0) - 11255 MiB free
llama_prepare_model_devices: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:0a:00.0) - 23040 MiB free
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: setting token '' (248069) attribute to USER_DEFINED (16), old attributes: 16
load: setting token '' (248068) attribute to USER_DEFINED (16), old attributes: 16
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch = qwen35
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 5120
print_info: n_embd_inp = 5120
print_info: n_layer = 65
print_info: n_head = 24
print_info: n_head_kv = 4
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 6
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: f_attn_value_scale = 0.0000
print_info: n_ff = 17408
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: mrope sections = [11, 11, 10, 0]
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 6144
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 48
print_info: ssm_n_group = 16
print_info: ssm_dt_b_c_rms = 0
print_info: model type = ?B
print_info: model params = 27.32 B
print_info: general.name = Qwen3.6-27B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 248044 '<|endoftext|>'
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248055 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256

000000000000000000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions