Skip to content

Eval bug: Gemma4 E4B MTP drafter crashes at slot init with fatal error in fattn.cu:110 #24376

@dboybaker

Description

@dboybaker

Name and Version

version: 9584 (e25a32e)
built with GNU 14.2.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

Ryzen 5600x + 2 x RTX 3090

Models

Unsloth's gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf and its mpt drafter mtp-gemma-4-E4B-it.gguf

Problem description & steps to reproduce

E4B + MTP drafter crashes during slot initialization warmup decode with a fatal error in ggml_cuda_flash_attn_ext. E2B with identical config works fine at ~320 t/s. *Crash is affected by --flash-attn on vs off. The bug arises with the addition of flags:

      --model-draft /gemma4-e4b/mtp-gemma-4-E4B-it.gguf
      --spec-type draft-mtp
      --spec-draft-n-max 1

First Bad Commit

PR #24282 introduce support for e4b/e2b

Relevant log output

Logs
0.03.104.831 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.03.104.834 I common_speculative_impl_draft_mtp: - n_max=1, n_min=0, p_min=0.00, n_embd=2560, backend_sampling=1
0.03.104.835 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
/llama.cpp/ggml/src/ggml-cuda/fattn.cu:110: fatal error
/llama.cpp/build/bin/libggml-base.so.0(+0x18665) [0x7fa52615f665]
/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1df) [0x7fa52615fa3f]
/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11e) [0x7fa52615fbce]
/llama.cpp/build/bin/libggml-cuda.so.0(_Z24ggml_cuda_flash_attn_extR25ggml_backend_cuda_contextP11ggml_tensor+0x2a85) [0x7fa5217e3155]
/llama.cpp/build/bin/libggml-cuda.so.0(+0x235616) [0x7fa521835616]
/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x827) [0x7fa52617c287]
/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1) [0x7fa5258cd2f1]
/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe4) [0x7fa5258cff04]
/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x365) [0x7fa5258d5e55]
/llama.cpp/build/bin/libllama.so.0(llama_decode+0xb) [0x7fa5258d7a3b]
/llama.cpp/build/bin/libllama-common.so.0(_Z25common_context_can_seq_rmP13llama_context+0xc8) [0x7fa525de28c8]
/llama.cpp/build/bin/libllama-server-impl.so(_ZN19server_context_impl10load_modelER13common_params+0xcd4) [0x7fa526972214]
/llama.cpp/build/bin/libllama-server-impl.so(_Z12llama_serveriPPc+0x2f03) [0x7fa5268c57d3]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7fa526235ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fa526235d65]
/usr/local/bin/llama-server(+0x11b1) [0x556ba39dc1b1] 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions