Skip to content

Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700 tokens #24541

@Zertruermmerdog

Description

@Zertruermmerdog

Name and Version

Reproduced on two builds:

  • Release binary b9611 (llama-b9611-bin-win-vulkan-x64) — for part 1 (assert)
  • Self-build of master 02182fc (MSVC 19.44 / Vulkan SDK 1.4.350, -DGGML_VULKAN=ON) with a 1-line patch (see below) — for part 2 (ctx_dft decode failure)

Operating systems

Windows 11 Pro (26200)

GGML backends

Vulkan

Hardware

2x AMD RDNA4 (RX 9070 XT 16 GB + RX 9070 16 GB), -sm layer

Models

  • Target: unsloth/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-UD-Q5_K_XL (arch qwen3_5, GDN hybrid)
  • EAGLE3 draft: Ex0bit/Qwen3.6-27B-PRISM-EAGLE3 compressed/ (standard EAGLE3 layout, draft_vocab 32768), converted with convert_hf_to_gguf.py --outtype f16 --target-model-dir <dir with Qwen/Qwen3.6-27B config.json + tokenizer files> — conversion works fine on master.

Problem description & steps to reproduce

Following up on #18039 (EAGLE3) with a Qwen3.6 / qwen3_5 hybrid (GDN) target. Two separate findings:

Part 1 — qwen3_5 target asserts immediately (stock b9611 and master):

llama-server -m Qwen3.6-27B-UD-Q5_K_XL-MTP.gguf -ngl 99 -sm layer -fa on -c 32768 --jinja -np 1 \
  --spec-type draft-eagle3 --spec-draft-model Qwen3.6-27B-EAGLE3-PRISM-f16.gguf

First request dies with:

src/llama-graph.cpp:956: GGML_ASSERT(t_layer_inp[il] != nullptr && "layer input tensor is null") failed

Cause: only llama.cpp, qwen3.cpp, qwen3moe.cpp, gemma4.cpp, openai-moe.cpp populate res->t_layer_inp[il]; qwen35.cpp does not. Adding the same one-liner as qwen3.cpp at the top of the layer loop in qwen35.cpp fixes the assert, and EAGLE3 then genuinely works on the hybrid target for short contexts (coherent output, draft acceptance reported, e.g. 0.57 over 550 greedy code-gen tokens with this third-party drafter):

for (int il = 0; il < n_layer; ++il) {
    res->t_layer_inp[il] = inpL;   // <-- added, same as qwen3.cpp
    ...

Happy to open a PR for this if qwen3_5 targets are in scope (related stale attempt: #21437).

Part 2 — with the hook in place, deterministic fatal once total context exceeds ~700 tokens:

  • prompt of ~650 tokens: works
  • prompt of ~740 tokens: draft decode fails every time, server dies via the server : add kill switch when server is stuck #20277 kill switch (the fatal-error message asks to report there, but this looked EAGLE3-specific enough for its own issue)
E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, ubatch_pos[0]=736)
tools/server/server-context.cpp:3185: fatal error - please provide logs and repro in https://github.com/ggml-org/llama.cpp/pull/20277

Notes:

  • Independent of -ub / -b (tried -ub 256 -b 4096 and -ub 2048 -b 2048) and of prompt content (plain repeated text, no chat history needed; single request on a fresh server reproduces it).
  • params_dft inherits the full n_ctx (32768 here), so it does not look like a simple draft-context-size limit.
  • rc=-1 (invalid input batch) comes from the EAGLE3 draft path in common/speculative.cpp; the failing micro-batch is tiny (n_tokens=4) at the position right below the prompt end.

Repro:

# 1) convert drafter (works)
python convert_hf_to_gguf.py prism-compressed/ --outtype f16 \
  --target-model-dir qwen36-27b-target/ --outfile Qwen3.6-27B-EAGLE3-PRISM-f16.gguf

# 2) patch src/models/qwen35.cpp as above, build with -DGGML_VULKAN=ON

# 3) start server (args above), then:
#    - send a ~650-token prompt  -> OK
#    - send a ~740-token prompt  -> llama_decode(ctx_dft) rc=-1, fatal

First Bad Commit

n/a (feature combination was never functional for this target arch)

Relevant log output

1.06.900.055 I slot print_timing: id  0 | task 0 | draft acceptance = 1.00000 (    3 accepted /     3 generated)
1.06.900.393 I statistics     draft-eagle3: #calls(b,g,a) =    1      1      1, ...
... (short contexts fine) ...
0.21.939.306 E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, ubatch_pos[0]=736)
F:\AI\llama.cpp-src\tools\server\server-context.cpp:3185: fatal error - please provide logs and repro in https://github.com/ggml-org/llama.cpp/pull/20277

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions