You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cause: only llama.cpp, qwen3.cpp, qwen3moe.cpp, gemma4.cpp, openai-moe.cpp populate res->t_layer_inp[il]; qwen35.cpp does not. Adding the same one-liner as qwen3.cpp at the top of the layer loop in qwen35.cpp fixes the assert, and EAGLE3 then genuinely works on the hybrid target for short contexts (coherent output, draft acceptance reported, e.g. 0.57 over 550 greedy code-gen tokens with this third-party drafter):
for (int il = 0; il < n_layer; ++il) {
res->t_layer_inp[il] = inpL; // <-- added, same as qwen3.cpp
...
Happy to open a PR for this if qwen3_5 targets are in scope (related stale attempt: #21437).
Part 2 — with the hook in place, deterministic fatal once total context exceeds ~700 tokens:
prompt of ~650 tokens: works
prompt of ~740 tokens: draft decode fails every time, server dies via the server : add kill switch when server is stuck #20277 kill switch (the fatal-error message asks to report there, but this looked EAGLE3-specific enough for its own issue)
E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, ubatch_pos[0]=736)
tools/server/server-context.cpp:3185: fatal error - please provide logs and repro in https://github.com/ggml-org/llama.cpp/pull/20277
Notes:
Independent of -ub / -b (tried -ub 256 -b 4096 and -ub 2048 -b 2048) and of prompt content (plain repeated text, no chat history needed; single request on a fresh server reproduces it).
params_dft inherits the full n_ctx (32768 here), so it does not look like a simple draft-context-size limit.
rc=-1 (invalid input batch) comes from the EAGLE3 draft path in common/speculative.cpp; the failing micro-batch is tiny (n_tokens=4) at the position right below the prompt end.
Repro:
# 1) convert drafter (works)
python convert_hf_to_gguf.py prism-compressed/ --outtype f16 \
--target-model-dir qwen36-27b-target/ --outfile Qwen3.6-27B-EAGLE3-PRISM-f16.gguf
# 2) patch src/models/qwen35.cpp as above, build with -DGGML_VULKAN=ON# 3) start server (args above), then:# - send a ~650-token prompt -> OK# - send a ~740-token prompt -> llama_decode(ctx_dft) rc=-1, fatal
First Bad Commit
n/a (feature combination was never functional for this target arch)
Relevant log output
1.06.900.055 I slot print_timing: id 0 | task 0 | draft acceptance = 1.00000 ( 3 accepted / 3 generated)
1.06.900.393 I statistics draft-eagle3: #calls(b,g,a) = 1 1 1, ...
... (short contexts fine) ...
0.21.939.306 E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, ubatch_pos[0]=736)
F:\AI\llama.cpp-src\tools\server\server-context.cpp:3185: fatal error - please provide logs and repro in https://github.com/ggml-org/llama.cpp/pull/20277
Name and Version
Reproduced on two builds:
b9611(llama-b9611-bin-win-vulkan-x64) — for part 1 (assert)02182fc(MSVC 19.44 / Vulkan SDK 1.4.350,-DGGML_VULKAN=ON) with a 1-line patch (see below) — for part 2 (ctx_dft decode failure)Operating systems
Windows 11 Pro (26200)
GGML backends
Vulkan
Hardware
2x AMD RDNA4 (RX 9070 XT 16 GB + RX 9070 16 GB),
-sm layerModels
Qwen3.6-27B-UD-Q5_K_XL(archqwen3_5, GDN hybrid)compressed/(standard EAGLE3 layout, draft_vocab 32768), converted withconvert_hf_to_gguf.py --outtype f16 --target-model-dir <dir with Qwen/Qwen3.6-27B config.json + tokenizer files>— conversion works fine on master.Problem description & steps to reproduce
Following up on #18039 (EAGLE3) with a Qwen3.6 /
qwen3_5hybrid (GDN) target. Two separate findings:Part 1 —
qwen3_5target asserts immediately (stock b9611 and master):First request dies with:
Cause: only
llama.cpp,qwen3.cpp,qwen3moe.cpp,gemma4.cpp,openai-moe.cpppopulateres->t_layer_inp[il];qwen35.cppdoes not. Adding the same one-liner asqwen3.cppat the top of the layer loop inqwen35.cppfixes the assert, and EAGLE3 then genuinely works on the hybrid target for short contexts (coherent output, draft acceptance reported, e.g. 0.57 over 550 greedy code-gen tokens with this third-party drafter):Happy to open a PR for this if qwen3_5 targets are in scope (related stale attempt: #21437).
Part 2 — with the hook in place, deterministic fatal once total context exceeds ~700 tokens:
Notes:
-ub/-b(tried-ub 256 -b 4096and-ub 2048 -b 2048) and of prompt content (plain repeated text, no chat history needed; single request on a fresh server reproduces it).params_dftinherits the fulln_ctx(32768 here), so it does not look like a simple draft-context-size limit.common/speculative.cpp; the failing micro-batch is tiny (n_tokens=4) at the position right below the prompt end.Repro:
First Bad Commit
n/a (feature combination was never functional for this target arch)
Relevant log output