Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700 tokens

### Name and Version

Reproduced on two builds:
- Release binary `b9611` (llama-b9611-bin-win-vulkan-x64) — for part 1 (assert)
- Self-build of master `02182fc` (MSVC 19.44 / Vulkan SDK 1.4.350, `-DGGML_VULKAN=ON`) with a 1-line patch (see below) — for part 2 (ctx_dft decode failure)

### Operating systems

Windows 11 Pro (26200)

### GGML backends

Vulkan

### Hardware

2x AMD RDNA4 (RX 9070 XT 16 GB + RX 9070 16 GB), `-sm layer`

### Models

- Target: [unsloth/Qwen3.6-27B-MTP-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) `Qwen3.6-27B-UD-Q5_K_XL` (arch `qwen3_5`, GDN hybrid)
- EAGLE3 draft: [Ex0bit/Qwen3.6-27B-PRISM-EAGLE3](https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3) `compressed/` (standard EAGLE3 layout, draft_vocab 32768), converted with `convert_hf_to_gguf.py --outtype f16 --target-model-dir <dir with Qwen/Qwen3.6-27B config.json + tokenizer files>` — conversion works fine on master.

### Problem description & steps to reproduce

Following up on #18039 (EAGLE3) with a Qwen3.6 / `qwen3_5` **hybrid (GDN) target**. Two separate findings:

**Part 1 — `qwen3_5` target asserts immediately (stock b9611 and master):**

```
llama-server -m Qwen3.6-27B-UD-Q5_K_XL-MTP.gguf -ngl 99 -sm layer -fa on -c 32768 --jinja -np 1 \
  --spec-type draft-eagle3 --spec-draft-model Qwen3.6-27B-EAGLE3-PRISM-f16.gguf
```

First request dies with:

```
src/llama-graph.cpp:956: GGML_ASSERT(t_layer_inp[il] != nullptr && "layer input tensor is null") failed
```

Cause: only `llama.cpp`, `qwen3.cpp`, `qwen3moe.cpp`, `gemma4.cpp`, `openai-moe.cpp` populate `res->t_layer_inp[il]`; `qwen35.cpp` does not. Adding the same one-liner as `qwen3.cpp` at the top of the layer loop in `qwen35.cpp` fixes the assert, and EAGLE3 then genuinely works on the hybrid target for short contexts (coherent output, draft acceptance reported, e.g. 0.57 over 550 greedy code-gen tokens with this third-party drafter):

```cpp
for (int il = 0; il < n_layer; ++il) {
    res->t_layer_inp[il] = inpL;   // <-- added, same as qwen3.cpp
    ...
```

Happy to open a PR for this if qwen3_5 targets are in scope (related stale attempt: #21437).

**Part 2 — with the hook in place, deterministic fatal once total context exceeds ~700 tokens:**

- prompt of ~650 tokens: works
- prompt of ~740 tokens: draft decode fails every time, server dies via the #20277 kill switch (the fatal-error message asks to report there, but this looked EAGLE3-specific enough for its own issue)

```
E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, ubatch_pos[0]=736)
tools/server/server-context.cpp:3185: fatal error - please provide logs and repro in https://github.com/ggml-org/llama.cpp/pull/20277
```

Notes:
- Independent of `-ub` / `-b` (tried `-ub 256 -b 4096` and `-ub 2048 -b 2048`) and of prompt content (plain repeated text, no chat history needed; single request on a fresh server reproduces it).
- `params_dft` inherits the full `n_ctx` (32768 here), so it does not look like a simple draft-context-size limit.
- rc=-1 (invalid input batch) comes from the EAGLE3 draft path in `common/speculative.cpp`; the failing micro-batch is tiny (`n_tokens=4`) at the position right below the prompt end.

Repro:

```bash
# 1) convert drafter (works)
python convert_hf_to_gguf.py prism-compressed/ --outtype f16 \
  --target-model-dir qwen36-27b-target/ --outfile Qwen3.6-27B-EAGLE3-PRISM-f16.gguf

# 2) patch src/models/qwen35.cpp as above, build with -DGGML_VULKAN=ON

# 3) start server (args above), then:
#    - send a ~650-token prompt  -> OK
#    - send a ~740-token prompt  -> llama_decode(ctx_dft) rc=-1, fatal
```

### First Bad Commit

n/a (feature combination was never functional for this target arch)

### Relevant log output

```
1.06.900.055 I slot print_timing: id  0 | task 0 | draft acceptance = 1.00000 (    3 accepted /     3 generated)
1.06.900.393 I statistics     draft-eagle3: #calls(b,g,a) =    1      1      1, ...
... (short contexts fine) ...
0.21.939.306 E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, ubatch_pos[0]=736)
F:\AI\llama.cpp-src\tools\server\server-context.cpp:3185: fatal error - please provide logs and repro in https://github.com/ggml-org/llama.cpp/pull/20277
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700 tokens #24541

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700 tokens #24541

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions