fix(decode): reserve >= dflash_block_size outputs for DFlash draft contexts (#108) by marksverdhei · Pull Request #114 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-14T03:43:39Z

Draft — the fix is complete and the diagnosis is high-confidence, but it must be smoke-tested against the cluster DFlash model before un-drafting (no local repro — gemma-4-31b-dflash-Q6_K is 62 G, doesn't fit here).

Root cause (full trace in #108)

The DFlash draft decodes the entire diffusion block in one llama_decode with every token logits=true (common/speculative.cpp:1325-1331), so n_outputs_all == dflash_block_size (default 16). But the server sizes the draft context's output buffer to n_parallel (tools/server/server-context.cpp:933, unconditional), so output_reserve(16) trips GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) (llama-context.cpp:2407) on the first draft() → the "loads clean, child dies on first decode" crash. MTP survives the same line because it outputs n_parallel per step; DFlash needs >= block_size.

Fix

Clamp cparams.n_outputs_max up to dflash_block_size in the llama_context ctor for DFlash-arch models:

if (model.arch == LLM_ARCH_DFLASH) {
    cparams.n_outputs_max = std::max(cparams.n_outputs_max, model.hparams.dflash_block_size);
}

In-library guard → defends every caller, not just the server path.

Validation

✅ Compiles clean (llama-server links; pre-existing clang-tidy warnings only, none on the change).
✅ Non-DFlash server load + /completion works (the ctor change is a no-op for non-DFlash — guarded on model.arch).
⏳ Pending: cluster smoke against gemma-4-31b-dflash-Q6_K — load + first draft() should now succeed instead of GGML_ASSERT. Un-draft once green.

Filed per /home/me/IDLE.md (draft-PR blocked-on-verification items rather than hold).

…ntexts The DFlash draft decodes an entire diffusion block per step with every token flagged for output (block_size tokens, default 16), so output_reserve() needs n_outputs_max >= dflash_block_size. The server sizes draft contexts to n_parallel (server-context.cpp:933), which is < block_size, tripping GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) in output_reserve() on the first draft() call — the "loads clean, child dies on first decode" crash. Clamp cparams.n_outputs_max up to dflash_block_size in the llama_context ctor for DFlash-arch models. In-library guard so it defends every caller, not just the server path. No effect on non-DFlash models (guarded on arch). Addresses #108.

marksverdhei · 2026-06-14T05:23:59Z

Cross-confirmation from a bug-hunt of the adjacent DFlash multi-seq path: DFlash forces --parallel 1 — the server refuses to start otherwise (server-context.cpp:1022-1027, because dflash.target_features isn't keyed per seq_id). So the draft context's n_outputs_max (= n_parallel, server-context.cpp:933) is always exactly 1, and output_reserve(16) asserting 16 <= 1 makes the first-decode crash deterministic, not just 'possible when n_parallel < block_size'. That removes any doubt about repro conditions — DFlash + this server path crashes 100% of the time on first draft(), which matches the #108 report. The clamp in this PR (n_outputs_max >= dflash_block_size) is necessary and sufficient.

marksverdhei · 2026-06-16T12:34:06Z

Superseded by #121 ("clamp n_outputs_max for DFlash models"), which is merged and lands the identical fix — src/llama-context.cpp now has the same cparams.n_outputs_max = std::max(cparams.n_outputs_max, model.hparams.dflash_block_size) for LLM_ARCH_DFLASH. Issue #108 is closed. The only remaining delta in this branch is an explanatory comment; not worth a separate PR. Closing as redundant.

marksverdhei mentioned this pull request Jun 14, 2026

Hivemind Maintenance Tasks Epoch 1 #112

Open

8 tasks

marksverdhei closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(decode): reserve >= dflash_block_size outputs for DFlash draft contexts (#108)#114

fix(decode): reserve >= dflash_block_size outputs for DFlash draft contexts (#108)#114
marksverdhei wants to merge 1 commit into
htfrom
fix/dflash-noutputs-block-size

marksverdhei commented Jun 14, 2026

Uh oh!

marksverdhei commented Jun 14, 2026

Uh oh!

marksverdhei commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 14, 2026

Root cause (full trace in #108)

Fix

Validation

Uh oh!

marksverdhei commented Jun 14, 2026

Uh oh!

marksverdhei commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant