Skip to content

fix(decode): reserve >= dflash_block_size outputs for DFlash draft contexts (#108)#114

Closed
marksverdhei wants to merge 1 commit into
htfrom
fix/dflash-noutputs-block-size
Closed

fix(decode): reserve >= dflash_block_size outputs for DFlash draft contexts (#108)#114
marksverdhei wants to merge 1 commit into
htfrom
fix/dflash-noutputs-block-size

Conversation

@marksverdhei

Copy link
Copy Markdown

Draft — the fix is complete and the diagnosis is high-confidence, but it must be smoke-tested against the cluster DFlash model before un-drafting (no local repro — gemma-4-31b-dflash-Q6_K is 62 G, doesn't fit here).

Root cause (full trace in #108)

The DFlash draft decodes the entire diffusion block in one llama_decode with every token logits=true (common/speculative.cpp:1325-1331), so n_outputs_all == dflash_block_size (default 16). But the server sizes the draft context's output buffer to n_parallel (tools/server/server-context.cpp:933, unconditional), so output_reserve(16) trips GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) (llama-context.cpp:2407) on the first draft() → the "loads clean, child dies on first decode" crash. MTP survives the same line because it outputs n_parallel per step; DFlash needs >= block_size.

Fix

Clamp cparams.n_outputs_max up to dflash_block_size in the llama_context ctor for DFlash-arch models:

if (model.arch == LLM_ARCH_DFLASH) {
    cparams.n_outputs_max = std::max(cparams.n_outputs_max, model.hparams.dflash_block_size);
}

In-library guard → defends every caller, not just the server path.

Validation

  • ✅ Compiles clean (llama-server links; pre-existing clang-tidy warnings only, none on the change).
  • ✅ Non-DFlash server load + /completion works (the ctor change is a no-op for non-DFlash — guarded on model.arch).
  • Pending: cluster smoke against gemma-4-31b-dflash-Q6_K — load + first draft() should now succeed instead of GGML_ASSERT. Un-draft once green.

Filed per /home/me/IDLE.md (draft-PR blocked-on-verification items rather than hold).

…ntexts

The DFlash draft decodes an entire diffusion block per step with every token
flagged for output (block_size tokens, default 16), so output_reserve() needs
n_outputs_max >= dflash_block_size. The server sizes draft contexts to
n_parallel (server-context.cpp:933), which is < block_size, tripping
GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) in output_reserve() on the
first draft() call — the "loads clean, child dies on first decode" crash.

Clamp cparams.n_outputs_max up to dflash_block_size in the llama_context ctor
for DFlash-arch models. In-library guard so it defends every caller, not just
the server path. No effect on non-DFlash models (guarded on arch).

Addresses #108.
@marksverdhei

Copy link
Copy Markdown
Author

Cross-confirmation from a bug-hunt of the adjacent DFlash multi-seq path: DFlash forces --parallel 1 — the server refuses to start otherwise (server-context.cpp:1022-1027, because dflash.target_features isn't keyed per seq_id). So the draft context's n_outputs_max (= n_parallel, server-context.cpp:933) is always exactly 1, and output_reserve(16) asserting 16 <= 1 makes the first-decode crash deterministic, not just 'possible when n_parallel < block_size'. That removes any doubt about repro conditions — DFlash + this server path crashes 100% of the time on first draft(), which matches the #108 report. The clamp in this PR (n_outputs_max >= dflash_block_size) is necessary and sufficient.

@marksverdhei

Copy link
Copy Markdown
Author

Superseded by #121 ("clamp n_outputs_max for DFlash models"), which is merged and lands the identical fix — src/llama-context.cpp now has the same cparams.n_outputs_max = std::max(cparams.n_outputs_max, model.hparams.dflash_block_size) for LLM_ARCH_DFLASH. Issue #108 is closed. The only remaining delta in this branch is an explanatory comment; not worth a separate PR. Closing as redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant