fix(decode): reserve >= dflash_block_size outputs for DFlash draft contexts (#108)#114
fix(decode): reserve >= dflash_block_size outputs for DFlash draft contexts (#108)#114marksverdhei wants to merge 1 commit into
Conversation
…ntexts The DFlash draft decodes an entire diffusion block per step with every token flagged for output (block_size tokens, default 16), so output_reserve() needs n_outputs_max >= dflash_block_size. The server sizes draft contexts to n_parallel (server-context.cpp:933), which is < block_size, tripping GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) in output_reserve() on the first draft() call — the "loads clean, child dies on first decode" crash. Clamp cparams.n_outputs_max up to dflash_block_size in the llama_context ctor for DFlash-arch models. In-library guard so it defends every caller, not just the server path. No effect on non-DFlash models (guarded on arch). Addresses #108.
|
Cross-confirmation from a bug-hunt of the adjacent DFlash multi-seq path: DFlash forces |
|
Superseded by #121 ("clamp n_outputs_max for DFlash models"), which is merged and lands the identical fix — |
Draft — the fix is complete and the diagnosis is high-confidence, but it must be smoke-tested against the cluster DFlash model before un-drafting (no local repro —
gemma-4-31b-dflash-Q6_Kis 62 G, doesn't fit here).Root cause (full trace in #108)
The DFlash draft decodes the entire diffusion block in one
llama_decodewith every tokenlogits=true(common/speculative.cpp:1325-1331), son_outputs_all == dflash_block_size(default 16). But the server sizes the draft context's output buffer ton_parallel(tools/server/server-context.cpp:933, unconditional), sooutput_reserve(16)tripsGGML_ASSERT(n_outputs_max <= cparams.n_outputs_max)(llama-context.cpp:2407) on the firstdraft()→ the "loads clean, child dies on first decode" crash. MTP survives the same line because it outputsn_parallelper step; DFlash needs>= block_size.Fix
Clamp
cparams.n_outputs_maxup todflash_block_sizein thellama_contextctor for DFlash-arch models:In-library guard → defends every caller, not just the server path.
Validation
llama-serverlinks; pre-existing clang-tidy warnings only, none on the change)./completionworks (the ctor change is a no-op for non-DFlash — guarded onmodel.arch).gemma-4-31b-dflash-Q6_K— load + firstdraft()should now succeed instead ofGGML_ASSERT. Un-draft once green.Filed per
/home/me/IDLE.md(draft-PR blocked-on-verification items rather than hold).