[codex] Recover dflash spec-decode agent stalls#315
Conversation
Code Review: dflash spec-decode agent stallsVerdict: approve (with suggestions)The implementation is functionally correct and aligned with the task goal. The env-gated approach ( Non-blocking findings1.
2. The 3. When 4. Duplicated Both 5. Unbounded Every stall/recovery event appends to 6.
|
There was a problem hiding this comment.
2 issues found across 4 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
3ba401f to
17a29de
Compare
Follow-up to the spec-decode stall-recovery commits, closing the remaining non-blocking review notes: - Reuse the shared env_int_or_default() helper for the do_spec_decode DFLASH_MIN_TOKENS floor instead of a duplicated inline getenv lambda (do_ar_decode already used the helper). - Document that /tmp/dflash_floor.log is a debug-only diagnostic, written only when the operator opts into DFLASH_MIN_TOKENS, so the default production lane never touches it. - Explain why last_tok is updated only on the non-floor path (the floor_to_ar branch sets cache_.last_tok directly and returns). - Explain why the action-suffix detector collects the trailing token of each colon variant rather than the full encoded sequence. No behavior change in the default lane; suffix-anchored skip detection and tool_choice/value-aware gating were already in place. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
17a29de to
138b797
Compare
|
Rebuilt and rebased this branch to clear the merge conflict and restore the Conflict cause. Upstream merged the empty-output→AR fallback in #314, so the Resolution. Rebased the two stall-recovery commits onto current Review findings — all addressed in the current commits:
Validation (taro, RTX 5090 / sm_120 / CUDA 13.3): |
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
|
Updated PR #315 again for the merge-conflict/comment follow-up. Current head: What changed:
Verification:
|
|
Status for whoever picks this up: this recovery is deployed and live on the fork production lane (Qwen3.6-27B dflash on an RTX 5090), env-gated behind |
…nflict Sole conflict is the empty-spec-decode retry log message in ModelBackend::generate_impl/restore_and_generate_impl (the Luce-Org#319 wrapper). Keep this branch's richer message (includes decode_s timing). All other stall-recovery changes auto-merge clean onto the post-Luce-Org#319 generate_impl surface. Verified on lucebox2 (RTX 3090): dflash_server + test_server_unit build clean, 1759 assertions 0 failures, default-off generation unchanged. Co-Authored-By: WOZCODE <contact@withwoz.com>
Summary
Env-gated recovery for the Qwen3.5/Qwen3.6 dflash spec-decode "agent stall": the
model emits a short action preamble (e.g.
let me check …:) and then EOS beforeproducing the tool-call XML, so the turn returns prose with no
tool_call.When
DFLASH_STALL_TOOL_PREFIXis enabled and the request carries tools, thespec-decode replay loop detects a premature EOS right after an action suffix,
injects a minimal tool-call XML prefix (
tool_choice/schema-aware so a forcednon-terminal tool never receives terminal XML), replays KV to the right boundary,
and tails off in AR decode. A separate
DFLASH_MIN_TOKENSfloor handles the samepreamble→EOS stall in the pure-AR path, and a bounded repeated-token guard turns
the residual malformed-tool-buffer case into a retryable
finish_reason=lengthinstead of burning the whole token budget.
Both envs default off, so the production lane is byte-for-byte unchanged.
Root Cause
The original EOS floor lived only in the AR path, but most agentic stalls exit
through the spec-decode replay/emit loop. That let Q4 accept short action
preambles and stop before emitting tool XML. The residual
req_0031case isdifferent: dflash starts a plausible
execute_codecall, then degenerates into arepeated-punctuation run before closing XML. This PR intentionally does not
salvage that incomplete code — it marks the decode as a bounded length-class
failure so the Hermes Q6 retry can recover safely.
Rebase / conflict resolution
This branch was rebased cleanly onto current
main. Upstream now contains theempty-output→AR fallback (merged in #314), so the previously-bundled fork copy of
that fix and an unrelated cancellation-handling commit (the #324 mirror) were
dropped — they were the source of the merge conflict. The empty-spec retry now
rides on upstream's
generate_with_empty_spec_fallbackwrapper instead of aninline fallback; the call sites in
generate/restore_and_generatekeepupstream's
force_ar_decodebranch and add the threestall_*parameters. Netresult is a focused 4-file diff that no longer conflicts with
main.What changed
server/src/common/model_backend.h— three optionalstall_*_tokensrequestfields on
GenerateRequest.server/src/qwen35/qwen35_backend.{h,cpp}— spec-decode stall detection +tool-prefix injection + AR tail-off; AR-path
DFLASH_MIN_TOKENSfloor; boundeddegenerate-run guard; suffix-anchored skip detection.
server/src/server/http_server.cpp— value-aware env gate,tool_choice/schema-aware prefix builder, action-suffix / skip token construction.
How to review
Start with
http_server.cpp(env_flag_enabled,select_stall_recovery_function,build_stall_tool_prefix) to see how the recovery prefix is chosen and gated, thenqwen35_backend.cppdo_spec_decode(thefloor_to_ar/inject_tool_prefixblock) for the replay-and-tail-off mechanics. The two
generatecall sites are theonly place that interacts with upstream's empty-spec wrapper.
Validation
cmake --build server/build --target test_server_uniton taro(RTX 5090, sm_120, CUDA 13.3) — clean, rc=0.
./test_server_unit→ 1620 assertions, 0 failures (rc=0). Thisincludes upstream's new
ModelBackend empty-spec retrytests, confirming thebranch composes correctly with fix(common): retry empty spec-decode output through AR #314's wrapper.
revision: 16/17 captured stall turns produce real tool calls, 0/2 legit controls
produce tool calls, and
req_0031terminates as a boundedfinish_reason=lengthat 321 completion tokens. The prefix-cache oracle remains covered with the
recovery envs off.
Review findings (all addressed)
DFLASH_STALL_TOOL_PREFIX=0still enabled):fixed via value-aware
env_flag_enabled(rejects0/false/no/off).tool_choice: fixed viaselect_stall_recovery_function/build_stall_tool_prefix, which honor a forcedfunction (or single
requiredtool) and only fall back toterminalotherwise,with schema-aware first-parameter selection.
tokens_contain_recent_sequence) insteadof full-sequence;
DFLASH_MIN_TOKENSreads through one sharedenv_int_or_defaulthelper;
last_tok/debug-log/suffix-token rationale is documented in comments.the retry (
[backend] spec-decode produced zero tokens; retrying with AR decode).Risks / gaps
DFLASH_STALL_TOOL_PREFIX+DFLASH_MIN_TOKENS; with bothunset the lane is byte-for-byte unchanged (covered by the default-off unit run).
/tmp/dflash_floor.logis an opt-in debug diagnostic (written only whenDFLASH_MIN_TOKENS>0); documented inline as append-only / rotate out of band.req_0031's incompleteexecute_codeis intentionally not salvaged — out ofscope here; it is handled as a bounded length-class failure for the Q6 retry.
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com