[codex] Recover dflash spec-decode agent stalls by OmarB97 · Pull Request #315 · Luce-Org/lucebox-hub

OmarB97 · 2026-05-31T01:00:10Z

Summary

Env-gated recovery for the Qwen3.5/Qwen3.6 dflash spec-decode "agent stall": the
model emits a short action preamble (e.g. let me check …:) and then EOS before
producing the tool-call XML, so the turn returns prose with no tool_call.

When DFLASH_STALL_TOOL_PREFIX is enabled and the request carries tools, the
spec-decode replay loop detects a premature EOS right after an action suffix,
injects a minimal tool-call XML prefix (tool_choice/schema-aware so a forced
non-terminal tool never receives terminal XML), replays KV to the right boundary,
and tails off in AR decode. A separate DFLASH_MIN_TOKENS floor handles the same
preamble→EOS stall in the pure-AR path, and a bounded repeated-token guard turns
the residual malformed-tool-buffer case into a retryable finish_reason=length
instead of burning the whole token budget.

Both envs default off, so the production lane is byte-for-byte unchanged.

Root Cause

The original EOS floor lived only in the AR path, but most agentic stalls exit
through the spec-decode replay/emit loop. That let Q4 accept short action
preambles and stop before emitting tool XML. The residual req_0031 case is
different: dflash starts a plausible execute_code call, then degenerates into a
repeated-punctuation run before closing XML. This PR intentionally does not
salvage that incomplete code — it marks the decode as a bounded length-class
failure so the Hermes Q6 retry can recover safely.

Rebase / conflict resolution

This branch was rebased cleanly onto current main. Upstream now contains the
empty-output→AR fallback (merged in #314), so the previously-bundled fork copy of
that fix and an unrelated cancellation-handling commit (the #324 mirror) were
dropped — they were the source of the merge conflict. The empty-spec retry now
rides on upstream's generate_with_empty_spec_fallback wrapper instead of an
inline fallback; the call sites in generate / restore_and_generate keep
upstream's force_ar_decode branch and add the three stall_* parameters. Net
result is a focused 4-file diff that no longer conflicts with main.

What changed

server/src/common/model_backend.h — three optional stall_*_tokens request
fields on GenerateRequest.
server/src/qwen35/qwen35_backend.{h,cpp} — spec-decode stall detection +
tool-prefix injection + AR tail-off; AR-path DFLASH_MIN_TOKENS floor; bounded
degenerate-run guard; suffix-anchored skip detection.
server/src/server/http_server.cpp — value-aware env gate, tool_choice/
schema-aware prefix builder, action-suffix / skip token construction.

How to review

Start with http_server.cpp (env_flag_enabled, select_stall_recovery_function,
build_stall_tool_prefix) to see how the recovery prefix is chosen and gated, then
qwen35_backend.cpp do_spec_decode (the floor_to_ar / inject_tool_prefix
block) for the replay-and-tail-off mechanics. The two generate call sites are the
only place that interacts with upstream's empty-spec wrapper.

Validation

Build: cmake --build server/build --target test_server_unit on taro
(RTX 5090, sm_120, CUDA 13.3) — clean, rc=0.
Unit: ./test_server_unit → 1620 assertions, 0 failures (rc=0). This
includes upstream's new ModelBackend empty-spec retry tests, confirming the
branch composes correctly with fix(common): retry empty spec-decode output through AR #314's wrapper.
The stall-recovery decode logic is unchanged from the previously-validated
revision: 16/17 captured stall turns produce real tool calls, 0/2 legit controls
produce tool calls, and req_0031 terminates as a bounded finish_reason=length
at 321 completion tokens. The prefix-cache oracle remains covered with the
recovery envs off.

Review findings (all addressed)

cubic P1 — presence-only env gate (DFLASH_STALL_TOOL_PREFIX=0 still enabled):
fixed via value-aware env_flag_enabled (rejects 0/false/no/off).
cubic P1 — recovery prefix ignored tool_choice: fixed via
select_stall_recovery_function / build_stall_tool_prefix, which honor a forced
function (or single required tool) and only fall back to terminal otherwise,
with schema-aware first-parameter selection.
Skip detection is now suffix-anchored (tokens_contain_recent_sequence) instead
of full-sequence; DFLASH_MIN_TOKENS reads through one shared env_int_or_default
helper; last_tok/debug-log/suffix-token rationale is documented in comments.
The "empty-fallback has no logging" note is moot now that upstream's wrapper logs
the retry ([backend] spec-decode produced zero tokens; retrying with AR decode).

Risks / gaps

Recovery is behind DFLASH_STALL_TOOL_PREFIX + DFLASH_MIN_TOKENS; with both
unset the lane is byte-for-byte unchanged (covered by the default-off unit run).
/tmp/dflash_floor.log is an opt-in debug diagnostic (written only when
DFLASH_MIN_TOKENS>0); documented inline as append-only / rotate out of band.
req_0031's incomplete execute_code is intentionally not salvaged — out of
scope here; it is handled as a bounded length-class failure for the Q6 retry.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

OmarB97 · 2026-05-31T01:04:30Z

Code Review: dflash spec-decode agent stalls

Verdict: approve (with suggestions)

The implementation is functionally correct and aligned with the task goal. The env-gated approach (DFLASH_MIN_TOKENS, DFLASH_STALL_TOOL_PREFIX) ensures zero behavioral change in the default production lane. The stall detection and recovery flow — detect preamble→EOS in spec-decode replay, inject tool-call XML prefix, replay KV, fall back to AR — is sound.

Non-blocking findings

1. tokens_contain for skip detection is full-sequence, not suffix-anchored (qwen35_backend.cpp ~L1461)

tokens_contain(out_tokens, *stall_skip_tokens) searches for the exact sequence " done" anywhere in out_tokens, not just at the end. If " done" appeared earlier in the conversation (e.g., in a previous assistant turn or the system prompt), the skip guard would fire on a stale match. Consider switching to a suffix-only check (e.g., searching only the last N+len(skip) positions) to avoid false negatives on stall recovery.

2. tokens_have_recent_any matches individual sub-tokens, not multi-token suffixes (http_server.cpp + qwen35_backend.cpp)

The stall_action_suffix_tokens vector is constructed by taking only the last token of each encoded suffix variant (":", "`:", "):", "":"). The tokens_have_recent_any check then looks for any of these individual tokens in the last 4 positions. This works when the relevant tokenizer always emits ":" as a single token, but if tokenization splits a suffix (e.g., "`:" → two tokens), the individual sub-token ":" could appear coincidentally. Consider adding a comment explaining why this flat-individual-token approach is safe for this tokenizer.

3. empty_result → AR fallback has no logging (qwen35_backend.cpp ~L608, ~L721)

When do_spec_decode returns true with empty result.tokens, the code silently falls back to do_ar_decode. This path is hit without logging to stdout/stderr or /tmp/dflash_floor.log. Consider adding a diagnostic fprintf here to distinguish "floor recovered" from "silent fallback" when debugging.

4. Duplicated _min_floor static init (qwen35_backend.cpp ~L1051 and ~L1211)

Both do_ar_decode and do_spec_decode initialize a static _min_floor via the same lambda reading getenv("DFLASH_MIN_TOKENS"). A shared helper (e.g., get_dflash_min_tokens()) would prevent drift.

5. Unbounded /tmp/dflash_floor.log growth

Every stall/recovery event appends to /tmp/dflash_floor.log without rotation or size cap. On a production server with frequent agentic turns, this could fill /tmp. Consider a max-size guard or documented log rotation expectation.

6. last_tok scoping change (qwen35_backend.cpp ~L1420)

last_tok = replay_last_tok was moved from its original unconditional position into an else block, and the floor_to_ar path sets cache_.last_tok directly before returning. I verified this is correct — the floor_to_ar path always returns via step_graph_destroy + return ok rather than falling through to the next loop iteration where last_tok would be needed. However, the control flow is now harder to follow. Consider a brief comment above the else block explaining why last_tok is conditionally updated.

cubic-dev-ai

2 issues found across 4 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…recovery

Follow-up to the spec-decode stall-recovery commits, closing the remaining non-blocking review notes: - Reuse the shared env_int_or_default() helper for the do_spec_decode DFLASH_MIN_TOKENS floor instead of a duplicated inline getenv lambda (do_ar_decode already used the helper). - Document that /tmp/dflash_floor.log is a debug-only diagnostic, written only when the operator opts into DFLASH_MIN_TOKENS, so the default production lane never touches it. - Explain why last_tok is updated only on the non-floor path (the floor_to_ar branch sets cache_.last_tok directly and returns). - Explain why the action-suffix detector collects the trailing token of each colon variant rather than the full encoded sequence. No behavior change in the default lane; suffix-anchored skip detection and tool_choice/value-aware gating were already in place. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

OmarB97 · 2026-06-02T02:29:27Z

Rebuilt and rebased this branch to clear the merge conflict and restore the
reviewed state. Summary of what changed and why (force-push: 17a29de → 138b797):

Conflict cause. Upstream merged the empty-output→AR fallback in #314, so the
copy of that fix carried on this branch (plus an unrelated cancellation commit, the
#324 mirror) collided with main. The branch had also drifted back to a pre-review
revision, dropping the earlier review fixes.

Resolution. Rebased the two stall-recovery commits onto current main, dropping
the now-redundant empty-output duplicate and the #324 cancellation work. The
empty-spec retry now rides on upstream's generate_with_empty_spec_fallback
wrapper; the generate / restore_and_generate call sites keep upstream's
force_ar_decode branch and just add the three stall_* parameters. Result is a
focused 4-file diff (+346/-5) that no longer conflicts. PR now shows MERGEABLE.

Review findings — all addressed in the current commits:

cubic P1 (presence-only env gate) → value-aware env_flag_enabled rejects
0/false/no/off.
cubic P1 (prefix ignored tool_choice) → select_stall_recovery_function /
build_stall_tool_prefix honor a forced function (or single required tool),
only falling back to terminal, with schema-aware first-parameter selection.
skip detection not suffix-anchored → now tokens_contain_recent_sequence
(bounded trailing window); also note out_tokens is per-generation, so a stale
match from earlier turns/system prompt isn't possible.
duplicated DFLASH_MIN_TOKENS init → both paths read through one
env_int_or_default helper.
tokens_have_recent_any / last_tok / /tmp/dflash_floor.log → clarifying
comments added (the log is an opt-in debug diagnostic, silent in the default lane).
empty-fallback "no logging" → moot; upstream's wrapper logs the retry.

Validation (taro, RTX 5090 / sm_120 / CUDA 13.3): test_server_unit builds
clean and passes 1620 assertions, 0 failures — including upstream's new
ModelBackend empty-spec retry tests, confirming the branch composes with #314. The
stall-recovery decode path itself is unchanged from the previously-validated
revision (16/17 captured stall turns → real tool calls; req_0031 → bounded
finish_reason=length).

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

OmarB97 · 2026-06-02T02:49:10Z

Updated PR #315 again for the merge-conflict/comment follow-up.

Current head: 3a06dd4

What changed:

Rebuilt/rebased the branch on current Luce-Org/main; PR is now MERGEABLE.
Addressed Omar's top-level review suggestions:
- skip detection now uses a trailing/suffix-window helper rather than a full-output search shape;
- suffix-token matching has explicit rationale at construction/use sites;
- empty spec-decode retry diagnostics now print decode timing and restore slot;
- DFLASH_MIN_TOKENS is centralized through one cached helper;
- /tmp/dflash_floor.log is capped through a safe open/fstat/ftruncate path;
- last_tok floor-to-AR control flow is documented.
Cubic identified a valid TOCTOU risk in my first log-cap implementation; fixed in 3a06dd4 with open + fstat + O_NOFOLLOW before truncation. The cubic thread is now marked addressed/resolved.

Verification:

git diff --check passed locally.
Clean taro worktree from pushed 3a06dd4:
- configured with CUDA 13.3 / sm_120 / consumer Blackwell fix;
- built test_server_unit and dflash_server;
- ran ./server/build-pr315/test_server_unit -> 1620 assertions, 0 failures.
GitHub checks as of this comment: uv workspace passed, cubic passed, Build (cmake + uv sync --extra megakernel) still in progress.

OmarB97 · 2026-06-02T05:22:23Z

Status for whoever picks this up: this recovery is deployed and live on the fork production lane (Qwen3.6-27B dflash on an RTX 5090), env-gated behind DFLASH_MIN_TOKENS + DFLASH_STALL_TOOL_PREFIX (default-off elsewhere). The /tmp/dflash_floor.log 1 MiB size-cap from 3a06dd4 is also merged into the fork deploy branch, so the deployed binary matches this PR head. test_server_unit is green (1620 here / 1624 on the fork variant). No code changes requested — this is purely a deploy-status note.

…nflict Sole conflict is the empty-spec-decode retry log message in ModelBackend::generate_impl/restore_and_generate_impl (the Luce-Org#319 wrapper). Keep this branch's richer message (includes decode_s timing). All other stall-recovery changes auto-merge clean onto the post-Luce-Org#319 generate_impl surface. Verified on lucebox2 (RTX 3090): dflash_server + test_server_unit build clean, 1759 assertions 0 failures, default-off generation unchanged. Co-Authored-By: WOZCODE <contact@withwoz.com>

OmarB97 marked this pull request as ready for review May 31, 2026 01:03

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

Comment thread server/src/server/http_server.cpp

Comment thread server/src/server/http_server.cpp Outdated

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026

Merge pull request Luce-Org#315 from Luce-Org/codex/dflash-spec-tool-…

cb6dde1

…recovery

OmarB97 force-pushed the codex/dflash-spec-tool-recovery branch from 3ba401f to 17a29de Compare June 2, 2026 01:52

codex and others added 3 commits June 1, 2026 19:11

fix(qwen35): recover spec-decode agent stalls

a1f1f91

fix(qwen35): bound residual dflash stall loops

f06f19e

OmarB97 force-pushed the codex/dflash-spec-tool-recovery branch from 17a29de to 138b797 Compare June 2, 2026 02:26

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026

Merge PR Luce-Org#315 into auto-integration

6d5aa53

fix(qwen35): tighten dflash review diagnostics

0805c94

cubic-dev-ai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread server/src/qwen35/qwen35_backend.cpp Outdated

fix(qwen35): harden dflash floor log cap

3a06dd4

OmarB97 mentioned this pull request Jun 2, 2026

fix(qwen35): bound the dflash floor-log diagnostic OmarB97/lucebox-hub#11

Merged

davide221 merged commit 89ce05d into Luce-Org:main Jun 4, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Recover dflash spec-decode agent stalls#315

[codex] Recover dflash spec-decode agent stalls#315
davide221 merged 6 commits into
Luce-Org:mainfrom
OmarB97:codex/dflash-spec-tool-recovery

OmarB97 commented May 31, 2026 •

edited

Loading

Uh oh!

OmarB97 commented May 31, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

OmarB97 commented Jun 2, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

OmarB97 commented Jun 2, 2026

Uh oh!

OmarB97 commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

OmarB97 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Rebase / conflict resolution

What changed

How to review

Validation

Review findings (all addressed)

Risks / gaps

Uh oh!

OmarB97 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: dflash spec-decode agent stalls

Verdict: approve (with suggestions)

Non-blocking findings

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

OmarB97 commented Jun 2, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

OmarB97 commented Jun 2, 2026

Uh oh!

OmarB97 commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OmarB97 commented May 31, 2026 •

edited

Loading

OmarB97 commented May 31, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading