Skip to content

[codex] Recover dflash spec-decode agent stalls#315

Merged
davide221 merged 6 commits into
Luce-Org:mainfrom
OmarB97:codex/dflash-spec-tool-recovery
Jun 4, 2026
Merged

[codex] Recover dflash spec-decode agent stalls#315
davide221 merged 6 commits into
Luce-Org:mainfrom
OmarB97:codex/dflash-spec-tool-recovery

Conversation

@OmarB97

@OmarB97 OmarB97 commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

Env-gated recovery for the Qwen3.5/Qwen3.6 dflash spec-decode "agent stall": the
model emits a short action preamble (e.g. let me check …:) and then EOS before
producing the tool-call XML, so the turn returns prose with no tool_call.

When DFLASH_STALL_TOOL_PREFIX is enabled and the request carries tools, the
spec-decode replay loop detects a premature EOS right after an action suffix,
injects a minimal tool-call XML prefix (tool_choice/schema-aware so a forced
non-terminal tool never receives terminal XML), replays KV to the right boundary,
and tails off in AR decode. A separate DFLASH_MIN_TOKENS floor handles the same
preamble→EOS stall in the pure-AR path, and a bounded repeated-token guard turns
the residual malformed-tool-buffer case into a retryable finish_reason=length
instead of burning the whole token budget.

Both envs default off, so the production lane is byte-for-byte unchanged.

Root Cause

The original EOS floor lived only in the AR path, but most agentic stalls exit
through the spec-decode replay/emit loop. That let Q4 accept short action
preambles and stop before emitting tool XML. The residual req_0031 case is
different: dflash starts a plausible execute_code call, then degenerates into a
repeated-punctuation run before closing XML. This PR intentionally does not
salvage that incomplete code — it marks the decode as a bounded length-class
failure so the Hermes Q6 retry can recover safely.

Rebase / conflict resolution

This branch was rebased cleanly onto current main. Upstream now contains the
empty-output→AR fallback (merged in #314), so the previously-bundled fork copy of
that fix and an unrelated cancellation-handling commit (the #324 mirror) were
dropped — they were the source of the merge conflict. The empty-spec retry now
rides on upstream's generate_with_empty_spec_fallback wrapper instead of an
inline fallback; the call sites in generate / restore_and_generate keep
upstream's force_ar_decode branch and add the three stall_* parameters. Net
result is a focused 4-file diff that no longer conflicts with main.

What changed

  • server/src/common/model_backend.h — three optional stall_*_tokens request
    fields on GenerateRequest.
  • server/src/qwen35/qwen35_backend.{h,cpp} — spec-decode stall detection +
    tool-prefix injection + AR tail-off; AR-path DFLASH_MIN_TOKENS floor; bounded
    degenerate-run guard; suffix-anchored skip detection.
  • server/src/server/http_server.cpp — value-aware env gate, tool_choice/
    schema-aware prefix builder, action-suffix / skip token construction.

How to review

Start with http_server.cpp (env_flag_enabled, select_stall_recovery_function,
build_stall_tool_prefix) to see how the recovery prefix is chosen and gated, then
qwen35_backend.cpp do_spec_decode (the floor_to_ar / inject_tool_prefix
block) for the replay-and-tail-off mechanics. The two generate call sites are the
only place that interacts with upstream's empty-spec wrapper.

Validation

  • Build: cmake --build server/build --target test_server_unit on taro
    (RTX 5090, sm_120, CUDA 13.3) — clean, rc=0.
  • Unit: ./test_server_unit1620 assertions, 0 failures (rc=0). This
    includes upstream's new ModelBackend empty-spec retry tests, confirming the
    branch composes correctly with fix(common): retry empty spec-decode output through AR #314's wrapper.
  • The stall-recovery decode logic is unchanged from the previously-validated
    revision: 16/17 captured stall turns produce real tool calls, 0/2 legit controls
    produce tool calls, and req_0031 terminates as a bounded finish_reason=length
    at 321 completion tokens. The prefix-cache oracle remains covered with the
    recovery envs off.

Review findings (all addressed)

  • cubic P1 — presence-only env gate (DFLASH_STALL_TOOL_PREFIX=0 still enabled):
    fixed via value-aware env_flag_enabled (rejects 0/false/no/off).
  • cubic P1 — recovery prefix ignored tool_choice: fixed via
    select_stall_recovery_function / build_stall_tool_prefix, which honor a forced
    function (or single required tool) and only fall back to terminal otherwise,
    with schema-aware first-parameter selection.
  • Skip detection is now suffix-anchored (tokens_contain_recent_sequence) instead
    of full-sequence; DFLASH_MIN_TOKENS reads through one shared env_int_or_default
    helper; last_tok/debug-log/suffix-token rationale is documented in comments.
  • The "empty-fallback has no logging" note is moot now that upstream's wrapper logs
    the retry ([backend] spec-decode produced zero tokens; retrying with AR decode).

Risks / gaps

  • Recovery is behind DFLASH_STALL_TOOL_PREFIX + DFLASH_MIN_TOKENS; with both
    unset the lane is byte-for-byte unchanged (covered by the default-off unit run).
  • /tmp/dflash_floor.log is an opt-in debug diagnostic (written only when
    DFLASH_MIN_TOKENS>0); documented inline as append-only / rotate out of band.
  • req_0031's incomplete execute_code is intentionally not salvaged — out of
    scope here; it is handled as a bounded length-class failure for the Q6 retry.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

@OmarB97 OmarB97 marked this pull request as ready for review May 31, 2026 01:03
@OmarB97

OmarB97 commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

Code Review: dflash spec-decode agent stalls

Verdict: approve (with suggestions)

The implementation is functionally correct and aligned with the task goal. The env-gated approach (DFLASH_MIN_TOKENS, DFLASH_STALL_TOOL_PREFIX) ensures zero behavioral change in the default production lane. The stall detection and recovery flow — detect preamble→EOS in spec-decode replay, inject tool-call XML prefix, replay KV, fall back to AR — is sound.

Non-blocking findings

1. tokens_contain for skip detection is full-sequence, not suffix-anchored (qwen35_backend.cpp ~L1461)

tokens_contain(out_tokens, *stall_skip_tokens) searches for the exact sequence " done" anywhere in out_tokens, not just at the end. If " done" appeared earlier in the conversation (e.g., in a previous assistant turn or the system prompt), the skip guard would fire on a stale match. Consider switching to a suffix-only check (e.g., searching only the last N+len(skip) positions) to avoid false negatives on stall recovery.

2. tokens_have_recent_any matches individual sub-tokens, not multi-token suffixes (http_server.cpp + qwen35_backend.cpp)

The stall_action_suffix_tokens vector is constructed by taking only the last token of each encoded suffix variant (":", "`:", "):", "":"). The tokens_have_recent_any check then looks for any of these individual tokens in the last 4 positions. This works when the relevant tokenizer always emits ":" as a single token, but if tokenization splits a suffix (e.g., "`:" → two tokens), the individual sub-token ":" could appear coincidentally. Consider adding a comment explaining why this flat-individual-token approach is safe for this tokenizer.

3. empty_result → AR fallback has no logging (qwen35_backend.cpp ~L608, ~L721)

When do_spec_decode returns true with empty result.tokens, the code silently falls back to do_ar_decode. This path is hit without logging to stdout/stderr or /tmp/dflash_floor.log. Consider adding a diagnostic fprintf here to distinguish "floor recovered" from "silent fallback" when debugging.

4. Duplicated _min_floor static init (qwen35_backend.cpp ~L1051 and ~L1211)

Both do_ar_decode and do_spec_decode initialize a static _min_floor via the same lambda reading getenv("DFLASH_MIN_TOKENS"). A shared helper (e.g., get_dflash_min_tokens()) would prevent drift.

5. Unbounded /tmp/dflash_floor.log growth

Every stall/recovery event appends to /tmp/dflash_floor.log without rotation or size cap. On a production server with frequent agentic turns, this could fill /tmp. Consider a max-size guard or documented log rotation expectation.

6. last_tok scoping change (qwen35_backend.cpp ~L1420)

last_tok = replay_last_tok was moved from its original unconditional position into an else block, and the floor_to_ar path sets cache_.last_tok directly before returning. I verified this is correct — the floor_to_ar path always returns via step_graph_destroy + return ok rather than falling through to the next loop iteration where last_tok would be needed. However, the control flow is now harder to follow. Consider a brief comment above the else block explaining why last_tok is conditionally updated.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/server/http_server.cpp
Comment thread server/src/server/http_server.cpp Outdated
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
@OmarB97 OmarB97 force-pushed the codex/dflash-spec-tool-recovery branch from 3ba401f to 17a29de Compare June 2, 2026 01:52
codex and others added 3 commits June 1, 2026 19:11
Follow-up to the spec-decode stall-recovery commits, closing the
remaining non-blocking review notes:

- Reuse the shared env_int_or_default() helper for the do_spec_decode
  DFLASH_MIN_TOKENS floor instead of a duplicated inline getenv lambda
  (do_ar_decode already used the helper).
- Document that /tmp/dflash_floor.log is a debug-only diagnostic, written
  only when the operator opts into DFLASH_MIN_TOKENS, so the default
  production lane never touches it.
- Explain why last_tok is updated only on the non-floor path (the
  floor_to_ar branch sets cache_.last_tok directly and returns).
- Explain why the action-suffix detector collects the trailing token of
  each colon variant rather than the full encoded sequence.

No behavior change in the default lane; suffix-anchored skip detection
and tool_choice/value-aware gating were already in place.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@OmarB97 OmarB97 force-pushed the codex/dflash-spec-tool-recovery branch from 17a29de to 138b797 Compare June 2, 2026 02:26
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
@OmarB97

OmarB97 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Rebuilt and rebased this branch to clear the merge conflict and restore the
reviewed state. Summary of what changed and why (force-push: 17a29de138b797):

Conflict cause. Upstream merged the empty-output→AR fallback in #314, so the
copy of that fix carried on this branch (plus an unrelated cancellation commit, the
#324 mirror) collided with main. The branch had also drifted back to a pre-review
revision, dropping the earlier review fixes.

Resolution. Rebased the two stall-recovery commits onto current main, dropping
the now-redundant empty-output duplicate and the #324 cancellation work. The
empty-spec retry now rides on upstream's generate_with_empty_spec_fallback
wrapper; the generate / restore_and_generate call sites keep upstream's
force_ar_decode branch and just add the three stall_* parameters. Result is a
focused 4-file diff (+346/-5) that no longer conflicts. PR now shows MERGEABLE.

Review findings — all addressed in the current commits:

  • cubic P1 (presence-only env gate) → value-aware env_flag_enabled rejects
    0/false/no/off.
  • cubic P1 (prefix ignored tool_choice)select_stall_recovery_function /
    build_stall_tool_prefix honor a forced function (or single required tool),
    only falling back to terminal, with schema-aware first-parameter selection.
  • skip detection not suffix-anchored → now tokens_contain_recent_sequence
    (bounded trailing window); also note out_tokens is per-generation, so a stale
    match from earlier turns/system prompt isn't possible.
  • duplicated DFLASH_MIN_TOKENS init → both paths read through one
    env_int_or_default helper.
  • tokens_have_recent_any / last_tok / /tmp/dflash_floor.log → clarifying
    comments added (the log is an opt-in debug diagnostic, silent in the default lane).
  • empty-fallback "no logging" → moot; upstream's wrapper logs the retry.

Validation (taro, RTX 5090 / sm_120 / CUDA 13.3): test_server_unit builds
clean and passes 1620 assertions, 0 failures — including upstream's new
ModelBackend empty-spec retry tests, confirming the branch composes with #314. The
stall-recovery decode path itself is unchanged from the previously-validated
revision (16/17 captured stall turns → real tool calls; req_0031 → bounded
finish_reason=length).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_backend.cpp Outdated
@OmarB97

OmarB97 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Updated PR #315 again for the merge-conflict/comment follow-up.

Current head: 3a06dd4

What changed:

  • Rebuilt/rebased the branch on current Luce-Org/main; PR is now MERGEABLE.
  • Addressed Omar's top-level review suggestions:
    • skip detection now uses a trailing/suffix-window helper rather than a full-output search shape;
    • suffix-token matching has explicit rationale at construction/use sites;
    • empty spec-decode retry diagnostics now print decode timing and restore slot;
    • DFLASH_MIN_TOKENS is centralized through one cached helper;
    • /tmp/dflash_floor.log is capped through a safe open/fstat/ftruncate path;
    • last_tok floor-to-AR control flow is documented.
  • Cubic identified a valid TOCTOU risk in my first log-cap implementation; fixed in 3a06dd4 with open + fstat + O_NOFOLLOW before truncation. The cubic thread is now marked addressed/resolved.

Verification:

  • git diff --check passed locally.
  • Clean taro worktree from pushed 3a06dd4:
    • configured with CUDA 13.3 / sm_120 / consumer Blackwell fix;
    • built test_server_unit and dflash_server;
    • ran ./server/build-pr315/test_server_unit -> 1620 assertions, 0 failures.
  • GitHub checks as of this comment: uv workspace passed, cubic passed, Build (cmake + uv sync --extra megakernel) still in progress.

@OmarB97

OmarB97 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Status for whoever picks this up: this recovery is deployed and live on the fork production lane (Qwen3.6-27B dflash on an RTX 5090), env-gated behind DFLASH_MIN_TOKENS + DFLASH_STALL_TOOL_PREFIX (default-off elsewhere). The /tmp/dflash_floor.log 1 MiB size-cap from 3a06dd4 is also merged into the fork deploy branch, so the deployed binary matches this PR head. test_server_unit is green (1620 here / 1624 on the fork variant). No code changes requested — this is purely a deploy-status note.

…nflict

Sole conflict is the empty-spec-decode retry log message in
ModelBackend::generate_impl/restore_and_generate_impl (the Luce-Org#319 wrapper).
Keep this branch's richer message (includes decode_s timing). All other
stall-recovery changes auto-merge clean onto the post-Luce-Org#319 generate_impl
surface.

Verified on lucebox2 (RTX 3090): dflash_server + test_server_unit build
clean, 1759 assertions 0 failures, default-off generation unchanged.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221 davide221 merged commit 89ce05d into Luce-Org:main Jun 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants