Skip to content

fix(ai): stop chat turns hanging/corrupting on provider stream resets#17

Merged
Harsh-2002 merged 2 commits into
mainfrom
dev
Jun 12, 2026
Merged

fix(ai): stop chat turns hanging/corrupting on provider stream resets#17
Harsh-2002 merged 2 commits into
mainfrom
dev

Conversation

@Harsh-2002

Copy link
Copy Markdown
Owner

Problem

The in-product AI chat (dashboard + orva chat) intermittently "gets stuck" — most often mid tool-calling. Root-caused to the agent loop trusting the provider stream unconditionally; a flaky gateway connection (resets/stalls) produced three failure modes, all reproduced deterministically with a mock OpenAI-compatible provider:

# Trigger Old behavior New behavior
1 Provider reset mid-response llm.Stream emitted EventDone whenever the chunk channel closed (no finish_reason check) → silently truncated answer marked as success Stream ending without a finish reason → EventError ("provider stream ended unexpectedly"). Client disconnects still classify as context.Canceled
2 Reset mid tool-call Half-assembled call (truncated JSON args) dispatched → confusing unmarshal failure → model re-emits → churn Args JSON-validated before dispatch; fails fast with a self-correctable error
3 Model repeats a failing call Burned the full 25-iteration budget — minutes of billed calls ("stuck in the tool calling") Loop-breaker: turn stops with a clear error after 3 consecutive all-failed tool iterations

Also fixed en route: the tool_call SSE frame embedded raw args as json.RawMessage; invalid args failed the frame's own marshal and the sink degraded the payload to {}, losing the call id/name the UI renders.

Changes

  • backend/internal/ai/llm/llm.go — extracted the stream consumer into pump() (unit-testable termination contract: exactly one EventDone or EventError); reject streams that close without a finish reason.
  • backend/internal/ai/agent/agent.go — pre-dispatch JSON validation of tool args; processToolCalls reports an all-failed signal; advance() loop-breaker (maxFailedToolIterations = 3); safe args payload in the tool_call SSE event.
  • Tests: pump termination (clean / truncated / truncated-tool-call / cancelled-context), invalid-args guard (no dispatch, status=failed), valid/empty args still dispatch, all-failed signal.

Validation (live, patched binary on an isolated copy of the dev data dir)

  • Normal + tool-call turns through a real provider complete cleanfinish_reason is populated on healthy completions; fix 1 does not break normal turns.
  • Replayed reset mid-response → SSE error in 0.2s (was: silent truncation accepted as done).
  • Replayed reset mid tool-call → SSE error in 0.2s (was: doomed dispatch + churn).
  • Repeating broken call → exactly 3 iterations then a clear error (was: 25).
  • go test ./... fully green; go vet clean.

Notes

  • The 60s SSE freeze fixed earlier (2e41c9e) was the server→client leg; this PR fixes the Orva→provider leg. Both share the "stuck" symptom.
  • Not in scope (follow-ups): lowering the Bifrost 60s idle timeout, and the ~70KB first-turn system-prompt inline that inflates time-to-first-token.

The chat 'gets stuck' bug (UI + CLI): the agent loop trusted the provider
stream unconditionally, so a flaky gateway connection produced three
distinct failure modes, all reproduced deterministically against a mock
OpenAI-compatible provider:

1. Reset mid-response was accepted as success: llm.Stream emitted
   EventDone whenever the chunk channel closed, with no finish_reason
   check — the user got a silently truncated (or empty) answer marked
   done. Now a stream that ends without a finish reason terminates with
   EventError ('provider stream ended unexpectedly'); a cancelled request
   context still surfaces as context.Canceled so disconnects stay out of
   the error log.

2. Reset mid-tool-call dispatched a half-assembled call: truncated
   argument JSON went straight to the dispatcher, failed with a
   confusing tool-specific unmarshal error, the model re-emitted the
   call, and the turn burned the full 25-iteration budget — the
   'stuck in the tool calling' symptom. Tool arguments are now
   JSON-validated before dispatch and fail fast with a
   self-correctable error.

3. No loop-breaker: a model repeating a failing call churned for
   minutes of billed calls. advance() now stops with a clear error
   after 3 consecutive iterations in which every dispatched call
   failed.

Also fixed en route: the tool_call SSE frame embedded raw args as
json.RawMessage; invalid args failed the frame's marshal and the sink
degraded the payload to {}, losing the call id/name the UI renders.
Invalid args now fall back to a JSON string.

llm.Stream's chunk consumer is extracted into pump() so the
termination contract (exactly one EventDone or EventError) is
unit-tested with a synthetic chunk channel.

Verified live against a patched instance on a copy of the dev data
dir: normal + tool-call turns through a real provider complete clean
(finish_reason present); replayed resets mid-response and mid-tool-call
now error in <1s instead of silently truncating; a repeating broken
call stops after 3 iterations instead of 25.
Review of the initial fix surfaced real gaps; all verified and fixed:

- Loop-breaker false positive: 'every call failed this round' aborted
  legitimate exploration (three not-found probes in a row = three
  informative results, not a stuck model). The breaker now tracks
  per-call-signature (name+args) consecutive-failure streaks: only the
  SAME call failing 3 rounds in a row trips it. This also closes the
  converse escape — a broken call shielded by a succeeding companion
  call now still trips the breaker.

- History poisoning: a persisted tool call with malformed args was
  replayed verbatim to the provider on every later iteration and every
  future turn; strict endpoints 400 the whole request, permanently
  bricking the conversation. toBifrostMessage now replays '{}' for
  invalid/empty arguments (the model already saw the failure in the
  call's tool result).

- Cancel classification race: a client disconnect could surface as a
  Bifrost error chunk instead of a channel close, producing a bare
  string error that defeated the logAIError context.Canceled filter —
  every Stop click logged a spurious failure. pump now reports ctx.Err()
  when the context is cancelled, whichever way Bifrost delivers it.

- Token accounting on errors: usage reported before a stream cut was
  discarded. EventError now carries Usage and the agent persists it.

- tool_result frames had the same {}-degradation hazard fixed for
  tool_call: a non-JSON dispatcher result killed the frame's marshal and
  the UI lost the id/status that settles the call's spinner.

- Dashboard: the post-turn refreshActive() rebuilt the timeline from
  server truth, wiping the just-pushed ErrorCard (server has no error
  items) — errors are now carried across the rebuild. The error frame
  also settles the streaming state (thinking timer kept ticking on a
  dead turn).

- CLI: the error path sent no message_end, so the streamed half-answer
  was never closed out and the error printed on the same row — and then
  printed twice (runTurn + REPL). drive() now finishes the message block
  on error; runTurn returns the error without printing.

- Efficiency: invalid-args calls skip the dead status=running write and
  go straight to failed; single args conversion in runToolCall; the
  invalid-args SSE fallback payload is capped at 2KB.

- Tests: fragmented tool-call reassembly across deltas; failed-signature
  reporting (succeeding companion must not mask, distinct probes must
  not conflate).

All three failure modes re-validated live against the rebuilt binary
(mock provider replays), plus a real-provider multi-iteration tool turn
through the changed history-replay path.
@Harsh-2002 Harsh-2002 merged commit 7832b81 into main Jun 12, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant