fix(ai): stop chat turns hanging/corrupting on provider stream resets#17
Merged
Conversation
The chat 'gets stuck' bug (UI + CLI): the agent loop trusted the provider
stream unconditionally, so a flaky gateway connection produced three
distinct failure modes, all reproduced deterministically against a mock
OpenAI-compatible provider:
1. Reset mid-response was accepted as success: llm.Stream emitted
EventDone whenever the chunk channel closed, with no finish_reason
check — the user got a silently truncated (or empty) answer marked
done. Now a stream that ends without a finish reason terminates with
EventError ('provider stream ended unexpectedly'); a cancelled request
context still surfaces as context.Canceled so disconnects stay out of
the error log.
2. Reset mid-tool-call dispatched a half-assembled call: truncated
argument JSON went straight to the dispatcher, failed with a
confusing tool-specific unmarshal error, the model re-emitted the
call, and the turn burned the full 25-iteration budget — the
'stuck in the tool calling' symptom. Tool arguments are now
JSON-validated before dispatch and fail fast with a
self-correctable error.
3. No loop-breaker: a model repeating a failing call churned for
minutes of billed calls. advance() now stops with a clear error
after 3 consecutive iterations in which every dispatched call
failed.
Also fixed en route: the tool_call SSE frame embedded raw args as
json.RawMessage; invalid args failed the frame's marshal and the sink
degraded the payload to {}, losing the call id/name the UI renders.
Invalid args now fall back to a JSON string.
llm.Stream's chunk consumer is extracted into pump() so the
termination contract (exactly one EventDone or EventError) is
unit-tested with a synthetic chunk channel.
Verified live against a patched instance on a copy of the dev data
dir: normal + tool-call turns through a real provider complete clean
(finish_reason present); replayed resets mid-response and mid-tool-call
now error in <1s instead of silently truncating; a repeating broken
call stops after 3 iterations instead of 25.
Review of the initial fix surfaced real gaps; all verified and fixed:
- Loop-breaker false positive: 'every call failed this round' aborted
legitimate exploration (three not-found probes in a row = three
informative results, not a stuck model). The breaker now tracks
per-call-signature (name+args) consecutive-failure streaks: only the
SAME call failing 3 rounds in a row trips it. This also closes the
converse escape — a broken call shielded by a succeeding companion
call now still trips the breaker.
- History poisoning: a persisted tool call with malformed args was
replayed verbatim to the provider on every later iteration and every
future turn; strict endpoints 400 the whole request, permanently
bricking the conversation. toBifrostMessage now replays '{}' for
invalid/empty arguments (the model already saw the failure in the
call's tool result).
- Cancel classification race: a client disconnect could surface as a
Bifrost error chunk instead of a channel close, producing a bare
string error that defeated the logAIError context.Canceled filter —
every Stop click logged a spurious failure. pump now reports ctx.Err()
when the context is cancelled, whichever way Bifrost delivers it.
- Token accounting on errors: usage reported before a stream cut was
discarded. EventError now carries Usage and the agent persists it.
- tool_result frames had the same {}-degradation hazard fixed for
tool_call: a non-JSON dispatcher result killed the frame's marshal and
the UI lost the id/status that settles the call's spinner.
- Dashboard: the post-turn refreshActive() rebuilt the timeline from
server truth, wiping the just-pushed ErrorCard (server has no error
items) — errors are now carried across the rebuild. The error frame
also settles the streaming state (thinking timer kept ticking on a
dead turn).
- CLI: the error path sent no message_end, so the streamed half-answer
was never closed out and the error printed on the same row — and then
printed twice (runTurn + REPL). drive() now finishes the message block
on error; runTurn returns the error without printing.
- Efficiency: invalid-args calls skip the dead status=running write and
go straight to failed; single args conversion in runToolCall; the
invalid-args SSE fallback payload is capped at 2KB.
- Tests: fragmented tool-call reassembly across deltas; failed-signature
reporting (succeeding companion must not mask, distinct probes must
not conflate).
All three failure modes re-validated live against the rebuilt binary
(mock provider replays), plus a real-provider multi-iteration tool turn
through the changed history-replay path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The in-product AI chat (dashboard +
orva chat) intermittently "gets stuck" — most often mid tool-calling. Root-caused to the agent loop trusting the provider stream unconditionally; a flaky gateway connection (resets/stalls) produced three failure modes, all reproduced deterministically with a mock OpenAI-compatible provider:llm.StreamemittedEventDonewhenever the chunk channel closed (nofinish_reasoncheck) → silently truncated answer marked as successEventError("provider stream ended unexpectedly"). Client disconnects still classify ascontext.CanceledAlso fixed en route: the
tool_callSSE frame embedded raw args asjson.RawMessage; invalid args failed the frame's own marshal and the sink degraded the payload to{}, losing the call id/name the UI renders.Changes
backend/internal/ai/llm/llm.go— extracted the stream consumer intopump()(unit-testable termination contract: exactly oneEventDoneorEventError); reject streams that close without a finish reason.backend/internal/ai/agent/agent.go— pre-dispatch JSON validation of tool args;processToolCallsreports an all-failed signal;advance()loop-breaker (maxFailedToolIterations = 3); safe args payload in thetool_callSSE event.pumptermination (clean / truncated / truncated-tool-call / cancelled-context), invalid-args guard (no dispatch, status=failed), valid/empty args still dispatch, all-failed signal.Validation (live, patched binary on an isolated copy of the dev data dir)
finish_reasonis populated on healthy completions; fix 1 does not break normal turns.errorin 0.2s (was: silent truncation accepted as done).errorin 0.2s (was: doomed dispatch + churn).go test ./...fully green;go vetclean.Notes