fix(ai): stop chat turns hanging/corrupting on provider stream resets by Harsh-2002 · Pull Request #17 · Harsh-2002/Orva

Harsh-2002 · 2026-06-12T19:38:25Z

Problem

The in-product AI chat (dashboard + orva chat) intermittently "gets stuck" — most often mid tool-calling. Root-caused to the agent loop trusting the provider stream unconditionally; a flaky gateway connection (resets/stalls) produced three failure modes, all reproduced deterministically with a mock OpenAI-compatible provider:

#	Trigger	Old behavior	New behavior
1	Provider reset mid-response	`llm.Stream` emitted `EventDone` whenever the chunk channel closed (no `finish_reason` check) → silently truncated answer marked as success	Stream ending without a finish reason → `EventError` ("provider stream ended unexpectedly"). Client disconnects still classify as `context.Canceled`
2	Reset mid tool-call	Half-assembled call (truncated JSON args) dispatched → confusing unmarshal failure → model re-emits → churn	Args JSON-validated before dispatch; fails fast with a self-correctable error
3	Model repeats a failing call	Burned the full 25-iteration budget — minutes of billed calls ("stuck in the tool calling")	Loop-breaker: turn stops with a clear error after 3 consecutive all-failed tool iterations

Also fixed en route: the tool_call SSE frame embedded raw args as json.RawMessage; invalid args failed the frame's own marshal and the sink degraded the payload to {}, losing the call id/name the UI renders.

Changes

backend/internal/ai/llm/llm.go — extracted the stream consumer into pump() (unit-testable termination contract: exactly one EventDone or EventError); reject streams that close without a finish reason.
backend/internal/ai/agent/agent.go — pre-dispatch JSON validation of tool args; processToolCalls reports an all-failed signal; advance() loop-breaker (maxFailedToolIterations = 3); safe args payload in the tool_call SSE event.
Tests: pump termination (clean / truncated / truncated-tool-call / cancelled-context), invalid-args guard (no dispatch, status=failed), valid/empty args still dispatch, all-failed signal.

Validation (live, patched binary on an isolated copy of the dev data dir)

Normal + tool-call turns through a real provider complete clean → finish_reason is populated on healthy completions; fix 1 does not break normal turns.
Replayed reset mid-response → SSE error in 0.2s (was: silent truncation accepted as done).
Replayed reset mid tool-call → SSE error in 0.2s (was: doomed dispatch + churn).
Repeating broken call → exactly 3 iterations then a clear error (was: 25).
go test ./... fully green; go vet clean.

Notes

The 60s SSE freeze fixed earlier (2e41c9e) was the server→client leg; this PR fixes the Orva→provider leg. Both share the "stuck" symptom.
Not in scope (follow-ups): lowering the Bifrost 60s idle timeout, and the ~70KB first-turn system-prompt inline that inflates time-to-first-token.

The chat 'gets stuck' bug (UI + CLI): the agent loop trusted the provider stream unconditionally, so a flaky gateway connection produced three distinct failure modes, all reproduced deterministically against a mock OpenAI-compatible provider: 1. Reset mid-response was accepted as success: llm.Stream emitted EventDone whenever the chunk channel closed, with no finish_reason check — the user got a silently truncated (or empty) answer marked done. Now a stream that ends without a finish reason terminates with EventError ('provider stream ended unexpectedly'); a cancelled request context still surfaces as context.Canceled so disconnects stay out of the error log. 2. Reset mid-tool-call dispatched a half-assembled call: truncated argument JSON went straight to the dispatcher, failed with a confusing tool-specific unmarshal error, the model re-emitted the call, and the turn burned the full 25-iteration budget — the 'stuck in the tool calling' symptom. Tool arguments are now JSON-validated before dispatch and fail fast with a self-correctable error. 3. No loop-breaker: a model repeating a failing call churned for minutes of billed calls. advance() now stops with a clear error after 3 consecutive iterations in which every dispatched call failed. Also fixed en route: the tool_call SSE frame embedded raw args as json.RawMessage; invalid args failed the frame's marshal and the sink degraded the payload to {}, losing the call id/name the UI renders. Invalid args now fall back to a JSON string. llm.Stream's chunk consumer is extracted into pump() so the termination contract (exactly one EventDone or EventError) is unit-tested with a synthetic chunk channel. Verified live against a patched instance on a copy of the dev data dir: normal + tool-call turns through a real provider complete clean (finish_reason present); replayed resets mid-response and mid-tool-call now error in <1s instead of silently truncating; a repeating broken call stops after 3 iterations instead of 25.

Review of the initial fix surfaced real gaps; all verified and fixed: - Loop-breaker false positive: 'every call failed this round' aborted legitimate exploration (three not-found probes in a row = three informative results, not a stuck model). The breaker now tracks per-call-signature (name+args) consecutive-failure streaks: only the SAME call failing 3 rounds in a row trips it. This also closes the converse escape — a broken call shielded by a succeeding companion call now still trips the breaker. - History poisoning: a persisted tool call with malformed args was replayed verbatim to the provider on every later iteration and every future turn; strict endpoints 400 the whole request, permanently bricking the conversation. toBifrostMessage now replays '{}' for invalid/empty arguments (the model already saw the failure in the call's tool result). - Cancel classification race: a client disconnect could surface as a Bifrost error chunk instead of a channel close, producing a bare string error that defeated the logAIError context.Canceled filter — every Stop click logged a spurious failure. pump now reports ctx.Err() when the context is cancelled, whichever way Bifrost delivers it. - Token accounting on errors: usage reported before a stream cut was discarded. EventError now carries Usage and the agent persists it. - tool_result frames had the same {}-degradation hazard fixed for tool_call: a non-JSON dispatcher result killed the frame's marshal and the UI lost the id/status that settles the call's spinner. - Dashboard: the post-turn refreshActive() rebuilt the timeline from server truth, wiping the just-pushed ErrorCard (server has no error items) — errors are now carried across the rebuild. The error frame also settles the streaming state (thinking timer kept ticking on a dead turn). - CLI: the error path sent no message_end, so the streamed half-answer was never closed out and the error printed on the same row — and then printed twice (runTurn + REPL). drive() now finishes the message block on error; runTurn returns the error without printing. - Efficiency: invalid-args calls skip the dead status=running write and go straight to failed; single args conversion in runToolCall; the invalid-args SSE fallback payload is capped at 2KB. - Tests: fragmented tool-call reassembly across deltas; failed-signature reporting (succeeding companion must not mask, distinct probes must not conflate). All three failure modes re-validated live against the rebuilt binary (mock provider replays), plus a real-provider multi-iteration tool turn through the changed history-replay path.

Harsh-2002 added 2 commits June 12, 2026 19:38

Harsh-2002 merged commit 7832b81 into main Jun 12, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ai): stop chat turns hanging/corrupting on provider stream resets#17

fix(ai): stop chat turns hanging/corrupting on provider stream resets#17
Harsh-2002 merged 2 commits into
mainfrom
dev

Harsh-2002 commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Harsh-2002 commented Jun 12, 2026

Problem

Changes

Validation (live, patched binary on an isolated copy of the dev data dir)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant