You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace triage's current binary "cloud OR local-only" routing with a tiered fallback chain: cloud first → on 429 / transient failure wait-and-retry → on persistent failure fall through to local → on local failure defer the turn (via JobOutcome::Defer).
Problem
agent/triage/evaluator.rs currently picks one of cloud or local based on resolve_provider, runs the call, and surfaces any failure to the orchestrator. There's no graceful degradation:
429 from the cloud → triage fails → user-visible toast.
Local model not loaded yet → triage fails the same way.
Cloud is hard-down → no automatic fallback to local even though the local arm is fully wired.
Every other tiered system (HTTP retry, mailgun, S3) has this fallback chain. Triage doesn't because the routing logic predates the local-arm gating from #1073.
Solution
In triage/evaluator.rs::evaluate:
Cloud first when resolve_provider == cloud.
On 429: cooperate with the cloud rate limiter (separate issue) — wait Retry-After, retry once. After the second 429, fall through.
On transient cloud failure (timeout, 5xx, connection): retry once with backoff, then fall through.
On local failure: return JobOutcome::Defer (separate issue) with a short wake-up so the next tick retries the whole chain. Today's "hard fail" is the wrong default for a transient blocker.
Path tracking matters: the orchestrator should know whether triage hit cloud, fell back to local, or deferred — so it can color-code degraded turns and surface the state in /debug views.
Acceptance criteria
Tiered fallback — cloud → retry → local → defer chain implemented in triage/evaluator.rs.
Cloud retry budget — at most 2 attempts (initial + 1 retry) before fallthrough; Retry-After honored.
Resolution path is observable — response carries which arm produced it (cloud / cloud-after-retry / local-fallback / deferred).
Tests — happy cloud, 429 → retry → ok, 429 → retry → 429 → local fallback, cloud 5xx → local fallback, local fail → defer.
Diff coverage ≥ 80% — implementing PR meets the changed-lines coverage gate (Vitest + cargo-llvm-cov, enforced by .github/workflows/coverage.yml).
Summary
Replace triage's current binary "cloud OR local-only" routing with a tiered fallback chain: cloud first → on 429 / transient failure wait-and-retry → on persistent failure fall through to local → on local failure defer the turn (via
JobOutcome::Defer).Problem
agent/triage/evaluator.rscurrently picks one ofcloudorlocalbased onresolve_provider, runs the call, and surfaces any failure to the orchestrator. There's no graceful degradation:Every other tiered system (HTTP retry, mailgun, S3) has this fallback chain. Triage doesn't because the routing logic predates the local-arm gating from #1073.
Solution
In
triage/evaluator.rs::evaluate:resolve_provider == cloud.Retry-After, retry once. After the second 429, fall through.LlmPermit(already wired by Add throttling and power-aware execution for local LLM inference #1073) and run the same triage prompt against the local model. Mark the resolution path in the response so callers can tell it was a fallback.JobOutcome::Defer(separate issue) with a short wake-up so the next tick retries the whole chain. Today's "hard fail" is the wrong default for a transient blocker.Path tracking matters: the orchestrator should know whether triage hit cloud, fell back to local, or deferred — so it can color-code degraded turns and surface the state in
/debugviews.Acceptance criteria
triage/evaluator.rs.Retry-Afterhonored.cloud/cloud-after-retry/local-fallback/deferred)..github/workflows/coverage.yml).Related
Retry-Aftercooperation and bucket awareness.JobOutcome::Defer(separate issue) for the final deferred state.