You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The triage circuit breaker (src/utils/circuit-breaker.ts) is the only blast-radius guard between webhook events and the Claude / Bedrock provider, but four of its five execution paths emit no log line of their own. The single point of observability is the optional onStateChange callback (src/utils/circuit-breaker.ts:71-75), which the triage caller wires up at src/orchestrator/triage.ts:157-159 to log from, to, reason. That fires on transitions only, so an operator watching during a Claude outage cannot answer the SRE-relevant questions:
How many requests did we drop while open? When state === "open" and the cooldown has not elapsed, execute returns { outcome: "circuit-open" } at src/utils/circuit-breaker.ts:108-111 with zero log emission. The triage caller produces one logger.warn at src/orchestrator/triage.ts:316 per skipped request, but it carries only deliveryId — no openMs (time since openedAt), no running skip counter, no shared correlation field across requests skipped in the same open window. To count skips during an incident the operator must grep "triage short-circuited by open breaker" and pipe through wc -l.
How close are we to tripping?recordFailure at src/utils/circuit-breaker.ts:147-156 increments consecutiveFailures privately and only logs when the counter crosses maxConsecutiveFailures (default 5) via the trip() path. There is no event at 1/5, 2/5, 3/5, 4/5, so the first signal an operator sees is the breaker already being open. The runbook at docs/operate/runbooks/triage.md:17-19 documents the threshold but has no pre-trip warning to alert on.
Was this a latency trip or a real error? Lines 119-128 of src/utils/circuit-breaker.ts swallow the latency violation into the same { outcome: "error" } shape as a thrown exception (lines 132-137). The caller at src/orchestrator/triage.ts:319-326 collapses both into reason: "llm-error", so the dashboard alert at docs/operate/observability.md:230 cannot distinguish "Anthropic is slow" (raise timeouts, ride it out) from "Anthropic is throwing" (page the on-call).
What's our MTTR when it does trip? On the half-open → closed transition (src/utils/circuit-breaker.ts:142-144) the breaker has this.openedAt from the most recent trip() and this.now() in hand, so it could emit openMs = now - openedAt, the recovery-time signal SREs need for outage post-mortems. Today the value is computed implicitly inside the state === "open" cooldown check (line 109) and never surfaced.
All four gaps are addressable with two changes inside the existing module and no new dependencies: (a) call the onStateChange callback for non-transition events too (rename to onEvent or add a sibling onObservation hook), or simpler — extend the executor's return type with the new fields (consecutiveFailures, openMs, skipsSinceOpened, latencyTripped: boolean) and let the triage caller log them through the existing pino child. The pattern follows precedent: issue #200 added queue_wait_ms to dispatcher.offer.*, #166 added pipeline.stagedelta_ms, #170 added octokit.rate-limit via hook.after. Issue #215 is the same idea for retryWithBackoff, this one is the matching guard for breaker.execute.
Diagram
flowchart TD
Start([breaker.execute fn]) --> CheckOpen{state == open?}
CheckOpen -- yes --> CheckCooldown{now - openedAt<br/>< cooldownMs?}
CheckCooldown -- yes --> SkipReturn["return outcome: circuit-open<br/><b>GAP: no skip event,<br/>no openMs, no skip counter</b>"]:::gap
CheckCooldown -- no --> HalfOpen[transition open to half-open<br/>onStateChange logs reason only]:::partial
CheckOpen -- no --> CallFn[await fn]
HalfOpen --> CallFn
CallFn --> CallResult{returned or threw?}
CallResult -- threw --> RecordErr["recordFailure error.message<br/><b>GAP: no failure event,<br/>consecutiveFailures hidden</b>"]:::gap
CallResult -- returned --> CheckLatency{latencyMs<br/>> latencyTripMs?}
CheckLatency -- yes --> RecordLatency["recordFailure latency-trip<br/><b>GAP: indistinguishable from<br/>real error at caller</b>"]:::gap
CheckLatency -- no --> RecordOk[recordSuccess]
RecordErr --> CheckThreshold{consecutiveFailures<br/>>= maxConsecutiveFailures?}
RecordLatency --> CheckThreshold
CheckThreshold -- yes --> Trip["trip to open<br/>onStateChange logs<br/><b>GAP: no openMs on later close</b>"]:::partial
CheckThreshold -- no --> Continue["no log<br/><b>GAP: operator misses 4 of 5<br/>pre-trip warning</b>"]:::gap
classDef gap fill:#cb2431;color:#ffffff
classDef partial fill:#dbab09;color:#000000
Loading
Red nodes are paths that emit no log line today. Yellow nodes have an onStateChange log but lack a key field (openMs).
Rationale
The triage breaker is in the request hot path: it fires on every webhook event subject to the kill switch (docs/operate/runbooks/triage.md:36). When Anthropic has a partial outage the breaker is the only thing that bounds cost — SC-005 (see src/utils/circuit-breaker.ts:16-19) caps error-path cost at ~$0.06/hour against a $1/hour budget, but that bound only holds if the operator notices the breaker is open within the cooldown window. Today the only signal is one onStateChange warn line at trip time, then silence. The alert recipe at docs/operate/observability.md:230 ("triage error rate above 10% over 15 minutes") fires too late on a low-traffic install: a handful of events at 5 consecutive failures each can trip the breaker without crossing the 10% rate threshold, and then the open period is invisible.
The pre-trip signal matters even more. maxConsecutiveFailures = 5 and latencyTripMs = 10s mean a single slow upstream can trip the breaker in under 50 seconds of wall clock. An operator who could alert on "consecutiveFailures >= 3" would have a 2-failure head start to investigate (provider status page, recent deploys, downstream rate limits) before the breaker actually opens. Resilience4j and Hystrix both publish this counter; the Azure Architecture Center explicitly recommends "logging changes to circuit breaker state with the reason for the change, and logging failures and successes detected by the circuit breaker" as the baseline observability bar.
The MTTR (openMs at close) and the latency-trip-vs-real-error distinction round out the trio of metrics every mature breaker library exposes (Resilience4j onCallNotPermitted, slow-call rate, state-transition events). This issue is a small, additive change within the existing pino-based observability discipline (issues #166, #170, #172, #174, #182, #186, #188, #189, #207, #209) — no new dependency, no new transport, no schema migration.
References
Internal:
src/utils/circuit-breaker.ts:107-138 — execute() method: the four silent paths
src/utils/circuit-breaker.ts:147-156 — recordFailure(): consecutiveFailures increment with no event
src/utils/circuit-breaker.ts:140-145 — recordSuccess() / half-open → closed: openMs discardable here
src/orchestrator/triage.ts:153-160 — sole CircuitBreaker instantiation + the lone onStateChange log
src/orchestrator/triage.ts:315-326 — caller-side log of circuit-open skip and llm-error collapse
docs/operate/runbooks/triage.md:17-19 — runbook expectation that operators can see breaker state
docs/operate/observability.md:230 — current alert recipe that depends on the rate signal alone
Extend CircuitBreakerExecuteResult<T> (src/utils/circuit-breaker.ts:45-54) with consecutiveFailures: number, skipsSinceOpened?: number, openMs?: number, latencyTripped: boolean. The first three are non-breaking additions; latencyTripped distinguishes the line-119 path from the line-132 path so the caller can stop conflating them.
Add a totalSkipsSinceOpened counter to the breaker private state, incremented at src/utils/circuit-breaker.ts:110 (the circuit-open return). Reset on trip() so each open window has its own running count.
Compute openMs = now - openedAt in recordSuccess's half-open → closed branch (src/utils/circuit-breaker.ts:142-144) and pass it through onStateChange's reason (or a new optional 4th param) so the existing log line at src/orchestrator/triage.ts:158 can emit it.
Emit one structured warn per skip at src/orchestrator/triage.ts:316 with { event: "triage.circuit.skip", openMs, skipsSinceOpened }; mirror the naming convention used by dispatcher.offer.* and pipeline.stage.*.
Optional pre-trip warning hook: callback onProgress?: (consecutiveFailures, maxConsecutiveFailures) => void, fired from recordFailure when consecutiveFailures is in [max - 2, max - 1]. Triage wires it to a logger.warn so operators get a 2-failure heads-up.
Update docs/operate/observability.md § "Alerts worth having" with a new bullet for event:"triage.circuit.skip" rate (sum over the open window) and a consecutiveFailures >= 3 warn rate. Cite the new event names in docs/operate/runbooks/triage.md Circuit-breaker section.
Tests: extend test/utils/circuit-breaker.test.ts to assert the new fields on CircuitBreakerExecuteResult and that openMs is populated on half-open → closed. Existing tests stay green because additions are non-breaking.
Areas Evaluated
This run focused on the observability area. Files read in detail:
Finding
The triage circuit breaker (
src/utils/circuit-breaker.ts) is the only blast-radius guard between webhook events and the Claude / Bedrock provider, but four of its five execution paths emit no log line of their own. The single point of observability is the optionalonStateChangecallback (src/utils/circuit-breaker.ts:71-75), which the triage caller wires up atsrc/orchestrator/triage.ts:157-159to logfrom,to,reason. That fires on transitions only, so an operator watching during a Claude outage cannot answer the SRE-relevant questions:state === "open"and the cooldown has not elapsed,executereturns{ outcome: "circuit-open" }atsrc/utils/circuit-breaker.ts:108-111with zero log emission. The triage caller produces onelogger.warnatsrc/orchestrator/triage.ts:316per skipped request, but it carries onlydeliveryId— noopenMs(time sinceopenedAt), no running skip counter, no shared correlation field across requests skipped in the same open window. To count skips during an incident the operator must grep"triage short-circuited by open breaker"and pipe throughwc -l.recordFailureatsrc/utils/circuit-breaker.ts:147-156incrementsconsecutiveFailuresprivately and only logs when the counter crossesmaxConsecutiveFailures(default 5) via thetrip()path. There is no event at 1/5, 2/5, 3/5, 4/5, so the first signal an operator sees is the breaker already being open. The runbook atdocs/operate/runbooks/triage.md:17-19documents the threshold but has no pre-trip warning to alert on.src/utils/circuit-breaker.tsswallow the latency violation into the same{ outcome: "error" }shape as a thrown exception (lines 132-137). The caller atsrc/orchestrator/triage.ts:319-326collapses both intoreason: "llm-error", so the dashboard alert atdocs/operate/observability.md:230cannot distinguish "Anthropic is slow" (raise timeouts, ride it out) from "Anthropic is throwing" (page the on-call).half-open → closedtransition (src/utils/circuit-breaker.ts:142-144) the breaker hasthis.openedAtfrom the most recenttrip()andthis.now()in hand, so it could emitopenMs = now - openedAt, the recovery-time signal SREs need for outage post-mortems. Today the value is computed implicitly inside thestate === "open"cooldown check (line 109) and never surfaced.All four gaps are addressable with two changes inside the existing module and no new dependencies: (a) call the
onStateChangecallback for non-transition events too (rename toonEventor add a siblingonObservationhook), or simpler — extend the executor's return type with the new fields (consecutiveFailures,openMs,skipsSinceOpened,latencyTripped: boolean) and let the triage caller log them through the existing pino child. The pattern follows precedent: issue #200 addedqueue_wait_mstodispatcher.offer.*, #166 addedpipeline.stagedelta_ms, #170 addedoctokit.rate-limitviahook.after. Issue #215 is the same idea forretryWithBackoff, this one is the matching guard forbreaker.execute.Diagram
flowchart TD Start([breaker.execute fn]) --> CheckOpen{state == open?} CheckOpen -- yes --> CheckCooldown{now - openedAt<br/>< cooldownMs?} CheckCooldown -- yes --> SkipReturn["return outcome: circuit-open<br/><b>GAP: no skip event,<br/>no openMs, no skip counter</b>"]:::gap CheckCooldown -- no --> HalfOpen[transition open to half-open<br/>onStateChange logs reason only]:::partial CheckOpen -- no --> CallFn[await fn] HalfOpen --> CallFn CallFn --> CallResult{returned or threw?} CallResult -- threw --> RecordErr["recordFailure error.message<br/><b>GAP: no failure event,<br/>consecutiveFailures hidden</b>"]:::gap CallResult -- returned --> CheckLatency{latencyMs<br/>> latencyTripMs?} CheckLatency -- yes --> RecordLatency["recordFailure latency-trip<br/><b>GAP: indistinguishable from<br/>real error at caller</b>"]:::gap CheckLatency -- no --> RecordOk[recordSuccess] RecordErr --> CheckThreshold{consecutiveFailures<br/>>= maxConsecutiveFailures?} RecordLatency --> CheckThreshold CheckThreshold -- yes --> Trip["trip to open<br/>onStateChange logs<br/><b>GAP: no openMs on later close</b>"]:::partial CheckThreshold -- no --> Continue["no log<br/><b>GAP: operator misses 4 of 5<br/>pre-trip warning</b>"]:::gap classDef gap fill:#cb2431;color:#ffffff classDef partial fill:#dbab09;color:#000000Red nodes are paths that emit no log line today. Yellow nodes have an
onStateChangelog but lack a key field (openMs).Rationale
The triage breaker is in the request hot path: it fires on every webhook event subject to the kill switch (
docs/operate/runbooks/triage.md:36). When Anthropic has a partial outage the breaker is the only thing that bounds cost —SC-005(seesrc/utils/circuit-breaker.ts:16-19) caps error-path cost at ~$0.06/hour against a $1/hour budget, but that bound only holds if the operator notices the breaker is open within the cooldown window. Today the only signal is oneonStateChangewarn line at trip time, then silence. The alert recipe atdocs/operate/observability.md:230("triage error rate above 10% over 15 minutes") fires too late on a low-traffic install: a handful of events at 5 consecutive failures each can trip the breaker without crossing the 10% rate threshold, and then the open period is invisible.The pre-trip signal matters even more.
maxConsecutiveFailures = 5andlatencyTripMs = 10smean a single slow upstream can trip the breaker in under 50 seconds of wall clock. An operator who could alert on "consecutiveFailures >= 3" would have a 2-failure head start to investigate (provider status page, recent deploys, downstream rate limits) before the breaker actually opens. Resilience4j and Hystrix both publish this counter; the Azure Architecture Center explicitly recommends "logging changes to circuit breaker state with the reason for the change, and logging failures and successes detected by the circuit breaker" as the baseline observability bar.The MTTR (
openMsat close) and the latency-trip-vs-real-error distinction round out the trio of metrics every mature breaker library exposes (Resilience4jonCallNotPermitted, slow-call rate, state-transition events). This issue is a small, additive change within the existing pino-based observability discipline (issues #166, #170, #172, #174, #182, #186, #188, #189, #207, #209) — no new dependency, no new transport, no schema migration.References
Internal:
src/utils/circuit-breaker.ts:107-138—execute()method: the four silent pathssrc/utils/circuit-breaker.ts:147-156—recordFailure():consecutiveFailuresincrement with no eventsrc/utils/circuit-breaker.ts:140-145—recordSuccess()/half-open → closed:openMsdiscardable heresrc/orchestrator/triage.ts:153-160— soleCircuitBreakerinstantiation + the loneonStateChangelogsrc/orchestrator/triage.ts:315-326— caller-side log ofcircuit-openskip andllm-errorcollapsedocs/operate/runbooks/triage.md:17-19— runbook expectation that operators can see breaker statedocs/operate/observability.md:230— current alert recipe that depends on the rate signal alonequeue_wait_ms), PR feat(observability): structured pipeline.stage timing events with delta_ms #182 (pipeline.stage delta_ms)External:
onSuccess,onError,onCallNotPermitted,onStateTransitionSuggested Next Steps
CircuitBreakerExecuteResult<T>(src/utils/circuit-breaker.ts:45-54) withconsecutiveFailures: number,skipsSinceOpened?: number,openMs?: number,latencyTripped: boolean. The first three are non-breaking additions;latencyTrippeddistinguishes the line-119 path from the line-132 path so the caller can stop conflating them.totalSkipsSinceOpenedcounter to the breaker private state, incremented atsrc/utils/circuit-breaker.ts:110(thecircuit-openreturn). Reset ontrip()so each open window has its own running count.openMs = now - openedAtinrecordSuccess'shalf-open → closedbranch (src/utils/circuit-breaker.ts:142-144) and pass it throughonStateChange'sreason(or a new optional 4th param) so the existing log line atsrc/orchestrator/triage.ts:158can emit it.src/orchestrator/triage.ts:316with{ event: "triage.circuit.skip", openMs, skipsSinceOpened }; mirror the naming convention used bydispatcher.offer.*andpipeline.stage.*.onProgress?: (consecutiveFailures, maxConsecutiveFailures) => void, fired fromrecordFailurewhenconsecutiveFailuresis in[max - 2, max - 1]. Triage wires it to alogger.warnso operators get a 2-failure heads-up.docs/operate/observability.md§ "Alerts worth having" with a new bullet forevent:"triage.circuit.skip"rate (sum over the open window) and aconsecutiveFailures >= 3warn rate. Cite the new event names indocs/operate/runbooks/triage.mdCircuit-breaker section.test/utils/circuit-breaker.test.tsto assert the new fields onCircuitBreakerExecuteResultand thatopenMsis populated onhalf-open → closed. Existing tests stay green because additions are non-breaking.Areas Evaluated
This run focused on the
observabilityarea. Files read in detail:src/logger.ts(root logger, child-logger contract, fatal handlers)src/utils/log-redaction.ts(REDACT_PATHS, err serializer, scrubStructured)src/utils/retry.ts(defaultLog, retry path — observability covered by open issue feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215)src/utils/circuit-breaker.ts(the finding's target)src/orchestrator/triage.ts(sole breaker caller; verified the wiring and what gets logged)docs/operate/observability.md(alert recipes, dispatcher / scheduler event taxonomies)docs/operate/runbooks/triage.md(runbook expectations for breaker visibility)test/utils/circuit-breaker.test.ts(test structure; confirmed the proposed additions are non-breaking)area: observabilityresearch issues (feat(observability): wire pino.final exit logging for uncaughtException + unhandledRejection #164, feat(observability): structured pipeline.stage events with delta_ms for runPipeline #166, feat(observability): log octokit rate-limit headers via hook.after for per-installation quota visibility #170, feat(observability): structured pino logger for stdio MCP servers w/ deliveryId correlation + err redaction parity #172, feat(observability): periodic fleet-state gauge snapshot for queue depth, daemon counts, busy slots #174, fix(observability): child-logger field-name drift across webhook handlers breaks per-entity log correlation #175, feat(observability): periodic fleet-state gauge snapshot #186 via feat(observability): periodic fleet-state gauge snapshot for queue depth, daemon counts, busy slots #174 thread, feat(observability): structured dispatcher.offer.{sent,accepted,rejected,timed_out} events + offer_latency_ms #187, feat(observability): installationId on loggers + config-free MCP retry #189, feat(observability): log input/output tokens + modelUsage from SDKResultMessage; persist to executions table #192, feat(observability): add queue_wait_ms to dispatcher.offer.sent for enqueue->dispatch lag visibility #200, feat(observability): add queue_wait_ms to dispatcher offer/no-daemon logs #207, feat(observability): log + persist SDK token usage on executions #209, feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215) to confirm non-overlap. The closest open issue (feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215) coversretryWithBackoffevents; circuit breaker is a different module with a different event surface.Generated by the scheduled research action on 2026-06-04