Skip to content

feat(observability): emit structured circuit.* events for triage breaker with counters and openMs #216

@chrisleekr

Description

@chrisleekr

Finding

The triage circuit breaker (src/utils/circuit-breaker.ts) is the only blast-radius guard between webhook events and the Claude / Bedrock provider, but four of its five execution paths emit no log line of their own. The single point of observability is the optional onStateChange callback (src/utils/circuit-breaker.ts:71-75), which the triage caller wires up at src/orchestrator/triage.ts:157-159 to log from, to, reason. That fires on transitions only, so an operator watching during a Claude outage cannot answer the SRE-relevant questions:

  1. How many requests did we drop while open? When state === "open" and the cooldown has not elapsed, execute returns { outcome: "circuit-open" } at src/utils/circuit-breaker.ts:108-111 with zero log emission. The triage caller produces one logger.warn at src/orchestrator/triage.ts:316 per skipped request, but it carries only deliveryId — no openMs (time since openedAt), no running skip counter, no shared correlation field across requests skipped in the same open window. To count skips during an incident the operator must grep "triage short-circuited by open breaker" and pipe through wc -l.
  2. How close are we to tripping? recordFailure at src/utils/circuit-breaker.ts:147-156 increments consecutiveFailures privately and only logs when the counter crosses maxConsecutiveFailures (default 5) via the trip() path. There is no event at 1/5, 2/5, 3/5, 4/5, so the first signal an operator sees is the breaker already being open. The runbook at docs/operate/runbooks/triage.md:17-19 documents the threshold but has no pre-trip warning to alert on.
  3. Was this a latency trip or a real error? Lines 119-128 of src/utils/circuit-breaker.ts swallow the latency violation into the same { outcome: "error" } shape as a thrown exception (lines 132-137). The caller at src/orchestrator/triage.ts:319-326 collapses both into reason: "llm-error", so the dashboard alert at docs/operate/observability.md:230 cannot distinguish "Anthropic is slow" (raise timeouts, ride it out) from "Anthropic is throwing" (page the on-call).
  4. What's our MTTR when it does trip? On the half-open → closed transition (src/utils/circuit-breaker.ts:142-144) the breaker has this.openedAt from the most recent trip() and this.now() in hand, so it could emit openMs = now - openedAt, the recovery-time signal SREs need for outage post-mortems. Today the value is computed implicitly inside the state === "open" cooldown check (line 109) and never surfaced.

All four gaps are addressable with two changes inside the existing module and no new dependencies: (a) call the onStateChange callback for non-transition events too (rename to onEvent or add a sibling onObservation hook), or simpler — extend the executor's return type with the new fields (consecutiveFailures, openMs, skipsSinceOpened, latencyTripped: boolean) and let the triage caller log them through the existing pino child. The pattern follows precedent: issue #200 added queue_wait_ms to dispatcher.offer.*, #166 added pipeline.stage delta_ms, #170 added octokit.rate-limit via hook.after. Issue #215 is the same idea for retryWithBackoff, this one is the matching guard for breaker.execute.

Diagram

flowchart TD
    Start([breaker.execute fn]) --> CheckOpen{state == open?}
    CheckOpen -- yes --> CheckCooldown{now - openedAt<br/>< cooldownMs?}
    CheckCooldown -- yes --> SkipReturn["return outcome: circuit-open<br/><b>GAP: no skip event,<br/>no openMs, no skip counter</b>"]:::gap
    CheckCooldown -- no --> HalfOpen[transition open to half-open<br/>onStateChange logs reason only]:::partial
    CheckOpen -- no --> CallFn[await fn]
    HalfOpen --> CallFn
    CallFn --> CallResult{returned or threw?}
    CallResult -- threw --> RecordErr["recordFailure error.message<br/><b>GAP: no failure event,<br/>consecutiveFailures hidden</b>"]:::gap
    CallResult -- returned --> CheckLatency{latencyMs<br/>> latencyTripMs?}
    CheckLatency -- yes --> RecordLatency["recordFailure latency-trip<br/><b>GAP: indistinguishable from<br/>real error at caller</b>"]:::gap
    CheckLatency -- no --> RecordOk[recordSuccess]
    RecordErr --> CheckThreshold{consecutiveFailures<br/>>= maxConsecutiveFailures?}
    RecordLatency --> CheckThreshold
    CheckThreshold -- yes --> Trip["trip to open<br/>onStateChange logs<br/><b>GAP: no openMs on later close</b>"]:::partial
    CheckThreshold -- no --> Continue["no log<br/><b>GAP: operator misses 4 of 5<br/>pre-trip warning</b>"]:::gap

    classDef gap fill:#cb2431;color:#ffffff
    classDef partial fill:#dbab09;color:#000000
Loading

Red nodes are paths that emit no log line today. Yellow nodes have an onStateChange log but lack a key field (openMs).

Rationale

The triage breaker is in the request hot path: it fires on every webhook event subject to the kill switch (docs/operate/runbooks/triage.md:36). When Anthropic has a partial outage the breaker is the only thing that bounds cost — SC-005 (see src/utils/circuit-breaker.ts:16-19) caps error-path cost at ~$0.06/hour against a $1/hour budget, but that bound only holds if the operator notices the breaker is open within the cooldown window. Today the only signal is one onStateChange warn line at trip time, then silence. The alert recipe at docs/operate/observability.md:230 ("triage error rate above 10% over 15 minutes") fires too late on a low-traffic install: a handful of events at 5 consecutive failures each can trip the breaker without crossing the 10% rate threshold, and then the open period is invisible.

The pre-trip signal matters even more. maxConsecutiveFailures = 5 and latencyTripMs = 10s mean a single slow upstream can trip the breaker in under 50 seconds of wall clock. An operator who could alert on "consecutiveFailures >= 3" would have a 2-failure head start to investigate (provider status page, recent deploys, downstream rate limits) before the breaker actually opens. Resilience4j and Hystrix both publish this counter; the Azure Architecture Center explicitly recommends "logging changes to circuit breaker state with the reason for the change, and logging failures and successes detected by the circuit breaker" as the baseline observability bar.

The MTTR (openMs at close) and the latency-trip-vs-real-error distinction round out the trio of metrics every mature breaker library exposes (Resilience4j onCallNotPermitted, slow-call rate, state-transition events). This issue is a small, additive change within the existing pino-based observability discipline (issues #166, #170, #172, #174, #182, #186, #188, #189, #207, #209) — no new dependency, no new transport, no schema migration.

References

Internal:

External:

Suggested Next Steps

  1. Extend CircuitBreakerExecuteResult<T> (src/utils/circuit-breaker.ts:45-54) with consecutiveFailures: number, skipsSinceOpened?: number, openMs?: number, latencyTripped: boolean. The first three are non-breaking additions; latencyTripped distinguishes the line-119 path from the line-132 path so the caller can stop conflating them.
  2. Add a totalSkipsSinceOpened counter to the breaker private state, incremented at src/utils/circuit-breaker.ts:110 (the circuit-open return). Reset on trip() so each open window has its own running count.
  3. Compute openMs = now - openedAt in recordSuccess's half-open → closed branch (src/utils/circuit-breaker.ts:142-144) and pass it through onStateChange's reason (or a new optional 4th param) so the existing log line at src/orchestrator/triage.ts:158 can emit it.
  4. Emit one structured warn per skip at src/orchestrator/triage.ts:316 with { event: "triage.circuit.skip", openMs, skipsSinceOpened }; mirror the naming convention used by dispatcher.offer.* and pipeline.stage.*.
  5. Optional pre-trip warning hook: callback onProgress?: (consecutiveFailures, maxConsecutiveFailures) => void, fired from recordFailure when consecutiveFailures is in [max - 2, max - 1]. Triage wires it to a logger.warn so operators get a 2-failure heads-up.
  6. Update docs/operate/observability.md § "Alerts worth having" with a new bullet for event:"triage.circuit.skip" rate (sum over the open window) and a consecutiveFailures >= 3 warn rate. Cite the new event names in docs/operate/runbooks/triage.md Circuit-breaker section.
  7. Tests: extend test/utils/circuit-breaker.test.ts to assert the new fields on CircuitBreakerExecuteResult and that openMs is populated on half-open → closed. Existing tests stay green because additions are non-breaking.

Areas Evaluated

This run focused on the observability area. Files read in detail:

Generated by the scheduled research action on 2026-06-04

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions