Skip to content

feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215

@chrisleekr

Description

@chrisleekr

Finding

retryWithBackoff (src/utils/retry.ts:83) is the single chokepoint guarding every transient-failure recovery in the bot: it wraps 12 call sites spanning the GraphQL fetcher, every GitHub-touching MCP server, the orchestrator triage probe, the ship probe, and the pipeline's three GitHub writes (tracking-comment create, finalize, and the failure-path finalize at src/core/pipeline.ts:303, :317, :465, :577). Despite that load-bearing role, its telemetry has not migrated to the structured-event convention adopted across the rest of observability over the last several months (pipeline.stage in #166, github.api.* in #170, dispatcher.offer.* in #187, queue_wait_ms in #200): all three terminal emissions are plain text messages with no event field — log.warn({ attempt, status, err }, "Non-retriable error, throwing immediately") at src/utils/retry.ts:128, log.warn({ attempt, maxAttempts, err }, "Operation attempt failed") at src/utils/retry.ts:136, and log.error({ maxAttempts }, "Operation failed after all attempts") at src/utils/retry.ts:146.

The bigger blind spot is success-after-retry: the early-return path at src/utils/retry.ts:109 (return await operation()) emits nothing. A call that fails on attempt 1 with a 502 and succeeds on attempt 2 produces exactly one warn-level Operation attempt failed line — but no signal that the operation eventually succeeded, what its total wall-clock was, or that the system absorbed transient failure on the caller's behalf. This is the weakly-flaky signal that compounds: a 1 % per-call upstream flake on a 100-call/hour comment MCP server is one retry per hour today, invisible; the same flake on a 1,000-call/hour fleet is 10 retries per hour and still invisible, until the upstream degrades further and the warn-level attempt failed lines spike with no baseline to compare against. There is also no op field on any line, so an operator grepping attempt: 2 cannot tell whether the flake was pulls.get, issues.updateComment, the triage Bedrock call, or the GraphQL paginator — every call site collapses into one anonymous stream. Finally, the existing fields omit two operationally useful scalars the loop already has in scope: delay_ms (the next backoff that will be slept at src/utils/retry.ts:140:141) and a cumulative elapsed_ms, which would let an exhaustion line carry the full retry-window duration without parsing.

Concretely: extend RetryOptions (src/utils/retry.ts:32) with an op?: string; pin the four-event family with a .strict() Zod schema in a new src/utils/retry-log-fields.ts (mirroring GithubApiLogFieldsSchema in src/utils/octokit-observability.ts:39); emit retry.attempt_failed (warn) at the existing :136 line, retry.non_retriable (warn) at :128, retry.exhausted (error) at :146, and the new retry.succeeded_after_retry (info, gated on attempt > 1) on the early-return path. Every event carries event, op, attempt, max_attempts, and elapsed_ms; attempt_failed adds the next delay_ms; non_retriable adds status. The attempt field aligns with the OpenTelemetry HTTP http.request.resend_count semantic so an OTel exporter can map directly later.

Diagram

flowchart TD
    Start([retryWithBackoff op]):::path --> Attempt[Attempt N]:::path
    Attempt -->|success| Success[Return value]:::path
    Attempt -->|throws| Classify{Classify error}:::path
    Classify -->|non-retriable 4xx| Nonret[Throw to caller]:::path
    Classify -->|retriable or 429| Decide{attempt less than max?}:::path
    Decide -->|yes| Waitloop[Sleep delay_ms<br/>backoff doubles]:::path
    Waitloop --> Attempt
    Decide -->|no| Exhaust[Throw lastError]:::path

    Success -.->|attempt greater than 1: SILENT today| MissedEvt[retry.succeeded_after_retry<br/>NEW info-level event]:::silent
    Attempt -.->|today: Operation attempt failed<br/>no event field| FailEvt[retry.attempt_failed<br/>+ op, delay_ms, elapsed_ms]:::proposed
    Nonret -.->|today: Non-retriable error<br/>no event field| NREvt[retry.non_retriable<br/>+ op, status]:::proposed
    Exhaust -.->|today: Operation failed after all<br/>no event field| ExhEvt[retry.exhausted<br/>+ op, elapsed_ms]:::proposed

    classDef silent fill:#922b21;color:#ffffff
    classDef proposed fill:#196f3d;color:#ffffff
    classDef path fill:#1f618d;color:#ffffff
Loading

Rationale

Three observability properties this closes that no other event currently provides:

  1. Weak-flake visibility. retry.succeeded_after_retry is the missing primitive for measuring upstream stability without waiting for full failure. Today a 1 % transient-failure rate against GitHub or Bedrock is invisible at the default info level: the only retry telemetry is warn/error, which fires only on the long tail. The success-after-retry event makes the body of the transient-failure distribution observable. The recurring retry.succeeded_after_retry count per op is the canonical leading indicator for "the upstream is starting to wobble," analogous to the cache hit-ratio alert documented at docs/operate/observability.md:55-63 for prompt-cache regressions.

  2. Per-call-site attribution. Without the op field, the existing warn lines from 12 call sites are indistinguishable — an alert on Operation attempt failed count cannot tell pulls.get flakes from graphql.paginate flakes from bedrock.invoke flakes. With op pinned, an operator dashboard can break retry rate down per upstream and route the alert to the team that owns each one. This matches the per-route attribution github.api.request already provides via its route field (src/utils/octokit-observability.ts:91).

  3. Schema-pinned consistency. A .strict() Zod schema co-located with the emitter (the existing pattern in src/core/log-fields.ts:29 and src/utils/octokit-observability.ts:39) prevents the silent field-name drift that issue fix(observability): child-logger field-name drift across webhook handlers breaks per-entity log correlation #175 closed for child-logger fields. Without the schema, the four events would be a fresh drift surface — easy to mistype delayMs vs delay_ms or elapsedMs vs elapsed_ms across the four emit sites.

The change is feasible in isolation: ~40 lines added to src/utils/retry.ts, one new ~25-line src/utils/retry-log-fields.ts, no new npm dependencies, no schema changes, no migrations. Backwards-compatible: each call site can opt in to the op field at its own pace; the events still fire (with op: "unknown" or omitted) for sites that have not been threaded yet, and the existing message-text remains attached as the pino msg so log shippers keyed on the text still match.

References

Internal:

  • src/utils/retry.ts:83#retryWithBackoff — the chokepoint and its three unstructured emit sites at :128, :136, :146; the silent early-return at :109.
  • src/utils/octokit-observability.ts:39#GithubApiLogFieldsSchema — the .strict() Zod-schema pattern this would mirror.
  • src/core/log-fields.ts:29#PipelineStageLogSchema — sibling structured-event schema for the core pipeline.
  • src/core/pipeline.ts:303,317,465,577 — four retryWithBackoff call sites whose op would become trackingComment.create, github.fetch, trackingComment.finalize, and trackingComment.finalize.failure.
  • docs/operate/observability.md:42-52 — the existing "Core pipeline log fields" section that the new retry table would slot next to.
  • Closed issue fix(mcp): 4 of 5 GitHub-touching MCP servers + 7 state-fetchers bypass retryWithBackoff #199 — wired retryWithBackoff through the MCP/state-fetcher surface but did not address telemetry shape.
  • Closed issue refactor(observability): decouple resolve-review-thread MCP server from config via retry.ts logger import #184 — established the config-free default logger in retry.ts:19-26, which this change extends.

External:

Suggested Next Steps

  1. Add src/utils/retry-log-fields.ts with RETRY_LOG_EVENTS and a .strict() RetryLogFieldsSchema (Zod), mirroring the shape of GithubApiLogFieldsSchema.
  2. Extend RetryOptions in src/utils/retry.ts:32 with op?: string; capture startedAt = Date.now() at function entry and thread elapsed_ms = Date.now() - startedAt into each emit site.
  3. Replace the three unstructured log.warn / log.error calls at :128, :136, :146 with object-first emits carrying the new fields; add the new log.info({ event: "retry.succeeded_after_retry", op, attempt, ... }) immediately before the early-return at :109, gated on attempt > 1.
  4. Thread the op argument through the 12 known call sites (start with src/core/pipeline.ts, src/core/fetcher.ts, and the five MCP servers; the orchestrator triage and ship probe can follow).
  5. Add a co-located src/utils/retry.test.ts that round-trips a captured pino line through RetryLogFieldsSchema.parse(...) for each of the four events (the same drift-prevention pattern src/core/log-fields.test.ts uses).
  6. Add a "Retry log fields" section to docs/operate/observability.md between "GitHub API rate-limit fields" (line 65) and "Output secret-guard log events" (line 77).
  7. Add a baseline alert recipe: count(event = "retry.succeeded_after_retry") by op over 5-minute windows, with deviation from baseline as the wobble signal.

Areas Evaluated

Generated by the scheduled research action on 2026-06-03

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions