You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
retryWithBackoff (src/utils/retry.ts:83) is the single chokepoint guarding every transient-failure recovery in the bot: it wraps 12 call sites spanning the GraphQL fetcher, every GitHub-touching MCP server, the orchestrator triage probe, the ship probe, and the pipeline's three GitHub writes (tracking-comment create, finalize, and the failure-path finalize at src/core/pipeline.ts:303, :317, :465, :577). Despite that load-bearing role, its telemetry has not migrated to the structured-event convention adopted across the rest of observability over the last several months (pipeline.stage in #166, github.api.* in #170, dispatcher.offer.* in #187, queue_wait_ms in #200): all three terminal emissions are plain text messages with no event field — log.warn({ attempt, status, err }, "Non-retriable error, throwing immediately") at src/utils/retry.ts:128, log.warn({ attempt, maxAttempts, err }, "Operation attempt failed") at src/utils/retry.ts:136, and log.error({ maxAttempts }, "Operation failed after all attempts") at src/utils/retry.ts:146.
The bigger blind spot is success-after-retry: the early-return path at src/utils/retry.ts:109 (return await operation()) emits nothing. A call that fails on attempt 1 with a 502 and succeeds on attempt 2 produces exactly one warn-level Operation attempt failed line — but no signal that the operation eventually succeeded, what its total wall-clock was, or that the system absorbed transient failure on the caller's behalf. This is the weakly-flaky signal that compounds: a 1 % per-call upstream flake on a 100-call/hour comment MCP server is one retry per hour today, invisible; the same flake on a 1,000-call/hour fleet is 10 retries per hour and still invisible, until the upstream degrades further and the warn-level attempt failed lines spike with no baseline to compare against. There is also no op field on any line, so an operator grepping attempt: 2 cannot tell whether the flake was pulls.get, issues.updateComment, the triage Bedrock call, or the GraphQL paginator — every call site collapses into one anonymous stream. Finally, the existing fields omit two operationally useful scalars the loop already has in scope: delay_ms (the next backoff that will be slept at src/utils/retry.ts:140–:141) and a cumulative elapsed_ms, which would let an exhaustion line carry the full retry-window duration without parsing.
Concretely: extend RetryOptions (src/utils/retry.ts:32) with an op?: string; pin the four-event family with a .strict() Zod schema in a new src/utils/retry-log-fields.ts (mirroring GithubApiLogFieldsSchema in src/utils/octokit-observability.ts:39); emit retry.attempt_failed (warn) at the existing :136 line, retry.non_retriable (warn) at :128, retry.exhausted (error) at :146, and the new retry.succeeded_after_retry (info, gated on attempt > 1) on the early-return path. Every event carries event, op, attempt, max_attempts, and elapsed_ms; attempt_failed adds the next delay_ms; non_retriable adds status. The attempt field aligns with the OpenTelemetry HTTP http.request.resend_count semantic so an OTel exporter can map directly later.
Three observability properties this closes that no other event currently provides:
Weak-flake visibility.retry.succeeded_after_retry is the missing primitive for measuring upstream stability without waiting for full failure. Today a 1 % transient-failure rate against GitHub or Bedrock is invisible at the default info level: the only retry telemetry is warn/error, which fires only on the long tail. The success-after-retry event makes the body of the transient-failure distribution observable. The recurring retry.succeeded_after_retry count per op is the canonical leading indicator for "the upstream is starting to wobble," analogous to the cache hit-ratio alert documented at docs/operate/observability.md:55-63 for prompt-cache regressions.
Per-call-site attribution. Without the op field, the existing warn lines from 12 call sites are indistinguishable — an alert on Operation attempt failed count cannot tell pulls.get flakes from graphql.paginate flakes from bedrock.invoke flakes. With op pinned, an operator dashboard can break retry rate down per upstream and route the alert to the team that owns each one. This matches the per-route attribution github.api.request already provides via its route field (src/utils/octokit-observability.ts:91).
Schema-pinned consistency. A .strict() Zod schema co-located with the emitter (the existing pattern in src/core/log-fields.ts:29 and src/utils/octokit-observability.ts:39) prevents the silent field-name drift that issue fix(observability): child-logger field-name drift across webhook handlers breaks per-entity log correlation #175 closed for child-logger fields. Without the schema, the four events would be a fresh drift surface — easy to mistype delayMs vs delay_ms or elapsedMs vs elapsed_ms across the four emit sites.
The change is feasible in isolation: ~40 lines added to src/utils/retry.ts, one new ~25-line src/utils/retry-log-fields.ts, no new npm dependencies, no schema changes, no migrations. Backwards-compatible: each call site can opt in to the op field at its own pace; the events still fire (with op: "unknown" or omitted) for sites that have not been threaded yet, and the existing message-text remains attached as the pino msg so log shippers keyed on the text still match.
References
Internal:
src/utils/retry.ts:83#retryWithBackoff — the chokepoint and its three unstructured emit sites at :128, :136, :146; the silent early-return at :109.
src/utils/octokit-observability.ts:39#GithubApiLogFieldsSchema — the .strict() Zod-schema pattern this would mirror.
src/core/log-fields.ts:29#PipelineStageLogSchema — sibling structured-event schema for the core pipeline.
src/core/pipeline.ts:303,317,465,577 — four retryWithBackoff call sites whose op would become trackingComment.create, github.fetch, trackingComment.finalize, and trackingComment.finalize.failure.
docs/operate/observability.md:42-52 — the existing "Core pipeline log fields" section that the new retry table would slot next to.
Add src/utils/retry-log-fields.ts with RETRY_LOG_EVENTS and a .strict()RetryLogFieldsSchema (Zod), mirroring the shape of GithubApiLogFieldsSchema.
Extend RetryOptions in src/utils/retry.ts:32 with op?: string; capture startedAt = Date.now() at function entry and thread elapsed_ms = Date.now() - startedAt into each emit site.
Replace the three unstructured log.warn / log.error calls at :128, :136, :146 with object-first emits carrying the new fields; add the new log.info({ event: "retry.succeeded_after_retry", op, attempt, ... }) immediately before the early-return at :109, gated on attempt > 1.
Thread the op argument through the 12 known call sites (start with src/core/pipeline.ts, src/core/fetcher.ts, and the five MCP servers; the orchestrator triage and ship probe can follow).
Add a co-located src/utils/retry.test.ts that round-trips a captured pino line through RetryLogFieldsSchema.parse(...) for each of the four events (the same drift-prevention pattern src/core/log-fields.test.ts uses).
Add a "Retry log fields" section to docs/operate/observability.md between "GitHub API rate-limit fields" (line 65) and "Output secret-guard log events" (line 77).
Add a baseline alert recipe: count(event = "retry.succeeded_after_retry") by op over 5-minute windows, with deviation from baseline as the wobble signal.
Finding
retryWithBackoff(src/utils/retry.ts:83) is the single chokepoint guarding every transient-failure recovery in the bot: it wraps 12 call sites spanning the GraphQL fetcher, every GitHub-touching MCP server, the orchestrator triage probe, the ship probe, and the pipeline's three GitHub writes (tracking-comment create, finalize, and the failure-path finalize atsrc/core/pipeline.ts:303,:317,:465,:577). Despite that load-bearing role, its telemetry has not migrated to the structured-eventconvention adopted across the rest of observability over the last several months (pipeline.stagein #166,github.api.*in #170,dispatcher.offer.*in #187,queue_wait_msin #200): all three terminal emissions are plain text messages with noeventfield —log.warn({ attempt, status, err }, "Non-retriable error, throwing immediately")atsrc/utils/retry.ts:128,log.warn({ attempt, maxAttempts, err }, "Operation attempt failed")atsrc/utils/retry.ts:136, andlog.error({ maxAttempts }, "Operation failed after all attempts")atsrc/utils/retry.ts:146.The bigger blind spot is success-after-retry: the early-return path at
src/utils/retry.ts:109(return await operation()) emits nothing. A call that fails on attempt 1 with a 502 and succeeds on attempt 2 produces exactly one warn-levelOperation attempt failedline — but no signal that the operation eventually succeeded, what its total wall-clock was, or that the system absorbed transient failure on the caller's behalf. This is the weakly-flaky signal that compounds: a 1 % per-call upstream flake on a 100-call/hour comment MCP server is one retry per hour today, invisible; the same flake on a 1,000-call/hour fleet is 10 retries per hour and still invisible, until the upstream degrades further and the warn-levelattempt failedlines spike with no baseline to compare against. There is also noopfield on any line, so an operator greppingattempt: 2cannot tell whether the flake waspulls.get,issues.updateComment, the triage Bedrock call, or the GraphQL paginator — every call site collapses into one anonymous stream. Finally, the existing fields omit two operationally useful scalars the loop already has in scope:delay_ms(the next backoff that will be slept atsrc/utils/retry.ts:140–:141) and a cumulativeelapsed_ms, which would let an exhaustion line carry the full retry-window duration without parsing.Concretely: extend
RetryOptions(src/utils/retry.ts:32) with anop?: string; pin the four-event family with a.strict()Zod schema in a newsrc/utils/retry-log-fields.ts(mirroringGithubApiLogFieldsSchemainsrc/utils/octokit-observability.ts:39); emitretry.attempt_failed(warn) at the existing:136line,retry.non_retriable(warn) at:128,retry.exhausted(error) at:146, and the newretry.succeeded_after_retry(info, gated onattempt > 1) on the early-return path. Every event carriesevent,op,attempt,max_attempts, andelapsed_ms;attempt_failedadds the nextdelay_ms;non_retriableaddsstatus. Theattemptfield aligns with the OpenTelemetry HTTPhttp.request.resend_countsemantic so an OTel exporter can map directly later.Diagram
flowchart TD Start([retryWithBackoff op]):::path --> Attempt[Attempt N]:::path Attempt -->|success| Success[Return value]:::path Attempt -->|throws| Classify{Classify error}:::path Classify -->|non-retriable 4xx| Nonret[Throw to caller]:::path Classify -->|retriable or 429| Decide{attempt less than max?}:::path Decide -->|yes| Waitloop[Sleep delay_ms<br/>backoff doubles]:::path Waitloop --> Attempt Decide -->|no| Exhaust[Throw lastError]:::path Success -.->|attempt greater than 1: SILENT today| MissedEvt[retry.succeeded_after_retry<br/>NEW info-level event]:::silent Attempt -.->|today: Operation attempt failed<br/>no event field| FailEvt[retry.attempt_failed<br/>+ op, delay_ms, elapsed_ms]:::proposed Nonret -.->|today: Non-retriable error<br/>no event field| NREvt[retry.non_retriable<br/>+ op, status]:::proposed Exhaust -.->|today: Operation failed after all<br/>no event field| ExhEvt[retry.exhausted<br/>+ op, elapsed_ms]:::proposed classDef silent fill:#922b21;color:#ffffff classDef proposed fill:#196f3d;color:#ffffff classDef path fill:#1f618d;color:#ffffffRationale
Three observability properties this closes that no other event currently provides:
Weak-flake visibility.
retry.succeeded_after_retryis the missing primitive for measuring upstream stability without waiting for full failure. Today a 1 % transient-failure rate against GitHub or Bedrock is invisible at the defaultinfolevel: the only retry telemetry is warn/error, which fires only on the long tail. The success-after-retry event makes the body of the transient-failure distribution observable. The recurringretry.succeeded_after_retrycount peropis the canonical leading indicator for "the upstream is starting to wobble," analogous to the cache hit-ratio alert documented atdocs/operate/observability.md:55-63for prompt-cache regressions.Per-call-site attribution. Without the
opfield, the existing warn lines from 12 call sites are indistinguishable — an alert onOperation attempt failedcount cannot tellpulls.getflakes fromgraphql.paginateflakes frombedrock.invokeflakes. Withoppinned, an operator dashboard can break retry rate down per upstream and route the alert to the team that owns each one. This matches the per-route attributiongithub.api.requestalready provides via itsroutefield (src/utils/octokit-observability.ts:91).Schema-pinned consistency. A
.strict()Zod schema co-located with the emitter (the existing pattern insrc/core/log-fields.ts:29andsrc/utils/octokit-observability.ts:39) prevents the silent field-name drift that issue fix(observability): child-logger field-name drift across webhook handlers breaks per-entity log correlation #175 closed for child-logger fields. Without the schema, the four events would be a fresh drift surface — easy to mistypedelayMsvsdelay_msorelapsedMsvselapsed_msacross the four emit sites.The change is feasible in isolation: ~40 lines added to
src/utils/retry.ts, one new ~25-linesrc/utils/retry-log-fields.ts, no new npm dependencies, no schema changes, no migrations. Backwards-compatible: each call site can opt in to theopfield at its own pace; the events still fire (withop: "unknown"or omitted) for sites that have not been threaded yet, and the existing message-text remains attached as the pinomsgso log shippers keyed on the text still match.References
Internal:
src/utils/retry.ts:83#retryWithBackoff— the chokepoint and its three unstructured emit sites at:128,:136,:146; the silent early-return at:109.src/utils/octokit-observability.ts:39#GithubApiLogFieldsSchema— the.strict()Zod-schema pattern this would mirror.src/core/log-fields.ts:29#PipelineStageLogSchema— sibling structured-event schema for the core pipeline.src/core/pipeline.ts:303,317,465,577— fourretryWithBackoffcall sites whoseopwould becometrackingComment.create,github.fetch,trackingComment.finalize, andtrackingComment.finalize.failure.docs/operate/observability.md:42-52— the existing "Core pipeline log fields" section that the new retry table would slot next to.retryWithBackoffthrough the MCP/state-fetcher surface but did not address telemetry shape.retry.ts:19-26, which this change extends.External:
http.request.resend_countas the ordinal attribute for retries; the proposedattemptfield maps 1:1.Suggested Next Steps
src/utils/retry-log-fields.tswithRETRY_LOG_EVENTSand a.strict()RetryLogFieldsSchema(Zod), mirroring the shape ofGithubApiLogFieldsSchema.RetryOptionsinsrc/utils/retry.ts:32withop?: string; capturestartedAt = Date.now()at function entry and threadelapsed_ms = Date.now() - startedAtinto each emit site.log.warn/log.errorcalls at:128,:136,:146with object-first emits carrying the new fields; add the newlog.info({ event: "retry.succeeded_after_retry", op, attempt, ... })immediately before the early-return at:109, gated onattempt > 1.opargument through the 12 known call sites (start withsrc/core/pipeline.ts,src/core/fetcher.ts, and the five MCP servers; the orchestrator triage and ship probe can follow).src/utils/retry.test.tsthat round-trips a captured pino line throughRetryLogFieldsSchema.parse(...)for each of the four events (the same drift-prevention patternsrc/core/log-fields.test.tsuses).docs/operate/observability.mdbetween "GitHub API rate-limit fields" (line 65) and "Output secret-guard log events" (line 77).count(event = "retry.succeeded_after_retry") by opover 5-minute windows, with deviation from baseline as the wobble signal.Areas Evaluated
src/logger.ts,src/utils/log-redaction.ts— root pino setup, redaction primitives, child-logger contract.src/utils/retry.ts— full review of the chokepoint, all four emit/no-emit sites, andRetryOptions.src/utils/octokit-observability.ts— the most recent structured-event implementation; pattern to mirror.src/core/log-fields.ts,docs/operate/observability.md— existing structured-event schema and its documentation layout.retryWithBackoffcall sites acrosssrc/core/,src/mcp/servers/,src/orchestrator/,src/workflows/ship/,src/github/.Generated by the scheduled research action on 2026-06-03