feat(observability): add structured retry.* events#225
Conversation
Replace text-only retry logs with structured `retry.attempt_failed`, `retry.non_retriable`, `retry.exhausted` events carrying `op`, `attempt`, `max_attempts`, `elapsed_ms`, plus a new `retry.succeeded_after_retry` info event as a weak-flake leading indicator (gated to attempt > 1, so first-try successes stay silent). Schema lives in `retry-log-fields.ts` with `.strict()` field-drift guarding, mirroring `octokit-observability.ts`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pass `op` to every `retryWithBackoff` call so operators can break the retry rate down per upstream call site. Includes pipeline tracking-comment paths, GitHub state-fetchers, fetcher review followups, ship probe, and four MCP servers. Document the new events + fields in `docs/operate/observability.md`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 30 minutes and 40 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughThe PR adds structured retry telemetry to ChangesRetry Observability
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
bot workflow 🔍 Code review complete, 13 files, +529/-51. SummaryPR #225 replaces Overall verdict: approve modulo two non-blocking suggestions. Correctness is solid, the schema is well-shaped (no drift can land silently), the What was checkedFiles read in full:
Cross-references performed:
Findings[minor]
|
There was a problem hiding this comment.
Pull request overview
This PR upgrades retryWithBackoff observability from message-only logging to a schema-pinned, structured retry.* event family with a stable field contract and per-call-site attribution via an op tag, enabling reliable alerting and breakdowns of transient-failure behavior across upstream integrations.
Changes:
- Added
RETRY_LOG_EVENTS+RetryLogFieldsSchema(Zod.strict()) for theretry.*event family and introduced tests to prevent field-name drift. - Updated
retryWithBackoffto emit structuredretry.attempt_failed,retry.non_retriable,retry.exhausted, andretry.succeeded_after_retryevents (gated toattempt > 1), plus threadedopthrough multiple call sites. - Documented the new retry event schema and recommended alerting queries in the observability docs.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/utils/retry.test.ts | Adds behavioral tests asserting structured retry event emission and gating behavior. |
| test/utils/retry-log-fields.test.ts | Adds schema tests ensuring strict rejection of drift (unknown fields, camelCase typos, invalid scalars). |
| src/workflows/ship/probe.ts | Threads op into retry-wrapped GraphQL probe calls. |
| src/utils/retry.ts | Emits structured retry.* events, adds op option, and refactors non-retriable classification helper. |
| src/utils/retry-log-fields.ts | Introduces canonical retry event constants and strict Zod schema for emitted fields. |
| src/mcp/servers/resolve-review-thread.ts | Threads op into retry-wrapped GraphQL preflight and mutation calls. |
| src/mcp/servers/merge-readiness.ts | Threads op into retry-wrapped probe call. |
| src/mcp/servers/inline-comment.ts | Threads op into retry-wrapped PR fetch and review-comment creation. |
| src/mcp/servers/comment.ts | Threads op into retry-wrapped comment update call. |
| src/github/state-fetchers.ts | Threads op into retry-wrapped GitHub state fetchers for attribution. |
| src/core/pipeline.ts | Threads op into retry-wrapped tracking-comment and GitHub fetch operations. |
| src/core/fetcher.ts | Threads op into retry-wrapped review comment pagination follow-up. |
| docs/operate/observability.md | Documents retry events/fields and adds operator guidance for monitoring success-after-retry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/utils/retry.ts (1)
95-102:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforce non-empty
opat runtime to preserve the retry event contract.
opdefaults to"unknown", but callers can still pass""(or whitespace), producingretry.*payloads that violate the schema/documented contract (opmust be non-empty). Validate and normalizeopbefore emitting logs.Suggested fix
const { @@ - op = "unknown", + op = "unknown", } = options; + const normalizedOp = op.trim(); + if (normalizedOp.length === 0) { + throw new Error("retryWithBackoff: op must be a non-empty string when provided"); + } @@ - op, + op: normalizedOp, @@ - op, + op: normalizedOp, @@ - op, + op: normalizedOp, @@ - op, + op: normalizedOp,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/utils/retry.ts` around lines 95 - 102, Normalize and validate the destructured op from options at runtime: after the options destructuring in retry (use the local symbol op) trim whitespace and if the result is empty replace it with "unknown" (or throw if you prefer stricter behavior) so any subsequent calls that emit retry.* events or call log/defaultLog always get a non-empty op; update any places that emit retry events or construct payloads to use this normalized op.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/utils/retry-log-fields.ts`:
- Around line 35-50: The schema RetryLogFieldsSchema uses z.object(...).strict()
— update it to the Zod v4 idiom by replacing the .object(...).strict() pattern
with z.strictObject({...}) so the same keys (event, op, attempt, max_attempts,
elapsed_ms, delay_ms, status) and validators are preserved; locate and refactor
the RetryLogFieldsSchema definition to call z.strictObject with the same
property validators instead of chaining .object(...).strict().
---
Outside diff comments:
In `@src/utils/retry.ts`:
- Around line 95-102: Normalize and validate the destructured op from options at
runtime: after the options destructuring in retry (use the local symbol op) trim
whitespace and if the result is empty replace it with "unknown" (or throw if you
prefer stricter behavior) so any subsequent calls that emit retry.* events or
call log/defaultLog always get a non-empty op; update any places that emit retry
events or construct payloads to use this normalized op.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 81c4a0b5-1992-4b3a-adda-dd8e40d597b3
📒 Files selected for processing (13)
docs/operate/observability.mdsrc/core/fetcher.tssrc/core/pipeline.tssrc/github/state-fetchers.tssrc/mcp/servers/comment.tssrc/mcp/servers/inline-comment.tssrc/mcp/servers/merge-readiness.tssrc/mcp/servers/resolve-review-thread.tssrc/utils/retry-log-fields.tssrc/utils/retry.tssrc/workflows/ship/probe.tstest/utils/retry-log-fields.test.tstest/utils/retry.test.ts
|
bot workflow 🔎 Resolve incomplete, outstanding items remain after the agent finished. OutstandingNone. All review threads resolved, CI green, PR is cost: $9.9620 · turns: 116 · duration: 1897s |
…ntion (#225) Address PR review pushback on #225: - retry.ts: normalize op via a small normalizeOp(op) helper so empty / whitespace-only / non-string op falls back to "unknown", holding the op: z.string().min(1) contract regardless of caller. - retry.ts: always read status from the raw error and spread it into the retry.attempt_failed payload (was missing); operators can now slice transient-failure rate by HTTP status without parsing err. - retry-log-fields.ts: refactor to z.discriminatedUnion of strictObject branches so per-event field presence is pinned, non_retriable requires status, only attempt_failed may carry delay_ms, and exhausted / succeeded_after_retry carry neither. Adopts Zod v4 idiomatic z.strictObject already used elsewhere in the repo. - Rename camelCase op-tag segments to snake_case (12 sites in pipeline, state-fetchers, ship probe, and three MCP servers) so all 20 op tags follow lowercase-dotted segments. Documented under "Retry log fields" in observability.md. - docs/operate/observability.md: add status row, document op convention, and refresh the retry-event table to reflect per-event field constraints. - docs/use/workflows/ship.md: note that ship probe wraps its GraphQL calls in retryWithBackoff with ship.probe.main / ship.probe.review_threads op tags (satisfies docs-sync guard for FR-019). - Tests: +6 schema cases for per-event field constraints, +5 behaviour cases for op normalisation and attempt_failed status parity. Co-authored-by: chrisleekr-bot[bot] <chrisleekr-bot[bot]@users.noreply.github.com>
Summary
Closes #215. Replaces
retryWithBackoff's text-only log emits with structuredretry.*events that carry a stable schema (event,op,attempt,max_attempts,elapsed_ms, optionaldelay_ms/status), and adds a newretry.succeeded_after_retryinfo event so the rate of "transient failure that resolved" can be monitored as a weak-flake leading indicator without the noise of first-try successes. Threads anoptag through every call site so operators can break the retry rate down per upstream call (e.g.github.fetch,mcp.comment.update).Diagram
flowchart LR classDef io fill:#1f6feb,stroke:#0a3069,color:#ffffff classDef logic fill:#dafbe1,stroke:#1a7f37,color:#0a3622 classDef emit fill:#fff8c5,stroke:#9a6700,color:#1f2328 caller[caller passes op<br/>e.g. github.fetch]:::io retry[retryWithBackoff loop]:::logic succ{success after<br/>attempt 1+?}:::logic nonret{non-retriable<br/>4xx?}:::logic exh{attempts<br/>exhausted?}:::logic e1[retry.attempt_failed<br/>+ delay_ms]:::emit e2[retry.non_retriable<br/>+ status]:::emit e3[retry.exhausted]:::emit e4[retry.succeeded_after_retry<br/>info]:::emit caller --> retry retry --> succ succ -- yes --> e4 retry --> nonret nonret -- yes --> e2 retry --> e1 retry --> exh exh -- yes --> e3Changes
src/utils/retry-log-fields.tsexportsRETRY_LOG_EVENTSconstants + a.strict()ZodRetryLogFieldsSchemaso field drift breaks tests, mirroring theoctokit-observability.tspattern.retryWithBackoffnow emits structured events (retry.attempt_failed,retry.non_retriable,retry.exhausted) and a newretry.succeeded_after_retryinfo event gated toattempt > 1so first-try successes stay silent.delay_msis omitted on the final-attempt emit since no sleep follows.RetryOptionsgains optionalop?: string(defaults to"unknown"when omitted).isNonRetriable()helper out of the main loop to keep complexity under the ESLint warning threshold.opthrough 12 call sites: pipeline tracking-comment paths,github.fetch, fetcher review followups, GitHub state-fetchers (7 ops), ship probe (2 ops), and four MCP servers (comment,inline-comment,merge-readiness,resolve-review-thread).docs/operate/observability.md, including acount(event = "retry.succeeded_after_retry") by opalert recipe for transient-failure tracking.Files changed
src/utils/retry-log-fields.ts· new module: event constants +.strict()Zod schema for the fourretry.*events.src/utils/retry.ts· structured emits,opthreading,succeeded_after_retryevent,isNonRetriableextraction.src/core/fetcher.ts·op: "github.review.followup"on the pagination follow-up retry.src/core/pipeline.ts· 4 ops on tracking-comment create/finalize paths and the GitHub fetch.src/github/state-fetchers.ts· 7 ops, one per state-fetch tool.src/workflows/ship/probe.ts· 2 ops on review-thread + main probe retries.src/mcp/servers/inline-comment.ts· 2 ops (PR fetch + create review comment).src/mcp/servers/comment.ts·op: "mcp.comment.update".src/mcp/servers/merge-readiness.ts·op: "mcp.mergeReadiness.probe".src/mcp/servers/resolve-review-thread.ts· 2 ops (preflight + mutate).test/utils/retry-log-fields.test.ts· new schema tests (accept well-formed events, reject camelCase typos / unknown fields / negativeelapsed_ms/ non-integerattempt/ missingop).test/utils/retry.test.ts· 5 behaviour tests covering first-try-silent, success-after-retry, non-retriable-404, exhausted, andopdefault.docs/operate/observability.md· five new field rows + a "Retry log fields" section with the event table and alert recipe.Commits
e22a82c· feat(observability): add structured retry.* events (feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215)1181d77· feat(observability): thread retry op tag through 12 call sites (feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215)Tests run
bun test test/utils/retry.test.ts test/utils/retry-log-fields.test.ts· 41 pass / 0 failbun test test/utils/· 194 pass / 0 failbun test test/core/fetcher.test.ts test/core/checkout.test.ts test/core/tracking-comment.test.ts test/workflows/ship/probe.test.ts· 63 pass / 0 fail (matches main baseline)bun run typecheck· cleanbunx eslint <changed files>· 0 errors, 12 warnings (all pre-existing onmain, verified viagit diff)bun run docs:build· strict build OKbun run scripts/check-docs-versions.ts· OKbun run scripts/check-docs-citations.ts· OKVerification
RETRY_LOG_EVENTS+RetryLogFieldsSchemaexist (.strict(), rejects camelCase typo / unknown field).RetryOptionscarriesop?: string. The three existing emits became structured withevent/op/attempt/max_attempts/elapsed_ms;attempt_failedcarriesdelay_msonly when a sleep follows;non_retriablecarries thestatus. The newretry.succeeded_after_retryinfo event fires only whenattempt > 1, so first-try successes stay silent. The 12 documented call sites pass theiroptag (verified via grep).op = "unknown"rather than makingoprequired: the helper has a config-free default path (used by stdio MCP servers, seeretry.ts:20-27), so a forced-requiredopwould force every internal use to pass a string just to satisfy the schema. The schema enforces non-emptyop, so emits always carry a meaningful identifier in practice.isNonRetriable()rather than suppressing the complexity warning: keeps the main loop legible as more structured-log branches accumulate, per CLAUDE.md "Don't add features … beyond what the task requires" — this was the smallest change that brought complexity back under the threshold without functional drift.delay_msis conditionally spread (...(willRetry ? { delay_ms } : {})) rather than always present with a sentinel; the field genuinely doesn't exist on the final attempt because no sleep occurs, which keeps the structured schema honest.eslint-disablefor thelastError!non-null assertion at the throw site; thevalidateNumberOptionguards ensure the loop ran at least once, so the assertion is sound.state-fetchers.ts,probe.ts,pipeline.ts,resolve-review-thread.ts(complexity / await-in-loop / unnecessary-optional-chain) were left untouched — they are not introduced by this PR and cleaning them up would expand scope.Related Issues
Test plan
bun run typecheckcleanbun run lintno new errorsSummary by CodeRabbit
Documentation
New Features