Skip to content

feat(observability): add structured retry.* events#225

Merged
chrisleekr merged 3 commits into
mainfrom
feat/structured-retry-events
Jun 12, 2026
Merged

feat(observability): add structured retry.* events#225
chrisleekr merged 3 commits into
mainfrom
feat/structured-retry-events

Conversation

@chrisleekr

@chrisleekr chrisleekr commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #215. Replaces retryWithBackoff's text-only log emits with structured retry.* events that carry a stable schema (event, op, attempt, max_attempts, elapsed_ms, optional delay_ms / status), and adds a new retry.succeeded_after_retry info event so the rate of "transient failure that resolved" can be monitored as a weak-flake leading indicator without the noise of first-try successes. Threads an op tag through every call site so operators can break the retry rate down per upstream call (e.g. github.fetch, mcp.comment.update).

Diagram

flowchart LR
    classDef io fill:#1f6feb,stroke:#0a3069,color:#ffffff
    classDef logic fill:#dafbe1,stroke:#1a7f37,color:#0a3622
    classDef emit fill:#fff8c5,stroke:#9a6700,color:#1f2328

    caller[caller passes op<br/>e.g. github.fetch]:::io
    retry[retryWithBackoff loop]:::logic
    succ{success after<br/>attempt 1+?}:::logic
    nonret{non-retriable<br/>4xx?}:::logic
    exh{attempts<br/>exhausted?}:::logic
    e1[retry.attempt_failed<br/>+ delay_ms]:::emit
    e2[retry.non_retriable<br/>+ status]:::emit
    e3[retry.exhausted]:::emit
    e4[retry.succeeded_after_retry<br/>info]:::emit

    caller --> retry
    retry --> succ
    succ -- yes --> e4
    retry --> nonret
    nonret -- yes --> e2
    retry --> e1
    retry --> exh
    exh -- yes --> e3
Loading

Changes

  • New src/utils/retry-log-fields.ts exports RETRY_LOG_EVENTS constants + a .strict() Zod RetryLogFieldsSchema so field drift breaks tests, mirroring the octokit-observability.ts pattern.
  • retryWithBackoff now emits structured events (retry.attempt_failed, retry.non_retriable, retry.exhausted) and a new retry.succeeded_after_retry info event gated to attempt > 1 so first-try successes stay silent. delay_ms is omitted on the final-attempt emit since no sleep follows.
  • RetryOptions gains optional op?: string (defaults to "unknown" when omitted).
  • Extracted isNonRetriable() helper out of the main loop to keep complexity under the ESLint warning threshold.
  • Threaded op through 12 call sites: pipeline tracking-comment paths, github.fetch, fetcher review followups, GitHub state-fetchers (7 ops), ship probe (2 ops), and four MCP servers (comment, inline-comment, merge-readiness, resolve-review-thread).
  • Documented the new events + fields in docs/operate/observability.md, including a count(event = "retry.succeeded_after_retry") by op alert recipe for transient-failure tracking.

Files changed

  • src/utils/retry-log-fields.ts · new module: event constants + .strict() Zod schema for the four retry.* events.
  • src/utils/retry.ts · structured emits, op threading, succeeded_after_retry event, isNonRetriable extraction.
  • src/core/fetcher.ts · op: "github.review.followup" on the pagination follow-up retry.
  • src/core/pipeline.ts · 4 ops on tracking-comment create/finalize paths and the GitHub fetch.
  • src/github/state-fetchers.ts · 7 ops, one per state-fetch tool.
  • src/workflows/ship/probe.ts · 2 ops on review-thread + main probe retries.
  • src/mcp/servers/inline-comment.ts · 2 ops (PR fetch + create review comment).
  • src/mcp/servers/comment.ts · op: "mcp.comment.update".
  • src/mcp/servers/merge-readiness.ts · op: "mcp.mergeReadiness.probe".
  • src/mcp/servers/resolve-review-thread.ts · 2 ops (preflight + mutate).
  • test/utils/retry-log-fields.test.ts · new schema tests (accept well-formed events, reject camelCase typos / unknown fields / negative elapsed_ms / non-integer attempt / missing op).
  • test/utils/retry.test.ts · 5 behaviour tests covering first-try-silent, success-after-retry, non-retriable-404, exhausted, and op default.
  • docs/operate/observability.md · five new field rows + a "Retry log fields" section with the event table and alert recipe.

Commits

Tests run

  • bun test test/utils/retry.test.ts test/utils/retry-log-fields.test.ts · 41 pass / 0 fail
  • bun test test/utils/ · 194 pass / 0 fail
  • bun test test/core/fetcher.test.ts test/core/checkout.test.ts test/core/tracking-comment.test.ts test/workflows/ship/probe.test.ts · 63 pass / 0 fail (matches main baseline)
  • bun run typecheck · clean
  • bunx eslint <changed files> · 0 errors, 12 warnings (all pre-existing on main, verified via git diff)
  • bun run docs:build · strict build OK
  • bun run scripts/check-docs-versions.ts · OK
  • bun run scripts/check-docs-citations.ts · OK

Verification

  • Acceptance criteria from feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215: RETRY_LOG_EVENTS + RetryLogFieldsSchema exist (.strict(), rejects camelCase typo / unknown field). RetryOptions carries op?: string. The three existing emits became structured with event/op/attempt/max_attempts/elapsed_ms; attempt_failed carries delay_ms only when a sleep follows; non_retriable carries the status. The new retry.succeeded_after_retry info event fires only when attempt > 1, so first-try successes stay silent. The 12 documented call sites pass their op tag (verified via grep).
  • Non-trivial decisions:
    • Default op = "unknown" rather than making op required: the helper has a config-free default path (used by stdio MCP servers, see retry.ts:20-27), so a forced-required op would force every internal use to pass a string just to satisfy the schema. The schema enforces non-empty op, so emits always carry a meaningful identifier in practice.
    • Extracted isNonRetriable() rather than suppressing the complexity warning: keeps the main loop legible as more structured-log branches accumulate, per CLAUDE.md "Don't add features … beyond what the task requires" — this was the smallest change that brought complexity back under the threshold without functional drift.
    • delay_ms is conditionally spread (...(willRetry ? { delay_ms } : {})) rather than always present with a sentinel; the field genuinely doesn't exist on the final attempt because no sleep occurs, which keeps the structured schema honest.
    • Did NOT relax the existing inline eslint-disable for the lastError! non-null assertion at the throw site; the validateNumberOption guards ensure the loop ran at least once, so the assertion is sound.
  • Not done (intentional): pre-existing lint warnings in state-fetchers.ts, probe.ts, pipeline.ts, resolve-review-thread.ts (complexity / await-in-loop / unnecessary-optional-chain) were left untouched — they are not introduced by this PR and cleaning them up would expand scope.

Related Issues

Test plan

  • Tests added/updated where the change introduces new behaviour
  • bun run typecheck clean
  • bun run lint no new errors
  • Existing tests still pass (or pre-existing failures noted above)

Summary by CodeRabbit

  • Documentation

    • Updated observability documentation with additional structured fields for retry telemetry events, including retry event families and log levels.
  • New Features

    • Enhanced retry operation tracking with operation identifiers, providing improved visibility into retry attempts, timing metadata, and operation status across the system.

chrisleekr-bot[bot] and others added 2 commits June 11, 2026 13:21
Replace text-only retry logs with structured `retry.attempt_failed`,
`retry.non_retriable`, `retry.exhausted` events carrying `op`,
`attempt`, `max_attempts`, `elapsed_ms`, plus a new
`retry.succeeded_after_retry` info event as a weak-flake leading
indicator (gated to attempt > 1, so first-try successes stay silent).
Schema lives in `retry-log-fields.ts` with `.strict()` field-drift
guarding, mirroring `octokit-observability.ts`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pass `op` to every `retryWithBackoff` call so operators can break
the retry rate down per upstream call site. Includes pipeline
tracking-comment paths, GitHub state-fetchers, fetcher review
followups, ship probe, and four MCP servers. Document the new
events + fields in `docs/operate/observability.md`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 11, 2026 13:23
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@chrisleekr, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 30 minutes and 40 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f11f254b-0650-49ed-87df-44f1aeefdf82

📥 Commits

Reviewing files that changed from the base of the PR and between 1181d77 and 8599157.

📒 Files selected for processing (12)
  • docs/operate/observability.md
  • docs/use/workflows/ship.md
  • src/core/pipeline.ts
  • src/github/state-fetchers.ts
  • src/mcp/servers/inline-comment.ts
  • src/mcp/servers/merge-readiness.ts
  • src/mcp/servers/resolve-review-thread.ts
  • src/utils/retry-log-fields.ts
  • src/utils/retry.ts
  • src/workflows/ship/probe.ts
  • test/utils/retry-log-fields.test.ts
  • test/utils/retry.test.ts
📝 Walkthrough

Walkthrough

The PR adds structured retry telemetry to retryWithBackoff by introducing a Zod schema to pin event shapes, refactoring the retry loop to emit four event types with timing and operation context, threading operation identifiers through 12+ existing call sites, and documenting the retry event contract.

Changes

Retry Observability

Layer / File(s) Summary
Retry schema and structured event contract
src/utils/retry-log-fields.ts, test/utils/retry-log-fields.test.ts
New schema (RetryLogFieldsSchema) with strict validation defines the canonical structure for retry.* events: required fields event, op, attempt, max_attempts, elapsed_ms, and conditional fields delay_ms (for non-final attempts) and status (for non-retriable errors). Schema tests validate parsing success for well-formed events and rejection of field-name drift, unknown literals, and invalid values.
Retry loop refactoring with structured events
src/utils/retry.ts, test/utils/retry.test.ts
retryWithBackoff now accepts optional op identifier in RetryOptions, emits four structured log events—attempt_failed (warn on each error with elapsed/delay), non_retriable (warn on 4xx errors), exhausted (error after max attempts), succeeded_after_retry (info only when success follows failures)—and introduces isNonRetriable helper to centralize 4xx detection. Tests validate event emissions, schema compliance, and op defaulting to "unknown".
Threading op identifiers through retry call sites
src/core/pipeline.ts, src/core/fetcher.ts, src/github/state-fetchers.ts, src/mcp/servers/comment.ts, src/mcp/servers/inline-comment.ts, src/mcp/servers/merge-readiness.ts, src/mcp/servers/resolve-review-thread.ts, src/workflows/ship/probe.ts
All existing retryWithBackoff invocations now include operation-specific op labels (trackingComment.create, github.fetch, github.state.*, mcp.comment.update, etc.) alongside log to enable tracing of retry events by operation in observability systems.
Observability documentation
docs/operate/observability.md
New "Retry log fields" section documents the four retry events (attempt_failed, non_retriable, exhausted, succeeded_after_retry), their log levels, required/optional fields, and semantic meaning. Common log fields table updated with op, attempt, max_attempts, elapsed_ms, delay_ms definitions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • chrisleekr/github-app-playground#205: Continues retry infrastructure work from PR #205 by extending retryWithBackoff with structured op labeling and updating the same MCP/state-fetcher call sites to pass operation metadata.
  • chrisleekr/github-app-playground#189: Both PRs modify src/utils/retry.ts and src/mcp/servers/resolve-review-thread.ts by updating how retryWithBackoff is configured; PR #189 adjusts retry logger/config while main PR adds structured op labeling and events.
  • chrisleekr/github-app-playground#182: Both PRs instrument src/core/pipeline.ts's runPipeline retry stages—main PR adds op labels to retryWithBackoff, while PR #182 adds timeStage/pipeline.stage timing logs to the same pipeline operations.

Suggested labels

type: feature ✨, type: docs 📋

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: adding structured retry.* events to improve observability.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from issue #215: structured retry events, schema validation, op threading, elapsed_ms tracking, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing structured retry.* events: schema definition, retry logic refactoring, op threading through call sites, tests, and documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chrisleekr

chrisleekr commented Jun 11, 2026

Copy link
Copy Markdown
Owner Author

bot workflow review, succeeded

🔍 Code review complete, 13 files, +529/-51.

Summary

PR #225 replaces retryWithBackoff's text-only log lines with a four-event structured family (retry.attempt_failed / .non_retriable / .exhausted / .succeeded_after_retry) pinned by a .strict() Zod schema, and threads an op tag through 20 call sites in src/ (12 in the second commit per the PR description, plus a handful of new sites added in the first commit). The schema lives in a new src/utils/retry-log-fields.ts co-located with a 173-line test that drift-checks every field name and accepted value. The retry hot path stays config-free (no src/logger / src/config import), so the stdio MCP servers that use retryWithBackoff keep their JSON-RPC stdout clean — the default logger writes to stderr.

Overall verdict: approve modulo two non-blocking suggestions. Correctness is solid, the schema is well-shaped (no drift can land silently), the op coverage of all 20 call sites is complete, and the observability docs are updated in lockstep. Two non-blocking items below: one Minor (missing status on the attempt_failed event for retriable HTTP errors) and one Nit (op-tag naming convention drift). No blockers, no majors.

What was checked

Files read in full:

  • src/utils/retry.ts — the heart of the PR; verified the structured emits, the delay_ms conditional spread on the final attempt, the attempt > 1 gate on succeeded_after_retry, and the extracted isNonRetriable branch logic.
  • src/utils/retry-log-fields.ts — schema strictness, optional delay_ms / status, intentional non-enumeration of err (mirrors GithubApiLogFieldsSchema).
  • test/utils/retry-log-fields.test.ts — 12 cases covering accept + reject of delayMs / elapsedMs / maxAttempts typos, unknown event literals, extra fields, negative elapsed_ms, non-integer attempt, missing/empty op.
  • test/utils/retry.test.ts — 5 new structured-event integration tests for the four emit branches plus the silent-first-try success case.
  • All 12 caller files: src/core/fetcher.ts, src/core/pipeline.ts (4 sites), src/github/state-fetchers.ts (7 sites), src/workflows/ship/probe.ts (2 sites), src/mcp/servers/comment.ts, src/mcp/servers/inline-comment.ts (2 sites), src/mcp/servers/merge-readiness.ts, src/mcp/servers/resolve-review-thread.ts (2 sites).
  • docs/operate/observability.md — added 5 field rows under "Common log fields" + new "Retry log fields" H2 with the event table and the count(event = "retry.succeeded_after_retry") by op alert recipe.

Cross-references performed:

  • rg -n "retryWithBackoff\(" src/ → 20 call sites, every one carries an op tag in this PR (zero forgotten).
  • Schema vs emit-site field-name parity: every event / op / attempt / max_attempts / elapsed_ms / delay_ms / status key in retry.ts matches the snake_case schema; no delayMs / elapsedMs / maxAttempts typos leaked through.
  • RETRY_LOG_EVENTS constants vs the four emit sites — all four use the constant, none use the literal string.
  • isNonRetriable truth table: undefined status → retry, 5xx → retry, 429 → retry, 4xx with "secondary rate limit" in message → retry, other 4xx → throw immediately. Matches the existing test fixtures in test/utils/retry.test.ts and the GitHub secondary-rate-limit contract.
  • The succeeded_after_retry gate on attempt > 1 — confirmed first-try successes stay silent so the event rate tracks the body of the transient-failure distribution, not normal traffic (matches the comment on src/utils/retry.ts:122-124).
  • The validateNumberOption guard against NaN/Infinity bypass — confirmed !Number.isFinite(value) catches NaN < 1 === false before the value < opts.min comparison can silently accept it.
  • The config-free default logger contract — verified src/utils/retry.ts:1-5 only imports pino, a type-only Logger, the redaction primitives, and the new event constants; no src/logger or src/config runtime import, so stdio MCP servers stay config-free.

Findings

[minor] retry.attempt_failed drops status for retriable HTTP errors

File: src/utils/retry.ts:162-173

The non_retriable branch at src/utils/retry.ts:142-154 reads (error as { status?: number }).status and includes it in the log payload; the sibling attempt_failed branch does not. The Zod schema at src/utils/retry-log-fields.ts:48 already accepts status on every event in the family, so adding it is a one-liner. The consequence is that an operator graphing transient-failure rate with the new schema cannot break attempt_failed down by HTTP status — they cannot distinguish a 503 surge from a 429 surge from a connection-reset surge without falling back to parsing the err serializer payload. This undermines one of the stated PR goals (per-call-site retry telemetry slicing).

Suggested fix: Read status once at the top of the catch block and spread it into both non_retriable and attempt_failed payloads. Schema field is already optional, so undefined just omits the key.

Inline: #225 (comment)

[nit] Op-tag naming mixes pure-dotted and camelCase-segment conventions

File: src/mcp/servers/inline-comment.ts:118 (representative; pattern repeats across the PR)

The 20 op values split into two styles:

  • Pure dotted-lowercase: github.fetch, github.review.followup, ship.probe.main, mcp.comment.update.
  • camelCase-mid-segment: mcp.inlineComment.fetchPr, mcp.inlineComment.createReviewComment, mcp.mergeReadiness.probe, mcp.resolveReviewThread.preflight, mcp.resolveReviewThread.mutate, github.state.prStateCheckRollup (× 7 in src/github/state-fetchers.ts), trackingComment.create / .finalize / .finalize.failure (in src/core/pipeline.ts), and ship.probe.reviewThreads.

The alert recipe added in docs/operate/observability.md (count(event = "retry.succeeded_after_retry") by op) is unaffected because exact-string grouping works regardless. But a future operator wanting op =~ "mcp\\..*" is fine, while one wanting "all comment-mutation ops" gets trackingComment.* vs mcp.comment.* vs mcp.inlineComment.* and has to disjunct. Pure cosmetics today; the inconsistency tax compounds as ops accumulate.

Suggested fix: Pick one convention and pin it in the new "Retry log fields" section of docs/operate/observability.md. Lowercase-dotted segments (mcp.inline_comment.fetch_pr, tracking_comment.create, github.state.pr_state_check_rollup) matches both the pipeline.stage event names already in that file and the retry.* event names this PR adds, so future call sites converge on the existing site convention.

Inline: #225 (comment)

Reasoning

A handful of "this could have looked sketchy but isn't" calls I made during the review:

  • succeeded_after_retry is emitted with attempt = the successful attempt number, not the retry count. Initially this looked like it could be confused with http.request.resend_count (which is the resend count, not the attempt ordinal — they differ by 1). The doc comment on src/utils/retry-log-fields.ts:13-14 explicitly calls out that attempt aligns with OpenTelemetry http.request.resend_count, but the alignment is conceptual ("retry ordinal", not numeric). I checked the four emit sites and they all use the same attempt semantics (1-based attempt counter), so the family is internally consistent. The doc could be tightened to say "1-based attempt ordinal, equivalent to http.request.resend_count + 1", but that's a comment-only tweak, not worth an inline finding.

  • delay_ms is the next backoff that will be slept, not the delay that preceded this attempt. I confirmed this by reading src/utils/retry.ts:161-178: delayMs is updated after the sleep (delayMs = Math.min(delayMs * backoffFactor, maxDelayMs)), and the emit happens before the sleep. So a retry.attempt_failed line with delay_ms: 5000 means "this attempt failed, and the retry handler will sleep 5s before attempt N+1". The doc comment on src/utils/retry-log-fields.ts:14-17 matches. Good as is.

  • err: lastError on the emit payloads is intentionally not enumerated in the Zod schema (src/utils/retry-log-fields.ts:30-34). I cross-checked the rationale against src/utils/octokit-observability.ts GithubApiLogFieldsSchema (per the file-level doc comment) and confirmed the parity. The err field is serialized through pino's errSerializer (shared via src/utils/log-redaction.ts), and enumerating it in the schema would force the test to construct full Error objects to satisfy .strict(). The current pattern is correct.

  • The defaultLog writes to stderr, not stdout. Initially this looked like a regression risk — most operators expect pino logs on stdout — but the doc comment on src/utils/retry.ts:16-19 explicitly explains that stdio MCP servers speak JSON-RPC on stdout, so a default-path retry warning on stdout would corrupt the protocol. stderr is safe in every context (k8s ships both streams, warn/error-to-stderr is conventional). Good as is.

  • The op defaults to "unknown" when omitted. I verified this is desirable by checking the schema constraint at src/utils/retry-log-fields.ts:43 (op: z.string().min(1)); an empty string would fail validation, but "unknown" satisfies it, so existing callers that haven't been threaded yet (zero in this PR — all 20 sites are covered) would still emit valid events. The default is a useful safety net for future call sites that forget to set one.

  • isNonRetriable reads status from the raw error object ((error as { status?: number }).status), not from the normalized Error. This was worth double-checking because error instanceof Error ? error : new Error(String(error)) would strip a .status from a non-Error throw. The function takes both error (raw) and normalized (Error-wrapped) and reads .status from raw, message from normalized, which is correct: Octokit's wrapped HTTP errors are Error instances with a .status property, so reading from the raw value preserves it; non-Error throws have no .status, so the status === undefined check returns false and we retry, which is the safe default.

No blockers. The two non-blocking items above are the only deltas I'd want addressed before another reviewer eyeballs this.

cost: $4.1810 · turns: 44 · duration: 557s

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades retryWithBackoff observability from message-only logging to a schema-pinned, structured retry.* event family with a stable field contract and per-call-site attribution via an op tag, enabling reliable alerting and breakdowns of transient-failure behavior across upstream integrations.

Changes:

  • Added RETRY_LOG_EVENTS + RetryLogFieldsSchema (Zod .strict()) for the retry.* event family and introduced tests to prevent field-name drift.
  • Updated retryWithBackoff to emit structured retry.attempt_failed, retry.non_retriable, retry.exhausted, and retry.succeeded_after_retry events (gated to attempt > 1), plus threaded op through multiple call sites.
  • Documented the new retry event schema and recommended alerting queries in the observability docs.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/utils/retry.test.ts Adds behavioral tests asserting structured retry event emission and gating behavior.
test/utils/retry-log-fields.test.ts Adds schema tests ensuring strict rejection of drift (unknown fields, camelCase typos, invalid scalars).
src/workflows/ship/probe.ts Threads op into retry-wrapped GraphQL probe calls.
src/utils/retry.ts Emits structured retry.* events, adds op option, and refactors non-retriable classification helper.
src/utils/retry-log-fields.ts Introduces canonical retry event constants and strict Zod schema for emitted fields.
src/mcp/servers/resolve-review-thread.ts Threads op into retry-wrapped GraphQL preflight and mutation calls.
src/mcp/servers/merge-readiness.ts Threads op into retry-wrapped probe call.
src/mcp/servers/inline-comment.ts Threads op into retry-wrapped PR fetch and review-comment creation.
src/mcp/servers/comment.ts Threads op into retry-wrapped comment update call.
src/github/state-fetchers.ts Threads op into retry-wrapped GitHub state fetchers for attribution.
src/core/pipeline.ts Threads op into retry-wrapped tracking-comment and GitHub fetch operations.
src/core/fetcher.ts Threads op into retry-wrapped review comment pagination follow-up.
docs/operate/observability.md Documents retry events/fields and adds operator guidance for monitoring success-after-retry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/utils/retry.ts
Comment thread src/utils/retry-log-fields.ts Outdated
Comment thread docs/operate/observability.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/utils/retry.ts (1)

95-102: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce non-empty op at runtime to preserve the retry event contract.

op defaults to "unknown", but callers can still pass "" (or whitespace), producing retry.* payloads that violate the schema/documented contract (op must be non-empty). Validate and normalize op before emitting logs.

Suggested fix
   const {
@@
-    op = "unknown",
+    op = "unknown",
   } = options;
+  const normalizedOp = op.trim();
+  if (normalizedOp.length === 0) {
+    throw new Error("retryWithBackoff: op must be a non-empty string when provided");
+  }
@@
-            op,
+            op: normalizedOp,
@@
-            op,
+            op: normalizedOp,
@@
-          op,
+          op: normalizedOp,
@@
-      op,
+      op: normalizedOp,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/retry.ts` around lines 95 - 102, Normalize and validate the
destructured op from options at runtime: after the options destructuring in
retry (use the local symbol op) trim whitespace and if the result is empty
replace it with "unknown" (or throw if you prefer stricter behavior) so any
subsequent calls that emit retry.* events or call log/defaultLog always get a
non-empty op; update any places that emit retry events or construct payloads to
use this normalized op.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/utils/retry-log-fields.ts`:
- Around line 35-50: The schema RetryLogFieldsSchema uses z.object(...).strict()
— update it to the Zod v4 idiom by replacing the .object(...).strict() pattern
with z.strictObject({...}) so the same keys (event, op, attempt, max_attempts,
elapsed_ms, delay_ms, status) and validators are preserved; locate and refactor
the RetryLogFieldsSchema definition to call z.strictObject with the same
property validators instead of chaining .object(...).strict().

---

Outside diff comments:
In `@src/utils/retry.ts`:
- Around line 95-102: Normalize and validate the destructured op from options at
runtime: after the options destructuring in retry (use the local symbol op) trim
whitespace and if the result is empty replace it with "unknown" (or throw if you
prefer stricter behavior) so any subsequent calls that emit retry.* events or
call log/defaultLog always get a non-empty op; update any places that emit retry
events or construct payloads to use this normalized op.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 81c4a0b5-1992-4b3a-adda-dd8e40d597b3

📥 Commits

Reviewing files that changed from the base of the PR and between cbff565 and 1181d77.

📒 Files selected for processing (13)
  • docs/operate/observability.md
  • src/core/fetcher.ts
  • src/core/pipeline.ts
  • src/github/state-fetchers.ts
  • src/mcp/servers/comment.ts
  • src/mcp/servers/inline-comment.ts
  • src/mcp/servers/merge-readiness.ts
  • src/mcp/servers/resolve-review-thread.ts
  • src/utils/retry-log-fields.ts
  • src/utils/retry.ts
  • src/workflows/ship/probe.ts
  • test/utils/retry-log-fields.test.ts
  • test/utils/retry.test.ts

Comment thread src/utils/retry-log-fields.ts Outdated
Comment thread src/utils/retry.ts Outdated
Comment thread src/mcp/servers/inline-comment.ts Outdated
@chrisleekr

chrisleekr commented Jun 11, 2026

Copy link
Copy Markdown
Owner Author

bot workflow resolve, incomplete

🔎 Resolve incomplete, outstanding items remain after the agent finished.

Outstanding

None. All review threads resolved, CI green, PR is CLEAN + MERGEABLE. Ready for human merge.

cost: $9.9620 · turns: 116 · duration: 1897s

…ntion (#225)

Address PR review pushback on #225:

- retry.ts: normalize op via a small normalizeOp(op) helper so empty /
  whitespace-only / non-string op falls back to "unknown", holding the
  op: z.string().min(1) contract regardless of caller.
- retry.ts: always read status from the raw error and spread it into the
  retry.attempt_failed payload (was missing); operators can now slice
  transient-failure rate by HTTP status without parsing err.
- retry-log-fields.ts: refactor to z.discriminatedUnion of strictObject
  branches so per-event field presence is pinned, non_retriable requires
  status, only attempt_failed may carry delay_ms, and exhausted /
  succeeded_after_retry carry neither. Adopts Zod v4 idiomatic
  z.strictObject already used elsewhere in the repo.
- Rename camelCase op-tag segments to snake_case (12 sites in pipeline,
  state-fetchers, ship probe, and three MCP servers) so all 20 op tags
  follow lowercase-dotted segments. Documented under "Retry log fields"
  in observability.md.
- docs/operate/observability.md: add status row, document op convention,
  and refresh the retry-event table to reflect per-event field constraints.
- docs/use/workflows/ship.md: note that ship probe wraps its GraphQL calls
  in retryWithBackoff with ship.probe.main / ship.probe.review_threads
  op tags (satisfies docs-sync guard for FR-019).
- Tests: +6 schema cases for per-event field constraints, +5 behaviour
  cases for op normalisation and attempt_failed status parity.

Co-authored-by: chrisleekr-bot[bot] <chrisleekr-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff

2 participants