Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/operate/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ The crash path is covered too. `installFatalHandlers(processName)` in `src/logge
| `stage` | Pipeline stage name on a `pipeline.stage` event (e.g. `github.fetch`, `repo.clone`, `executor.invoke`). |
| `delta_ms` | Wall-clock of a single `pipeline.stage` (integer ms). |
| `pipeline_wall_clock_ms` | Cumulative pipeline duration, on the terminal `pipeline.completed` / `pipeline.failed` line (integer ms). |
| `op` | Short dotted identifier on a `retry.*` event identifying the wrapped call site (e.g. `github.fetch`, `mcp.comment.update`). Lowercase-dotted segments with `snake_case` inside each segment, see [Retry log fields](#retry-log-fields) for the naming convention. |
| `attempt` | 1-based attempt ordinal on a `retry.*` event. Aligns with the OpenTelemetry `http.request.resend_count` semantic. |
| `max_attempts` | Maximum-attempts ceiling for the current `retryWithBackoff` call. |
| `elapsed_ms` | Wall-clock since `retryWithBackoff` entry on a `retry.*` event (integer ms). On `retry.exhausted` it carries the full retry-window duration. |
| `delay_ms` | Next backoff that will be slept after a `retry.attempt_failed`. Omitted on the final attempt because no sleep follows. |
| `status` | HTTP status code lifted from the raw error. Present on `retry.non_retriable` (always, since the branch only fires for 4xx) and on `retry.attempt_failed` when the failure carried one (e.g. 503, 429); also on `github.api.*` events. |

## Core pipeline log fields

Expand Down Expand Up @@ -74,6 +80,21 @@ Volume policy: the per-request line is `debug` (default `info` stays quiet on a
| `github.api.rate_limit_low` | warn | Same fields; emitted once `rate_limit_remaining` drops below `RATE_LIMIT_LOW_WATER` (500). |
| `github.api.rate_limit_warning` | warn | `route`, `status`, `retry_after_s`; on a 429 or 403 secondary-rate-limit response. |

## Retry log fields

`retryWithBackoff` (`src/utils/retry.ts`) is the single chokepoint guarding every transient-failure recovery in the bot, including the GraphQL fetcher, every GitHub-touching MCP server, the orchestrator triage probe, the ship probe, and the pipeline's three GitHub writes. Its four-event family is pinned by a `z.discriminatedUnion` of strict objects in `src/utils/retry-log-fields.ts` so an emitter that adds an unpinned field, mistypes one (e.g. `delayMs` vs `delay_ms`), or attaches a field to the wrong event (e.g. `delay_ms` on `retry.exhausted`, `status` on `retry.succeeded_after_retry`) trips the co-located test. Every event carries `op` (a short dotted identifier from the call site), `attempt`, `max_attempts`, and `elapsed_ms`. See [Common log fields](#common-log-fields) for the scalar field definitions.

**`op` naming convention.** Lowercase-dotted segments with `snake_case` inside each segment, e.g. `mcp.inline_comment.fetch_pr`, `github.state.pr_state_check_rollup`, `tracking_comment.create`. This matches the `pipeline.stage` event-name style already used elsewhere in this doc and keeps `op =~ "mcp\\..*"`-style operator groupings regular as new call sites are threaded. New `retryWithBackoff` call sites should follow it. Empty / whitespace-only `op` is normalised to `"unknown"` at the entry of `retryWithBackoff` so the non-empty `op` contract holds even when a caller threads an unexpected value.

The load-bearing event is `retry.succeeded_after_retry`: it is the only signal in the bot's telemetry today that an upstream is starting to wobble _before_ full failure. A 1% transient-failure rate against GitHub or Bedrock is invisible at the default `info` level otherwise, the only retry telemetry is warn/error, which fires only on the long tail. The `info`-level `succeeded_after_retry` line makes the body of the transient-failure distribution observable; an alert on `count(event = "retry.succeeded_after_retry") by op` over 5-minute windows surfaces the leading indicator without waiting for the warn-level `attempt_failed` count to spike.

| `event` | Level | When |
| ----------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `retry.attempt_failed` | warn | An attempt threw a retriable error (5xx, 429, or 403 secondary rate-limit). Carries `delay_ms` when another attempt will follow (omitted on the final attempt) and `status` when the underlying error carried one (HTTP errors; absent for non-HTTP errors like connection resets). |
| `retry.non_retriable` | warn | A 4xx (except 429 and 403 secondary rate-limit) bypasses retry. Always carries `status` (the branch only fires when the raw error carried a 4xx) and rethrows. |
| `retry.exhausted` | error | All `max_attempts` attempts failed. Carries the full retry-window `elapsed_ms`; rethrows the last error. Neither `delay_ms` nor `status` are emitted on this event. |
| `retry.succeeded_after_retry` | info | The call succeeded on `attempt > 1`. Weak-flake leading indicator: gated on `attempt > 1` so first-try successes stay silent. Alert on `count(...) by op` over 5-minute windows. Neither `delay_ms` nor `status` are emitted on this event. |

## Output secret-guard log events

`safePostToGitHub` (`src/utils/github-output-guard.ts`) is the output-side chokepoint for every byte sent to GitHub. It emits structured `warn`/`error` events when the regex pass or the optional LLM scanner acts on a body. Per the logging contract, none of these carry the matched bytes, surrounding context, or a hash, only `kinds`, counts, lengths, `callsite`, and `deliveryId`.
Expand Down
2 changes: 2 additions & 0 deletions docs/use/workflows/ship.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,8 @@ Other non-readiness reasons (`failing_checks`, `behind_base`, `pending_checks`,

The `chat-thread` executor's GitHub-state tool calls (CI rollup, check output, branch protection, PR diff/files, comments) go through `dispatchGithubStateTool`, whose fetchers wrap every octokit call in `retryWithBackoff`. A transient GitHub API blip (5xx, 429, or a secondary rate limit) is retried up to three times with exponential backoff and a deliveryId-correlated retry-warning log, instead of surfacing to the model as a tool error it would have to recover from semantically (issue #199).

The merge-readiness probe (`src/workflows/ship/probe.ts`) follows the same pattern. Both its main `PROBE_QUERY` call and the `paginateReviewThreads` follow-up are wrapped in `retryWithBackoff`, carrying `op: "ship.probe.main"` and `op: "ship.probe.review_threads"` respectively so probe-side retries can be sliced out from the surrounding fleet in observability dashboards (see [Retry log fields](../../operate/observability.md#retry-log-fields)). A single GraphQL rate-limit or network blip is recovered in-place rather than tearing down the verdict and yielding the session.

## Re-triggering

Re-applying the `bot:ship` label or re-commenting `bot:ship` on the same PR while a session is **active** is a no-op. Re-applying after the session is **terminal** starts a fresh session: the prior `ship_intents` row is preserved for audit.
Expand Down
2 changes: 1 addition & 1 deletion src/core/fetcher.ts
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,7 @@ async function fetchRemainingReviewComments(
reviewId,
cursor: startCursor,
}),
{ log },
{ log, op: "github.review.followup" },
);
return result.node?.comments.nodes ?? [];
} catch (err) {
Expand Down
4 changes: 4 additions & 0 deletions src/core/pipeline.ts
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,7 @@ export async function runPipeline(
maxAttempts: 3,
initialDelayMs: 1000,
log: ctx.log,
op: "tracking_comment.create",
}),
);
}
Expand All @@ -318,6 +319,7 @@ export async function runPipeline(
maxAttempts: 3,
initialDelayMs: 2000,
log: ctx.log,
op: "github.fetch",
}),
);

Expand Down Expand Up @@ -468,6 +470,7 @@ export async function runPipeline(
maxAttempts: 3,
initialDelayMs: 1000,
log: enrichedCtx.log,
op: "tracking_comment.finalize",
},
),
);
Expand Down Expand Up @@ -587,6 +590,7 @@ export async function runPipeline(
maxAttempts: 3,
initialDelayMs: 1000,
log: ctx.log,
op: "tracking_comment.finalize.failure",
},
);
} catch (commentError) {
Expand Down
14 changes: 7 additions & 7 deletions src/github/state-fetchers.ts
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ export async function getPrStateCheckRollup(
repo: deps.repo,
number: prNumber,
}),
{ log: deps.log },
{ log: deps.log, op: "github.state.pr_state_check_rollup" },
);
const pr = data.repository?.pullRequest;
if (pr === null || pr === undefined) {
Expand Down Expand Up @@ -187,7 +187,7 @@ export async function getCheckRunOutput(
repo: deps.repo,
check_run_id: checkRunId,
}),
{ log: deps.log },
{ log: deps.log, op: "github.state.check_run_output" },
);
const text = result.data.output.text ?? "";
const truncated = truncate(text, MAX_CHECK_OUTPUT_BYTES);
Expand Down Expand Up @@ -216,7 +216,7 @@ export async function getWorkflowRun(deps: GithubStateDeps, runId: number): Prom
repo: deps.repo,
run_id: runId,
}),
{ log: deps.log },
{ log: deps.log, op: "github.state.workflow_run" },
);
return serialize({
id: result.data.id,
Expand Down Expand Up @@ -261,7 +261,7 @@ export async function getBranchProtection(deps: GithubStateDeps, branch: string)
throw err;
}
},
{ log: deps.log },
{ log: deps.log, op: "github.state.branch_protection" },
);
if (result === null) {
return serialize({ branch, protected: false });
Expand All @@ -285,7 +285,7 @@ export async function getPrDiff(deps: GithubStateDeps, prNumber: number): Promis
pull_number: prNumber,
mediaType: { format: "diff" },
}),
{ log: deps.log },
{ log: deps.log, op: "github.state.pr_diff" },
);
const diff = typeof result.data === "string" ? result.data : JSON.stringify(result.data);
const { text, truncated } = truncate(diff, MAX_DIFF_BYTES);
Expand All @@ -309,7 +309,7 @@ export async function getPrFiles(deps: GithubStateDeps, prNumber: number): Promi
pull_number: prNumber,
per_page: 100,
}),
{ log: deps.log },
{ log: deps.log, op: "github.state.pr_files" },
);
return serialize({
pr_number: prNumber,
Expand Down Expand Up @@ -340,7 +340,7 @@ export async function listPrComments(
per_page: 30,
page,
}),
{ log: deps.log },
{ log: deps.log, op: "github.state.pr_comments" },
);
const comments = result.data.map((c) => ({
id: c.id,
Expand Down
2 changes: 1 addition & 1 deletion src/mcp/servers/comment.ts
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ server.tool(
comment_id: commentId,
body: bodyWithMarker,
}),
{ log },
{ log, op: "mcp.comment.update" },
);

return {
Expand Down
3 changes: 2 additions & 1 deletion src/mcp/servers/inline-comment.ts
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ server.tool(
repo: REPO_NAME,
pull_number,
}),
{ log },
{ log, op: "mcp.inline_comment.fetch_pr" },
);
commitSha = pr.data.head.sha;
}
Expand All @@ -140,6 +140,7 @@ server.tool(

const result = await retryWithBackoff(() => octokit.rest.pulls.createReviewComment(params), {
log,
op: "mcp.inline_comment.create_review_comment",
});

return {
Expand Down
2 changes: 1 addition & 1 deletion src/mcp/servers/merge-readiness.ts
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ server.tool(
repo: REPO_NAME,
number: pr_number,
}),
{ log },
{ log, op: "mcp.merge_readiness.probe" },
);
// The probe fetches the first 100 review threads only, sufficient for
// a scheduled action's own freshly-created PR. botPushedShas is empty:
Expand Down
4 changes: 2 additions & 2 deletions src/mcp/servers/resolve-review-thread.ts
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ server.tool(
// still see prompt failures.
const preflight = await retryWithBackoff(
() => octokit.graphql<PreflightResponse>(GET_THREAD_QUERY, { threadId: thread_id }),
{ log },
{ log, op: "mcp.resolve_review_thread.preflight" },
);
const preflightPr = preflight.node?.pullRequest.number;
const preflightRepo = preflight.node?.pullRequest.repository;
Expand Down Expand Up @@ -208,7 +208,7 @@ server.tool(

const result = await retryWithBackoff(
() => octokit.graphql<ResolveResponse>(RESOLVE_MUTATION, { threadId: thread_id }),
{ log },
{ log, op: "mcp.resolve_review_thread.mutate" },
);
const thread = result.resolveReviewThread.thread;

Expand Down
98 changes: 98 additions & 0 deletions src/utils/retry-log-fields.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
/**
* Canonical pino log-field schema for `retryWithBackoff` observability (issue #215).
*
* Mirrors `src/utils/octokit-observability.ts` and `src/core/log-fields.ts`: a
* strict Zod shape pins the structured `retry.*` event family so the four
* emit sites in `retryWithBackoff` cannot drift on a field name (e.g. `delayMs`
* vs `delay_ms`) without the co-located test catching it. Emitters log plain
* objects via `log.warn` / `log.info` / `log.error`; the schema is the
* drift-prevention contract, not a runtime validator on the hot path.
*
* `attempt` is the 1-based attempt ordinal (aligns with OpenTelemetry
* `http.request.resend_count`). `elapsed_ms` is wall-clock since
* `retryWithBackoff` entry, so an `exhausted` line carries the full retry
* window's duration without parsing prior lines. `delay_ms` is the *next*
* backoff that will be slept (set on `retry.attempt_failed` only when another
* attempt will follow; omitted on the final attempt because no sleep will
* occur).
*
* The schema is a `z.discriminatedUnion` on `event` so per-event field
* presence is pinned: `retry.non_retriable` requires `status` (the 4xx branch
* always has it), only `retry.attempt_failed` may carry `delay_ms`, and
* `retry.exhausted` / `retry.succeeded_after_retry` carry neither. Future
* emitter changes that drop a required field or attach `delay_ms` to the
* wrong event trip the co-located test.
*
* `op` follows lowercase-dotted segments with snake_case inside each segment
* (e.g. `mcp.inline_comment.fetch_pr`, `github.state.pr_state_check_rollup`,
* `tracking_comment.create`). The convention is documented under "Retry log
* fields" in `docs/operate/observability.md`; new call sites should follow it
* so future operator queries (`op =~ "mcp\\..*"`) stay regular.
*/
import { z } from "zod";

export const RETRY_LOG_EVENTS = {
attemptFailed: "retry.attempt_failed",
nonRetriable: "retry.non_retriable",
exhausted: "retry.exhausted",
succeededAfterRetry: "retry.succeeded_after_retry",
} as const;

/**
* Shared scalar shape across every `retry.*` event. `err` is intentionally
* not enumerated here: it is serialized via pino's existing `errSerializer`
* and remains attached to the line as-is, mirroring the existing
* `GithubApiLogFieldsSchema` pattern (which also does not enumerate `err`).
* Callers test the structured scalar fields; pino serializes the error.
*/
const baseFields = {
op: z.string().min(1),
attempt: z.number().int().positive(),
max_attempts: z.number().int().positive(),
elapsed_ms: z.number().int().nonnegative(),
} as const;

export const RetryLogFieldsSchema = z.discriminatedUnion("event", [
/**
* Warn-level emit on a retriable failure. `delay_ms` is present when
* another attempt will follow (and equals the next backoff sleep);
* omitted on the final attempt because no sleep occurs. `status` is
* present when the raw error carried one (HTTP errors) and absent for
* non-HTTP errors like connection resets.
*/
z.strictObject({
event: z.literal(RETRY_LOG_EVENTS.attemptFailed),
...baseFields,
delay_ms: z.number().int().nonnegative().optional(),
status: z.number().int().optional(),
}),
/**
* Warn-level emit on a 4xx (except 429 and 403 secondary-rate-limit) that
* bypasses retry. `status` is always present in this branch because
* `isNonRetriable` only returns `true` when a numeric 4xx status was
* read off the error.
*/
z.strictObject({
event: z.literal(RETRY_LOG_EVENTS.nonRetriable),
...baseFields,
status: z.number().int(),
}),
/**
* Error-level emit after `maxAttempts` retriable failures. No `delay_ms`
* (no further sleep), no `status` (last-error context lives on `err`).
*/
z.strictObject({
event: z.literal(RETRY_LOG_EVENTS.exhausted),
...baseFields,
}),
/**
* Info-level emit when an attempt succeeded after at least one prior
* failure. Gated on `attempt > 1` so first-try successes stay silent.
*/
z.strictObject({
event: z.literal(RETRY_LOG_EVENTS.succeededAfterRetry),
...baseFields,
}),
]);

export type RetryLogFields = z.infer<typeof RetryLogFieldsSchema>;
Loading
Loading