feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff

## Finding

`retryWithBackoff` (`src/utils/retry.ts:83`) is the single chokepoint guarding every transient-failure recovery in the bot: it wraps **12 call sites** spanning the GraphQL fetcher, every GitHub-touching MCP server, the orchestrator triage probe, the ship probe, and the pipeline's three GitHub writes (tracking-comment create, finalize, and the failure-path finalize at `src/core/pipeline.ts:303`, `:317`, `:465`, `:577`). Despite that load-bearing role, its telemetry has not migrated to the structured-`event` convention adopted across the rest of observability over the last several months (`pipeline.stage` in #166, `github.api.*` in #170, `dispatcher.offer.*` in #187, `queue_wait_ms` in #200): all three terminal emissions are plain text messages with no `event` field — `log.warn({ attempt, status, err }, "Non-retriable error, throwing immediately")` at `src/utils/retry.ts:128`, `log.warn({ attempt, maxAttempts, err }, "Operation attempt failed")` at `src/utils/retry.ts:136`, and `log.error({ maxAttempts }, "Operation failed after all attempts")` at `src/utils/retry.ts:146`.

The bigger blind spot is **success-after-retry**: the early-return path at `src/utils/retry.ts:109` (`return await operation()`) emits nothing. A call that fails on attempt 1 with a 502 and succeeds on attempt 2 produces exactly one warn-level `Operation attempt failed` line — but no signal that the operation eventually succeeded, what its total wall-clock was, or that the system absorbed transient failure on the caller's behalf. This is the weakly-flaky signal that compounds: a 1 % per-call upstream flake on a 100-call/hour comment MCP server is one retry per hour today, invisible; the same flake on a 1,000-call/hour fleet is 10 retries per hour and still invisible, until the upstream degrades further and the warn-level `attempt failed` lines spike with no baseline to compare against. There is also no `op` field on any line, so an operator grepping `attempt: 2` cannot tell whether the flake was `pulls.get`, `issues.updateComment`, the triage Bedrock call, or the GraphQL paginator — every call site collapses into one anonymous stream. Finally, the existing fields omit two operationally useful scalars the loop already has in scope: `delay_ms` (the next backoff that will be slept at `src/utils/retry.ts:140`–`:141`) and a cumulative `elapsed_ms`, which would let an exhaustion line carry the full retry-window duration without parsing.

Concretely: extend `RetryOptions` (`src/utils/retry.ts:32`) with an `op?: string`; pin the four-event family with a `.strict()` Zod schema in a new `src/utils/retry-log-fields.ts` (mirroring `GithubApiLogFieldsSchema` in `src/utils/octokit-observability.ts:39`); emit `retry.attempt_failed` (warn) at the existing `:136` line, `retry.non_retriable` (warn) at `:128`, `retry.exhausted` (error) at `:146`, and the new `retry.succeeded_after_retry` (info, gated on `attempt > 1`) on the early-return path. Every event carries `event`, `op`, `attempt`, `max_attempts`, and `elapsed_ms`; `attempt_failed` adds the next `delay_ms`; `non_retriable` adds `status`. The `attempt` field aligns with the OpenTelemetry HTTP `http.request.resend_count` semantic so an OTel exporter can map directly later.

## Diagram

```mermaid
flowchart TD
 Start([retryWithBackoff op]):::path --> Attempt[Attempt N]:::path
 Attempt -->|success| Success[Return value]:::path
 Attempt -->|throws| Classify{Classify error}:::path
 Classify -->|non-retriable 4xx| Nonret[Throw to caller]:::path
 Classify -->|retriable or 429| Decide{attempt less than max?}:::path
 Decide -->|yes| Waitloop[Sleep delay_ms backoff doubles]:::path
 Waitloop --> Attempt
 Decide -->|no| Exhaust[Throw lastError]:::path

 Success -.->|attempt greater than 1: SILENT today| MissedEvt[retry.succeeded_after_retry NEW info-level event]:::silent
 Attempt -.->|today: Operation attempt failed no event field| FailEvt[retry.attempt_failed + op, delay_ms, elapsed_ms]:::proposed
 Nonret -.->|today: Non-retriable error no event field| NREvt[retry.non_retriable + op, status]:::proposed
 Exhaust -.->|today: Operation failed after all no event field| ExhEvt[retry.exhausted + op, elapsed_ms]:::proposed

 classDef silent fill:#922b21;color:#ffffff
 classDef proposed fill:#196f3d;color:#ffffff
 classDef path fill:#1f618d;color:#ffffff
```

## Rationale

Three observability properties this closes that no other event currently provides:

1. **Weak-flake visibility.** `retry.succeeded_after_retry` is the missing primitive for measuring upstream stability *without* waiting for full failure. Today a 1 % transient-failure rate against GitHub or Bedrock is invisible at the default `info` level: the only retry telemetry is warn/error, which fires only on the long tail. The success-after-retry event makes the body of the transient-failure distribution observable. The recurring `retry.succeeded_after_retry` count per `op` is the canonical leading indicator for "the upstream is starting to wobble," analogous to the cache hit-ratio alert documented at `docs/operate/observability.md:55-63` for prompt-cache regressions.

2. **Per-call-site attribution.** Without the `op` field, the existing warn lines from 12 call sites are indistinguishable — an alert on `Operation attempt failed` count cannot tell `pulls.get` flakes from `graphql.paginate` flakes from `bedrock.invoke` flakes. With `op` pinned, an operator dashboard can break retry rate down per upstream and route the alert to the team that owns each one. This matches the per-route attribution `github.api.request` already provides via its `route` field (`src/utils/octokit-observability.ts:91`).

3. **Schema-pinned consistency.** A `.strict()` Zod schema co-located with the emitter (the existing pattern in `src/core/log-fields.ts:29` and `src/utils/octokit-observability.ts:39`) prevents the silent field-name drift that issue #175 closed for child-logger fields. Without the schema, the four events would be a fresh drift surface — easy to mistype `delayMs` vs `delay_ms` or `elapsedMs` vs `elapsed_ms` across the four emit sites.

The change is feasible in isolation: ~40 lines added to `src/utils/retry.ts`, one new ~25-line `src/utils/retry-log-fields.ts`, no new npm dependencies, no schema changes, no migrations. Backwards-compatible: each call site can opt in to the `op` field at its own pace; the events still fire (with `op: "unknown"` or omitted) for sites that have not been threaded yet, and the existing message-text remains attached as the pino `msg` so log shippers keyed on the text still match.

## References

**Internal**:
- `src/utils/retry.ts:83#retryWithBackoff` — the chokepoint and its three unstructured emit sites at `:128`, `:136`, `:146`; the silent early-return at `:109`.
- `src/utils/octokit-observability.ts:39#GithubApiLogFieldsSchema` — the `.strict()` Zod-schema pattern this would mirror.
- `src/core/log-fields.ts:29#PipelineStageLogSchema` — sibling structured-event schema for the core pipeline.
- `src/core/pipeline.ts:303,317,465,577` — four `retryWithBackoff` call sites whose `op` would become `trackingComment.create`, `github.fetch`, `trackingComment.finalize`, and `trackingComment.finalize.failure`.
- `docs/operate/observability.md:42-52` — the existing "Core pipeline log fields" section that the new retry table would slot next to.
- Closed issue #199 — wired `retryWithBackoff` through the MCP/state-fetcher surface but did not address telemetry shape.
- Closed issue #184 — established the config-free default logger in `retry.ts:19-26`, which this change extends.

**External**:
- [OpenTelemetry semantic conventions for HTTP spans](https://opentelemetry.io/docs/specs/semconv/http/http-spans/) — defines `http.request.resend_count` as the ordinal attribute for retries; the proposed `attempt` field maps 1:1.
- [Monitoring API Call Retries (Qoala Engineering)](https://medium.com/qoala-engineering/monitoring-api-call-retries-b6b86e221c1c) — argues for emitting on both retry-failed and retry-succeeded paths with descriptive tags, the exact gap addressed here.
- [Structured Logging Best Practices (uptrace)](https://uptrace.dev/glossary/structured-logging) — schema-first event design and the value of stable, machine-queryable event names.

## Suggested Next Steps

1. Add `src/utils/retry-log-fields.ts` with `RETRY_LOG_EVENTS` and a `.strict()` `RetryLogFieldsSchema` (Zod), mirroring the shape of `GithubApiLogFieldsSchema`.
2. Extend `RetryOptions` in `src/utils/retry.ts:32` with `op?: string`; capture `startedAt = Date.now()` at function entry and thread `elapsed_ms = Date.now() - startedAt` into each emit site.
3. Replace the three unstructured `log.warn` / `log.error` calls at `:128`, `:136`, `:146` with object-first emits carrying the new fields; add the new `log.info({ event: "retry.succeeded_after_retry", op, attempt, ... })` immediately before the early-return at `:109`, gated on `attempt > 1`.
4. Thread the `op` argument through the 12 known call sites (start with `src/core/pipeline.ts`, `src/core/fetcher.ts`, and the five MCP servers; the orchestrator triage and ship probe can follow).
5. Add a co-located `src/utils/retry.test.ts` that round-trips a captured pino line through `RetryLogFieldsSchema.parse(...)` for each of the four events (the same drift-prevention pattern `src/core/log-fields.test.ts` uses).
6. Add a "Retry log fields" section to `docs/operate/observability.md` between "GitHub API rate-limit fields" (line 65) and "Output secret-guard log events" (line 77).
7. Add a baseline alert recipe: `count(event = "retry.succeeded_after_retry") by op` over 5-minute windows, with deviation from baseline as the wobble signal.

## Areas Evaluated

- `src/logger.ts`, `src/utils/log-redaction.ts` — root pino setup, redaction primitives, child-logger contract.
- `src/utils/retry.ts` — full review of the chokepoint, all four emit/no-emit sites, and `RetryOptions`.
- `src/utils/octokit-observability.ts` — the most recent structured-event implementation; pattern to mirror.
- `src/core/log-fields.ts`, `docs/operate/observability.md` — existing structured-event schema and its documentation layout.
- 12 `retryWithBackoff` call sites across `src/core/`, `src/mcp/servers/`, `src/orchestrator/`, `src/workflows/ship/`, `src/github/`.
- Closed research issues #164, #166, #170, #172, #174, #175, #182, #184, #187, #188, #192, #199, #200 — confirmed no existing or closed issue covers retry-event structure or success-after-retry telemetry.
- External: OpenTelemetry HTTP semantic conventions; structured-logging best-practice articles (uptrace, Qoala, openobserve).

*Generated by the scheduled research action on 2026-06-03*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215

Finding

Diagram

Rationale

References

Suggested Next Steps

Areas Evaluated

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(observability): add structured retry.* events + retry.succeeded_after_retry to retryWithBackoff #215

Description

Finding

Diagram

Rationale

References

Suggested Next Steps

Areas Evaluated

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions