feat(observability): add duration_ms + github.api.slow to octokit hooks for per-request GitHub latency visibility

## Finding

Every webhook handler reaches GitHub at least twice (GraphQL fetch + tracking-comment update) and most paginated PR fetches in `src/core/fetcher.ts` fan out to four parallel `paginate(...)` calls, but the operator has zero per-request latency signal. The single octokit observability chokepoint, `installRateLimitHooks` in `src/utils/octokit-observability.ts:104-138`, registers `hook.after("request", ...)` and `hook.error("request", ...)` only: neither closes over a start timestamp, neither emits a `duration_ms`, and the `.strict()` `GithubApiLogFieldsSchema` at `src/utils/octokit-observability.ts:39-54` explicitly forbids any new field on the line, so a drop-in addition fails the co-located strict-schema test at `test/utils/octokit-observability.test.ts:109-118`.

The result is a latency blind spot at the layer that dominates end-to-end webhook lag. When `core/fetcher.ts` GraphQL pagination, `core/tracking-comment.ts` REST writes, or the `proposal-poller` reaction scans get slow, the existing `github.api.request` line still reads `status: 200` and the existing `dispatcher.offer.accepted` `offer_latency_ms` (`docs/operate/observability.md:121-126`) keeps reading healthy: the slow leg is the GitHub call, not the dispatcher round-trip, but neither event surfaces it. A `hook.wrap("request", ...)` closure that captures `start = Date.now()` and threads `duration_ms = Date.now() - start` into both `hook.after` and `hook.error` log lines closes the gap, mirrors the existing `offer_latency_ms` / `queue_wait_ms` `_ms` naming idiom, and extends (not replaces) the `installRateLimitHooks` plugin so the single `observableOctokit()` subclass wired in `src/app.ts:82-89` keeps owning every emitted line.

The same closure can fire a new `github.api.slow` warn event when `duration_ms >= SLOW_REQUEST_MS` (start at 3000ms — about 3x typical REST latency, well below the abuse-detection 10s search-API ceiling), giving an actionable alert on GitHub-side degradations without dialing every request to info volume.

## Diagram

```mermaid
flowchart LR
 Caller["pipeline / handler calls octokit.request or graphql"]
 Wrap["hook.wrap request start = Date.now"]
 Fetch["GitHub HTTP fetch REST / GraphQL"]
 After["hook.after request compute duration_ms = Date.now - start"]
 Schema["GithubApiLogFieldsSchema strict + duration_ms"]
 Decide{"duration_ms >= SLOW_REQUEST_MS or remaining < RATE_LIMIT_LOW_WATER?"}
 Slow["warn github.api.slow route, status, duration_ms, rate_limit fields"]
 Low["warn github.api.rate_limit_low route, status, duration_ms, rate_limit fields"]
 Req["debug github.api.request route, status, duration_ms, rate_limit fields"]
 ErrHook["hook.error request compute duration_ms on rate-limit error"]
 Warn["warn github.api.rate_limit_warning route, status, retry_after_s, duration_ms"]

 Caller --> Wrap
 Wrap --> Fetch
 Fetch -->|2xx 3xx 4xx response| After
 Fetch -->|throw 429 or 403 secondary| ErrHook
 After --> Schema
 Schema --> Decide
 Decide -->|slow| Slow
 Decide -->|low quota| Low
 Decide -->|healthy| Req
 ErrHook --> Warn

 Wrap:::added
 Slow:::added
 Decide:::added
 After:::current
 ErrHook:::current
 Req:::current
 Low:::current
 Warn:::current
 Schema:::current

 classDef added fill:#196f3d,color:#ffffff,stroke:#0e4d27
 classDef current fill:#1f4e79,color:#ffffff,stroke:#143d61
```

## Rationale

GitHub-side latency is the single largest variable component of `runPipeline` end-to-end time. `src/core/fetcher.ts` issues three parallel `paginate(...)` calls per PR (`PR_FIRST_QUERY`, `PR_COMMENTS_QUERY`, `PR_REVIEWS_QUERY`) plus a per-review `REVIEW_COMMENTS_QUERY` fan-out capped at `REVIEW_OVERFLOW_CONCURRENCY = 5` (`src/core/fetcher.ts:13-16`); a single slow page from any of them stalls the whole handler before the agent ever runs, but today the operator sees only the per-stage `pipeline.stage` line and has to guess whether the stall was network, GraphQL slowness, or local CPU. Adding `duration_ms` to the existing line — no new sink, no new schema, no extra log volume at default `info` — makes `p50` / `p95` / `p99` per-route directly queryable in the existing log shipper.

The `github.api.slow` warn variant is the operational lever. It mirrors how `github.api.rate_limit_low` already converts a per-request debug line into a warn once `remaining` crosses `RATE_LIMIT_LOW_WATER` (`src/utils/octokit-observability.ts:86-92`), reusing a pattern operators already know rather than introducing a new alert taxonomy. The pre-existing `dispatcher.offer.accepted` `offer_latency_ms` documentation block (`docs/operate/observability.md:234`) already calls out p99 tracking as the operating signal for slow daemon responses; symmetric coverage on GitHub's leg means an operator chasing a webhook-to-tracking-comment lag regression can finally tell whether the bottleneck is the daemon (`offer_latency_ms`) or GitHub (`duration_ms`) without ad-hoc instrumentation.

Cost is bounded: `hook.wrap` is the octokit-blessed wrapper hook (see octokit/core.js docs); `Date.now()` is constant-time; the per-request log line already exists, only the field set grows by one integer. The closed sibling issue #170 (`feat(observability): log octokit rate-limit headers via hook.after for per-installation quota visibility`) added the rate-limit half; the latency half is the natural follow-on within the same plugin.

## References

**Internal**:
- `src/utils/octokit-observability.ts:39-54` — `.strict()` `GithubApiLogFieldsSchema`, must gain a `duration_ms: z.number().int().nonnegative()` field.
- `src/utils/octokit-observability.ts:75-98` — `rateLimitFields()`, the pure builder that becomes the natural seam for accepting `durationMs` and re-emitting it on every event variant.
- `src/utils/octokit-observability.ts:104-138` — `installRateLimitHooks()`, the only place a `hook.wrap` / `hook.before` belongs (single chokepoint).
- `src/app.ts:82-89` — `App({ Octokit: observableOctokit() })`, the wiring that propagates the plugin to every `app.octokit` + `getInstallationOctokit()`.
- `src/orchestrator/connection-handler.ts:8` — second consumer of `observableOctokit()`; both pick up the new field for free.
- `src/core/fetcher.ts:13-78` — the four parallel paginated GraphQL calls per PR that this measurement is for.
- `test/utils/octokit-observability.test.ts:89-118` — strict-schema drift test that pins the schema; will need a `duration_ms` case alongside.
- `docs/operate/observability.md:73-75` — the `github.api.*` event table that documents the field set today; must list `duration_ms` + the new `github.api.slow` event.

**External**:
- octokit/core.js hooks documentation: <https://github.com/octokit/core.js#hooks> — `hook.wrap("request", async (request, options) => {...})` is the documented pattern for measuring request duration.
- octokit/octokit.js logging plugin example: <https://github.com/octokit/octokit.js#logging> — shows the `Date.now()` delta pattern explicitly applied via a `before-after-hook` wrap.
- before-after-hook README (the library backing `octokit.hook`): <https://github.com/gr2m/before-after-hook> — distinguishes `before` / `wrap` / `after` / `error` semantics; `wrap` is the recommended pattern when the before-side state (start time) must be visible to the after-side.
- GitHub REST API rate-limit headers reference: <https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api> — current behaviour for the `x-ratelimit-*` headers the existing plugin already reads.

## Suggested Next Steps

1. Extend `GithubApiLogFieldsSchema` (`src/utils/octokit-observability.ts:39-54`) with `duration_ms: z.number().int().nonnegative()` and add `GITHUB_API_LOG_EVENTS.slow = "github.api.slow"`.
2. Replace `hook.after("request", ...)` with `hook.wrap("request", ...)` in `installRateLimitHooks` (`src/utils/octokit-observability.ts:104-138`): capture `start = Date.now()` before `await request(options)`, compute `duration_ms` after, thread it through `rateLimitFields` (rename internal arg or add an overload). Mirror the same delta capture inside `hook.error("request", ...)` so error lines carry latency too.
3. Add a `SLOW_REQUEST_MS` constant (start at `3000`) and emit `github.api.slow` at `warn` when `duration_ms >= SLOW_REQUEST_MS`, taking precedence over `github.api.request` but yielding to `github.api.rate_limit_low` when both apply (or emit both — both are useful operational signals; document the precedence).
4. Update `test/utils/octokit-observability.test.ts:32-118` with two new cases: (a) `rateLimitFields` accepts and round-trips `durationMs`, (b) strict schema accepts a `github.api.slow` line and rejects a `latency_ms` typo.
5. Update `docs/operate/observability.md:73-75` event table with the new `duration_ms` field on every existing event + a row for `github.api.slow`. Add a one-line operations note mirroring the existing `offer_latency_ms` p99 guidance.

## Areas Evaluated

- `src/utils/octokit-observability.ts` — the only octokit observability chokepoint; confirmed `hook.after` and `hook.error` only, no timer / no `duration_ms`.
- `src/app.ts:82-89` + `src/orchestrator/connection-handler.ts:8` — the two callers of `observableOctokit()`; confirmed both inherit any plugin change for free.
- `src/logger.ts` — confirmed `duration_ms` would flow through the existing `REDACT_PATHS` + `errSerializer` redaction pipeline without modification.
- `src/core/fetcher.ts` — confirmed the per-handler GraphQL call volume that justifies the per-request granularity over a single coarse pipeline-stage metric.
- `docs/operate/observability.md:73-75,121-126,234` — confirmed the existing `_ms` naming convention (`offer_latency_ms`, `queue_wait_ms`) and the symmetric latency story this finding completes.
- `test/utils/octokit-observability.test.ts` — confirmed the existing strict-schema test is the drift gate the new field must pass.
- Open and closed `research`-labelled issues — confirmed no overlap: the closed sibling #170 (rate-limit headers) covers quota state; the open observability issues (`daemon.connection.*`, `scheduler.scan.*`, `circuit.*` triage, `retry.*`) live in unrelated subsystems.

*Generated by the scheduled research action on 2026-06-09*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): add duration_ms + github.api.slow to octokit hooks for per-request GitHub latency visibility #223

Finding

Diagram

Rationale

References

Suggested Next Steps

Areas Evaluated

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(observability): add duration_ms + github.api.slow to octokit hooks for per-request GitHub latency visibility #223

Description

Finding

Diagram

Rationale

References

Suggested Next Steps

Areas Evaluated

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions