Skip to content

feat(observability): add duration_ms + github.api.slow to octokit hooks for per-request GitHub latency visibility #223

@chrisleekr

Description

@chrisleekr

Finding

Every webhook handler reaches GitHub at least twice (GraphQL fetch + tracking-comment update) and most paginated PR fetches in src/core/fetcher.ts fan out to four parallel paginate(...) calls, but the operator has zero per-request latency signal. The single octokit observability chokepoint, installRateLimitHooks in src/utils/octokit-observability.ts:104-138, registers hook.after("request", ...) and hook.error("request", ...) only: neither closes over a start timestamp, neither emits a duration_ms, and the .strict() GithubApiLogFieldsSchema at src/utils/octokit-observability.ts:39-54 explicitly forbids any new field on the line, so a drop-in addition fails the co-located strict-schema test at test/utils/octokit-observability.test.ts:109-118.

The result is a latency blind spot at the layer that dominates end-to-end webhook lag. When core/fetcher.ts GraphQL pagination, core/tracking-comment.ts REST writes, or the proposal-poller reaction scans get slow, the existing github.api.request line still reads status: 200 and the existing dispatcher.offer.accepted offer_latency_ms (docs/operate/observability.md:121-126) keeps reading healthy: the slow leg is the GitHub call, not the dispatcher round-trip, but neither event surfaces it. A hook.wrap("request", ...) closure that captures start = Date.now() and threads duration_ms = Date.now() - start into both hook.after and hook.error log lines closes the gap, mirrors the existing offer_latency_ms / queue_wait_ms _ms naming idiom, and extends (not replaces) the installRateLimitHooks plugin so the single observableOctokit() subclass wired in src/app.ts:82-89 keeps owning every emitted line.

The same closure can fire a new github.api.slow warn event when duration_ms >= SLOW_REQUEST_MS (start at 3000ms — about 3x typical REST latency, well below the abuse-detection 10s search-API ceiling), giving an actionable alert on GitHub-side degradations without dialing every request to info volume.

Diagram

flowchart LR
    Caller["pipeline / handler<br/>calls octokit.request or graphql"]
    Wrap["hook.wrap request<br/>start = Date.now"]
    Fetch["GitHub HTTP fetch<br/>REST / GraphQL"]
    After["hook.after request<br/>compute duration_ms<br/>= Date.now - start"]
    Schema["GithubApiLogFieldsSchema strict<br/>+ duration_ms"]
    Decide{"duration_ms<br/>>= SLOW_REQUEST_MS<br/>or remaining<br/>< RATE_LIMIT_LOW_WATER?"}
    Slow["warn github.api.slow<br/>route, status, duration_ms,<br/>rate_limit fields"]
    Low["warn github.api.rate_limit_low<br/>route, status, duration_ms,<br/>rate_limit fields"]
    Req["debug github.api.request<br/>route, status, duration_ms,<br/>rate_limit fields"]
    ErrHook["hook.error request<br/>compute duration_ms<br/>on rate-limit error"]
    Warn["warn github.api.rate_limit_warning<br/>route, status, retry_after_s,<br/>duration_ms"]

    Caller --> Wrap
    Wrap --> Fetch
    Fetch -->|2xx 3xx 4xx response| After
    Fetch -->|throw 429 or 403 secondary| ErrHook
    After --> Schema
    Schema --> Decide
    Decide -->|slow| Slow
    Decide -->|low quota| Low
    Decide -->|healthy| Req
    ErrHook --> Warn

    Wrap:::added
    Slow:::added
    Decide:::added
    After:::current
    ErrHook:::current
    Req:::current
    Low:::current
    Warn:::current
    Schema:::current

    classDef added fill:#196f3d,color:#ffffff,stroke:#0e4d27
    classDef current fill:#1f4e79,color:#ffffff,stroke:#143d61
Loading

Rationale

GitHub-side latency is the single largest variable component of runPipeline end-to-end time. src/core/fetcher.ts issues three parallel paginate(...) calls per PR (PR_FIRST_QUERY, PR_COMMENTS_QUERY, PR_REVIEWS_QUERY) plus a per-review REVIEW_COMMENTS_QUERY fan-out capped at REVIEW_OVERFLOW_CONCURRENCY = 5 (src/core/fetcher.ts:13-16); a single slow page from any of them stalls the whole handler before the agent ever runs, but today the operator sees only the per-stage pipeline.stage line and has to guess whether the stall was network, GraphQL slowness, or local CPU. Adding duration_ms to the existing line — no new sink, no new schema, no extra log volume at default info — makes p50 / p95 / p99 per-route directly queryable in the existing log shipper.

The github.api.slow warn variant is the operational lever. It mirrors how github.api.rate_limit_low already converts a per-request debug line into a warn once remaining crosses RATE_LIMIT_LOW_WATER (src/utils/octokit-observability.ts:86-92), reusing a pattern operators already know rather than introducing a new alert taxonomy. The pre-existing dispatcher.offer.accepted offer_latency_ms documentation block (docs/operate/observability.md:234) already calls out p99 tracking as the operating signal for slow daemon responses; symmetric coverage on GitHub's leg means an operator chasing a webhook-to-tracking-comment lag regression can finally tell whether the bottleneck is the daemon (offer_latency_ms) or GitHub (duration_ms) without ad-hoc instrumentation.

Cost is bounded: hook.wrap is the octokit-blessed wrapper hook (see octokit/core.js docs); Date.now() is constant-time; the per-request log line already exists, only the field set grows by one integer. The closed sibling issue #170 (feat(observability): log octokit rate-limit headers via hook.after for per-installation quota visibility) added the rate-limit half; the latency half is the natural follow-on within the same plugin.

References

Internal:

  • src/utils/octokit-observability.ts:39-54.strict() GithubApiLogFieldsSchema, must gain a duration_ms: z.number().int().nonnegative() field.
  • src/utils/octokit-observability.ts:75-98rateLimitFields(), the pure builder that becomes the natural seam for accepting durationMs and re-emitting it on every event variant.
  • src/utils/octokit-observability.ts:104-138installRateLimitHooks(), the only place a hook.wrap / hook.before belongs (single chokepoint).
  • src/app.ts:82-89App({ Octokit: observableOctokit() }), the wiring that propagates the plugin to every app.octokit + getInstallationOctokit().
  • src/orchestrator/connection-handler.ts:8 — second consumer of observableOctokit(); both pick up the new field for free.
  • src/core/fetcher.ts:13-78 — the four parallel paginated GraphQL calls per PR that this measurement is for.
  • test/utils/octokit-observability.test.ts:89-118 — strict-schema drift test that pins the schema; will need a duration_ms case alongside.
  • docs/operate/observability.md:73-75 — the github.api.* event table that documents the field set today; must list duration_ms + the new github.api.slow event.

External:

Suggested Next Steps

  1. Extend GithubApiLogFieldsSchema (src/utils/octokit-observability.ts:39-54) with duration_ms: z.number().int().nonnegative() and add GITHUB_API_LOG_EVENTS.slow = "github.api.slow".
  2. Replace hook.after("request", ...) with hook.wrap("request", ...) in installRateLimitHooks (src/utils/octokit-observability.ts:104-138): capture start = Date.now() before await request(options), compute duration_ms after, thread it through rateLimitFields (rename internal arg or add an overload). Mirror the same delta capture inside hook.error("request", ...) so error lines carry latency too.
  3. Add a SLOW_REQUEST_MS constant (start at 3000) and emit github.api.slow at warn when duration_ms >= SLOW_REQUEST_MS, taking precedence over github.api.request but yielding to github.api.rate_limit_low when both apply (or emit both — both are useful operational signals; document the precedence).
  4. Update test/utils/octokit-observability.test.ts:32-118 with two new cases: (a) rateLimitFields accepts and round-trips durationMs, (b) strict schema accepts a github.api.slow line and rejects a latency_ms typo.
  5. Update docs/operate/observability.md:73-75 event table with the new duration_ms field on every existing event + a row for github.api.slow. Add a one-line operations note mirroring the existing offer_latency_ms p99 guidance.

Areas Evaluated

  • src/utils/octokit-observability.ts — the only octokit observability chokepoint; confirmed hook.after and hook.error only, no timer / no duration_ms.
  • src/app.ts:82-89 + src/orchestrator/connection-handler.ts:8 — the two callers of observableOctokit(); confirmed both inherit any plugin change for free.
  • src/logger.ts — confirmed duration_ms would flow through the existing REDACT_PATHS + errSerializer redaction pipeline without modification.
  • src/core/fetcher.ts — confirmed the per-handler GraphQL call volume that justifies the per-request granularity over a single coarse pipeline-stage metric.
  • docs/operate/observability.md:73-75,121-126,234 — confirmed the existing _ms naming convention (offer_latency_ms, queue_wait_ms) and the symmetric latency story this finding completes.
  • test/utils/octokit-observability.test.ts — confirmed the existing strict-schema test is the drift gate the new field must pass.
  • Open and closed research-labelled issues — confirmed no overlap: the closed sibling feat(observability): log octokit rate-limit headers via hook.after for per-installation quota visibility #170 (rate-limit headers) covers quota state; the open observability issues (daemon.connection.*, scheduler.scan.*, circuit.* triage, retry.*) live in unrelated subsystems.

Generated by the scheduled research action on 2026-06-09

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions