Skip to content

feat(observability): add scheduler.scan.* lifecycle events for tick heartbeat, duration, and reentrancy #217

@chrisleekr

Description

@chrisleekr

Finding

The scheduled-actions scheduler (src/scheduler/scheduler.ts) has zero scan-level observability. The only two structured log events it emits are scheduler.action.claimed (src/scheduler/scheduler.ts:132-143, inside enqueueRun) and scheduler.action.skipped_missed (src/scheduler/scheduler.ts:177-186, inside processAction). Both fire only when a specific action is due (run or advance). A healthy scheduler whose actions happen to have zero due slots this tick emits no log line at all, so an operator cannot distinguish "scheduler is running normally" from "scheduler timer was cleared / crashed / never started" — without either tailing for a far rarer claimed-action line or querying Postgres directly.

The reentrancy guard guardedScan (src/scheduler/scheduler.ts:394-405) compounds the gap. When a scan takes longer than SCHEDULER_SCAN_INTERVAL_MS, the next tick reads scanning === true and returns silently (line 395) with no log line. This is the most operationally dangerous failure mode of the scheduler — slow scans cause missed cron slots, which become scheduler.action.skipped_missed on the next successful scan, looking identical to a server restart — and it is the most invisible. The pattern matches the issue this codebase already has a precedent for fixing: fleet-snapshot.ts (issue #174) introduced periodic gauge lines for the dispatcher because the orchestrator likewise discarded fleet state between webhooks and operators had no log-visible signal.

Three other call sites in the scan path also lack structured event: fields and so cannot be alerted on: scheduler: scan tick failed (src/scheduler/scheduler.ts:400, no event, no duration), scheduler: processAction failed, continuing scan (src/scheduler/scheduler.ts:271-274, no event), and scheduler: repo enumeration failed for installation (src/scheduler/installation-enumerator.ts:44, no event). Adding a single scheduler.scan.* lifecycle — start, completed (with counters + duration_ms), skipped_overlap, failed — would close the heartbeat gap and make the scheduler a graphable component of the fleet.

Diagram

flowchart TD
    Tick["setInterval tick<br/>guardedScan"] --> Reentry{"scanning<br/>flag set?"}
    Reentry -- yes --> SilentSkip["return silently<br/>NO LOG LINE"]:::missing
    Reentry -- no --> ScanStart["scanOnce begins<br/>NO scan.started event"]:::missing
    ScanStart --> Enum["enumerateScheduledRepos<br/>NO repos_total emitted"]:::missing
    Enum --> Process["processAction per action"]
    Process --> Decide{"computeDueDecision"}
    Decide -- idle --> Quiet["return<br/>NO action.idle event"]:::missing
    Decide -- advance --> SkipMissed["scheduler.action.skipped_missed<br/>emitted"]:::existing
    Decide -- run --> Claim{"slot claimed?"}
    Claim -- no --> Raced["debug only<br/>NO event field"]:::missing
    Claim -- yes --> Claimed["scheduler.action.claimed<br/>emitted"]:::existing
    Process --> ScanEnd["scanOnce returns<br/>NO scan.completed plus duration_ms"]:::missing
    ScanEnd --> Loop["next tick"]

    classDef missing fill:#7b241c,color:#ffffff,stroke:#5a1812
    classDef existing fill:#196f3d,color:#ffffff,stroke:#0e4c27
Loading

Rationale

The scheduler is the only timer-driven control plane in the bot. Every other dispatch path (webhook → router → daemon) is webhook-driven, so an outage shows up as missing tracking comments and is loud. The scheduler's failure mode is silent: a stalled scan does not page anyone, it only manifests as cron slots gradually drifting to scheduler.action.skipped_missed on the next recovery scan. Production users of .github-app.yaml (research workflows, scheduled audits, periodic policy checks) get cancelled actions with no causal log line.

The four golden signals map directly onto this gap:

  • Latency: no duration_ms on either successful or failed scans. A scan that doubles in p95 (e.g. one installation's GraphQL latency regressed) is invisible.
  • Traffic: no count of repos enumerated or actions evaluated per scan. Operators cannot graph "the scheduler is reaching all installations".
  • Errors: per-installation enumeration failures, YAML/schema parse failures (src/scheduler/config-fetcher.ts:89,96,105,111-113), and prompt-resolution failures (src/scheduler/scheduler.ts:201-204) all log at warn with no event: field, so they cannot be reliably aggregated.
  • Saturation: the reentrancy guard fires silently. The most important saturation signal — "scans are exceeding intervalMs" — has zero observability today.

The fix is small and additive: add a SCHEDULER_LOG_EVENTS const + SchedulerScanLogSchema Zod union following the exact pattern of src/orchestrator/log-fields.ts:28-145 (which already pins DISPATCHER_LOG_EVENTS and DAEMON_HEARTBEAT_LOG_EVENTS) and src/orchestrator/fleet-snapshot.ts:27-40. The existing scheduler.action.* events stay unchanged — they describe per-action state transitions and are correctly orthogonal to the scan-level lifecycle being proposed.

References

Internal:

  • src/scheduler/scheduler.ts:132-143scheduler.action.claimed (existing, kept).
  • src/scheduler/scheduler.ts:177-186scheduler.action.skipped_missed (existing, kept).
  • src/scheduler/scheduler.ts:246-278scanOnce (no events; this is where scheduler.scan.started / scheduler.scan.completed should bracket).
  • src/scheduler/scheduler.ts:394-405guardedScan reentrancy guard (silent skip on overlap; needs scheduler.scan.skipped_overlap).
  • src/scheduler/scheduler.ts:400scheduler: scan tick failed (needs event: scheduler.scan.failed + duration_ms).
  • src/scheduler/installation-enumerator.ts:44scheduler: repo enumeration failed for installation (needs structured event).
  • src/scheduler/config-fetcher.ts:89,96,105,111-113 — config-fetch failure log lines (no event field).
  • src/orchestrator/log-fields.ts:28-145 — canonical pattern for *_LOG_EVENTS const + Zod discriminated union; the new schema should mirror it.
  • src/orchestrator/fleet-snapshot.ts:27-40 — precedent for "periodic gauge line so an operator always sees the loop is alive".
  • docs/operate/runbooks/scheduled-actions.md:37-44 — the "Diagnose" table that the new events would extend.

External:

Suggested Next Steps

  1. Add SCHEDULER_LOG_EVENTS const + SchedulerScanLogSchema Zod discriminated union in a new src/scheduler/log-fields.ts, mirroring src/orchestrator/log-fields.ts:28-145 (round-trip tested with a co-located log-fields.test.ts). Four variants: scheduler.scan.started, scheduler.scan.completed, scheduler.scan.skipped_overlap, scheduler.scan.failed.
  2. In scanOnce (src/scheduler/scheduler.ts:246-278), record startedAt = Date.now() at entry and accumulate counters (repos_enumerated, actions_evaluated, actions_claimed, actions_advanced, actions_failed) by threading a per-scan state object into processAction. Emit scheduler.scan.completed { duration_ms, ...counters } on the success path.
  3. Emit scheduler.scan.skipped_overlap { lastStartedAt } at src/scheduler/scheduler.ts:395 before the silent return, so an operator can alert on event = "scheduler.scan.skipped_overlap" rate > 0 over 5m.
  4. Convert scheduler: scan tick failed (src/scheduler/scheduler.ts:400) into event: "scheduler.scan.failed" { duration_ms, err }.
  5. Add structured event: fields to the three remaining warn lines: scheduler.installation.enumeration_failed, scheduler.config.invalid (single event with a reason discriminator for 404 / non-file / YAML / schema), and scheduler.action.prompt_resolution_failed.
  6. Extend the "Diagnose" table in docs/operate/runbooks/scheduled-actions.md:37-44 and the events table in docs/operate/observability.md:206-210 with the new events.

Areas Evaluated

Generated by the scheduled research action on 2026-06-05

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions