You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The scheduled-actions scheduler (src/scheduler/scheduler.ts) has zero scan-level observability. The only two structured log events it emits are scheduler.action.claimed (src/scheduler/scheduler.ts:132-143, inside enqueueRun) and scheduler.action.skipped_missed (src/scheduler/scheduler.ts:177-186, inside processAction). Both fire only when a specific action is due (run or advance). A healthy scheduler whose actions happen to have zero due slots this tick emits no log line at all, so an operator cannot distinguish "scheduler is running normally" from "scheduler timer was cleared / crashed / never started" — without either tailing for a far rarer claimed-action line or querying Postgres directly.
The reentrancy guard guardedScan (src/scheduler/scheduler.ts:394-405) compounds the gap. When a scan takes longer than SCHEDULER_SCAN_INTERVAL_MS, the next tick reads scanning === true and returns silently (line 395) with no log line. This is the most operationally dangerous failure mode of the scheduler — slow scans cause missed cron slots, which become scheduler.action.skipped_missed on the next successful scan, looking identical to a server restart — and it is the most invisible. The pattern matches the issue this codebase already has a precedent for fixing: fleet-snapshot.ts (issue #174) introduced periodic gauge lines for the dispatcher because the orchestrator likewise discarded fleet state between webhooks and operators had no log-visible signal.
Three other call sites in the scan path also lack structured event: fields and so cannot be alerted on: scheduler: scan tick failed (src/scheduler/scheduler.ts:400, no event, no duration), scheduler: processAction failed, continuing scan (src/scheduler/scheduler.ts:271-274, no event), and scheduler: repo enumeration failed for installation (src/scheduler/installation-enumerator.ts:44, no event). Adding a single scheduler.scan.* lifecycle — start, completed (with counters + duration_ms), skipped_overlap, failed — would close the heartbeat gap and make the scheduler a graphable component of the fleet.
The scheduler is the only timer-driven control plane in the bot. Every other dispatch path (webhook → router → daemon) is webhook-driven, so an outage shows up as missing tracking comments and is loud. The scheduler's failure mode is silent: a stalled scan does not page anyone, it only manifests as cron slots gradually drifting to scheduler.action.skipped_missed on the next recovery scan. Production users of .github-app.yaml (research workflows, scheduled audits, periodic policy checks) get cancelled actions with no causal log line.
The four golden signals map directly onto this gap:
Latency: no duration_ms on either successful or failed scans. A scan that doubles in p95 (e.g. one installation's GraphQL latency regressed) is invisible.
Traffic: no count of repos enumerated or actions evaluated per scan. Operators cannot graph "the scheduler is reaching all installations".
Errors: per-installation enumeration failures, YAML/schema parse failures (src/scheduler/config-fetcher.ts:89,96,105,111-113), and prompt-resolution failures (src/scheduler/scheduler.ts:201-204) all log at warn with no event: field, so they cannot be reliably aggregated.
Saturation: the reentrancy guard fires silently. The most important saturation signal — "scans are exceeding intervalMs" — has zero observability today.
The fix is small and additive: add a SCHEDULER_LOG_EVENTS const + SchedulerScanLogSchema Zod union following the exact pattern of src/orchestrator/log-fields.ts:28-145 (which already pins DISPATCHER_LOG_EVENTS and DAEMON_HEARTBEAT_LOG_EVENTS) and src/orchestrator/fleet-snapshot.ts:27-40. The existing scheduler.action.* events stay unchanged — they describe per-action state transitions and are correctly orthogonal to the scan-level lifecycle being proposed.
Add SCHEDULER_LOG_EVENTS const + SchedulerScanLogSchema Zod discriminated union in a new src/scheduler/log-fields.ts, mirroring src/orchestrator/log-fields.ts:28-145 (round-trip tested with a co-located log-fields.test.ts). Four variants: scheduler.scan.started, scheduler.scan.completed, scheduler.scan.skipped_overlap, scheduler.scan.failed.
In scanOnce (src/scheduler/scheduler.ts:246-278), record startedAt = Date.now() at entry and accumulate counters (repos_enumerated, actions_evaluated, actions_claimed, actions_advanced, actions_failed) by threading a per-scan state object into processAction. Emit scheduler.scan.completed { duration_ms, ...counters } on the success path.
Emit scheduler.scan.skipped_overlap { lastStartedAt } at src/scheduler/scheduler.ts:395 before the silent return, so an operator can alert on event = "scheduler.scan.skipped_overlap" rate > 0 over 5m.
Add structured event: fields to the three remaining warn lines: scheduler.installation.enumeration_failed, scheduler.config.invalid (single event with a reason discriminator for 404 / non-file / YAML / schema), and scheduler.action.prompt_resolution_failed.
Extend the "Diagnose" table in docs/operate/runbooks/scheduled-actions.md:37-44 and the events table in docs/operate/observability.md:206-210 with the new events.
src/scheduler/scheduler.ts — full file read; verified scan loop, reentrancy guard, and the two existing scheduler.action.* events.
src/scheduler/installation-enumerator.ts and src/scheduler/config-fetcher.ts — verified missing event: fields on every warn line.
src/orchestrator/log-fields.ts and src/orchestrator/fleet-snapshot.ts — confirmed the canonical pattern for new structured-event schemas.
src/daemon/scheduled-action-executor.ts — confirmed the daemon-side scheduler.action.daemon.{started,completed,failed} events already exist and are orthogonal.
docs/operate/runbooks/scheduled-actions.md and docs/operate/observability.md — confirmed the event table and runbook would need to grow.
Generated by the scheduled research action on 2026-06-05
Finding
The scheduled-actions scheduler (
src/scheduler/scheduler.ts) has zero scan-level observability. The only two structured log events it emits arescheduler.action.claimed(src/scheduler/scheduler.ts:132-143, insideenqueueRun) andscheduler.action.skipped_missed(src/scheduler/scheduler.ts:177-186, insideprocessAction). Both fire only when a specific action is due (run or advance). A healthy scheduler whose actions happen to have zero due slots this tick emits no log line at all, so an operator cannot distinguish "scheduler is running normally" from "scheduler timer was cleared / crashed / never started" — without either tailing for a far rarer claimed-action line or querying Postgres directly.The reentrancy guard
guardedScan(src/scheduler/scheduler.ts:394-405) compounds the gap. When a scan takes longer thanSCHEDULER_SCAN_INTERVAL_MS, the next tick readsscanning === trueand returns silently (line 395) with no log line. This is the most operationally dangerous failure mode of the scheduler — slow scans cause missed cron slots, which becomescheduler.action.skipped_missedon the next successful scan, looking identical to a server restart — and it is the most invisible. The pattern matches the issue this codebase already has a precedent for fixing:fleet-snapshot.ts(issue #174) introduced periodic gauge lines for the dispatcher because the orchestrator likewise discarded fleet state between webhooks and operators had no log-visible signal.Three other call sites in the scan path also lack structured
event:fields and so cannot be alerted on:scheduler: scan tick failed(src/scheduler/scheduler.ts:400, noevent, no duration),scheduler: processAction failed, continuing scan(src/scheduler/scheduler.ts:271-274, noevent), andscheduler: repo enumeration failed for installation(src/scheduler/installation-enumerator.ts:44, noevent). Adding a singlescheduler.scan.*lifecycle — start, completed (with counters +duration_ms), skipped_overlap, failed — would close the heartbeat gap and make the scheduler a graphable component of the fleet.Diagram
flowchart TD Tick["setInterval tick<br/>guardedScan"] --> Reentry{"scanning<br/>flag set?"} Reentry -- yes --> SilentSkip["return silently<br/>NO LOG LINE"]:::missing Reentry -- no --> ScanStart["scanOnce begins<br/>NO scan.started event"]:::missing ScanStart --> Enum["enumerateScheduledRepos<br/>NO repos_total emitted"]:::missing Enum --> Process["processAction per action"] Process --> Decide{"computeDueDecision"} Decide -- idle --> Quiet["return<br/>NO action.idle event"]:::missing Decide -- advance --> SkipMissed["scheduler.action.skipped_missed<br/>emitted"]:::existing Decide -- run --> Claim{"slot claimed?"} Claim -- no --> Raced["debug only<br/>NO event field"]:::missing Claim -- yes --> Claimed["scheduler.action.claimed<br/>emitted"]:::existing Process --> ScanEnd["scanOnce returns<br/>NO scan.completed plus duration_ms"]:::missing ScanEnd --> Loop["next tick"] classDef missing fill:#7b241c,color:#ffffff,stroke:#5a1812 classDef existing fill:#196f3d,color:#ffffff,stroke:#0e4c27Rationale
The scheduler is the only timer-driven control plane in the bot. Every other dispatch path (webhook → router → daemon) is webhook-driven, so an outage shows up as missing tracking comments and is loud. The scheduler's failure mode is silent: a stalled scan does not page anyone, it only manifests as cron slots gradually drifting to
scheduler.action.skipped_missedon the next recovery scan. Production users of.github-app.yaml(research workflows, scheduled audits, periodic policy checks) get cancelled actions with no causal log line.The four golden signals map directly onto this gap:
duration_mson either successful or failed scans. A scan that doubles in p95 (e.g. one installation's GraphQL latency regressed) is invisible.src/scheduler/config-fetcher.ts:89,96,105,111-113), and prompt-resolution failures (src/scheduler/scheduler.ts:201-204) all log atwarnwith noevent:field, so they cannot be reliably aggregated.intervalMs" — has zero observability today.The fix is small and additive: add a
SCHEDULER_LOG_EVENTSconst +SchedulerScanLogSchemaZod union following the exact pattern ofsrc/orchestrator/log-fields.ts:28-145(which already pinsDISPATCHER_LOG_EVENTSandDAEMON_HEARTBEAT_LOG_EVENTS) andsrc/orchestrator/fleet-snapshot.ts:27-40. The existingscheduler.action.*events stay unchanged — they describe per-action state transitions and are correctly orthogonal to the scan-level lifecycle being proposed.References
Internal:
src/scheduler/scheduler.ts:132-143—scheduler.action.claimed(existing, kept).src/scheduler/scheduler.ts:177-186—scheduler.action.skipped_missed(existing, kept).src/scheduler/scheduler.ts:246-278—scanOnce(no events; this is wherescheduler.scan.started/scheduler.scan.completedshould bracket).src/scheduler/scheduler.ts:394-405—guardedScanreentrancy guard (silent skip on overlap; needsscheduler.scan.skipped_overlap).src/scheduler/scheduler.ts:400—scheduler: scan tick failed(needsevent: scheduler.scan.failed+duration_ms).src/scheduler/installation-enumerator.ts:44—scheduler: repo enumeration failed for installation(needs structuredevent).src/scheduler/config-fetcher.ts:89,96,105,111-113— config-fetch failure log lines (noeventfield).src/orchestrator/log-fields.ts:28-145— canonical pattern for*_LOG_EVENTSconst + Zod discriminated union; the new schema should mirror it.src/orchestrator/fleet-snapshot.ts:27-40— precedent for "periodic gauge line so an operator always sees the loop is alive".docs/operate/runbooks/scheduled-actions.md:37-44— the "Diagnose" table that the new events would extend.External:
scan.started/scan.completedbracketing.job name, status, exit code, durationis the field set proposed here.Suggested Next Steps
SCHEDULER_LOG_EVENTSconst +SchedulerScanLogSchemaZod discriminated union in a newsrc/scheduler/log-fields.ts, mirroringsrc/orchestrator/log-fields.ts:28-145(round-trip tested with a co-locatedlog-fields.test.ts). Four variants:scheduler.scan.started,scheduler.scan.completed,scheduler.scan.skipped_overlap,scheduler.scan.failed.scanOnce(src/scheduler/scheduler.ts:246-278), recordstartedAt = Date.now()at entry and accumulate counters (repos_enumerated,actions_evaluated,actions_claimed,actions_advanced,actions_failed) by threading a per-scan state object intoprocessAction. Emitscheduler.scan.completed { duration_ms, ...counters }on the success path.scheduler.scan.skipped_overlap { lastStartedAt }atsrc/scheduler/scheduler.ts:395before the silentreturn, so an operator can alert onevent = "scheduler.scan.skipped_overlap"rate > 0 over 5m.scheduler: scan tick failed(src/scheduler/scheduler.ts:400) intoevent: "scheduler.scan.failed" { duration_ms, err }.event:fields to the three remaining warn lines:scheduler.installation.enumeration_failed,scheduler.config.invalid(single event with areasondiscriminator for 404 / non-file / YAML / schema), andscheduler.action.prompt_resolution_failed.docs/operate/runbooks/scheduled-actions.md:37-44and the events table indocs/operate/observability.md:206-210with the new events.Areas Evaluated
fleet.snapshot(feat(observability): periodic fleet-state gauge snapshot for queue depth, daemon counts, busy slots #174) which is dispatcher-side and complementary.src/scheduler/scheduler.ts— full file read; verified scan loop, reentrancy guard, and the two existingscheduler.action.*events.src/scheduler/installation-enumerator.tsandsrc/scheduler/config-fetcher.ts— verified missingevent:fields on every warn line.src/orchestrator/log-fields.tsandsrc/orchestrator/fleet-snapshot.ts— confirmed the canonical pattern for new structured-event schemas.src/daemon/scheduled-action-executor.ts— confirmed the daemon-sidescheduler.action.daemon.{started,completed,failed}events already exist and are orthogonal.docs/operate/runbooks/scheduled-actions.mdanddocs/operate/observability.md— confirmed the event table and runbook would need to grow.Generated by the scheduled research action on 2026-06-05