Skip to content

feat(observability): add structured workflow.run.* event family for workflow_runs lifecycle transitions #235

@chrisleekr

Description

@chrisleekr

Finding

The workflow_runs state machine (queuedrunningsucceeded / failed / incomplete, plus the transient handed-off and refused paths) is the durable backbone for every comment- and label-triggered workflow (triage, plan, implement, review, resolve, and the composite ship chain). Yet none of its state transitions emit a structured event: discriminator, the convention every other observable subsystem in the codebase already uses. The mutators themselves (insertQueued at src/workflows/runs-store.ts:87, markRunning at src/workflows/runs-store.ts:124, markSucceeded at src/workflows/runs-store.ts:143, markFailed at src/workflows/runs-store.ts:159, markIncomplete at src/workflows/runs-store.ts:180) are pure SQL — they never call the logger. All observability is left to the callers, and the callers do not agree on a schema.

The dispatcher emits text-only messages with a reason field instead of event: Workflow run dispatched at src/workflows/dispatcher.ts:166 carries reason: "workflow-dispatch", the in-flight collision at src/workflows/dispatcher.ts:104 carries reason: "workflow-dispatch-inflight", and the post-insertQueued enqueue failure at src/workflows/dispatcher.ts:151 carries reason: "workflow-dispatch-enqueue-failed". The daemon-side workflow executor uses a third schema, outcome: — see Workflow run completed at src/daemon/workflow-executor.ts:193, Workflow run reported incomplete at src/daemon/workflow-executor.ts:250, Workflow run reported failure at src/daemon/workflow-executor.ts:309, the hand-off at src/daemon/workflow-executor.ts:148, and the uncaught-throw at src/daemon/workflow-executor.ts:364. The ship-iteration insert at src/workflows/ship/iteration.ts:170 and the orchestrator compensation markFailed at src/workflows/orchestrator.ts:341 and src/workflows/orchestrator.ts:347 emit nothing keyed to the run lifecycle at all. markRunning at src/daemon/workflow-executor.ts:92 has no paired log either.

This forces operators into per-message text greps ("Workflow run reported", "Workflow run dispatched", outcome:"succeeded") or table scans against workflow_runs to derive lifecycle metrics that should be one Loki/Datadog filter away. By contrast, sibling subsystems already expose discriminated event families: pipeline.stage.* (closed issue #166), retry.* (closed issue #225, commit 6713cbf), dispatcher.offer.*, ship.tickle.*, ship.intent.transition, scheduler.action.*, and ws.scoped_completion.* — 61 distinct event: names already, but workflow.run.* is absent. Propose a single workflow.run.{queued,running,succeeded,failed,incomplete,handed_off,dispatch_refused,enqueue_failed} event family written next to every existing call site, carrying runId, workflowName, target, deliveryId, and (for terminal events) durationMs and reason. No schema change to workflow_runs; no new dependency.

Diagram

flowchart LR
  classDef miss fill:#b03a2e,color:#ffffff,stroke:#922b21,stroke-width:2px
  classDef partial fill:#1f618d,color:#ffffff,stroke:#154360,stroke-width:2px
  classDef good fill:#196f3d,color:#ffffff,stroke:#0e6251,stroke-width:2px

  IQ[insertQueued at 3 call sites]:::miss --> Q[queued]:::partial
  Q -->|markRunning| RUN[running]:::partial
  RUN -->|markSucceeded| OK[succeeded]:::partial
  RUN -->|markIncomplete| INC[incomplete]:::partial
  RUN -->|markFailed| FAIL[failed]:::partial
  RUN -->|uncaught throw| UNC[failed uncaught]:::partial
  RUN -->|hand off| HND[running handed-off]:::partial
  Q -->|enqueue throw| ENF[failed enqueue]:::miss
  Q -->|inflight collision| COL[refused at dispatch]:::miss

  subgraph LEG[Legend]
    direction TB
    L1[no log emitted today]:::miss
    L2[log message exists, no event field]:::partial
    L3[proposed workflow.run event family]:::good
  end
Loading

Rationale

Six (succeeded, failed, incomplete, handed_off, uncaught, dispatch_refused) of the eight workflow-run lifecycle transitions are user-observable outcomes that map directly to operator dashboards: success rate per workflow, p95/p99 duration per workflow, incomplete-rate (the resolve handler's incompleteReason distinguishes "agent ran cleanly but CI still red" from a true pipeline error, per migration 009 at src/db/migrations/009_workflow_runs_incomplete.sql:6), and dispatch refusal rate (an early signal that the partial unique index idx_workflow_runs_inflight is over-triggering or that an intent classifier regression is mass-routing into a single workflow). Today every one of those answers requires a SQL aggregation against workflow_runs; with the proposed events they become a single Loki / Datadog query keyed on event plus workflowName — the same pattern operators already use for dispatcher.offer.* and retry.*.

The two transitions that emit nothing today (insertQueued at all three call sites, and the orchestrator compensation markFailed at src/workflows/orchestrator.ts:341 / src/workflows/orchestrator.ts:347) are the most painful: a run can be inserted-then-immediately-failed by the post-commit Valkey publish branch and leave only a failed row with state.failedReason = "enqueue failed: …", with no log breadcrumb tying that row back to the originating delivery. Adding the events closes the gap and parallels closed issues #207 (queue_wait_ms on dispatcher.offer.sent) and #209 (token usage on executions) without persisting any new column.

Because the mutators themselves stay log-free (they have no logger and are reused under transactions in src/workflows/orchestrator.ts), the right wiring point is the caller — same convention as src/scheduler/scheduler.ts:134 scheduler.action.claimed. The change is additive, fail-open (a logger failure cannot reach the DB write that already committed), and observable via the existing src/logger.ts Pino setup that already routes through the redaction paths documented in CLAUDE.md.

References

Internal:

  • src/workflows/runs-store.ts:87insertQueued (no log, three callers, no shared schema).
  • src/workflows/runs-store.ts:124markRunning (no log; daemon caller at src/daemon/workflow-executor.ts:92 also silent).
  • src/workflows/runs-store.ts:143,159,180markSucceeded/markFailed/markIncomplete (no log; callers use outcome: only).
  • src/daemon/workflow-executor.ts:148,193,250,309,364 — five lifecycle log lines using outcome: instead of event:.
  • src/workflows/dispatcher.ts:104,151,166,368,435 — dispatcher lifecycle logs using reason: instead of event:.
  • src/workflows/orchestrator.ts:341,347 — compensation markFailed on post-commit enqueue failure with no run-lifecycle log emitted.
  • src/workflows/ship/iteration.ts:170 — ship iteration insertQueued with no log emitted.
  • src/db/migrations/005_workflow_runs.sql:24 and src/db/migrations/009_workflow_runs_incomplete.sql:20 — the canonical state set (queued, running, succeeded, failed, incomplete).
  • CLAUDE.md — the documented Doc gates and Structured JSON logging via pino with child loggers per request conventions the proposal matches.

External:

Suggested Next Steps

  1. Add a tiny helper (logWorkflowRunEvent(log, event, fields)) in src/workflows/log-fields.ts (the file that already centralises this kind of binding alongside src/orchestrator/log-fields.ts) emitting event: "workflow.run.<state>" + runId + workflowName + target + deliveryId + (terminal-only) durationMs + (failure-only) reason.
  2. Wire it into the eight call sites identified above (src/workflows/dispatcher.ts:104,151,166,368,435; src/daemon/workflow-executor.ts:92,148,193,250,286,309,364; src/workflows/ship/iteration.ts:170; src/workflows/orchestrator.ts:341,347). The existing reason: / outcome: fields can stay in place during the transition window for backwards-compatible log queries.
  3. Document the event family alongside the already-listed events in docs/operate/observability.md (the doc-citations check at scripts/check-docs-citations.ts is the enforcement gate) and add one fanout assertion in src/daemon/workflow-executor.test.ts to lock the contract.
  4. Optional follow-up: emit workflow.run.completed with outcome: succeeded|failed|incomplete as a single rollup, so an operator dashboard can chart success rate from one query instead of three — parallel to how retry.* is structured (commit 6713cbf).

Areas Evaluated

  • src/workflows/runs-store.ts — the durable state store and its mutators.
  • src/daemon/workflow-executor.ts — daemon-side run-lifecycle execution + logging.
  • src/workflows/dispatcher.ts — orchestrator-side dispatch + post-commit failure compensation.
  • src/workflows/orchestrator.ts — cascade compensation markFailed paths.
  • src/workflows/ship/iteration.ts — ship-chain child run insertion.
  • src/db/migrations/005_workflow_runs.sql and src/db/migrations/009_workflow_runs_incomplete.sql — the authoritative state-set for the table.
  • The repo-wide event: taxonomy under src/ (61 distinct names, none under workflow.run.*).
  • Existing open and closed research issues under area: observability to confirm no overlap with pipeline.stage.*, retry.*, dispatcher.offer.*, digest.*, daemon.connection.*, or structured_output.* proposals.

Generated by the scheduled research action on 2026-06-17

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions