You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The workflow_runs state machine (queued → running → succeeded / failed / incomplete, plus the transient handed-off and refused paths) is the durable backbone for every comment- and label-triggered workflow (triage, plan, implement, review, resolve, and the composite ship chain). Yet none of its state transitions emit a structured event: discriminator, the convention every other observable subsystem in the codebase already uses. The mutators themselves (insertQueued at src/workflows/runs-store.ts:87, markRunning at src/workflows/runs-store.ts:124, markSucceeded at src/workflows/runs-store.ts:143, markFailed at src/workflows/runs-store.ts:159, markIncomplete at src/workflows/runs-store.ts:180) are pure SQL — they never call the logger. All observability is left to the callers, and the callers do not agree on a schema.
The dispatcher emits text-only messages with a reason field instead of event: Workflow run dispatched at src/workflows/dispatcher.ts:166 carries reason: "workflow-dispatch", the in-flight collision at src/workflows/dispatcher.ts:104 carries reason: "workflow-dispatch-inflight", and the post-insertQueued enqueue failure at src/workflows/dispatcher.ts:151 carries reason: "workflow-dispatch-enqueue-failed". The daemon-side workflow executor uses a third schema, outcome: — see Workflow run completed at src/daemon/workflow-executor.ts:193, Workflow run reported incomplete at src/daemon/workflow-executor.ts:250, Workflow run reported failure at src/daemon/workflow-executor.ts:309, the hand-off at src/daemon/workflow-executor.ts:148, and the uncaught-throw at src/daemon/workflow-executor.ts:364. The ship-iteration insert at src/workflows/ship/iteration.ts:170 and the orchestrator compensation markFailed at src/workflows/orchestrator.ts:341 and src/workflows/orchestrator.ts:347 emit nothing keyed to the run lifecycle at all. markRunning at src/daemon/workflow-executor.ts:92 has no paired log either.
This forces operators into per-message text greps ("Workflow run reported", "Workflow run dispatched", outcome:"succeeded") or table scans against workflow_runs to derive lifecycle metrics that should be one Loki/Datadog filter away. By contrast, sibling subsystems already expose discriminated event families: pipeline.stage.* (closed issue #166), retry.* (closed issue #225, commit 6713cbf), dispatcher.offer.*, ship.tickle.*, ship.intent.transition, scheduler.action.*, and ws.scoped_completion.* — 61 distinct event: names already, but workflow.run.* is absent. Propose a single workflow.run.{queued,running,succeeded,failed,incomplete,handed_off,dispatch_refused,enqueue_failed} event family written next to every existing call site, carrying runId, workflowName, target, deliveryId, and (for terminal events) durationMs and reason. No schema change to workflow_runs; no new dependency.
Diagram
flowchart LR
classDef miss fill:#b03a2e,color:#ffffff,stroke:#922b21,stroke-width:2px
classDef partial fill:#1f618d,color:#ffffff,stroke:#154360,stroke-width:2px
classDef good fill:#196f3d,color:#ffffff,stroke:#0e6251,stroke-width:2px
IQ[insertQueued at 3 call sites]:::miss --> Q[queued]:::partial
Q -->|markRunning| RUN[running]:::partial
RUN -->|markSucceeded| OK[succeeded]:::partial
RUN -->|markIncomplete| INC[incomplete]:::partial
RUN -->|markFailed| FAIL[failed]:::partial
RUN -->|uncaught throw| UNC[failed uncaught]:::partial
RUN -->|hand off| HND[running handed-off]:::partial
Q -->|enqueue throw| ENF[failed enqueue]:::miss
Q -->|inflight collision| COL[refused at dispatch]:::miss
subgraph LEG[Legend]
direction TB
L1[no log emitted today]:::miss
L2[log message exists, no event field]:::partial
L3[proposed workflow.run event family]:::good
end
Loading
Rationale
Six (succeeded, failed, incomplete, handed_off, uncaught, dispatch_refused) of the eight workflow-run lifecycle transitions are user-observable outcomes that map directly to operator dashboards: success rate per workflow, p95/p99 duration per workflow, incomplete-rate (the resolve handler's incompleteReason distinguishes "agent ran cleanly but CI still red" from a true pipeline error, per migration 009 at src/db/migrations/009_workflow_runs_incomplete.sql:6), and dispatch refusal rate (an early signal that the partial unique index idx_workflow_runs_inflight is over-triggering or that an intent classifier regression is mass-routing into a single workflow). Today every one of those answers requires a SQL aggregation against workflow_runs; with the proposed events they become a single Loki / Datadog query keyed on event plus workflowName — the same pattern operators already use for dispatcher.offer.* and retry.*.
The two transitions that emit nothing today (insertQueued at all three call sites, and the orchestrator compensation markFailed at src/workflows/orchestrator.ts:341 / src/workflows/orchestrator.ts:347) are the most painful: a run can be inserted-then-immediately-failed by the post-commit Valkey publish branch and leave only a failed row with state.failedReason = "enqueue failed: …", with no log breadcrumb tying that row back to the originating delivery. Adding the events closes the gap and parallels closed issues #207 (queue_wait_ms on dispatcher.offer.sent) and #209 (token usage on executions) without persisting any new column.
Because the mutators themselves stay log-free (they have no logger and are reused under transactions in src/workflows/orchestrator.ts), the right wiring point is the caller — same convention as src/scheduler/scheduler.ts:134scheduler.action.claimed. The change is additive, fail-open (a logger failure cannot reach the DB write that already committed), and observable via the existing src/logger.ts Pino setup that already routes through the redaction paths documented in CLAUDE.md.
References
Internal:
src/workflows/runs-store.ts:87 — insertQueued (no log, three callers, no shared schema).
src/workflows/runs-store.ts:124 — markRunning (no log; daemon caller at src/daemon/workflow-executor.ts:92 also silent).
src/workflows/runs-store.ts:143,159,180 — markSucceeded/markFailed/markIncomplete (no log; callers use outcome: only).
src/daemon/workflow-executor.ts:148,193,250,309,364 — five lifecycle log lines using outcome: instead of event:.
src/workflows/dispatcher.ts:104,151,166,368,435 — dispatcher lifecycle logs using reason: instead of event:.
src/workflows/orchestrator.ts:341,347 — compensation markFailed on post-commit enqueue failure with no run-lifecycle log emitted.
src/workflows/ship/iteration.ts:170 — ship iteration insertQueued with no log emitted.
src/db/migrations/005_workflow_runs.sql:24 and src/db/migrations/009_workflow_runs_incomplete.sql:20 — the canonical state set (queued, running, succeeded, failed, incomplete).
CLAUDE.md — the documented Doc gates and Structured JSON logging via pino with child loggers per request conventions the proposal matches.
External:
BullMQ events — Events guide and Telemetry / Metrics: the upstream queue library exposes per-job-state events (completed, failed, progress, …) precisely so observability lives in the queue, not in the handler.
Temporal — Mastering Durable Execution in Distributed Systems: every step of execution should be trackable through a queryable audit trail; state transitions are the right unit of structured emission.
Add a tiny helper (logWorkflowRunEvent(log, event, fields)) in src/workflows/log-fields.ts (the file that already centralises this kind of binding alongside src/orchestrator/log-fields.ts) emitting event: "workflow.run.<state>" + runId + workflowName + target + deliveryId + (terminal-only) durationMs + (failure-only) reason.
Wire it into the eight call sites identified above (src/workflows/dispatcher.ts:104,151,166,368,435; src/daemon/workflow-executor.ts:92,148,193,250,286,309,364; src/workflows/ship/iteration.ts:170; src/workflows/orchestrator.ts:341,347). The existing reason: / outcome: fields can stay in place during the transition window for backwards-compatible log queries.
Document the event family alongside the already-listed events in docs/operate/observability.md (the doc-citations check at scripts/check-docs-citations.ts is the enforcement gate) and add one fanout assertion in src/daemon/workflow-executor.test.ts to lock the contract.
Optional follow-up: emit workflow.run.completed with outcome: succeeded|failed|incomplete as a single rollup, so an operator dashboard can chart success rate from one query instead of three — parallel to how retry.* is structured (commit 6713cbf).
Areas Evaluated
src/workflows/runs-store.ts — the durable state store and its mutators.
src/workflows/ship/iteration.ts — ship-chain child run insertion.
src/db/migrations/005_workflow_runs.sql and src/db/migrations/009_workflow_runs_incomplete.sql — the authoritative state-set for the table.
The repo-wide event: taxonomy under src/ (61 distinct names, none under workflow.run.*).
Existing open and closed research issues under area: observability to confirm no overlap with pipeline.stage.*, retry.*, dispatcher.offer.*, digest.*, daemon.connection.*, or structured_output.* proposals.
Generated by the scheduled research action on 2026-06-17
Finding
The
workflow_runsstate machine (queued→running→succeeded/failed/incomplete, plus the transienthanded-offandrefusedpaths) is the durable backbone for every comment- and label-triggered workflow (triage,plan,implement,review,resolve, and the compositeshipchain). Yet none of its state transitions emit a structuredevent:discriminator, the convention every other observable subsystem in the codebase already uses. The mutators themselves (insertQueuedatsrc/workflows/runs-store.ts:87,markRunningatsrc/workflows/runs-store.ts:124,markSucceededatsrc/workflows/runs-store.ts:143,markFailedatsrc/workflows/runs-store.ts:159,markIncompleteatsrc/workflows/runs-store.ts:180) are pure SQL — they never call the logger. All observability is left to the callers, and the callers do not agree on a schema.The dispatcher emits text-only messages with a
reasonfield instead ofevent:Workflow run dispatchedatsrc/workflows/dispatcher.ts:166carriesreason: "workflow-dispatch", the in-flight collision atsrc/workflows/dispatcher.ts:104carriesreason: "workflow-dispatch-inflight", and the post-insertQueuedenqueue failure atsrc/workflows/dispatcher.ts:151carriesreason: "workflow-dispatch-enqueue-failed". The daemon-side workflow executor uses a third schema,outcome:— seeWorkflow run completedatsrc/daemon/workflow-executor.ts:193,Workflow run reported incompleteatsrc/daemon/workflow-executor.ts:250,Workflow run reported failureatsrc/daemon/workflow-executor.ts:309, the hand-off atsrc/daemon/workflow-executor.ts:148, and the uncaught-throw atsrc/daemon/workflow-executor.ts:364. The ship-iteration insert atsrc/workflows/ship/iteration.ts:170and the orchestrator compensationmarkFailedatsrc/workflows/orchestrator.ts:341andsrc/workflows/orchestrator.ts:347emit nothing keyed to the run lifecycle at all.markRunningatsrc/daemon/workflow-executor.ts:92has no paired log either.This forces operators into per-message text greps (
"Workflow run reported","Workflow run dispatched",outcome:"succeeded") or table scans againstworkflow_runsto derive lifecycle metrics that should be one Loki/Datadog filter away. By contrast, sibling subsystems already expose discriminated event families:pipeline.stage.*(closed issue #166),retry.*(closed issue #225, commit6713cbf),dispatcher.offer.*,ship.tickle.*,ship.intent.transition,scheduler.action.*, andws.scoped_completion.*— 61 distinctevent:names already, butworkflow.run.*is absent. Propose a singleworkflow.run.{queued,running,succeeded,failed,incomplete,handed_off,dispatch_refused,enqueue_failed}event family written next to every existing call site, carryingrunId,workflowName,target,deliveryId, and (for terminal events)durationMsandreason. No schema change toworkflow_runs; no new dependency.Diagram
flowchart LR classDef miss fill:#b03a2e,color:#ffffff,stroke:#922b21,stroke-width:2px classDef partial fill:#1f618d,color:#ffffff,stroke:#154360,stroke-width:2px classDef good fill:#196f3d,color:#ffffff,stroke:#0e6251,stroke-width:2px IQ[insertQueued at 3 call sites]:::miss --> Q[queued]:::partial Q -->|markRunning| RUN[running]:::partial RUN -->|markSucceeded| OK[succeeded]:::partial RUN -->|markIncomplete| INC[incomplete]:::partial RUN -->|markFailed| FAIL[failed]:::partial RUN -->|uncaught throw| UNC[failed uncaught]:::partial RUN -->|hand off| HND[running handed-off]:::partial Q -->|enqueue throw| ENF[failed enqueue]:::miss Q -->|inflight collision| COL[refused at dispatch]:::miss subgraph LEG[Legend] direction TB L1[no log emitted today]:::miss L2[log message exists, no event field]:::partial L3[proposed workflow.run event family]:::good endRationale
Six (
succeeded,failed,incomplete,handed_off,uncaught,dispatch_refused) of the eight workflow-run lifecycle transitions are user-observable outcomes that map directly to operator dashboards: success rate per workflow, p95/p99 duration per workflow, incomplete-rate (the resolve handler'sincompleteReasondistinguishes "agent ran cleanly but CI still red" from a true pipeline error, per migration 009 atsrc/db/migrations/009_workflow_runs_incomplete.sql:6), and dispatch refusal rate (an early signal that the partial unique indexidx_workflow_runs_inflightis over-triggering or that an intent classifier regression is mass-routing into a single workflow). Today every one of those answers requires a SQL aggregation againstworkflow_runs; with the proposed events they become a single Loki / Datadog query keyed oneventplusworkflowName— the same pattern operators already use fordispatcher.offer.*andretry.*.The two transitions that emit nothing today (
insertQueuedat all three call sites, and the orchestrator compensationmarkFailedatsrc/workflows/orchestrator.ts:341/src/workflows/orchestrator.ts:347) are the most painful: a run can be inserted-then-immediately-failed by the post-commit Valkey publish branch and leave only afailedrow withstate.failedReason = "enqueue failed: …", with no log breadcrumb tying that row back to the originating delivery. Adding the events closes the gap and parallels closed issues #207 (queue_wait_msondispatcher.offer.sent) and #209 (token usage onexecutions) without persisting any new column.Because the mutators themselves stay log-free (they have no logger and are reused under transactions in
src/workflows/orchestrator.ts), the right wiring point is the caller — same convention assrc/scheduler/scheduler.ts:134scheduler.action.claimed. The change is additive, fail-open (a logger failure cannot reach the DB write that already committed), and observable via the existingsrc/logger.tsPino setup that already routes through the redaction paths documented inCLAUDE.md.References
Internal:
src/workflows/runs-store.ts:87—insertQueued(no log, three callers, no shared schema).src/workflows/runs-store.ts:124—markRunning(no log; daemon caller atsrc/daemon/workflow-executor.ts:92also silent).src/workflows/runs-store.ts:143,159,180—markSucceeded/markFailed/markIncomplete(no log; callers useoutcome:only).src/daemon/workflow-executor.ts:148,193,250,309,364— five lifecycle log lines usingoutcome:instead ofevent:.src/workflows/dispatcher.ts:104,151,166,368,435— dispatcher lifecycle logs usingreason:instead ofevent:.src/workflows/orchestrator.ts:341,347— compensationmarkFailedon post-commit enqueue failure with no run-lifecycle log emitted.src/workflows/ship/iteration.ts:170— ship iterationinsertQueuedwith no log emitted.src/db/migrations/005_workflow_runs.sql:24andsrc/db/migrations/009_workflow_runs_incomplete.sql:20— the canonical state set (queued,running,succeeded,failed,incomplete).CLAUDE.md— the documentedDoc gatesandStructured JSON logging via pino with child loggers per requestconventions the proposal matches.External:
completed,failed,progress, …) precisely so observability lives in the queue, not in the handler.Suggested Next Steps
logWorkflowRunEvent(log, event, fields)) insrc/workflows/log-fields.ts(the file that already centralises this kind of binding alongsidesrc/orchestrator/log-fields.ts) emittingevent: "workflow.run.<state>"+runId+workflowName+target+deliveryId+ (terminal-only)durationMs+ (failure-only)reason.src/workflows/dispatcher.ts:104,151,166,368,435;src/daemon/workflow-executor.ts:92,148,193,250,286,309,364;src/workflows/ship/iteration.ts:170;src/workflows/orchestrator.ts:341,347). The existingreason:/outcome:fields can stay in place during the transition window for backwards-compatible log queries.docs/operate/observability.md(the doc-citations check atscripts/check-docs-citations.tsis the enforcement gate) and add one fanout assertion insrc/daemon/workflow-executor.test.tsto lock the contract.workflow.run.completedwithoutcome: succeeded|failed|incompleteas a single rollup, so an operator dashboard can chart success rate from one query instead of three — parallel to howretry.*is structured (commit6713cbf).Areas Evaluated
src/workflows/runs-store.ts— the durable state store and its mutators.src/daemon/workflow-executor.ts— daemon-side run-lifecycle execution + logging.src/workflows/dispatcher.ts— orchestrator-side dispatch + post-commit failure compensation.src/workflows/orchestrator.ts— cascade compensationmarkFailedpaths.src/workflows/ship/iteration.ts— ship-chain child run insertion.src/db/migrations/005_workflow_runs.sqlandsrc/db/migrations/009_workflow_runs_incomplete.sql— the authoritative state-set for the table.event:taxonomy undersrc/(61 distinct names, none underworkflow.run.*).area: observabilityto confirm no overlap withpipeline.stage.*,retry.*,dispatcher.offer.*,digest.*,daemon.connection.*, orstructured_output.*proposals.Generated by the scheduled research action on 2026-06-17