Skip to content

Finished realtime StoryRuns retain Running stepStates after StepRuns are deleted #89

@lanycrost

Description

@lanycrost

Summary

A realtime StoryRun can reach terminal phase while status.stepStates still reports all steps as Running, even after the child StepRuns have been deleted.

This leaves the StoryRun status internally inconsistent and makes post-finish reconciliation/debugging much harder.

Live evidence

StoryRun:

  • livekit-voice/livekit-voice-assistant-rm-6jwy7uutjydd-241a4c029032ae1d

Observed object state:

  • status.phase: Finished
  • status.message: StoryRun gracefully canceled
  • status.finishedAt: 2026-04-22T18:15:32Z
  • all entries in status.stepStates[*].phase still Running

Cluster state at the same time:

  • kubectl get stepruns -A returned No resources found

Controller logs for the same run showed realtime StepRun deletion cleanup executing successfully:

  • Reconciling deletion for StepRun
  • Deleting realtime resource after TTL expiry
  • Owned resources are deleted, removing finalizer

Why this matters

A terminal StoryRun that still reports every step as Running breaks status invariants:

  • the user-facing run state is misleading
  • downstream controllers/debug tooling can no longer trust status.stepStates
  • it becomes difficult to tell whether the run terminated cleanly or was partially torn down

Suspected root cause area

handleGracefulCancel in internal/controller/runs/storyrun_controller.go:1517-1577 can finish the StoryRun after force-deleting remaining StepRuns, but there is no guaranteed final sync that converts the in-memory step states to terminal values before the StepRuns disappear.

The DAG sync path in internal/controller/runs/dag.go:955-1013 mirrors StepRun phases into StoryRun status while StepRuns still exist. Once the StepRuns are gone, there is no source of truth left to repair the stale Running entries.

Acceptance criteria

  • A terminal StoryRun must not retain Running entries in status.stepStates
  • Bobrapet should mark remaining realtime steps terminal before or during cancellation-based finish
  • Add regression coverage for the sequence:
    1. realtime StoryRun running
    2. cancel requested
    3. StepRuns deleted
    4. StoryRun becomes terminal
    5. status.stepStates is terminal/consistent

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/operatorBobrapet controller or CRD-level change.kind/bugUnexpected behaviour or regression that needs fixing.priority/highImportant issue to schedule soon.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions