Skip to content

fix(core): terminate runtime-lost sessions when agent process probe is indeterminate (#2025)#2027

Open
harshitsinghbhandari wants to merge 3 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:session/ao-184
Open

fix(core): terminate runtime-lost sessions when agent process probe is indeterminate (#2025)#2027
harshitsinghbhandari wants to merge 3 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:session/ao-184

Conversation

@harshitsinghbhandari
Copy link
Copy Markdown
Contributor

@harshitsinghbhandari harshitsinghbhandari commented May 22, 2026

Summary

Fixes #2025. Codex/claude worker sessions whose tmux runtime vanished got stuck permanently on the dashboard sidebar in detecting / runtime_lost ("Detecting runtime truth (runtime lost)") and never reached a terminal state.

Root cause: when a tmux session is gone, runtime.isAlive returns a clean false (authoritative dead), but the agent's tmux-based isProcessRunning (tmux list-panes) throws → mapped to PROCESS_PROBE_INDETERMINATE. The lifecycle poll short-circuited on the indeterminate probe at lifecycle-manager.ts with skipMetadataWrite: true, never reaching resolveProbeDecision — so the session froze in detecting forever (which deriveLegacyStatus maps to non-terminal detecting, keeping the card on the sidebar). The displayed activity=idle evidence was a stale fossil preserved by the skip.

Fix

A process living inside a dead tmux session cannot be alive. When the runtime probe authoritatively reports dead, reclassify the indeterminate agent process probe as dead, so the poll falls through to resolveProbeDecisiondead + deadterminated / runtime_lost.

  • No agent-plugin changes — single core change in the lifecycle poll.
  • bug(agent-plugins): ps timeout conflated with "process not found" — bulk-terminates all sessions when ps exceeds 5s #1838 protection preserved: this only fires on an authoritative dead runtime (runtimeProbe.state === "dead" && !failed), never on a flaky/alive one. A transient ps/tmux timeout still yields runtimeProbe.failed (state unknown) → indeterminate still short-circuits, no false termination.
  • Recent-liveness guard intact: a dead runtime with fresh agent activity still routes to detecting via the existing signal_disagreement branch, so a genuinely-working agent whose runtime probe glitched is not killed.
  • resolveProbeDecision's general dead + unknown → detecting grace (e.g. no agent plugin to probe) is untouched.

Affects all agents — codex and claude-code share the same tmux-probe pattern.

User-facing change

When a worker's terminal ends because its tmux session is gone (e.g. the smoke test finished, ao stop killed the pane, the tmux server was reaped, laptop sleep/wake) — and the session was not manually killed and has no PR:

Before

  • The session stayed on the sidebar forever showing "Detecting runtime truth (runtime lost)", Runtime: missing, PR: none, and a stale activity=idle line.
  • It never moved to a terminal/ended state on its own. The only way to clear it was to manually kill it. Zombie cards piled up over time (one per dead worker).

Now

  • The session promptly resolves to a terminal ended state (terminated / runtime_lost, shown as killed), the same as any other finished session.
  • It leaves the active list, moves to the ended/terminated grouping, and becomes eligible for normal cleanup — no manual kill required.
  • No change for sessions you manually kill, sessions with a live runtime, or sessions whose agent is still genuinely active (a fresh activity signal still holds them in detecting, never falsely terminated).

Test plan

🤖 Generated with Claude Code

…s indeterminate (AgentWrapper#2025)

When a tmux session vanishes, runtime.isAlive returns a clean dead but the
agent's tmux-based isProcessRunning throws and is mapped to INDETERMINATE.
The lifecycle poll short-circuited on the indeterminate probe with
skipMetadataWrite, never reaching resolveProbeDecision, so the session froze
forever in detecting/runtime_lost on the sidebar.

A process living inside a dead tmux session cannot be alive. When the runtime
probe authoritatively reports dead, reclassify the indeterminate agent probe
as dead so the poll resolves terminal (dead+dead -> terminated/runtime_lost).
AgentWrapper#1838 protection is intact: this only fires on an authoritative dead runtime,
never on a flaky or alive one, and the recent-liveness guard in
resolveProbeDecision still keeps a genuinely-working agent in detecting.

Affects all agents (codex and claude-code share the tmux-probe pattern).
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR fixes a bug where sessions with a vanished tmux runtime got permanently stuck in detecting / runtime_lost on the dashboard sidebar. The root cause was that the agent's tmux-based isProcessRunning returned indeterminate when the tmux session was gone, causing the lifecycle poll to short-circuit with skipMetadataWrite: true and never reach resolveProbeDecision.

Confidence Score: 5/5

Safe to merge — the change is tightly scoped to a single conditional branch in the lifecycle poll, all existing false-termination guards remain intact, and the new behaviour is fully covered by integration tests.

The fix is a minimal, well-contained addition: a single reclassification branch that only fires when the runtime is authoritatively dead (clean non-failed false from isAlive) and the process probe is indeterminate. It cannot fire on a flaky probe or a live runtime, so the risk of false session termination is negligible. The existing signal_disagreement branch in resolveProbeDecision still catches the recent-liveness edge case, and two new integration tests plus five new unit tests give strong confidence that all relevant paths behave correctly.

No files require special attention.

Important Files Changed

Filename Overview
packages/core/src/lifecycle-manager.ts Reclassification block added before the indeterminate short-circuit: when the runtime is authoritatively dead, an indeterminate process probe is now treated as dead so the poll reaches resolveProbeDecision and terminates the session. Audit event added for traceability. Logic is well-guarded and reads correctly.
packages/core/src/tests/lifecycle-manager.test.ts Two new integration tests added: one verifying the fix (indeterminate + authoritatively dead runtime → killed), and one verifying the #1838 guard still holds (indeterminate + dead runtime + fresh liveness → detecting). Both tests are correctly structured and assert the right state and evidence strings.
packages/core/src/tests/lifecycle-status-decisions.test.ts Five new unit tests cover resolveProbeDecision for all relevant probe-state combinations: dead+dead→terminated, dead+unknown→detecting grace, dead+unknown+recent liveness→detecting, transient probe failure→detecting, and alive+unknown→null. Good coverage of the state machine.
.changeset/fix-runtime-lost-indeterminate-2025.md Patch changeset for @aoagents/ao-core with an accurate description of the fix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[lifecycle poll fires] --> B{runtimeHandle present?}
    B -->|yes| C[runtime.isAlive]
    C -->|true| D[runtimeProbe: alive]
    C -->|false| E[runtimeProbe: dead, failed=false]
    C -->|throws| F[runtimeProbe: unknown, failed=true]
    B -->|no| G[runtimeProbe: unknown, failed=false]

    D & E & F & G --> H[agent.isProcessRunning]
    H -->|true/false| I[processProbe: alive/dead]
    H -->|indeterminate| J[processProbe: unknown, indeterminate=true]
    H -->|throws| K[processProbe: unknown, failed=true]

    J --> L{runtimeAuthoritativelyDead?}
    L -->|YES - NEW| M[reclassify processProbe to dead]
    L -->|NO| N[short-circuit: skipMetadataWrite]

    M --> O[resolveProbeDecision dead+dead]
    I --> O
    K --> O

    O --> P{recentActivitySupportsLiveness?}
    P -->|yes| Q[signal_disagreement → detecting]
    P -->|no| R[terminated / runtime_lost]
Loading

Reviews (2): Last reviewed commit: "fix(core): log reclassification event + ..." | Re-trigger Greptile

Comment thread packages/core/src/lifecycle-manager.ts
Comment thread packages/core/src/__tests__/lifecycle-manager.test.ts
…gentWrapper#2025 review)

Address PR review: record an agent.process_probe_failed event when an
indeterminate agent probe is reclassified to dead under an authoritative dead
runtime, so the audit trail explains the detecting->terminated transition. Add
a lifecycle-manager integration test asserting that a fresh liveness signal
still routes to detecting (never a false termination) at the reclassification
site, locking in the AgentWrapper#1838 guard for this combination.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(core): indeterminate agent process-probe vetoes authoritative runtime-dead — sessions stuck forever in 'detecting/runtime_lost'

1 participant