fix(core): terminate runtime-lost sessions when agent process probe is indeterminate (#2025)#2027
Conversation
…s indeterminate (AgentWrapper#2025) When a tmux session vanishes, runtime.isAlive returns a clean dead but the agent's tmux-based isProcessRunning throws and is mapped to INDETERMINATE. The lifecycle poll short-circuited on the indeterminate probe with skipMetadataWrite, never reaching resolveProbeDecision, so the session froze forever in detecting/runtime_lost on the sidebar. A process living inside a dead tmux session cannot be alive. When the runtime probe authoritatively reports dead, reclassify the indeterminate agent probe as dead so the poll resolves terminal (dead+dead -> terminated/runtime_lost). AgentWrapper#1838 protection is intact: this only fires on an authoritative dead runtime, never on a flaky or alive one, and the recent-liveness guard in resolveProbeDecision still keeps a genuinely-working agent in detecting. Affects all agents (codex and claude-code share the tmux-probe pattern).
Greptile SummaryThis PR fixes a bug where sessions with a vanished tmux runtime got permanently stuck in
Confidence Score: 5/5Safe to merge — the change is tightly scoped to a single conditional branch in the lifecycle poll, all existing false-termination guards remain intact, and the new behaviour is fully covered by integration tests. The fix is a minimal, well-contained addition: a single reclassification branch that only fires when the runtime is authoritatively dead (clean non-failed false from isAlive) and the process probe is indeterminate. It cannot fire on a flaky probe or a live runtime, so the risk of false session termination is negligible. The existing signal_disagreement branch in resolveProbeDecision still catches the recent-liveness edge case, and two new integration tests plus five new unit tests give strong confidence that all relevant paths behave correctly. No files require special attention.
|
| Filename | Overview |
|---|---|
| packages/core/src/lifecycle-manager.ts | Reclassification block added before the indeterminate short-circuit: when the runtime is authoritatively dead, an indeterminate process probe is now treated as dead so the poll reaches resolveProbeDecision and terminates the session. Audit event added for traceability. Logic is well-guarded and reads correctly. |
| packages/core/src/tests/lifecycle-manager.test.ts | Two new integration tests added: one verifying the fix (indeterminate + authoritatively dead runtime → killed), and one verifying the #1838 guard still holds (indeterminate + dead runtime + fresh liveness → detecting). Both tests are correctly structured and assert the right state and evidence strings. |
| packages/core/src/tests/lifecycle-status-decisions.test.ts | Five new unit tests cover resolveProbeDecision for all relevant probe-state combinations: dead+dead→terminated, dead+unknown→detecting grace, dead+unknown+recent liveness→detecting, transient probe failure→detecting, and alive+unknown→null. Good coverage of the state machine. |
| .changeset/fix-runtime-lost-indeterminate-2025.md | Patch changeset for @aoagents/ao-core with an accurate description of the fix. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[lifecycle poll fires] --> B{runtimeHandle present?}
B -->|yes| C[runtime.isAlive]
C -->|true| D[runtimeProbe: alive]
C -->|false| E[runtimeProbe: dead, failed=false]
C -->|throws| F[runtimeProbe: unknown, failed=true]
B -->|no| G[runtimeProbe: unknown, failed=false]
D & E & F & G --> H[agent.isProcessRunning]
H -->|true/false| I[processProbe: alive/dead]
H -->|indeterminate| J[processProbe: unknown, indeterminate=true]
H -->|throws| K[processProbe: unknown, failed=true]
J --> L{runtimeAuthoritativelyDead?}
L -->|YES - NEW| M[reclassify processProbe to dead]
L -->|NO| N[short-circuit: skipMetadataWrite]
M --> O[resolveProbeDecision dead+dead]
I --> O
K --> O
O --> P{recentActivitySupportsLiveness?}
P -->|yes| Q[signal_disagreement → detecting]
P -->|no| R[terminated / runtime_lost]
Reviews (2): Last reviewed commit: "fix(core): log reclassification event + ..." | Re-trigger Greptile
…gentWrapper#2025 review) Address PR review: record an agent.process_probe_failed event when an indeterminate agent probe is reclassified to dead under an authoritative dead runtime, so the audit trail explains the detecting->terminated transition. Add a lifecycle-manager integration test asserting that a fresh liveness signal still routes to detecting (never a false termination) at the reclassification site, locking in the AgentWrapper#1838 guard for this combination.
Summary
Fixes #2025. Codex/claude worker sessions whose tmux runtime vanished got stuck permanently on the dashboard sidebar in
detecting / runtime_lost("Detecting runtime truth (runtime lost)") and never reached a terminal state.Root cause: when a tmux session is gone,
runtime.isAlivereturns a cleanfalse(authoritative dead), but the agent's tmux-basedisProcessRunning(tmux list-panes) throws → mapped toPROCESS_PROBE_INDETERMINATE. The lifecycle poll short-circuited on the indeterminate probe atlifecycle-manager.tswithskipMetadataWrite: true, never reachingresolveProbeDecision— so the session froze indetectingforever (whichderiveLegacyStatusmaps to non-terminaldetecting, keeping the card on the sidebar). The displayedactivity=idleevidence was a stale fossil preserved by the skip.Fix
A process living inside a dead tmux session cannot be alive. When the runtime probe authoritatively reports dead, reclassify the indeterminate agent process probe as
dead, so the poll falls through toresolveProbeDecision→dead + dead→terminated / runtime_lost.pstimeout conflated with "process not found" — bulk-terminates all sessions whenpsexceeds 5s #1838 protection preserved: this only fires on an authoritative dead runtime (runtimeProbe.state === "dead" && !failed), never on a flaky/alive one. A transientps/tmuxtimeout still yieldsruntimeProbe.failed(stateunknown) → indeterminate still short-circuits, no false termination.detectingvia the existingsignal_disagreementbranch, so a genuinely-working agent whose runtime probe glitched is not killed.resolveProbeDecision's generaldead + unknown → detectinggrace (e.g. no agent plugin to probe) is untouched.Affects all agents — codex and claude-code share the same tmux-probe pattern.
User-facing change
When a worker's terminal ends because its tmux session is gone (e.g. the smoke test finished,
ao stopkilled the pane, the tmux server was reaped, laptop sleep/wake) — and the session was not manually killed and has no PR:Before
Runtime: missing,PR: none, and a staleactivity=idleline.Now
terminated / runtime_lost, shown askilled), the same as any other finished session.detecting, never falsely terminated).Test plan
resolveProbeDecisionunit tests: dead+dead → terminated; dead+unknown (no agent) → detecting grace; dead+unknown+recent-liveness → detecting (bug(agent-plugins):pstimeout conflated with "process not found" — bulk-terminates all sessions whenpsexceeds 5s #1838); transient probe failure → detecting (bug(agent-plugins):pstimeout conflated with "process not found" — bulk-terminates all sessions whenpsexceeds 5s #1838); alive+unknown → null.indeterminate+ null activity →killed(runtime_dead process_dead), metadata written (not skipped).pstimeout conflated with "process not found" — bulk-terminates all sessions whenpsexceeds 5s #1838 guard.🤖 Generated with Claude Code