Summary
agent.process_probe_failed events fire every polling cycle for already-terminated sessions with the message agent.isProcessRunning indeterminate for <session>, without ever escalating to a definitive terminated state. The same evidence-line repeats indefinitely until the lifecycle-manager itself restarts.
What I see
On 2026-06-04 two sessions (btbo-2, btbo-4) had been terminated cleanly, but their probe failures continued firing once per minute for hours:
2026-06-04T20:23:04.569Z btbo-4 agent.process_probe_failed "agent.isProcessRunning indeterminate for btbo-4"
2026-06-04T20:22:04.569Z btbo-2 agent.process_probe_failed "agent.isProcessRunning indeterminate for btbo-2"
2026-06-04T20:22:04.569Z btbo-4 agent.process_probe_failed "agent.isProcessRunning indeterminate for btbo-4"
2026-06-04T20:21:04.569Z btbo-2 agent.process_probe_failed "agent.isProcessRunning indeterminate for btbo-2"
2026-06-04T20:21:04.569Z btbo-4 agent.process_probe_failed "agent.isProcessRunning indeterminate for btbo-4"
…repeating every 60s, same evidence…
The runtime state stays alive / process_running indefinitely in the lifecycle-poll trace even though the AO-spawned worker for that session is gone:
{
"previousRuntimeState": "alive",
"newRuntimeState": "alive",
"previousRuntimeReason": "process_running",
"newRuntimeReason": "process_running",
"primaryReason": "probe_failure",
"evidence": "idle_beyond_threshold activity_signal=valid via_native activity=idle at=2026-06-04T10:12:49.964Z"
}
What I expected
After N consecutive indeterminate probes (e.g. 5 minutes' worth) the runtime state should escalate to terminated and the probe should stop firing for that session, OR the session should be silently reconciled and removed from the poll set the way runtime.lost_detected does on startup.
Why it matters
- Log noise: every running session generates one of these every minute when it terminates. Over a day a single dead session writes ~1,440 lines to the observability
ndjson. Across a fleet of sessions that's the dominant log noise category.
- Confused state model:
runtimeState = alive, runtimeReason = process_running for a session whose actual process is gone is misleading when triaging from logs.
- Wasted polling: each indeterminate probe spends compute on a probe that will never succeed.
Workaround I'm using
After a Mac restart kills the tmux server, I run ao start --reap-orphans --restore. That spawns a fresh lifecycle-manager which correctly emits runtime.lost_detected for the stale sessions and clears them — instead of leaving them in the indeterminate loop.
Environment
- macOS Apple Silicon,
@aoagents/ao v0.9.4
- runtime:
tmux, agent: claude-code, workspace: worktree
- session lifecycle managed by
lifecycle-manager in ~/.agent-orchestrator/c3c2ee38d54f-observability/processes/
Suggested fix shape
In whichever file owns agent.isProcessRunning / the poll loop, after N consecutive indeterminate results for the same sessionId, write a runtime.lost_detected event for that session and skip it on subsequent polls until it's explicitly restored. The 1-minute polling interval makes 5 consecutive indeterminates a sensible threshold.
Happy to send a patch if a maintainer can point me at the file that owns the probe loop and the convention for the threshold constant.
Summary
agent.process_probe_failedevents fire every polling cycle for already-terminated sessions with the messageagent.isProcessRunning indeterminate for <session>, without ever escalating to a definitiveterminatedstate. The same evidence-line repeats indefinitely until the lifecycle-manager itself restarts.What I see
On
2026-06-04two sessions (btbo-2,btbo-4) had been terminated cleanly, but their probe failures continued firing once per minute for hours:The runtime state stays
alive/process_runningindefinitely in the lifecycle-poll trace even though the AO-spawned worker for that session is gone:{ "previousRuntimeState": "alive", "newRuntimeState": "alive", "previousRuntimeReason": "process_running", "newRuntimeReason": "process_running", "primaryReason": "probe_failure", "evidence": "idle_beyond_threshold activity_signal=valid via_native activity=idle at=2026-06-04T10:12:49.964Z" }What I expected
After N consecutive
indeterminateprobes (e.g. 5 minutes' worth) the runtime state should escalate toterminatedand the probe should stop firing for that session, OR the session should be silentlyreconciledand removed from the poll set the wayruntime.lost_detecteddoes on startup.Why it matters
ndjson. Across a fleet of sessions that's the dominant log noise category.runtimeState = alive, runtimeReason = process_runningfor a session whose actual process is gone is misleading when triaging from logs.Workaround I'm using
After a Mac restart kills the tmux server, I run
ao start --reap-orphans --restore. That spawns a fresh lifecycle-manager which correctly emitsruntime.lost_detectedfor the stale sessions and clears them — instead of leaving them in the indeterminate loop.Environment
@aoagents/aov0.9.4tmux, agent:claude-code, workspace:worktreelifecycle-managerin~/.agent-orchestrator/c3c2ee38d54f-observability/processes/Suggested fix shape
In whichever file owns
agent.isProcessRunning/ the poll loop, after N consecutiveindeterminateresults for the samesessionId, write aruntime.lost_detectedevent for that session and skip it on subsequent polls until it's explicitly restored. The 1-minute polling interval makes 5 consecutiveindeterminates a sensible threshold.Happy to send a patch if a maintainer can point me at the file that owns the probe loop and the convention for the threshold constant.