Skip to content

Log noise: agent.isProcessRunning 'indeterminate' fires every poll without escalating to terminated #2102

@Bdandc

Description

@Bdandc

Summary

agent.process_probe_failed events fire every polling cycle for already-terminated sessions with the message agent.isProcessRunning indeterminate for <session>, without ever escalating to a definitive terminated state. The same evidence-line repeats indefinitely until the lifecycle-manager itself restarts.

What I see

On 2026-06-04 two sessions (btbo-2, btbo-4) had been terminated cleanly, but their probe failures continued firing once per minute for hours:

2026-06-04T20:23:04.569Z  btbo-4  agent.process_probe_failed  "agent.isProcessRunning indeterminate for btbo-4"
2026-06-04T20:22:04.569Z  btbo-2  agent.process_probe_failed  "agent.isProcessRunning indeterminate for btbo-2"
2026-06-04T20:22:04.569Z  btbo-4  agent.process_probe_failed  "agent.isProcessRunning indeterminate for btbo-4"
2026-06-04T20:21:04.569Z  btbo-2  agent.process_probe_failed  "agent.isProcessRunning indeterminate for btbo-2"
2026-06-04T20:21:04.569Z  btbo-4  agent.process_probe_failed  "agent.isProcessRunning indeterminate for btbo-4"
…repeating every 60s, same evidence…

The runtime state stays alive / process_running indefinitely in the lifecycle-poll trace even though the AO-spawned worker for that session is gone:

{
  "previousRuntimeState": "alive",
  "newRuntimeState":      "alive",
  "previousRuntimeReason": "process_running",
  "newRuntimeReason":      "process_running",
  "primaryReason": "probe_failure",
  "evidence": "idle_beyond_threshold activity_signal=valid via_native activity=idle at=2026-06-04T10:12:49.964Z"
}

What I expected

After N consecutive indeterminate probes (e.g. 5 minutes' worth) the runtime state should escalate to terminated and the probe should stop firing for that session, OR the session should be silently reconciled and removed from the poll set the way runtime.lost_detected does on startup.

Why it matters

  1. Log noise: every running session generates one of these every minute when it terminates. Over a day a single dead session writes ~1,440 lines to the observability ndjson. Across a fleet of sessions that's the dominant log noise category.
  2. Confused state model: runtimeState = alive, runtimeReason = process_running for a session whose actual process is gone is misleading when triaging from logs.
  3. Wasted polling: each indeterminate probe spends compute on a probe that will never succeed.

Workaround I'm using

After a Mac restart kills the tmux server, I run ao start --reap-orphans --restore. That spawns a fresh lifecycle-manager which correctly emits runtime.lost_detected for the stale sessions and clears them — instead of leaving them in the indeterminate loop.

Environment

  • macOS Apple Silicon, @aoagents/ao v0.9.4
  • runtime: tmux, agent: claude-code, workspace: worktree
  • session lifecycle managed by lifecycle-manager in ~/.agent-orchestrator/c3c2ee38d54f-observability/processes/

Suggested fix shape

In whichever file owns agent.isProcessRunning / the poll loop, after N consecutive indeterminate results for the same sessionId, write a runtime.lost_detected event for that session and skip it on subsequent polls until it's explicitly restored. The 1-minute polling interval makes 5 consecutive indeterminates a sensible threshold.

Happy to send a patch if a maintainer can point me at the file that owns the probe loop and the convention for the threshold constant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions