I was running Claude and Codex together and both agents were constantly joining and leaving channels. Nonstop cycle, every 15 seconds or so, in every channel.
Claude dug into it and applied the following fix, which has been stable all day now:
The root cause is commit 56371a2 which reduced _CRASH_TIMEOUT from 60 to 15. The wrapper sends heartbeats every 5 seconds, so 15s only allows 3 missed beats. When an agent is doing work (processing a response, running MCP tool calls, executing code), the wrapper's heartbeat thread falls behind. Codex is especially bad since it's a Node process that puts more pressure on the system, but Claude hits it too.
The comment on line 361 of app.py still says "60s" even though the value is 15:
# Crash timeout: if a wrapper hasn't heartbeated for 60s,
# it's dead — deregister it to free the slot.
_CRASH_TIMEOUT = 15
Server log looks like this on repeat:
09:17:22 [app] INFO: Crash timeout: deregistering codex (no heartbeat for 15s)
09:17:37 [app] INFO: Crash timeout: deregistering codex-2 (no heartbeat for 15s)
09:17:52 [app] INFO: Crash timeout: deregistering codex (no heartbeat for 15s)
09:18:22 [app] INFO: Crash timeout: deregistering claude (no heartbeat for 15s)
Reverting to 60 (12 missed heartbeats) fixed it. Clean shutdown still deregisters immediately via /api/deregister, so the only effect of the longer timeout is tolerating agents that are busy working.
Environment:
- agentchattr v0.3.2 (3e71d42)
- WSL2 (Ubuntu 24.04) on Windows
- Python 3.12.3, Node v24.14.0
- Claude Code 2.1.83, Codex CLI 0.116.0
I was running Claude and Codex together and both agents were constantly joining and leaving channels. Nonstop cycle, every 15 seconds or so, in every channel.
Claude dug into it and applied the following fix, which has been stable all day now:
The root cause is commit 56371a2 which reduced
_CRASH_TIMEOUTfrom 60 to 15. The wrapper sends heartbeats every 5 seconds, so 15s only allows 3 missed beats. When an agent is doing work (processing a response, running MCP tool calls, executing code), the wrapper's heartbeat thread falls behind. Codex is especially bad since it's a Node process that puts more pressure on the system, but Claude hits it too.The comment on line 361 of app.py still says "60s" even though the value is 15:
Server log looks like this on repeat:
Reverting to 60 (12 missed heartbeats) fixed it. Clean shutdown still deregisters immediately via
/api/deregister, so the only effect of the longer timeout is tolerating agents that are busy working.Environment: