Problem
When an ACP subprocess dies unexpectedly (crash, OOM, manual kill), weclaw continues dispatching messages to the dead process. Every subsequent message fails with:
session error: write to stdin: write |1: broken pipe
The only recovery is weclaw restart, which kills the entire process and drops all active sessions. There is no automatic detection or recovery.
Additionally, ACP sessions live forever. Over extended conversations, cachedReadTokens grows unboundedly (observed 52K → 136K+ in a single session), degrading response quality and increasing cost. There is no way to configure an idle timeout.
Proposal
1. ACP subprocess health check with lazy respawn
In agent/acp_agent.go, after spawning the ACP subprocess:
- Start a goroutine that calls
cmd.Wait() on the child process
- On unexpected exit (non-zero or signal), remove the session from the session map and log a warning
- On the next incoming message,
getOrCreateSession() will naturally spawn a fresh subprocess (lazy respawn)
- Add a
max_respawn config (default 3) with a cooldown window (e.g. 5 min) to prevent infinite restart loops
Pseudocode:
go func() {
err := cmd.Wait()
if err != nil {
log.Warnf("ACP subprocess pid=%d exited: %v", cmd.Process.Pid, err)
}
sessionManager.Remove(sessionID)
}()
This is how wechat-acp handles it — listens for process.exit and rebuilds on next message.
2. Idle session timeout
Add two optional fields to config.json:
{
"idle_timeout": "30m",
"max_sessions": 10
}
idle_timeout (default: "0" = disabled):
- Each session tracks a
lastActiveAt timestamp, updated on every incoming message
- A periodic ticker (e.g. every 60s) iterates sessions and kills + removes any that exceed
idle_timeout
- Alternatively, check lazily on each new message (simpler, but less precise)
max_sessions (default: 0 = unlimited):
- When a new session would exceed the limit, evict the least-recently-active session
- Prevents resource exhaustion on shared deployments
Why both?
Health check (1) handles the crash case — subprocess dies, weclaw recovers automatically. Idle timeout (2) handles the drift case — session is alive but stale, context is polluted, tokens are wasted. Together they make weclaw's ACP lifecycle management production-ready.
Current workaround
We're running an external bash watchdog that monitors weclaw.log for idle gaps and calls weclaw restart — functional but blunt (kills all sessions, not just the stale one). Native support would be significantly better.
Related
Problem
When an ACP subprocess dies unexpectedly (crash, OOM, manual kill), weclaw continues dispatching messages to the dead process. Every subsequent message fails with:
The only recovery is
weclaw restart, which kills the entire process and drops all active sessions. There is no automatic detection or recovery.Additionally, ACP sessions live forever. Over extended conversations,
cachedReadTokensgrows unboundedly (observed 52K → 136K+ in a single session), degrading response quality and increasing cost. There is no way to configure an idle timeout.Proposal
1. ACP subprocess health check with lazy respawn
In
agent/acp_agent.go, after spawning the ACP subprocess:cmd.Wait()on the child processgetOrCreateSession()will naturally spawn a fresh subprocess (lazy respawn)max_respawnconfig (default 3) with a cooldown window (e.g. 5 min) to prevent infinite restart loopsPseudocode:
This is how wechat-acp handles it — listens for
process.exitand rebuilds on next message.2. Idle session timeout
Add two optional fields to
config.json:{ "idle_timeout": "30m", "max_sessions": 10 }idle_timeout(default:"0"= disabled):lastActiveAttimestamp, updated on every incoming messageidle_timeoutmax_sessions(default:0= unlimited):Why both?
Health check (1) handles the crash case — subprocess dies, weclaw recovers automatically. Idle timeout (2) handles the drift case — session is alive but stale, context is polluted, tokens are wasted. Together they make weclaw's ACP lifecycle management production-ready.
Current workaround
We're running an external bash watchdog that monitors
weclaw.logfor idle gaps and callsweclaw restart— functional but blunt (kills all sessions, not just the stale one). Native support would be significantly better.Related
/newcommand (PR Add /new and /clear commands to reset agent session #14) handles user-initiated reset but not automated lifecycleweclaw restartis the only recovery path for dead subprocesses today