Skip to content

Feature: ACP subprocess health check + idle session timeout #40

@Gemini-Nick

Description

@Gemini-Nick

Problem

When an ACP subprocess dies unexpectedly (crash, OOM, manual kill), weclaw continues dispatching messages to the dead process. Every subsequent message fails with:

session error: write to stdin: write |1: broken pipe

The only recovery is weclaw restart, which kills the entire process and drops all active sessions. There is no automatic detection or recovery.

Additionally, ACP sessions live forever. Over extended conversations, cachedReadTokens grows unboundedly (observed 52K → 136K+ in a single session), degrading response quality and increasing cost. There is no way to configure an idle timeout.

Proposal

1. ACP subprocess health check with lazy respawn

In agent/acp_agent.go, after spawning the ACP subprocess:

  • Start a goroutine that calls cmd.Wait() on the child process
  • On unexpected exit (non-zero or signal), remove the session from the session map and log a warning
  • On the next incoming message, getOrCreateSession() will naturally spawn a fresh subprocess (lazy respawn)
  • Add a max_respawn config (default 3) with a cooldown window (e.g. 5 min) to prevent infinite restart loops

Pseudocode:

go func() {
    err := cmd.Wait()
    if err != nil {
        log.Warnf("ACP subprocess pid=%d exited: %v", cmd.Process.Pid, err)
    }
    sessionManager.Remove(sessionID)
}()

This is how wechat-acp handles it — listens for process.exit and rebuilds on next message.

2. Idle session timeout

Add two optional fields to config.json:

{
  "idle_timeout": "30m",
  "max_sessions": 10
}

idle_timeout (default: "0" = disabled):

  • Each session tracks a lastActiveAt timestamp, updated on every incoming message
  • A periodic ticker (e.g. every 60s) iterates sessions and kills + removes any that exceed idle_timeout
  • Alternatively, check lazily on each new message (simpler, but less precise)

max_sessions (default: 0 = unlimited):

  • When a new session would exceed the limit, evict the least-recently-active session
  • Prevents resource exhaustion on shared deployments

Why both?

Health check (1) handles the crash case — subprocess dies, weclaw recovers automatically. Idle timeout (2) handles the drift case — session is alive but stale, context is polluted, tokens are wasted. Together they make weclaw's ACP lifecycle management production-ready.

Current workaround

We're running an external bash watchdog that monitors weclaw.log for idle gaps and calls weclaw restart — functional but blunt (kills all sessions, not just the stale one). Native support would be significantly better.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions