Feature: ACP subprocess health check + idle session timeout

## Problem

When an ACP subprocess dies unexpectedly (crash, OOM, manual kill), weclaw continues dispatching messages to the dead process. Every subsequent message fails with:

```
session error: write to stdin: write |1: broken pipe
```

The only recovery is `weclaw restart`, which kills the entire process and drops all active sessions. There is no automatic detection or recovery.

Additionally, ACP sessions live forever. Over extended conversations, `cachedReadTokens` grows unboundedly (observed 52K → 136K+ in a single session), degrading response quality and increasing cost. There is no way to configure an idle timeout.

## Proposal

### 1. ACP subprocess health check with lazy respawn

In `agent/acp_agent.go`, after spawning the ACP subprocess:

- Start a goroutine that calls `cmd.Wait()` on the child process
- On unexpected exit (non-zero or signal), remove the session from the session map and log a warning
- On the next incoming message, `getOrCreateSession()` will naturally spawn a fresh subprocess (lazy respawn)
- Add a `max_respawn` config (default 3) with a cooldown window (e.g. 5 min) to prevent infinite restart loops

Pseudocode:

```go
go func() {
    err := cmd.Wait()
    if err != nil {
        log.Warnf("ACP subprocess pid=%d exited: %v", cmd.Process.Pid, err)
    }
    sessionManager.Remove(sessionID)
}()
```

This is how [wechat-acp](https://github.com/anthropics/wechat-acp) handles it — listens for `process.exit` and rebuilds on next message.

### 2. Idle session timeout

Add two optional fields to `config.json`:

```json
{
  "idle_timeout": "30m",
  "max_sessions": 10
}
```

**`idle_timeout`** (default: `"0"` = disabled):
- Each session tracks a `lastActiveAt` timestamp, updated on every incoming message
- A periodic ticker (e.g. every 60s) iterates sessions and kills + removes any that exceed `idle_timeout`
- Alternatively, check lazily on each new message (simpler, but less precise)

**`max_sessions`** (default: `0` = unlimited):
- When a new session would exceed the limit, evict the least-recently-active session
- Prevents resource exhaustion on shared deployments

### Why both?

Health check (1) handles the crash case — subprocess dies, weclaw recovers automatically. Idle timeout (2) handles the drift case — session is alive but stale, context is polluted, tokens are wasted. Together they make weclaw's ACP lifecycle management production-ready.

### Current workaround

We're running an external bash watchdog that monitors `weclaw.log` for idle gaps and calls `weclaw restart` — functional but blunt (kills all sessions, not just the stale one). Native support would be significantly better.

### Related

- `/new` command (PR #14) handles user-initiated reset but not automated lifecycle
- `weclaw restart` is the only recovery path for dead subprocesses today

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: ACP subprocess health check + idle session timeout #40

Problem

Proposal

1. ACP subprocess health check with lazy respawn

2. Idle session timeout

Why both?

Current workaround

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: ACP subprocess health check + idle session timeout #40

Description

Problem

Proposal

1. ACP subprocess health check with lazy respawn

2. Idle session timeout

Why both?

Current workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions