ops(swarm): jared-box worker pool was down — queue backlog 110, copilot agents starved

## Incident Summary

**Detected by:** studio-qa (scheduled run 2026-03-30 ~11:00 UTC)
**Severity:** P0 (worker pool down, queue growing)
**Status:** MITIGATED — workers restarted, queue draining

---

## Root Cause

All 32 worker PIDs in `~/.agentguard/worker.pids` on **jared-box** were stale (dead processes from a prior deployment). The queue grew unchecked because no workers were alive to drain it.

`check-swarm-health.sh` did NOT catch this — it checks log freshness but not worker liveness. The script skipped all 135 agents because `server/logs/` was not present on this host (readybench holds logs for most agents).

## Impact

| Metric | Value |
|--------|-------|
| Queue depth at detection | 110 items |
| Stale worker PIDs | 32 (all dead) |
| Copilot events in telemetry | 0 of 368,489 total events |
| Duration of outage | Unknown — stale PIDs from previous deployment cycle |

**Copilot event pipeline impact:** `verify-copilot-events.sh` returned `PARTIAL` — 43 copilot agents are registered and have SKILL.md files, but zero log files exist and zero copilot events appear in the telemetry API. These agents were enqueued on jared-box but never executed due to the dead worker pool. This is a **conference demo dependency** (issue #41, May 6).

## Mitigation (applied)

Cleared stale PID file and started 32 fresh worker processes via `server/worker.sh`. Queue dropped 110 → 70 within 3 seconds of restart (confirmed draining).

## Follow-up Required

1. **Add worker liveness check to `check-swarm-health.sh`** — cross-reference `worker.pids` against live processes; alert if all workers are dead even when log dirs are absent.
2. **Add systemd supervision for jared-box workers** — `agentguard-worker@.service` exists but is inactive. Enable it so workers auto-restart on crash.
3. **Investigate root cause of original worker death** — was it a deploy, a crash, a reboot? Check journalctl for old PIDs `1890563`–`1890604`.
4. **Verify copilot events after recovery** — re-run `scripts/verify-copilot-events.sh` in ~2 hours to confirm copilot-driven agents are producing telemetry events.

## Readybench Status

Readybench workers are healthy — 32 live `worker.sh` processes confirmed running.

---
*Filed by studio-qa during scheduled infrastructure health run*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops(swarm): jared-box worker pool was down — queue backlog 110, copilot agents starved #1481

Incident Summary

Root Cause

Impact

Mitigation (applied)

Follow-up Required

Readybench Status

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Queue depth at detection	110 items
Stale worker PIDs	32 (all dead)
Copilot events in telemetry	0 of 368,489 total events
Duration of outage	Unknown — stale PIDs from previous deployment cycle

ops(swarm): jared-box worker pool was down — queue backlog 110, copilot agents starved #1481

Description

Incident Summary

Root Cause

Impact

Mitigation (applied)

Follow-up Required

Readybench Status

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions