-
Notifications
You must be signed in to change notification settings - Fork 1
ops(swarm): jared-box worker pool was down — queue backlog 110, copilot agents starved #1481
Description
Incident Summary
Detected by: studio-qa (scheduled run 2026-03-30 ~11:00 UTC)
Severity: P0 (worker pool down, queue growing)
Status: MITIGATED — workers restarted, queue draining
Root Cause
All 32 worker PIDs in ~/.agentguard/worker.pids on jared-box were stale (dead processes from a prior deployment). The queue grew unchecked because no workers were alive to drain it.
check-swarm-health.sh did NOT catch this — it checks log freshness but not worker liveness. The script skipped all 135 agents because server/logs/ was not present on this host (readybench holds logs for most agents).
Impact
| Metric | Value |
|---|---|
| Queue depth at detection | 110 items |
| Stale worker PIDs | 32 (all dead) |
| Copilot events in telemetry | 0 of 368,489 total events |
| Duration of outage | Unknown — stale PIDs from previous deployment cycle |
Copilot event pipeline impact: verify-copilot-events.sh returned PARTIAL — 43 copilot agents are registered and have SKILL.md files, but zero log files exist and zero copilot events appear in the telemetry API. These agents were enqueued on jared-box but never executed due to the dead worker pool. This is a conference demo dependency (issue #41, May 6).
Mitigation (applied)
Cleared stale PID file and started 32 fresh worker processes via server/worker.sh. Queue dropped 110 → 70 within 3 seconds of restart (confirmed draining).
Follow-up Required
- Add worker liveness check to
check-swarm-health.sh— cross-referenceworker.pidsagainst live processes; alert if all workers are dead even when log dirs are absent. - Add systemd supervision for jared-box workers —
agentguard-worker@.serviceexists but is inactive. Enable it so workers auto-restart on crash. - Investigate root cause of original worker death — was it a deploy, a crash, a reboot? Check journalctl for old PIDs
1890563–1890604. - Verify copilot events after recovery — re-run
scripts/verify-copilot-events.shin ~2 hours to confirm copilot-driven agents are producing telemetry events.
Readybench Status
Readybench workers are healthy — 32 live worker.sh processes confirmed running.
Filed by studio-qa during scheduled infrastructure health run