Skip to content

ops(swarm): jared-box worker pool was down — queue backlog 110, copilot agents starved #1481

@jpleva91

Description

@jpleva91

Incident Summary

Detected by: studio-qa (scheduled run 2026-03-30 ~11:00 UTC)
Severity: P0 (worker pool down, queue growing)
Status: MITIGATED — workers restarted, queue draining


Root Cause

All 32 worker PIDs in ~/.agentguard/worker.pids on jared-box were stale (dead processes from a prior deployment). The queue grew unchecked because no workers were alive to drain it.

check-swarm-health.sh did NOT catch this — it checks log freshness but not worker liveness. The script skipped all 135 agents because server/logs/ was not present on this host (readybench holds logs for most agents).

Impact

Metric Value
Queue depth at detection 110 items
Stale worker PIDs 32 (all dead)
Copilot events in telemetry 0 of 368,489 total events
Duration of outage Unknown — stale PIDs from previous deployment cycle

Copilot event pipeline impact: verify-copilot-events.sh returned PARTIAL — 43 copilot agents are registered and have SKILL.md files, but zero log files exist and zero copilot events appear in the telemetry API. These agents were enqueued on jared-box but never executed due to the dead worker pool. This is a conference demo dependency (issue #41, May 6).

Mitigation (applied)

Cleared stale PID file and started 32 fresh worker processes via server/worker.sh. Queue dropped 110 → 70 within 3 seconds of restart (confirmed draining).

Follow-up Required

  1. Add worker liveness check to check-swarm-health.sh — cross-reference worker.pids against live processes; alert if all workers are dead even when log dirs are absent.
  2. Add systemd supervision for jared-box workersagentguard-worker@.service exists but is inactive. Enable it so workers auto-restart on crash.
  3. Investigate root cause of original worker death — was it a deploy, a crash, a reboot? Check journalctl for old PIDs 18905631890604.
  4. Verify copilot events after recovery — re-run scripts/verify-copilot-events.sh in ~2 hours to confirm copilot-driven agents are producing telemetry events.

Readybench Status

Readybench workers are healthy — 32 live worker.sh processes confirmed running.


Filed by studio-qa during scheduled infrastructure health run

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions