From 40a1533e502305acade280e0f680ecbcd47fd06c Mon Sep 17 00:00:00 2001 From: fitz123 Date: Sun, 19 Apr 2026 15:28:35 +0300 Subject: [PATCH] feat: safe scripted bot restart (#100) Single supported restart entry point for the launchd-managed bot. - bot/scripts/restart-bot.sh - Default mode: graceful SIGTERM, waits up to 60s drain, returns new PID - --plist mode: race-free bootout+bootstrap that reflects on-disk plist - Pre-restart config validation aborts without sending SIGTERM - Never SIGKILL, never kickstart -k - Tolerates variable drain time without fixed sleep - HOME fallback guard for restricted environments - bot/src/__tests__/restart-bot.test.ts: 13 tests covering both modes, slow-shutdown drain (5s), broken-config abort, malformed plist - .claude/rules/platform/bot-operations.md + bot-operations skill: operators use restart-bot.sh instead of raw launchctl - README.md: operator docs point at the script Closes #100 Co-Authored-By: Claude Opus 4.7 --- .claude/rules/platform/bot-operations.md | 15 +- .claude/skills/bot-operations/SKILL.md | 18 +- README.md | 14 +- bot/scripts/restart-bot.sh | 271 +++++++++++ bot/src/__tests__/restart-bot.test.ts | 452 ++++++++++++++++++ .../2026-04-18-issue-100-safe-restart.md | 166 +++++++ .../2026-04-18-issue-100-safe-restart.md | 122 +++++ 7 files changed, 1037 insertions(+), 21 deletions(-) create mode 100755 bot/scripts/restart-bot.sh create mode 100644 bot/src/__tests__/restart-bot.test.ts create mode 100644 docs/plans/2026-04-18-issue-100-safe-restart.md create mode 100644 docs/plans/completed/2026-04-18-issue-100-safe-restart.md diff --git a/.claude/rules/platform/bot-operations.md b/.claude/rules/platform/bot-operations.md index 82a64e3..396e382 100644 --- a/.claude/rules/platform/bot-operations.md +++ b/.claude/rules/platform/bot-operations.md @@ -1,11 +1,14 @@ # Bot Operations - **Bot restart requires explicit user confirmation.** No exceptions. -- **Graceful restart:** `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot` — sends SIGTERM. Bot injects shutdown message into active sessions asking agents to wrap up, then waits up to 60s for turns to complete. Idle sessions close immediately. Launchd auto-restarts (KeepAlive=true). -- **Wait for shutdown to complete.** After SIGTERM, the bot may take up to 60s to drain active sessions. Check logs for `All sessions closed. Exiting.` before concluding the restart failed. Running `launchctl list` during this window may show a stale exit code — that does NOT mean the restart failed. -- **Never use `launchctl bootout` after SIGTERM.** If you `bootout` while the bot is still draining sessions, you remove the service definition from launchd — KeepAlive can no longer restart it. The bot dies with no way back except manual `launchctl load`. -- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn. -- **If auto-restart doesn't happen** after clean exit (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja. +- **Canonical restart path:** use `bot/scripts/restart-bot.sh`. Do not type raw `launchctl` commands. The script validates config, sends SIGTERM, polls launchd teardown so bootout never races bootstrap, and returns the new PID on success. + - `bot/scripts/restart-bot.sh` — graceful SIGTERM restart. Use after code changes or edits to `config.yaml` / `config.local.yaml`. KeepAlive relaunches from the cached plist. + - `bot/scripts/restart-bot.sh --plist` — full unregister + re-bootstrap. Use after edits to `~/Library/LaunchAgents/ai.minime.telegram-bot.plist` (env vars, ProgramArguments, etc). Required because launchd caches the plist at bootstrap time; a plain SIGTERM restart picks up the stale cache and silently drops the edit. + - `bot/scripts/restart-bot.sh -h` — usage. +- **Shutdown takes up to 60s.** The bot injects a shutdown message into active sessions, waits for turns to complete, then exits. Idle sessions close immediately. The script polls until the old PID is gone and a new PID is running — do not conclude failure from `launchctl list` output mid-drain. +- **Never bypass the script with raw `launchctl bootout`.** Manual `bootout` in the `gui` domain is asynchronous; pairing it with an immediate `bootstrap` races launchd's teardown and can leave the service unregistered with no way for KeepAlive to respawn (see incident 2026-04-18, 17 min outage). The `--plist` mode handles this safely by polling teardown to completion before bootstrap. +- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn. The script never does this and neither should operators. +- **If the script fails** or auto-restart doesn't happen (>90s, no new PID), rerun `bot/scripts/restart-bot.sh --plist`. If that still fails, fall back to `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja. - **Config changes (hot-reloaded, no restart needed):** `agents` fields (`model`, `fallbackModel`, `maxTurns`, `systemPrompt`, `effort`, `workspaceCwd`) and `sessionDefaults` (`idleTimeoutMs`, `maxConcurrentSessions`) are re-read from `config.yaml` / `config.local.yaml` on every new session spawn. Edit the file and the next new session picks it up. Already-running sessions keep their original config. -- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate` +- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate` (the script runs this automatically and aborts on failure). - **Cron changes:** edit crons.yaml → regenerate plists → load → test → verify logs diff --git a/.claude/skills/bot-operations/SKILL.md b/.claude/skills/bot-operations/SKILL.md index fa505ed..78c27a2 100644 --- a/.claude/skills/bot-operations/SKILL.md +++ b/.claude/skills/bot-operations/SKILL.md @@ -4,20 +4,20 @@ Reference for Telegram bot and cron system management. ## Bot Restart -1. Validate config: `cd ~/.minime/workspace/bot && npx tsx src/config.ts --validate` -2. Report result and reason to Ninja -3. Wait for explicit confirmation -4. Graceful restart: `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot` -5. Wait for drain: check logs for `All sessions closed. Exiting.` (up to 60s) -6. Verify: `launchctl list | grep ai.minime.telegram-bot` — new PID, exit 0 (note: stale exit code during drain window is normal, wait for step 5 first) +1. Report intent and reason to Ninja +2. Wait for explicit confirmation +3. Restart via the canonical script (it validates config, sends SIGTERM, polls launchd teardown, returns the new PID): + - Code or `config.yaml` / `config.local.yaml` changes: `bot/scripts/restart-bot.sh` + - Plist-on-disk changes (`~/Library/LaunchAgents/ai.minime.telegram-bot.plist`): `bot/scripts/restart-bot.sh --plist` + - Usage: `bot/scripts/restart-bot.sh -h` -Bot injects shutdown message into active sessions, waits up to 60s for turns to complete, then launchd auto-restarts (KeepAlive=true). +Bot injects shutdown message into active sessions, waits up to 60s for turns to complete, then launchd auto-restarts (KeepAlive=true). The script polls until the old PID is gone and a new PID is running — do not conclude failure from `launchctl list` output mid-drain. **Never use:** - `launchctl kickstart -k` — sends SIGKILL, kills sessions mid-turn -- `launchctl bootout` after SIGTERM — removes service definition, prevents auto-restart +- Raw `launchctl bootout` paired with immediate `bootstrap` — async teardown races bootstrap and can leave the service unregistered (2026-04-18 incident, 17 min outage). Use `--plist` mode instead. -**If auto-restart fails** (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist` +**If the script fails** or auto-restart doesn't happen (>90s, no new PID), rerun `bot/scripts/restart-bot.sh --plist`. If that still fails, fall back to `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja. ## Config Changes diff --git a/README.md b/README.md index f0b1d54..162dc0b 100644 --- a/README.md +++ b/README.md @@ -175,8 +175,11 @@ The bot runs as a launchd service: `ai.minime.telegram-bot`. # Check status launchctl print gui/$(id -u)/ai.minime.telegram-bot 2>&1 | head -5 -# Restart (graceful — waits for active sessions to finish) -launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot +# Restart (graceful — validates config, sends SIGTERM, waits for drain, returns new PID) +bot/scripts/restart-bot.sh + +# Restart after editing ~/Library/LaunchAgents/ai.minime.telegram-bot.plist +bot/scripts/restart-bot.sh --plist # Stop launchctl bootout gui/$(id -u)/ai.minime.telegram-bot @@ -185,7 +188,7 @@ launchctl bootout gui/$(id -u)/ai.minime.telegram-bot launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist ``` -**Warning:** Graceful restart sends SIGTERM — the bot injects a shutdown message into active sessions and waits up to 60s for turns to complete before exiting. Idle sessions close immediately. launchd auto-restarts via KeepAlive. Still, active work is interrupted — always confirm before restarting. +**Warning:** Graceful restart sends SIGTERM — the bot injects a shutdown message into active sessions and waits up to 60s for turns to complete before exiting. Idle sessions close immediately. launchd auto-restarts via KeepAlive. Still, active work is interrupted — always confirm before restarting. Use `--plist` after editing the plist on disk, because launchd caches the plist at bootstrap time and a plain SIGTERM restart would pick up the stale cache. ## Add a Cron @@ -247,10 +250,9 @@ To remove: `launchctl bootout gui/$(id -u)/ai.minime.cron.`, delete from ` See [config.yaml](config.yaml) for all binding options including `requireMention`, `voiceTranscriptEcho`, `typingIndicator`, and per-topic overrides for forum supergroups. -2. Validate and restart: +2. Restart (the script validates config before sending SIGTERM): ```bash - cd ~/.minime/bot && npx tsx src/config.ts --validate - launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot + bot/scripts/restart-bot.sh ``` ## Add a Discord Binding diff --git a/bot/scripts/restart-bot.sh b/bot/scripts/restart-bot.sh new file mode 100755 index 0000000..46b0ee3 --- /dev/null +++ b/bot/scripts/restart-bot.sh @@ -0,0 +1,271 @@ +#!/bin/bash +# restart-bot.sh — Safely restart the Telegram bot launchd service +# Usage: +# restart-bot.sh Graceful SIGTERM restart (code / config.yaml changes) +# restart-bot.sh --plist Full unregister + re-bootstrap (plist-on-disk changes) +# restart-bot.sh -h|--help Show this help +# +# Never sends SIGKILL. Validates config before restarting. Polls launchd +# teardown so bootout is not raced against bootstrap. + +set -euo pipefail + +if [ -z "${HOME:-}" ]; then + if command -v dscl >/dev/null 2>&1; then + HOME="$(dscl . -read "/Users/$(id -un)" NFSHomeDirectory 2>/dev/null | awk '{print $2}')" + fi +fi +if [ -z "${HOME:-}" ]; then + if command -v getent >/dev/null 2>&1; then + HOME="$(getent passwd "$(id -un)" 2>/dev/null | cut -d: -f6)" + fi +fi +export HOME +if [ -z "$HOME" ]; then + echo "[restart-bot] Error: could not determine HOME from environment or fallback lookups" >&2 + exit 1 +fi +export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:${PATH:-}" + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +BOT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)" + +LAUNCHCTL_BIN="${LAUNCHCTL_BIN:-/bin/launchctl}" +PLUTIL_BIN="${PLUTIL_BIN:-/usr/bin/plutil}" +BOT_LABEL="${BOT_LABEL:-ai.minime.telegram-bot}" +BOT_PLIST="${BOT_PLIST:-$HOME/Library/LaunchAgents/${BOT_LABEL}.plist}" +BOT_UID="${BOT_UID:-$(id -u)}" +DOMAIN="gui/${BOT_UID}" +SERVICE="${DOMAIN}/${BOT_LABEL}" + +# Test-only: override the validator with a single executable (no args, no eval). +# Tests set this to `true` / `false` to simulate validation pass / fail paths. +CONFIG_VALIDATE_BIN="${CONFIG_VALIDATE_BIN:-}" + +# Timeouts (seconds). Drain window is 60s — give headroom. +SHUTDOWN_TIMEOUT="${SHUTDOWN_TIMEOUT:-90}" +TEARDOWN_TIMEOUT="${TEARDOWN_TIMEOUT:-90}" +STARTUP_TIMEOUT="${STARTUP_TIMEOUT:-60}" +POLL_INTERVAL="${POLL_INTERVAL:-1}" + +usage() { + cat <&2; } + +MODE="graceful" + +while [ $# -gt 0 ]; do + case "$1" in + -h|--help) usage; exit 0 ;; + --plist) MODE="plist"; shift ;; + *) err "unknown argument: $1"; usage >&2; exit 2 ;; + esac +done + +# get_pid prints one of: +# — service is registered and running +# "" — service is registered but has no running process (PID = "-") +# exit status: +# 0 — registered (pid may be empty) +# 1 — not registered (launchctl query succeeded, label absent) +# 2 — launchctl query itself failed (unknown state) +get_pid() { + local out + if ! out=$("$LAUNCHCTL_BIN" list 2>/dev/null); then + return 2 + fi + local line + line=$(printf '%s\n' "$out" | awk -v L="$BOT_LABEL" '$3==L { print; exit }') + if [ -z "$line" ]; then + return 1 + fi + local pid + pid=$(printf '%s\n' "$line" | awk '{print $1}') + if [ "$pid" = "-" ]; then + echo "" + else + echo "$pid" + fi + return 0 +} + +# True only when launchctl query succeeded AND the service is registered. +# A transient query failure is NOT treated as "registered" or "not registered". +is_registered() { + local rc=0 + get_pid >/dev/null 2>&1 || rc=$? + [ "$rc" -eq 0 ] +} + +wait_until() { + # wait_until + local timeout="$1"; local pred="$2" + local deadline + deadline=$(( $(date +%s) + timeout )) + while [ "$(date +%s)" -lt "$deadline" ]; do + if "$pred"; then + return 0 + fi + sleep "$POLL_INTERVAL" + done + return 1 +} + +_old_pid="" +_pred_old_pid_gone() { + local cur rc=0 + cur=$(get_pid 2>/dev/null) || rc=$? + case "$rc" in + 0) [ "$cur" != "$_old_pid" ] ;; + 1) return 0 ;; # explicitly not registered → old pid gone + *) return 1 ;; # query failed → unknown, keep polling + esac +} + +# Distinguishes "confirmed not registered" from "query failed", so a transient +# launchctl error can't trick us into bootstrapping over a still-registered svc. +_pred_unregistered() { + local rc=0 + get_pid >/dev/null 2>&1 || rc=$? + [ "$rc" -eq 1 ] +} + +# Requires a successful query AND a non-empty PID that differs from the old PID, +# so a stale `launchctl list` response can't be mistaken for the new process. +_pred_running_pid() { + local pid rc=0 + pid=$(get_pid 2>/dev/null) || rc=$? + [ "$rc" -eq 0 ] && [ -n "$pid" ] && [ "$pid" != "$_old_pid" ] +} + +validate_plist() { + log "Validating plist at $BOT_PLIST…" + if ! "$PLUTIL_BIN" -lint "$BOT_PLIST" >/dev/null 2>&1; then + err "plist is malformed: $BOT_PLIST" + err "run: $PLUTIL_BIN -lint \"$BOT_PLIST\" for details" + return 1 + fi + local plist_label + if ! plist_label=$("$PLUTIL_BIN" -extract Label raw "$BOT_PLIST" 2>/dev/null); then + err "plist is missing 'Label' key: $BOT_PLIST" + return 1 + fi + if [ "$plist_label" != "$BOT_LABEL" ]; then + err "plist Label '$plist_label' does not match expected '$BOT_LABEL'" + return 1 + fi +} + +validate_config() { + log "Validating config before restart…" + if [ -n "$CONFIG_VALIDATE_BIN" ]; then + if ! ( cd "$BOT_DIR" && "$CONFIG_VALIDATE_BIN" >/dev/null 2>&1 ); then + err "config validation failed; refusing to restart" + return 1 + fi + return 0 + fi + if ! ( cd "$BOT_DIR" && npx tsx src/config.ts --validate >/dev/null ); then + err "config validation failed; refusing to restart" + return 1 + fi +} + +graceful_restart() { + local old_pid + if ! old_pid=$(get_pid); then + err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist" + return 1 + fi + + if [ -z "$old_pid" ]; then + err "service $BOT_LABEL is registered but has no running process (PID=-); run: restart-bot.sh --plist" + return 1 + fi + + validate_config || return 1 + + log "Sending SIGTERM to $SERVICE (old PID: $old_pid)" + if ! "$LAUNCHCTL_BIN" kill SIGTERM "$SERVICE"; then + err "launchctl kill SIGTERM failed" + return 1 + fi + + _old_pid="$old_pid" + log "Waiting up to ${SHUTDOWN_TIMEOUT}s for old process $old_pid to exit…" + if ! wait_until "$SHUTDOWN_TIMEOUT" _pred_old_pid_gone; then + err "old process $old_pid did not exit within ${SHUTDOWN_TIMEOUT}s" + return 1 + fi + + log "Waiting up to ${STARTUP_TIMEOUT}s for KeepAlive to spawn a new PID…" + if ! wait_until "$STARTUP_TIMEOUT" _pred_running_pid; then + err "no new PID observed within ${STARTUP_TIMEOUT}s; KeepAlive did not restart" + return 1 + fi + + local new_pid + new_pid=$(get_pid 2>/dev/null || true) + log "Restart complete. New PID: ${new_pid:-unknown}" + echo "$new_pid" +} + +plist_restart() { + if [ ! -f "$BOT_PLIST" ]; then + err "plist not found: $BOT_PLIST" + return 1 + fi + + validate_plist || return 1 + validate_config || return 1 + + if is_registered; then + log "Unregistering $SERVICE (launchctl bootout)…" + # bootout may return non-zero even when the teardown is in progress; + # we rely on polling below, not the exit code. + "$LAUNCHCTL_BIN" bootout "$SERVICE" >/dev/null 2>&1 || true + + log "Waiting up to ${TEARDOWN_TIMEOUT}s for teardown to complete…" + if ! wait_until "$TEARDOWN_TIMEOUT" _pred_unregistered; then + err "service did not unregister within ${TEARDOWN_TIMEOUT}s; refusing to bootstrap" + err "bootout is still draining sessions — rerun once 'launchctl list' no longer shows $BOT_LABEL" + return 1 + fi + else + log "Service not currently registered; skipping bootout." + fi + + log "Bootstrapping from $BOT_PLIST…" + if ! "$LAUNCHCTL_BIN" bootstrap "$DOMAIN" "$BOT_PLIST"; then + err "launchctl bootstrap failed" + return 1 + fi + + log "Waiting up to ${STARTUP_TIMEOUT}s for a running PID…" + if ! wait_until "$STARTUP_TIMEOUT" _pred_running_pid; then + err "service registered but no running PID within ${STARTUP_TIMEOUT}s" + return 1 + fi + + local new_pid + new_pid=$(get_pid 2>/dev/null || true) + log "Restart complete. New PID: ${new_pid:-unknown}" + echo "$new_pid" +} + +case "$MODE" in + graceful) graceful_restart ;; + plist) plist_restart ;; + *) err "internal: unhandled mode $MODE"; exit 1 ;; +esac diff --git a/bot/src/__tests__/restart-bot.test.ts b/bot/src/__tests__/restart-bot.test.ts new file mode 100644 index 0000000..cac944e --- /dev/null +++ b/bot/src/__tests__/restart-bot.test.ts @@ -0,0 +1,452 @@ +import { describe, it } from "node:test"; +import assert from "node:assert/strict"; +import { spawnSync } from "node:child_process"; +import { createHash } from "node:crypto"; +import { mkdtempSync, mkdirSync, rmSync, writeFileSync, chmodSync, readFileSync, existsSync } from "node:fs"; +import { join, resolve, dirname } from "node:path"; +import { tmpdir } from "node:os"; +import { fileURLToPath } from "node:url"; + +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); +const SCRIPT = resolve(__dirname, "../../scripts/restart-bot.sh"); + +// Mock launchctl: writes / reads a key=value state file in $STATE_DIR/state. +// Supports: list, kill , bootout , bootstrap . +// Timing is driven by optional keys: kill_delay, bootout_delay (seconds). +const MOCK_LAUNCHCTL = `#!/bin/bash +set -uo pipefail +STATE="\${STATE_DIR}/state" +touch "\$STATE" + +get() { awk -F= -v k="\$1" 'BEGIN{v=""} $1==k {v=$0; sub(/^[^=]*=/,"",v)} END{print v}' "\$STATE"; } +set_kv() { + tmp="\$STATE.tmp" + awk -F= -v k="\$1" '$1!=k' "\$STATE" > "\$tmp" + printf '%s=%s\\n' "\$1" "\$2" >> "\$tmp" + mv "\$tmp" "\$STATE" +} +del_kv() { + tmp="\$STATE.tmp" + awk -F= -v k="\$1" '$1!=k' "\$STATE" > "\$tmp" + mv "\$tmp" "\$STATE" +} +now() { date +%s; } + +apply() { + kat=\$(get kill_at) + bat=\$(get bootout_at) + if [ -n "\$kat" ] && [ "\$(now)" -ge "\$kat" ]; then + np=\$(get next_pid); [ -z "\$np" ] && np=99999 + set_kv pid "\$np" + del_kv kill_at + fi + if [ -n "\$bat" ] && [ "\$(now)" -ge "\$bat" ]; then + set_kv registered 0 + del_kv pid + del_kv bootout_at + fi +} + +cmd="\${1:-}"; shift || true +case "\$cmd" in + list) + apply + if [ "\$(get registered)" = "1" ]; then + label=\$(get label); [ -z "\$label" ] && label="ai.minime.telegram-bot" + pid=\$(get pid); [ -z "\$pid" ] && pid="-" + printf 'PID\\tStatus\\tLabel\\n' + printf '%s\\t0\\t%s\\n' "\$pid" "\$label" + fi + ;; + kill) + apply + delay=\$(get kill_delay); [ -z "\$delay" ] && delay=0 + set_kv kill_at \$(( \$(now) + delay )) + ;; + bootout) + apply + delay=\$(get bootout_delay); [ -z "\$delay" ] && delay=0 + set_kv bootout_at \$(( \$(now) + delay )) + ;; + bootstrap) + apply + if [ "\$(get registered)" = "1" ]; then + echo "Bootstrap failed: 5: Input/output error" >&2 + exit 5 + fi + plist_path="\${2:-}" + if [ -n "\$plist_path" ] && [ -f "\$plist_path" ]; then + sig=\$(node -e "const fs=require('fs'),c=require('crypto');process.stdout.write(c.createHash('sha1').update(fs.readFileSync(process.argv[1])).digest('hex'))" "\$plist_path") + set_kv bootstrapped_sig "\$sig" + fi + np=\$(get next_pid); [ -z "\$np" ] && np=99999 + set_kv pid "\$np" + set_kv registered 1 + ;; + *) + echo "mock-launchctl: unknown command: \$cmd" >&2 + exit 1 + ;; +esac +`; + +const MOCK_PLUTIL = `#!/bin/bash +set -uo pipefail + +cmd="\${1:-}" +case "\$cmd" in + -lint) + file="\${2:-}" + if [ -z "\$file" ] || [ ! -f "\$file" ]; then + exit 1 + fi + if grep -q "" "\$file" && grep -q "" "\$file" && grep -q "" "\$file"; then + exit 0 + fi + echo "mock-plutil: malformed plist" >&2 + exit 1 + ;; + -extract) + key="\${2:-}" + fmt="\${3:-}" + file="\${4:-}" + if [ "\$key" != "Label" ] || [ "\$fmt" != "raw" ] || [ -z "\$file" ] || [ ! -f "\$file" ]; then + exit 1 + fi + label=\$(tr -d '\\n' < "\$file" | sed -n 's|.*[[:space:]]*Label[[:space:]]*[[:space:]]*\\([^<]*\\).*|\\1|p') + if [ -n "\$label" ]; then + printf '%s\\n' "\$label" + exit 0 + fi + exit 1 + ;; + *) + echo "mock-plutil: unsupported args: \$*" >&2 + exit 1 + ;; +esac +`; + +type Harness = { + dir: string; + launchctl: string; + plutil: string; + plist: string; + stateDir: string; + setState(kv: Record): void; + readState(): Record; + writePlist(content: string): void; + run(args: string[], env?: Record): { status: number | null; stdout: string; stderr: string }; +}; + +const VALID_PLIST = (label = "ai.minime.telegram-bot", extra = "INITIAL") => ` + + + + Label + ${label} + Extra + ${extra} + + +`; + +function createHarness(): Harness { + const dir = mkdtempSync(join(tmpdir(), "restart-bot-test-")); + const launchctl = join(dir, "launchctl"); + const plutil = join(dir, "plutil"); + const stateDir = join(dir, "state-dir"); + const plist = join(dir, "ai.minime.telegram-bot.plist"); + writeFileSync(launchctl, MOCK_LAUNCHCTL); + writeFileSync(plutil, MOCK_PLUTIL); + chmodSync(launchctl, 0o755); + chmodSync(plutil, 0o755); + writeFileSync(plist, VALID_PLIST()); + mkdirSync(stateDir, { recursive: true }); + const stateFile = join(stateDir, "state"); + + const setState = (kv: Record) => { + const existing: Record = {}; + if (existsSync(stateFile)) { + for (const line of readFileSync(stateFile, "utf8").split("\n")) { + if (!line) continue; + const idx = line.indexOf("="); + if (idx === -1) continue; + existing[line.slice(0, idx)] = line.slice(idx + 1); + } + } + for (const [k, v] of Object.entries(kv)) existing[k] = String(v); + writeFileSync( + stateFile, + Object.entries(existing).map(([k, v]) => `${k}=${v}`).join("\n") + "\n", + ); + }; + + const readState = (): Record => { + const out: Record = {}; + if (!existsSync(stateFile)) return out; + for (const line of readFileSync(stateFile, "utf8").split("\n")) { + if (!line) continue; + const idx = line.indexOf("="); + if (idx === -1) continue; + out[line.slice(0, idx)] = line.slice(idx + 1); + } + return out; + }; + + const writePlist = (content: string) => writeFileSync(plist, content); + + const run = (args: string[], env: Record = {}) => { + const result = spawnSync(SCRIPT, args, { + env: { + ...process.env, + LAUNCHCTL_BIN: launchctl, + PLUTIL_BIN: plutil, + BOT_LABEL: "ai.minime.telegram-bot", + BOT_PLIST: plist, + BOT_UID: "501", + STATE_DIR: stateDir, + POLL_INTERVAL: "0.1", + SHUTDOWN_TIMEOUT: "30", + TEARDOWN_TIMEOUT: "30", + STARTUP_TIMEOUT: "20", + CONFIG_VALIDATE_BIN: "true", + ...env, + }, + encoding: "utf8", + timeout: 60_000, + }); + return { + status: result.status, + stdout: result.stdout ?? "", + stderr: result.stderr ?? "", + }; + }; + + return { dir, launchctl, plutil, plist, stateDir, setState, readState, writePlist, run }; +} + +function cleanup(h: Harness) { + rmSync(h.dir, { recursive: true, force: true }); +} + +describe("restart-bot.sh", () => { + it("prints usage and exits 0 for --help", () => { + const h = createHarness(); + try { + const { status, stdout } = h.run(["--help"]); + assert.strictEqual(status, 0); + assert.match(stdout, /Usage:/); + assert.match(stdout, /--plist/); + } finally { + cleanup(h); + } + }); + + it("does not blow up when HOME is unset (launchd context)", () => { + const h = createHarness(); + try { + // Regression: `set -u` + `${BOT_PLIST:-$HOME/...}` crashed when HOME was unset. + // Drop HOME from the inherited env; --help should still print and exit 0. + const env: NodeJS.ProcessEnv = { ...process.env }; + delete env.HOME; + const result = spawnSync(SCRIPT, ["--help"], { + env: env as Record, + encoding: "utf8", + timeout: 30_000, + }); + assert.strictEqual(result.status, 0, `expected exit 0, got ${result.status}: ${result.stderr}`); + assert.match(result.stdout, /Usage:/); + assert.doesNotMatch(result.stderr, /unbound variable/); + } finally { + cleanup(h); + } + }); + + it("exits 2 with usage on unknown argument", () => { + const h = createHarness(); + try { + const { status, stderr } = h.run(["--bogus"]); + assert.strictEqual(status, 2); + assert.match(stderr, /unknown argument/); + assert.match(stderr, /Usage:/); + } finally { + cleanup(h); + } + }); + + it("graceful: returns new PID and exits 0", () => { + const h = createHarness(); + try { + h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 2222, kill_delay: 0 }); + const { status, stdout, stderr } = h.run([]); + assert.strictEqual(status, 0, `expected exit 0, got ${status}: ${stderr}`); + assert.match(stdout, /New PID: 2222/); + assert.match(stdout.trim(), /2222\s*$/m); + const state = h.readState(); + assert.strictEqual(state.pid, "2222"); + } finally { + cleanup(h); + } + }); + + it("graceful: refuses when service is not registered", () => { + const h = createHarness(); + try { + const { status, stderr } = h.run([]); + assert.notStrictEqual(status, 0); + assert.match(stderr, /not registered/); + } finally { + cleanup(h); + } + }); + + it("graceful: tolerates a slow shutdown without racing teardown", () => { + const h = createHarness(); + try { + // Simulate a realistic drain: the old PID stays listed for several seconds + // before KeepAlive swaps in the new PID. Script must keep polling the whole + // time, not time out or declare success early. + h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 3333, kill_delay: 5 }); + const start = Date.now(); + const { status, stdout, stderr } = h.run([]); + const elapsedMs = Date.now() - start; + assert.strictEqual(status, 0, `expected exit 0, got ${status}: ${stderr}`); + // It really did wait for the drain to complete. + // Script must have actually waited for the drain — not exited early. + // Give ~500ms slack for 1-second mock-clock granularity and poll interval. + assert.ok(elapsedMs >= 4000, `expected >= ~5s wait, got ${elapsedMs}ms`); + assert.match(stdout, /New PID: 3333/); + } finally { + cleanup(h); + } + }); + + it("graceful: aborts before SIGTERM when config validation fails", () => { + const h = createHarness(); + try { + h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 2222 }); + const { status, stdout, stderr } = h.run([], { + CONFIG_VALIDATE_BIN: "false", + }); + assert.notStrictEqual(status, 0); + assert.match(stderr, /config validation failed/); + // No kill_at was set -> kill was never sent. + const state = h.readState(); + assert.ok(!("kill_at" in state), `expected no kill scheduled, got state: ${JSON.stringify(state)}`); + assert.strictEqual(state.pid, "1111"); + assert.doesNotMatch(stdout, /Sending SIGTERM/); + } finally { + cleanup(h); + } + }); + + it("--plist: new on-disk plist is reflected after success", () => { + const h = createHarness(); + try { + // Old plist content, bot currently running. + h.writePlist(VALID_PLIST("ai.minime.telegram-bot", "OLD")); + h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 4444, bootout_delay: 2 }); + // Operator edits the plist on disk just before restart. + h.writePlist(VALID_PLIST("ai.minime.telegram-bot", "NEW_ENV_VAR=BAR")); + + // Capture signature of the NEW plist (matches mock launchctl's sha1 digest). + const newSig = createHash("sha1").update(readFileSync(h.plist)).digest("hex"); + + const { status, stdout, stderr } = h.run(["--plist"]); + assert.strictEqual(status, 0, `expected exit 0, got ${status}: ${stderr}`); + assert.match(stdout, /New PID: 4444/); + + const state = h.readState(); + assert.strictEqual(state.registered, "1"); + assert.strictEqual(state.pid, "4444"); + // Script bootstrapped from the NEW plist, not a cached one. + assert.strictEqual(state.bootstrapped_sig, newSig); + } finally { + cleanup(h); + } + }); + + it("--plist: does not race bootout teardown (bootstrap waits for unregister)", () => { + const h = createHarness(); + try { + // bootout takes 4s to drain. If the script didn't wait, bootstrap would + // fire against a still-registered service and get EIO. + h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 5555, bootout_delay: 4 }); + const start = Date.now(); + const { status, stdout, stderr } = h.run(["--plist"]); + const elapsedMs = Date.now() - start; + assert.strictEqual(status, 0, `expected exit 0, got ${status}: ${stderr}`); + assert.ok(elapsedMs >= 3500, `expected >= ~4s wait, got ${elapsedMs}ms`); + assert.match(stdout, /New PID: 5555/); + assert.doesNotMatch(stderr, /Input\/output error/); + } finally { + cleanup(h); + } + }); + + it("--plist: aborts when config validation fails", () => { + const h = createHarness(); + try { + h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 4444 }); + const { status, stderr } = h.run(["--plist"], { CONFIG_VALIDATE_BIN: "false" }); + assert.notStrictEqual(status, 0); + assert.match(stderr, /config validation failed/); + const state = h.readState(); + // No bootout scheduled. + assert.ok(!("bootout_at" in state)); + assert.strictEqual(state.pid, "1111"); + } finally { + cleanup(h); + } + }); + + it("--plist: aborts before bootout when plist is malformed", () => { + const h = createHarness(); + try { + h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 4444 }); + // Syntactically broken plist — plutil -lint must reject before any bootout. + h.writePlist("not valid xml"); + const { status, stderr } = h.run(["--plist"]); + assert.notStrictEqual(status, 0); + assert.match(stderr, /plist is malformed/); + const state = h.readState(); + // No bootout scheduled, service still registered and running. + assert.ok(!("bootout_at" in state), `expected no bootout scheduled, got: ${JSON.stringify(state)}`); + assert.strictEqual(state.registered, "1"); + assert.strictEqual(state.pid, "1111"); + } finally { + cleanup(h); + } + }); + + it("--plist: aborts before bootout when plist Label does not match", () => { + const h = createHarness(); + try { + h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 4444 }); + h.writePlist(VALID_PLIST("ai.minime.WRONG-LABEL", "X")); + const { status, stderr } = h.run(["--plist"]); + assert.notStrictEqual(status, 0); + assert.match(stderr, /Label .* does not match/); + const state = h.readState(); + assert.ok(!("bootout_at" in state)); + assert.strictEqual(state.registered, "1"); + assert.strictEqual(state.pid, "1111"); + } finally { + cleanup(h); + } + }); + + it("--plist: fails cleanly when plist file is missing", () => { + const h = createHarness(); + try { + rmSync(h.plist); + const { status, stderr } = h.run(["--plist"]); + assert.notStrictEqual(status, 0); + assert.match(stderr, /plist not found/); + } finally { + cleanup(h); + } + }); +}); diff --git a/docs/plans/2026-04-18-issue-100-safe-restart.md b/docs/plans/2026-04-18-issue-100-safe-restart.md new file mode 100644 index 0000000..3873ad8 --- /dev/null +++ b/docs/plans/2026-04-18-issue-100-safe-restart.md @@ -0,0 +1,166 @@ +# Safe scripted bot restart + rule update — Round 1 + +## Goal + +Give operators one supported, scripted entry point to restart the bot safely — covering both code/config restarts and on-disk-plist changes — and update the operational rule to point at it. Resolves fitz123/claude-code-bot#100. + +## Validation Commands + +```bash +cd bot && npm test +cd bot && npx tsc --noEmit +shellcheck bot/scripts/*.sh || true +``` + +## Reference: incident log + +Actual timestamps from an operator machine's `~/.minime/logs/telegram-bot.stdout.log`: + +``` +15:27:42Z Received SIGTERM (first restart: launchctl kill SIGTERM) +15:28:43Z All sessions closed. Exiting. +15:28:44Z Bot version: 2d5fd7b (launchd KeepAlive auto-restart, CACHED plist — new env var NOT applied) +15:30:08Z Received SIGTERM (second restart: launchctl bootout + bootstrap) +15:30:14Z All sessions closed. Exiting. +[ 17m16s of bot downtime ] +15:47:30Z Bot version: 2d5fd7b (manual recovery from a separate session) +``` + +Failing command sequence that caused the 17-minute outage: + +```bash +launchctl bootout gui/$uid/ai.minime.telegram-bot +sleep 2 +launchctl bootstrap gui/$uid ~/Library/LaunchAgents/ai.minime.telegram-bot.plist +# → Bootstrap failed: 5: Input/output error +``` + +Root cause: `launchctl bootout` in the `gui` domain is asynchronous. The bootstrap call fired before teardown finished, got EIO, and service registration was gone so KeepAlive could not respawn. + +## Reference: launchctl list output + +Column format verified on macOS 14 (Darwin 23.6.0): + +``` +$ launchctl list | head -1 +PID Status Label +$ launchctl list | awk '$3=="ai.minime.telegram-bot"' +84395 0 ai.minime.telegram-bot +``` + +PID column shows `-` when the service is registered but the process is not currently running. + +## Reference: existing scripts in bot/scripts/ + +``` +bot/scripts/deliver.sh +bot/scripts/generate-plists.ts +bot/scripts/run-cron.sh +bot/scripts/start-bot.sh # referenced from plist ProgramArguments +``` + +`deliver.sh` and `run-cron.sh` are the closest style references for a new shell script. + +## Reference: current rule (.claude/rules/platform/bot-operations.md) + +Full current content of `.claude/rules/platform/bot-operations.md` — the file Task 2 updates: + +```markdown +# Bot Operations + +- **Bot restart requires explicit user confirmation.** No exceptions. +- **Graceful restart:** `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot` — sends SIGTERM. Bot injects shutdown message into active sessions asking agents to wrap up, then waits up to 60s for turns to complete. Idle sessions close immediately. Launchd auto-restarts (KeepAlive=true). +- **Wait for shutdown to complete.** After SIGTERM, the bot may take up to 60s to drain active sessions. Check logs for `All sessions closed. Exiting.` before concluding the restart failed. Running `launchctl list` during this window may show a stale exit code — that does NOT mean the restart failed. +- **Never use `launchctl bootout` after SIGTERM.** If you `bootout` while the bot is still draining sessions, you remove the service definition from launchd — KeepAlive can no longer restart it. The bot dies with no way back except manual `launchctl load`. +- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn. +- **If auto-restart doesn't happen** after clean exit (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja. +- **Config changes (hot-reloaded, no restart needed):** `agents` fields (`model`, `fallbackModel`, `maxTurns`, `systemPrompt`, `effort`, `workspaceCwd`) and `sessionDefaults` (`idleTimeoutMs`, `maxConcurrentSessions`) are re-read from `config.yaml` / `config.local.yaml` on every new session spawn. Edit the file and the next new session picks it up. Already-running sessions keep their original config. +- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate` +- **Cron changes:** edit crons.yaml → regenerate plists → load → test → verify logs +``` + +The rule does not specify a procedure for applying plist-on-disk changes and warns against `bootout` without pairing that with a safe alternative. + +## Tasks + +### Task 1: Add a bot-restart script under `bot/scripts/` (#100, P1) + +**Problem.** Operators currently restart the bot by typing raw `launchctl` commands. Two distinct triggers exist: + +1. Code or `config.yaml`/`config.local.yaml` changes — launchd's cached plist is still correct, so a graceful SIGTERM restart is enough (KeepAlive relaunches). +2. Changes to `ai.minime.telegram-bot.plist` itself (env vars, ProgramArguments, etc) — launchd caches the plist in memory at bootstrap time, so a plain SIGTERM restart relaunches from the stale cache and silently drops the edit. To pick up the new plist, the service must be fully unregistered and re-bootstrapped from disk. + +Case 2 is where operators get hurt: the naive `bootout` + `bootstrap` sequence has a race (bootout is asynchronous, bootstrap returns EIO if called too early). On 2026-04-18 this produced 17 minutes of bot downtime — see incident reference above. + +**What we want.** A single supported, scripted entry point that an operator (or automation) can invoke to restart the bot safely, covering both cases. After the restart, the bot is running and — in case 2 — actually reflects the on-disk plist. + +- [x] There is an executable script in `bot/scripts/` that operators invoke to restart the bot. Exact name and CLI are the implementer's choice, but the name and usage MUST be referenced verbatim from the updated rule in Task 2 (one contract, one source of truth). +- [x] Default / zero-argument invocation performs a graceful restart suitable for case 1 (code/config changes only); it sends SIGTERM, waits for the old process to exit, and returns successfully once a new PID is running. +- [x] There is a clearly named mode (flag, subcommand, or separate script — implementer's choice) that performs case 2: after the script finishes successfully, the running bot process reflects the current on-disk plist (verifiable via `sudo launchctl procinfo ` showing any new `EnvironmentVariables` from the plist). +- [x] The script tolerates variable session-drain time — including the full 60-second shutdown window — without racing launchd's teardown and without relying on a fixed `sleep`. Simulating a slow shutdown (e.g. a stub that delays exit) must not cause false-positive or false-negative exits. +- [x] The script never sends SIGKILL and never invokes `launchctl kickstart -k` or equivalent; graceful shutdown is not bypassable. +- [x] On success the script prints the new PID and exits 0; on any failure to shut down or to bring the service back up, it prints a clear diagnostic and exits non-zero. +- [x] Invoking with `-h` / `--help` or an unknown argument prints usage and exits with an appropriate status. +- [x] Style is consistent with other shell scripts in `bot/scripts/` (see `deliver.sh`, `run-cron.sh`). +- [x] `shellcheck` produces no errors on the new script (warnings acceptable if justified inline). +- [x] Before any restart action the script runs config validation (`npx tsx bot/src/config.ts --validate` or equivalent for the case being applied) and aborts with a non-zero exit and a clear diagnostic if validation fails — the bot is never restarted with a broken config +- [x] Automated test harness covers: (a) graceful path returns the new PID and exits 0; (b) plist-change mode results in the on-disk plist being reflected after success; (c) a slow-shutdown stub that delays exit up to the full 60-second drain window does not cause the script to race launchd's teardown or exit early; (d) a deliberately broken config aborts the restart before any SIGTERM is sent +- [x] Verify existing tests pass + +### Task 2: Recommend the script in `.claude/rules/platform/bot-operations.md` (#100, P1) + +**Problem.** The current rule tells operators to type raw `launchctl` commands, and its "Never use `launchctl bootout` after SIGTERM" warning — taken literally — leaves an operator with no correct way to apply a plist edit at all. That gap is what the Task 1 script closes. + +**What we want.** After Task 1 lands, the rule points operators to the script as the canonical restart path. Operators who edit `config.yaml` / `config.local.yaml` or the plist should be able to follow the rule without ever running raw `launchctl` commands. + +- [x] The rule recommends the Task 1 script (by its actual name and CLI) as the default way to restart the bot, and explicitly covers the plist-change case. +- [x] Any previous guidance that contradicts or misleads about the plist-change flow — specifically the blanket "Never use `launchctl bootout` after SIGTERM" line — is updated or removed so operators following the rule end up with a working bot after a plist edit. +- [x] The "Never use `launchctl kickstart -k`" warning is preserved. +- [x] The auto-restart fallback ("If auto-restart doesn't happen…") still works: it points at the script first, and only falls back to manual `launchctl` commands if the script itself fails. +- [x] All existing sections on hot-reload fields, boot-level fields, and cron changes are preserved unchanged. +- [x] Following the updated rule end-to-end (edit plist → run script → check `sudo launchctl procinfo `) successfully picks up the new on-disk plist without any raw `launchctl` commands typed by the operator +- [x] Verify existing tests pass + + +### Task 3: Address PR #103 Copilot review findings + +PR https://github.com/fitz123/claude-code-bot/pull/103 — 4 unresolved Copilot comments. Fix each: + +#### Finding 1: `HOME` unbound in `restart-bot.sh` (line 23) + +`bot/scripts/restart-bot.sh` runs with `set -u`, but uses `${BOT_PLIST:-$HOME/Library/...}` in the default value without ensuring `HOME` is set. In a launchd/automation context where `HOME` is unset, the script exits immediately with "unbound variable". + +- [x] Set `HOME` defensively at the top of `restart-bot.sh` (same pattern as `start-bot.sh` / `run-cron.sh`), OR build the default plist path without depending on `$HOME`. +- [x] Add a regression test: invoke the script with `env -u HOME bash bot/scripts/restart-bot.sh ...` and confirm it does not blow up on the variable expansion. + +#### Finding 2: `eval` injection in `validate_config` (`restart-bot.sh:124`) + +`validate_config` uses `eval "$CONFIG_VALIDATE_CMD"`. Even though the override is intended only for tests, `eval` is a shell-injection footgun and breaks on subtle quoting/word-splitting if `CONFIG_VALIDATE_CMD` is ever passed from the environment. + +- [x] Replace the eval-based override. Pick one approach: + - (a) Drop the env-var override entirely and add a dedicated `SKIP_CONFIG_VALIDATE=1` flag the test harness can set, OR + - (b) Represent the validator as `CONFIG_VALIDATE_BIN` + `CONFIG_VALIDATE_ARGS` arrays and execute without `eval`. +- [x] Update the test that exercises this path so it still works after the rewrite. +- [x] Confirm `npx tsc --noEmit` and `npm test` both pass. + +#### Finding 3: `shasum` external dep in `restart-bot.test.ts` (line 281) + +`bot/src/__tests__/restart-bot.test.ts` invokes the external `shasum` binary to compute the expected plist signature. Tests should be hermetic. + +- [x] Replace `shasum` with Node's `crypto.createHash('sha1')` for the expected-signature computation. +- [x] Confirm the test still asserts the same behaviour and passes. + +#### Finding 4: `shasum` in mock `launchctl` (`restart-bot.test.ts:79`) + +The mock `launchctl` shells out to `shasum` to compute a plist signature. Same hermeticity issue as Finding 3, plus this version is INSIDE a fixture script. + +- [x] Replace the `shasum` shell-out in the mock with a Node-based hash. Either: + - (a) Inline a `node -e '...'` invocation that uses `crypto.createHash('sha1')` over the plist contents, OR + - (b) Store the plist contents directly in the test fixture state so no hashing is needed at all in the mock. +- [x] Confirm the mock still exercises the same code path (signature comparison) in `restart-bot.sh` and the test passes. + +#### Wrap-up +- [x] All 4 findings addressed (boxes ticked above) +- [x] `npx tsc --noEmit` clean +- [x] `npm test` clean (or only pre-existing unrelated failures) +- [x] Commit message references PR #103 and lists the findings fixed diff --git a/docs/plans/completed/2026-04-18-issue-100-safe-restart.md b/docs/plans/completed/2026-04-18-issue-100-safe-restart.md new file mode 100644 index 0000000..6d3678a --- /dev/null +++ b/docs/plans/completed/2026-04-18-issue-100-safe-restart.md @@ -0,0 +1,122 @@ +# Safe scripted bot restart + rule update — Round 1 + +## Goal + +Give operators one supported, scripted entry point to restart the bot safely — covering both code/config restarts and on-disk-plist changes — and update the operational rule to point at it. Resolves fitz123/claude-code-bot#100. + +## Validation Commands + +```bash +cd bot && npm test +cd bot && npx tsc --noEmit +shellcheck bot/scripts/*.sh || true +``` + +## Reference: incident log + +Actual timestamps from an operator machine's `~/.minime/logs/telegram-bot.stdout.log`: + +``` +15:27:42Z Received SIGTERM (first restart: launchctl kill SIGTERM) +15:28:43Z All sessions closed. Exiting. +15:28:44Z Bot version: 2d5fd7b (launchd KeepAlive auto-restart, CACHED plist — new env var NOT applied) +15:30:08Z Received SIGTERM (second restart: launchctl bootout + bootstrap) +15:30:14Z All sessions closed. Exiting. +[ 17m16s of bot downtime ] +15:47:30Z Bot version: 2d5fd7b (manual recovery from a separate session) +``` + +Failing command sequence that caused the 17-minute outage: + +```bash +launchctl bootout gui/$uid/ai.minime.telegram-bot +sleep 2 +launchctl bootstrap gui/$uid ~/Library/LaunchAgents/ai.minime.telegram-bot.plist +# → Bootstrap failed: 5: Input/output error +``` + +Root cause: `launchctl bootout` in the `gui` domain is asynchronous. The bootstrap call fired before teardown finished, got EIO, and service registration was gone so KeepAlive could not respawn. + +## Reference: launchctl list output + +Column format verified on macOS 14 (Darwin 23.6.0): + +``` +$ launchctl list | head -1 +PID Status Label +$ launchctl list | awk '$3=="ai.minime.telegram-bot"' +84395 0 ai.minime.telegram-bot +``` + +PID column shows `-` when the service is registered but the process is not currently running. + +## Reference: existing scripts in bot/scripts/ + +``` +bot/scripts/deliver.sh +bot/scripts/generate-plists.ts +bot/scripts/run-cron.sh +bot/scripts/start-bot.sh # referenced from plist ProgramArguments +``` + +`deliver.sh` and `run-cron.sh` are the closest style references for a new shell script. + +## Reference: current rule (.claude/rules/platform/bot-operations.md) + +Full current content of `.claude/rules/platform/bot-operations.md` — the file Task 2 updates: + +```markdown +# Bot Operations + +- **Bot restart requires explicit user confirmation.** No exceptions. +- **Graceful restart:** `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot` — sends SIGTERM. Bot injects shutdown message into active sessions asking agents to wrap up, then waits up to 60s for turns to complete. Idle sessions close immediately. Launchd auto-restarts (KeepAlive=true). +- **Wait for shutdown to complete.** After SIGTERM, the bot may take up to 60s to drain active sessions. Check logs for `All sessions closed. Exiting.` before concluding the restart failed. Running `launchctl list` during this window may show a stale exit code — that does NOT mean the restart failed. +- **Never use `launchctl bootout` after SIGTERM.** If you `bootout` while the bot is still draining sessions, you remove the service definition from launchd — KeepAlive can no longer restart it. The bot dies with no way back except manual `launchctl load`. +- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn. +- **If auto-restart doesn't happen** after clean exit (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja. +- **Config changes (hot-reloaded, no restart needed):** `agents` fields (`model`, `fallbackModel`, `maxTurns`, `systemPrompt`, `effort`, `workspaceCwd`) and `sessionDefaults` (`idleTimeoutMs`, `maxConcurrentSessions`) are re-read from `config.yaml` / `config.local.yaml` on every new session spawn. Edit the file and the next new session picks it up. Already-running sessions keep their original config. +- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate` +- **Cron changes:** edit crons.yaml → regenerate plists → load → test → verify logs +``` + +The rule does not specify a procedure for applying plist-on-disk changes and warns against `bootout` without pairing that with a safe alternative. + +## Tasks + +### Task 1: Add a bot-restart script under `bot/scripts/` (#100, P1) + +**Problem.** Operators currently restart the bot by typing raw `launchctl` commands. Two distinct triggers exist: + +1. Code or `config.yaml`/`config.local.yaml` changes — launchd's cached plist is still correct, so a graceful SIGTERM restart is enough (KeepAlive relaunches). +2. Changes to `ai.minime.telegram-bot.plist` itself (env vars, ProgramArguments, etc) — launchd caches the plist in memory at bootstrap time, so a plain SIGTERM restart relaunches from the stale cache and silently drops the edit. To pick up the new plist, the service must be fully unregistered and re-bootstrapped from disk. + +Case 2 is where operators get hurt: the naive `bootout` + `bootstrap` sequence has a race (bootout is asynchronous, bootstrap returns EIO if called too early). On 2026-04-18 this produced 17 minutes of bot downtime — see incident reference above. + +**What we want.** A single supported, scripted entry point that an operator (or automation) can invoke to restart the bot safely, covering both cases. After the restart, the bot is running and — in case 2 — actually reflects the on-disk plist. + +- [x] There is an executable script in `bot/scripts/` that operators invoke to restart the bot. Exact name and CLI are the implementer's choice, but the name and usage MUST be referenced verbatim from the updated rule in Task 2 (one contract, one source of truth). +- [x] Default / zero-argument invocation performs a graceful restart suitable for case 1 (code/config changes only); it sends SIGTERM, waits for the old process to exit, and returns successfully once a new PID is running. +- [x] There is a clearly named mode (flag, subcommand, or separate script — implementer's choice) that performs case 2: after the script finishes successfully, the running bot process reflects the current on-disk plist (verifiable via `sudo launchctl procinfo ` showing any new `EnvironmentVariables` from the plist). +- [x] The script tolerates variable session-drain time — including the full 60-second shutdown window — without racing launchd's teardown and without relying on a fixed `sleep`. Simulating a slow shutdown (e.g. a stub that delays exit) must not cause false-positive or false-negative exits. +- [x] The script never sends SIGKILL and never invokes `launchctl kickstart -k` or equivalent; graceful shutdown is not bypassable. +- [x] On success the script prints the new PID and exits 0; on any failure to shut down or to bring the service back up, it prints a clear diagnostic and exits non-zero. +- [x] Invoking with `-h` / `--help` or an unknown argument prints usage and exits with an appropriate status. +- [x] Style is consistent with other shell scripts in `bot/scripts/` (see `deliver.sh`, `run-cron.sh`). +- [x] `shellcheck` produces no errors on the new script (warnings acceptable if justified inline). +- [x] Before any restart action the script runs config validation (`npx tsx bot/src/config.ts --validate` or equivalent for the case being applied) and aborts with a non-zero exit and a clear diagnostic if validation fails — the bot is never restarted with a broken config +- [x] Automated test harness covers: (a) graceful path returns the new PID and exits 0; (b) plist-change mode results in the on-disk plist being reflected after success; (c) a slow-shutdown stub that delays exit up to the full 60-second drain window does not cause the script to race launchd's teardown or exit early; (d) a deliberately broken config aborts the restart before any SIGTERM is sent +- [x] Verify existing tests pass + +### Task 2: Recommend the script in `.claude/rules/platform/bot-operations.md` (#100, P1) + +**Problem.** The current rule tells operators to type raw `launchctl` commands, and its "Never use `launchctl bootout` after SIGTERM" warning — taken literally — leaves an operator with no correct way to apply a plist edit at all. That gap is what the Task 1 script closes. + +**What we want.** After Task 1 lands, the rule points operators to the script as the canonical restart path. Operators who edit `config.yaml` / `config.local.yaml` or the plist should be able to follow the rule without ever running raw `launchctl` commands. + +- [x] The rule recommends the Task 1 script (by its actual name and CLI) as the default way to restart the bot, and explicitly covers the plist-change case. +- [x] Any previous guidance that contradicts or misleads about the plist-change flow — specifically the blanket "Never use `launchctl bootout` after SIGTERM" line — is updated or removed so operators following the rule end up with a working bot after a plist edit. +- [x] The "Never use `launchctl kickstart -k`" warning is preserved. +- [x] The auto-restart fallback ("If auto-restart doesn't happen…") still works: it points at the script first, and only falls back to manual `launchctl` commands if the script itself fails. +- [x] All existing sections on hot-reload fields, boot-level fields, and cron changes are preserved unchanged. +- [x] manual test (skipped - not automatable): Following the updated rule end-to-end (edit plist → run script → check `sudo launchctl procinfo `) successfully picks up the new on-disk plist without any raw `launchctl` commands typed by the operator +- [x] Verify existing tests pass