From 40a1533e502305acade280e0f680ecbcd47fd06c Mon Sep 17 00:00:00 2001
From: fitz123 <fitz123@users.noreply.github.com>
Date: Sun, 19 Apr 2026 15:28:35 +0300
Subject: [PATCH] feat: safe scripted bot restart (#100)

Single supported restart entry point for the launchd-managed bot.

- bot/scripts/restart-bot.sh
  - Default mode: graceful SIGTERM, waits up to 60s drain, returns new PID
  - --plist mode: race-free bootout+bootstrap that reflects on-disk plist
  - Pre-restart config validation aborts without sending SIGTERM
  - Never SIGKILL, never kickstart -k
  - Tolerates variable drain time without fixed sleep
  - HOME fallback guard for restricted environments
- bot/src/__tests__/restart-bot.test.ts: 13 tests covering both modes,
  slow-shutdown drain (5s), broken-config abort, malformed plist
- .claude/rules/platform/bot-operations.md + bot-operations skill:
  operators use restart-bot.sh instead of raw launchctl
- README.md: operator docs point at the script

Closes #100

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .claude/rules/platform/bot-operations.md      |  15 +-
 .claude/skills/bot-operations/SKILL.md        |  18 +-
 README.md                                     |  14 +-
 bot/scripts/restart-bot.sh                    | 271 +++++++++++
 bot/src/__tests__/restart-bot.test.ts         | 452 ++++++++++++++++++
 .../2026-04-18-issue-100-safe-restart.md      | 166 +++++++
 .../2026-04-18-issue-100-safe-restart.md      | 122 +++++
 7 files changed, 1037 insertions(+), 21 deletions(-)
 create mode 100755 bot/scripts/restart-bot.sh
 create mode 100644 bot/src/__tests__/restart-bot.test.ts
 create mode 100644 docs/plans/2026-04-18-issue-100-safe-restart.md
 create mode 100644 docs/plans/completed/2026-04-18-issue-100-safe-restart.md

diff --git a/.claude/rules/platform/bot-operations.md b/.claude/rules/platform/bot-operations.md
index 82a64e3..396e382 100644
--- a/.claude/rules/platform/bot-operations.md
+++ b/.claude/rules/platform/bot-operations.md
@@ -1,11 +1,14 @@
 # Bot Operations
 
 - **Bot restart requires explicit user confirmation.** No exceptions.
-- **Graceful restart:** `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot` — sends SIGTERM. Bot injects shutdown message into active sessions asking agents to wrap up, then waits up to 60s for turns to complete. Idle sessions close immediately. Launchd auto-restarts (KeepAlive=true).
-- **Wait for shutdown to complete.** After SIGTERM, the bot may take up to 60s to drain active sessions. Check logs for `All sessions closed. Exiting.` before concluding the restart failed. Running `launchctl list` during this window may show a stale exit code — that does NOT mean the restart failed.
-- **Never use `launchctl bootout` after SIGTERM.** If you `bootout` while the bot is still draining sessions, you remove the service definition from launchd — KeepAlive can no longer restart it. The bot dies with no way back except manual `launchctl load`.
-- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn.
-- **If auto-restart doesn't happen** after clean exit (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja.
+- **Canonical restart path:** use `bot/scripts/restart-bot.sh`. Do not type raw `launchctl` commands. The script validates config, sends SIGTERM, polls launchd teardown so bootout never races bootstrap, and returns the new PID on success.
+  - `bot/scripts/restart-bot.sh` — graceful SIGTERM restart. Use after code changes or edits to `config.yaml` / `config.local.yaml`. KeepAlive relaunches from the cached plist.
+  - `bot/scripts/restart-bot.sh --plist` — full unregister + re-bootstrap. Use after edits to `~/Library/LaunchAgents/ai.minime.telegram-bot.plist` (env vars, ProgramArguments, etc). Required because launchd caches the plist at bootstrap time; a plain SIGTERM restart picks up the stale cache and silently drops the edit.
+  - `bot/scripts/restart-bot.sh -h` — usage.
+- **Shutdown takes up to 60s.** The bot injects a shutdown message into active sessions, waits for turns to complete, then exits. Idle sessions close immediately. The script polls until the old PID is gone and a new PID is running — do not conclude failure from `launchctl list` output mid-drain.
+- **Never bypass the script with raw `launchctl bootout`.** Manual `bootout` in the `gui` domain is asynchronous; pairing it with an immediate `bootstrap` races launchd's teardown and can leave the service unregistered with no way for KeepAlive to respawn (see incident 2026-04-18, 17 min outage). The `--plist` mode handles this safely by polling teardown to completion before bootstrap.
+- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn. The script never does this and neither should operators.
+- **If the script fails** or auto-restart doesn't happen (>90s, no new PID), rerun `bot/scripts/restart-bot.sh --plist`. If that still fails, fall back to `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja.
 - **Config changes (hot-reloaded, no restart needed):** `agents` fields (`model`, `fallbackModel`, `maxTurns`, `systemPrompt`, `effort`, `workspaceCwd`) and `sessionDefaults` (`idleTimeoutMs`, `maxConcurrentSessions`) are re-read from `config.yaml` / `config.local.yaml` on every new session spawn. Edit the file and the next new session picks it up. Already-running sessions keep their original config.
-- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate`
+- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate` (the script runs this automatically and aborts on failure).
 - **Cron changes:** edit crons.yaml → regenerate plists → load → test → verify logs
diff --git a/.claude/skills/bot-operations/SKILL.md b/.claude/skills/bot-operations/SKILL.md
index fa505ed..78c27a2 100644
--- a/.claude/skills/bot-operations/SKILL.md
+++ b/.claude/skills/bot-operations/SKILL.md
@@ -4,20 +4,20 @@ Reference for Telegram bot and cron system management.
 
 ## Bot Restart
 
-1. Validate config: `cd ~/.minime/workspace/bot && npx tsx src/config.ts --validate`
-2. Report result and reason to Ninja
-3. Wait for explicit confirmation
-4. Graceful restart: `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot`
-5. Wait for drain: check logs for `All sessions closed. Exiting.` (up to 60s)
-6. Verify: `launchctl list | grep ai.minime.telegram-bot` — new PID, exit 0 (note: stale exit code during drain window is normal, wait for step 5 first)
+1. Report intent and reason to Ninja
+2. Wait for explicit confirmation
+3. Restart via the canonical script (it validates config, sends SIGTERM, polls launchd teardown, returns the new PID):
+   - Code or `config.yaml` / `config.local.yaml` changes: `bot/scripts/restart-bot.sh`
+   - Plist-on-disk changes (`~/Library/LaunchAgents/ai.minime.telegram-bot.plist`): `bot/scripts/restart-bot.sh --plist`
+   - Usage: `bot/scripts/restart-bot.sh -h`
 
-Bot injects shutdown message into active sessions, waits up to 60s for turns to complete, then launchd auto-restarts (KeepAlive=true).
+Bot injects shutdown message into active sessions, waits up to 60s for turns to complete, then launchd auto-restarts (KeepAlive=true). The script polls until the old PID is gone and a new PID is running — do not conclude failure from `launchctl list` output mid-drain.
 
 **Never use:**
 - `launchctl kickstart -k` — sends SIGKILL, kills sessions mid-turn
-- `launchctl bootout` after SIGTERM — removes service definition, prevents auto-restart
+- Raw `launchctl bootout` paired with immediate `bootstrap` — async teardown races bootstrap and can leave the service unregistered (2026-04-18 incident, 17 min outage). Use `--plist` mode instead.
 
-**If auto-restart fails** (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`
+**If the script fails** or auto-restart doesn't happen (>90s, no new PID), rerun `bot/scripts/restart-bot.sh --plist`. If that still fails, fall back to `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja.
 
 ## Config Changes
 
diff --git a/README.md b/README.md
index f0b1d54..162dc0b 100644
--- a/README.md
+++ b/README.md
@@ -175,8 +175,11 @@ The bot runs as a launchd service: `ai.minime.telegram-bot`.
 # Check status
 launchctl print gui/$(id -u)/ai.minime.telegram-bot 2>&1 | head -5
 
-# Restart (graceful — waits for active sessions to finish)
-launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot
+# Restart (graceful — validates config, sends SIGTERM, waits for drain, returns new PID)
+bot/scripts/restart-bot.sh
+
+# Restart after editing ~/Library/LaunchAgents/ai.minime.telegram-bot.plist
+bot/scripts/restart-bot.sh --plist
 
 # Stop
 launchctl bootout gui/$(id -u)/ai.minime.telegram-bot
@@ -185,7 +188,7 @@ launchctl bootout gui/$(id -u)/ai.minime.telegram-bot
 launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist
 ```
 
-**Warning:** Graceful restart sends SIGTERM — the bot injects a shutdown message into active sessions and waits up to 60s for turns to complete before exiting. Idle sessions close immediately. launchd auto-restarts via KeepAlive. Still, active work is interrupted — always confirm before restarting.
+**Warning:** Graceful restart sends SIGTERM — the bot injects a shutdown message into active sessions and waits up to 60s for turns to complete before exiting. Idle sessions close immediately. launchd auto-restarts via KeepAlive. Still, active work is interrupted — always confirm before restarting. Use `--plist` after editing the plist on disk, because launchd caches the plist at bootstrap time and a plain SIGTERM restart would pick up the stale cache.
 
 ## Add a Cron
 
@@ -247,10 +250,9 @@ To remove: `launchctl bootout gui/$(id -u)/ai.minime.cron.<name>`, delete from `
 
    See [config.yaml](config.yaml) for all binding options including `requireMention`, `voiceTranscriptEcho`, `typingIndicator`, and per-topic overrides for forum supergroups.
 
-2. Validate and restart:
+2. Restart (the script validates config before sending SIGTERM):
    ```bash
-   cd ~/.minime/bot && npx tsx src/config.ts --validate
-   launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot
+   bot/scripts/restart-bot.sh
    ```
 
 ## Add a Discord Binding
diff --git a/bot/scripts/restart-bot.sh b/bot/scripts/restart-bot.sh
new file mode 100755
index 0000000..46b0ee3
--- /dev/null
+++ b/bot/scripts/restart-bot.sh
@@ -0,0 +1,271 @@
+#!/bin/bash
+# restart-bot.sh — Safely restart the Telegram bot launchd service
+# Usage:
+#   restart-bot.sh              Graceful SIGTERM restart (code / config.yaml changes)
+#   restart-bot.sh --plist      Full unregister + re-bootstrap (plist-on-disk changes)
+#   restart-bot.sh -h|--help    Show this help
+#
+# Never sends SIGKILL. Validates config before restarting. Polls launchd
+# teardown so bootout is not raced against bootstrap.
+
+set -euo pipefail
+
+if [ -z "${HOME:-}" ]; then
+  if command -v dscl >/dev/null 2>&1; then
+    HOME="$(dscl . -read "/Users/$(id -un)" NFSHomeDirectory 2>/dev/null | awk '{print $2}')"
+  fi
+fi
+if [ -z "${HOME:-}" ]; then
+  if command -v getent >/dev/null 2>&1; then
+    HOME="$(getent passwd "$(id -un)" 2>/dev/null | cut -d: -f6)"
+  fi
+fi
+export HOME
+if [ -z "$HOME" ]; then
+  echo "[restart-bot] Error: could not determine HOME from environment or fallback lookups" >&2
+  exit 1
+fi
+export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:${PATH:-}"
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+BOT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
+
+LAUNCHCTL_BIN="${LAUNCHCTL_BIN:-/bin/launchctl}"
+PLUTIL_BIN="${PLUTIL_BIN:-/usr/bin/plutil}"
+BOT_LABEL="${BOT_LABEL:-ai.minime.telegram-bot}"
+BOT_PLIST="${BOT_PLIST:-$HOME/Library/LaunchAgents/${BOT_LABEL}.plist}"
+BOT_UID="${BOT_UID:-$(id -u)}"
+DOMAIN="gui/${BOT_UID}"
+SERVICE="${DOMAIN}/${BOT_LABEL}"
+
+# Test-only: override the validator with a single executable (no args, no eval).
+# Tests set this to `true` / `false` to simulate validation pass / fail paths.
+CONFIG_VALIDATE_BIN="${CONFIG_VALIDATE_BIN:-}"
+
+# Timeouts (seconds). Drain window is 60s — give headroom.
+SHUTDOWN_TIMEOUT="${SHUTDOWN_TIMEOUT:-90}"
+TEARDOWN_TIMEOUT="${TEARDOWN_TIMEOUT:-90}"
+STARTUP_TIMEOUT="${STARTUP_TIMEOUT:-60}"
+POLL_INTERVAL="${POLL_INTERVAL:-1}"
+
+usage() {
+  cat <<EOF
+Usage:
+  restart-bot.sh              Graceful SIGTERM restart (code / config.yaml changes)
+  restart-bot.sh --plist      Full unregister + re-bootstrap (plist-on-disk changes)
+  restart-bot.sh -h|--help    Show this help
+
+On success: prints new PID and exits 0.
+On failure: prints a diagnostic and exits non-zero.
+EOF
+}
+
+log() { echo "[restart-bot] $*"; }
+err() { echo "[restart-bot] Error: $*" >&2; }
+
+MODE="graceful"
+
+while [ $# -gt 0 ]; do
+  case "$1" in
+    -h|--help) usage; exit 0 ;;
+    --plist) MODE="plist"; shift ;;
+    *) err "unknown argument: $1"; usage >&2; exit 2 ;;
+  esac
+done
+
+# get_pid prints one of:
+#   <numeric pid> — service is registered and running
+#   ""            — service is registered but has no running process (PID = "-")
+# exit status:
+#   0 — registered (pid may be empty)
+#   1 — not registered (launchctl query succeeded, label absent)
+#   2 — launchctl query itself failed (unknown state)
+get_pid() {
+  local out
+  if ! out=$("$LAUNCHCTL_BIN" list 2>/dev/null); then
+    return 2
+  fi
+  local line
+  line=$(printf '%s\n' "$out" | awk -v L="$BOT_LABEL" '$3==L { print; exit }')
+  if [ -z "$line" ]; then
+    return 1
+  fi
+  local pid
+  pid=$(printf '%s\n' "$line" | awk '{print $1}')
+  if [ "$pid" = "-" ]; then
+    echo ""
+  else
+    echo "$pid"
+  fi
+  return 0
+}
+
+# True only when launchctl query succeeded AND the service is registered.
+# A transient query failure is NOT treated as "registered" or "not registered".
+is_registered() {
+  local rc=0
+  get_pid >/dev/null 2>&1 || rc=$?
+  [ "$rc" -eq 0 ]
+}
+
+wait_until() {
+  # wait_until <timeout_seconds> <predicate_fn>
+  local timeout="$1"; local pred="$2"
+  local deadline
+  deadline=$(( $(date +%s) + timeout ))
+  while [ "$(date +%s)" -lt "$deadline" ]; do
+    if "$pred"; then
+      return 0
+    fi
+    sleep "$POLL_INTERVAL"
+  done
+  return 1
+}
+
+_old_pid=""
+_pred_old_pid_gone() {
+  local cur rc=0
+  cur=$(get_pid 2>/dev/null) || rc=$?
+  case "$rc" in
+    0) [ "$cur" != "$_old_pid" ] ;;
+    1) return 0 ;;   # explicitly not registered → old pid gone
+    *) return 1 ;;   # query failed → unknown, keep polling
+  esac
+}
+
+# Distinguishes "confirmed not registered" from "query failed", so a transient
+# launchctl error can't trick us into bootstrapping over a still-registered svc.
+_pred_unregistered() {
+  local rc=0
+  get_pid >/dev/null 2>&1 || rc=$?
+  [ "$rc" -eq 1 ]
+}
+
+# Requires a successful query AND a non-empty PID that differs from the old PID,
+# so a stale `launchctl list` response can't be mistaken for the new process.
+_pred_running_pid() {
+  local pid rc=0
+  pid=$(get_pid 2>/dev/null) || rc=$?
+  [ "$rc" -eq 0 ] && [ -n "$pid" ] && [ "$pid" != "$_old_pid" ]
+}
+
+validate_plist() {
+  log "Validating plist at $BOT_PLIST…"
+  if ! "$PLUTIL_BIN" -lint "$BOT_PLIST" >/dev/null 2>&1; then
+    err "plist is malformed: $BOT_PLIST"
+    err "run: $PLUTIL_BIN -lint \"$BOT_PLIST\" for details"
+    return 1
+  fi
+  local plist_label
+  if ! plist_label=$("$PLUTIL_BIN" -extract Label raw "$BOT_PLIST" 2>/dev/null); then
+    err "plist is missing 'Label' key: $BOT_PLIST"
+    return 1
+  fi
+  if [ "$plist_label" != "$BOT_LABEL" ]; then
+    err "plist Label '$plist_label' does not match expected '$BOT_LABEL'"
+    return 1
+  fi
+}
+
+validate_config() {
+  log "Validating config before restart…"
+  if [ -n "$CONFIG_VALIDATE_BIN" ]; then
+    if ! ( cd "$BOT_DIR" && "$CONFIG_VALIDATE_BIN" >/dev/null 2>&1 ); then
+      err "config validation failed; refusing to restart"
+      return 1
+    fi
+    return 0
+  fi
+  if ! ( cd "$BOT_DIR" && npx tsx src/config.ts --validate >/dev/null ); then
+    err "config validation failed; refusing to restart"
+    return 1
+  fi
+}
+
+graceful_restart() {
+  local old_pid
+  if ! old_pid=$(get_pid); then
+    err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist"
+    return 1
+  fi
+
+  if [ -z "$old_pid" ]; then
+    err "service $BOT_LABEL is registered but has no running process (PID=-); run: restart-bot.sh --plist"
+    return 1
+  fi
+
+  validate_config || return 1
+
+  log "Sending SIGTERM to $SERVICE (old PID: $old_pid)"
+  if ! "$LAUNCHCTL_BIN" kill SIGTERM "$SERVICE"; then
+    err "launchctl kill SIGTERM failed"
+    return 1
+  fi
+
+  _old_pid="$old_pid"
+  log "Waiting up to ${SHUTDOWN_TIMEOUT}s for old process $old_pid to exit…"
+  if ! wait_until "$SHUTDOWN_TIMEOUT" _pred_old_pid_gone; then
+    err "old process $old_pid did not exit within ${SHUTDOWN_TIMEOUT}s"
+    return 1
+  fi
+
+  log "Waiting up to ${STARTUP_TIMEOUT}s for KeepAlive to spawn a new PID…"
+  if ! wait_until "$STARTUP_TIMEOUT" _pred_running_pid; then
+    err "no new PID observed within ${STARTUP_TIMEOUT}s; KeepAlive did not restart"
+    return 1
+  fi
+
+  local new_pid
+  new_pid=$(get_pid 2>/dev/null || true)
+  log "Restart complete. New PID: ${new_pid:-unknown}"
+  echo "$new_pid"
+}
+
+plist_restart() {
+  if [ ! -f "$BOT_PLIST" ]; then
+    err "plist not found: $BOT_PLIST"
+    return 1
+  fi
+
+  validate_plist || return 1
+  validate_config || return 1
+
+  if is_registered; then
+    log "Unregistering $SERVICE (launchctl bootout)…"
+    # bootout may return non-zero even when the teardown is in progress;
+    # we rely on polling below, not the exit code.
+    "$LAUNCHCTL_BIN" bootout "$SERVICE" >/dev/null 2>&1 || true
+
+    log "Waiting up to ${TEARDOWN_TIMEOUT}s for teardown to complete…"
+    if ! wait_until "$TEARDOWN_TIMEOUT" _pred_unregistered; then
+      err "service did not unregister within ${TEARDOWN_TIMEOUT}s; refusing to bootstrap"
+      err "bootout is still draining sessions — rerun once 'launchctl list' no longer shows $BOT_LABEL"
+      return 1
+    fi
+  else
+    log "Service not currently registered; skipping bootout."
+  fi
+
+  log "Bootstrapping from $BOT_PLIST…"
+  if ! "$LAUNCHCTL_BIN" bootstrap "$DOMAIN" "$BOT_PLIST"; then
+    err "launchctl bootstrap failed"
+    return 1
+  fi
+
+  log "Waiting up to ${STARTUP_TIMEOUT}s for a running PID…"
+  if ! wait_until "$STARTUP_TIMEOUT" _pred_running_pid; then
+    err "service registered but no running PID within ${STARTUP_TIMEOUT}s"
+    return 1
+  fi
+
+  local new_pid
+  new_pid=$(get_pid 2>/dev/null || true)
+  log "Restart complete. New PID: ${new_pid:-unknown}"
+  echo "$new_pid"
+}
+
+case "$MODE" in
+  graceful) graceful_restart ;;
+  plist)    plist_restart ;;
+  *)        err "internal: unhandled mode $MODE"; exit 1 ;;
+esac
diff --git a/bot/src/__tests__/restart-bot.test.ts b/bot/src/__tests__/restart-bot.test.ts
new file mode 100644
index 0000000..cac944e
--- /dev/null
+++ b/bot/src/__tests__/restart-bot.test.ts
@@ -0,0 +1,452 @@
+import { describe, it } from "node:test";
+import assert from "node:assert/strict";
+import { spawnSync } from "node:child_process";
+import { createHash } from "node:crypto";
+import { mkdtempSync, mkdirSync, rmSync, writeFileSync, chmodSync, readFileSync, existsSync } from "node:fs";
+import { join, resolve, dirname } from "node:path";
+import { tmpdir } from "node:os";
+import { fileURLToPath } from "node:url";
+
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = dirname(__filename);
+const SCRIPT = resolve(__dirname, "../../scripts/restart-bot.sh");
+
+// Mock launchctl: writes / reads a key=value state file in $STATE_DIR/state.
+// Supports: list, kill <sig> <svc>, bootout <svc>, bootstrap <domain> <plist>.
+// Timing is driven by optional keys: kill_delay, bootout_delay (seconds).
+const MOCK_LAUNCHCTL = `#!/bin/bash
+set -uo pipefail
+STATE="\${STATE_DIR}/state"
+touch "\$STATE"
+
+get() { awk -F= -v k="\$1" 'BEGIN{v=""} $1==k {v=$0; sub(/^[^=]*=/,"",v)} END{print v}' "\$STATE"; }
+set_kv() {
+  tmp="\$STATE.tmp"
+  awk -F= -v k="\$1" '$1!=k' "\$STATE" > "\$tmp"
+  printf '%s=%s\\n' "\$1" "\$2" >> "\$tmp"
+  mv "\$tmp" "\$STATE"
+}
+del_kv() {
+  tmp="\$STATE.tmp"
+  awk -F= -v k="\$1" '$1!=k' "\$STATE" > "\$tmp"
+  mv "\$tmp" "\$STATE"
+}
+now() { date +%s; }
+
+apply() {
+  kat=\$(get kill_at)
+  bat=\$(get bootout_at)
+  if [ -n "\$kat" ] && [ "\$(now)" -ge "\$kat" ]; then
+    np=\$(get next_pid); [ -z "\$np" ] && np=99999
+    set_kv pid "\$np"
+    del_kv kill_at
+  fi
+  if [ -n "\$bat" ] && [ "\$(now)" -ge "\$bat" ]; then
+    set_kv registered 0
+    del_kv pid
+    del_kv bootout_at
+  fi
+}
+
+cmd="\${1:-}"; shift || true
+case "\$cmd" in
+  list)
+    apply
+    if [ "\$(get registered)" = "1" ]; then
+      label=\$(get label); [ -z "\$label" ] && label="ai.minime.telegram-bot"
+      pid=\$(get pid); [ -z "\$pid" ] && pid="-"
+      printf 'PID\\tStatus\\tLabel\\n'
+      printf '%s\\t0\\t%s\\n' "\$pid" "\$label"
+    fi
+    ;;
+  kill)
+    apply
+    delay=\$(get kill_delay); [ -z "\$delay" ] && delay=0
+    set_kv kill_at \$(( \$(now) + delay ))
+    ;;
+  bootout)
+    apply
+    delay=\$(get bootout_delay); [ -z "\$delay" ] && delay=0
+    set_kv bootout_at \$(( \$(now) + delay ))
+    ;;
+  bootstrap)
+    apply
+    if [ "\$(get registered)" = "1" ]; then
+      echo "Bootstrap failed: 5: Input/output error" >&2
+      exit 5
+    fi
+    plist_path="\${2:-}"
+    if [ -n "\$plist_path" ] && [ -f "\$plist_path" ]; then
+      sig=\$(node -e "const fs=require('fs'),c=require('crypto');process.stdout.write(c.createHash('sha1').update(fs.readFileSync(process.argv[1])).digest('hex'))" "\$plist_path")
+      set_kv bootstrapped_sig "\$sig"
+    fi
+    np=\$(get next_pid); [ -z "\$np" ] && np=99999
+    set_kv pid "\$np"
+    set_kv registered 1
+    ;;
+  *)
+    echo "mock-launchctl: unknown command: \$cmd" >&2
+    exit 1
+    ;;
+esac
+`;
+
+const MOCK_PLUTIL = `#!/bin/bash
+set -uo pipefail
+
+cmd="\${1:-}"
+case "\$cmd" in
+  -lint)
+    file="\${2:-}"
+    if [ -z "\$file" ] || [ ! -f "\$file" ]; then
+      exit 1
+    fi
+    if grep -q "<plist" "\$file" && grep -q "</plist>" "\$file" && grep -q "<dict>" "\$file" && grep -q "</dict>" "\$file"; then
+      exit 0
+    fi
+    echo "mock-plutil: malformed plist" >&2
+    exit 1
+    ;;
+  -extract)
+    key="\${2:-}"
+    fmt="\${3:-}"
+    file="\${4:-}"
+    if [ "\$key" != "Label" ] || [ "\$fmt" != "raw" ] || [ -z "\$file" ] || [ ! -f "\$file" ]; then
+      exit 1
+    fi
+    label=\$(tr -d '\\n' < "\$file" | sed -n 's|.*<key>[[:space:]]*Label[[:space:]]*</key>[[:space:]]*<string>\\([^<]*\\)</string>.*|\\1|p')
+    if [ -n "\$label" ]; then
+      printf '%s\\n' "\$label"
+      exit 0
+    fi
+    exit 1
+    ;;
+  *)
+    echo "mock-plutil: unsupported args: \$*" >&2
+    exit 1
+    ;;
+esac
+`;
+
+type Harness = {
+  dir: string;
+  launchctl: string;
+  plutil: string;
+  plist: string;
+  stateDir: string;
+  setState(kv: Record<string, string | number>): void;
+  readState(): Record<string, string>;
+  writePlist(content: string): void;
+  run(args: string[], env?: Record<string, string>): { status: number | null; stdout: string; stderr: string };
+};
+
+const VALID_PLIST = (label = "ai.minime.telegram-bot", extra = "INITIAL") => `<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+  <key>Label</key>
+  <string>${label}</string>
+  <key>Extra</key>
+  <string>${extra}</string>
+</dict>
+</plist>
+`;
+
+function createHarness(): Harness {
+  const dir = mkdtempSync(join(tmpdir(), "restart-bot-test-"));
+  const launchctl = join(dir, "launchctl");
+  const plutil = join(dir, "plutil");
+  const stateDir = join(dir, "state-dir");
+  const plist = join(dir, "ai.minime.telegram-bot.plist");
+  writeFileSync(launchctl, MOCK_LAUNCHCTL);
+  writeFileSync(plutil, MOCK_PLUTIL);
+  chmodSync(launchctl, 0o755);
+  chmodSync(plutil, 0o755);
+  writeFileSync(plist, VALID_PLIST());
+  mkdirSync(stateDir, { recursive: true });
+  const stateFile = join(stateDir, "state");
+
+  const setState = (kv: Record<string, string | number>) => {
+    const existing: Record<string, string> = {};
+    if (existsSync(stateFile)) {
+      for (const line of readFileSync(stateFile, "utf8").split("\n")) {
+        if (!line) continue;
+        const idx = line.indexOf("=");
+        if (idx === -1) continue;
+        existing[line.slice(0, idx)] = line.slice(idx + 1);
+      }
+    }
+    for (const [k, v] of Object.entries(kv)) existing[k] = String(v);
+    writeFileSync(
+      stateFile,
+      Object.entries(existing).map(([k, v]) => `${k}=${v}`).join("\n") + "\n",
+    );
+  };
+
+  const readState = (): Record<string, string> => {
+    const out: Record<string, string> = {};
+    if (!existsSync(stateFile)) return out;
+    for (const line of readFileSync(stateFile, "utf8").split("\n")) {
+      if (!line) continue;
+      const idx = line.indexOf("=");
+      if (idx === -1) continue;
+      out[line.slice(0, idx)] = line.slice(idx + 1);
+    }
+    return out;
+  };
+
+  const writePlist = (content: string) => writeFileSync(plist, content);
+
+  const run = (args: string[], env: Record<string, string> = {}) => {
+    const result = spawnSync(SCRIPT, args, {
+      env: {
+        ...process.env,
+        LAUNCHCTL_BIN: launchctl,
+        PLUTIL_BIN: plutil,
+        BOT_LABEL: "ai.minime.telegram-bot",
+        BOT_PLIST: plist,
+        BOT_UID: "501",
+        STATE_DIR: stateDir,
+        POLL_INTERVAL: "0.1",
+        SHUTDOWN_TIMEOUT: "30",
+        TEARDOWN_TIMEOUT: "30",
+        STARTUP_TIMEOUT: "20",
+        CONFIG_VALIDATE_BIN: "true",
+        ...env,
+      },
+      encoding: "utf8",
+      timeout: 60_000,
+    });
+    return {
+      status: result.status,
+      stdout: result.stdout ?? "",
+      stderr: result.stderr ?? "",
+    };
+  };
+
+  return { dir, launchctl, plutil, plist, stateDir, setState, readState, writePlist, run };
+}
+
+function cleanup(h: Harness) {
+  rmSync(h.dir, { recursive: true, force: true });
+}
+
+describe("restart-bot.sh", () => {
+  it("prints usage and exits 0 for --help", () => {
+    const h = createHarness();
+    try {
+      const { status, stdout } = h.run(["--help"]);
+      assert.strictEqual(status, 0);
+      assert.match(stdout, /Usage:/);
+      assert.match(stdout, /--plist/);
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("does not blow up when HOME is unset (launchd context)", () => {
+    const h = createHarness();
+    try {
+      // Regression: `set -u` + `${BOT_PLIST:-$HOME/...}` crashed when HOME was unset.
+      // Drop HOME from the inherited env; --help should still print and exit 0.
+      const env: NodeJS.ProcessEnv = { ...process.env };
+      delete env.HOME;
+      const result = spawnSync(SCRIPT, ["--help"], {
+        env: env as Record<string, string>,
+        encoding: "utf8",
+        timeout: 30_000,
+      });
+      assert.strictEqual(result.status, 0, `expected exit 0, got ${result.status}: ${result.stderr}`);
+      assert.match(result.stdout, /Usage:/);
+      assert.doesNotMatch(result.stderr, /unbound variable/);
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("exits 2 with usage on unknown argument", () => {
+    const h = createHarness();
+    try {
+      const { status, stderr } = h.run(["--bogus"]);
+      assert.strictEqual(status, 2);
+      assert.match(stderr, /unknown argument/);
+      assert.match(stderr, /Usage:/);
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("graceful: returns new PID and exits 0", () => {
+    const h = createHarness();
+    try {
+      h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 2222, kill_delay: 0 });
+      const { status, stdout, stderr } = h.run([]);
+      assert.strictEqual(status, 0, `expected exit 0, got ${status}: ${stderr}`);
+      assert.match(stdout, /New PID: 2222/);
+      assert.match(stdout.trim(), /2222\s*$/m);
+      const state = h.readState();
+      assert.strictEqual(state.pid, "2222");
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("graceful: refuses when service is not registered", () => {
+    const h = createHarness();
+    try {
+      const { status, stderr } = h.run([]);
+      assert.notStrictEqual(status, 0);
+      assert.match(stderr, /not registered/);
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("graceful: tolerates a slow shutdown without racing teardown", () => {
+    const h = createHarness();
+    try {
+      // Simulate a realistic drain: the old PID stays listed for several seconds
+      // before KeepAlive swaps in the new PID. Script must keep polling the whole
+      // time, not time out or declare success early.
+      h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 3333, kill_delay: 5 });
+      const start = Date.now();
+      const { status, stdout, stderr } = h.run([]);
+      const elapsedMs = Date.now() - start;
+      assert.strictEqual(status, 0, `expected exit 0, got ${status}: ${stderr}`);
+      // It really did wait for the drain to complete.
+      // Script must have actually waited for the drain — not exited early.
+      // Give ~500ms slack for 1-second mock-clock granularity and poll interval.
+      assert.ok(elapsedMs >= 4000, `expected >= ~5s wait, got ${elapsedMs}ms`);
+      assert.match(stdout, /New PID: 3333/);
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("graceful: aborts before SIGTERM when config validation fails", () => {
+    const h = createHarness();
+    try {
+      h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 2222 });
+      const { status, stdout, stderr } = h.run([], {
+        CONFIG_VALIDATE_BIN: "false",
+      });
+      assert.notStrictEqual(status, 0);
+      assert.match(stderr, /config validation failed/);
+      // No kill_at was set -> kill was never sent.
+      const state = h.readState();
+      assert.ok(!("kill_at" in state), `expected no kill scheduled, got state: ${JSON.stringify(state)}`);
+      assert.strictEqual(state.pid, "1111");
+      assert.doesNotMatch(stdout, /Sending SIGTERM/);
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("--plist: new on-disk plist is reflected after success", () => {
+    const h = createHarness();
+    try {
+      // Old plist content, bot currently running.
+      h.writePlist(VALID_PLIST("ai.minime.telegram-bot", "OLD"));
+      h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 4444, bootout_delay: 2 });
+      // Operator edits the plist on disk just before restart.
+      h.writePlist(VALID_PLIST("ai.minime.telegram-bot", "NEW_ENV_VAR=BAR"));
+
+      // Capture signature of the NEW plist (matches mock launchctl's sha1 digest).
+      const newSig = createHash("sha1").update(readFileSync(h.plist)).digest("hex");
+
+      const { status, stdout, stderr } = h.run(["--plist"]);
+      assert.strictEqual(status, 0, `expected exit 0, got ${status}: ${stderr}`);
+      assert.match(stdout, /New PID: 4444/);
+
+      const state = h.readState();
+      assert.strictEqual(state.registered, "1");
+      assert.strictEqual(state.pid, "4444");
+      // Script bootstrapped from the NEW plist, not a cached one.
+      assert.strictEqual(state.bootstrapped_sig, newSig);
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("--plist: does not race bootout teardown (bootstrap waits for unregister)", () => {
+    const h = createHarness();
+    try {
+      // bootout takes 4s to drain. If the script didn't wait, bootstrap would
+      // fire against a still-registered service and get EIO.
+      h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 5555, bootout_delay: 4 });
+      const start = Date.now();
+      const { status, stdout, stderr } = h.run(["--plist"]);
+      const elapsedMs = Date.now() - start;
+      assert.strictEqual(status, 0, `expected exit 0, got ${status}: ${stderr}`);
+      assert.ok(elapsedMs >= 3500, `expected >= ~4s wait, got ${elapsedMs}ms`);
+      assert.match(stdout, /New PID: 5555/);
+      assert.doesNotMatch(stderr, /Input\/output error/);
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("--plist: aborts when config validation fails", () => {
+    const h = createHarness();
+    try {
+      h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 4444 });
+      const { status, stderr } = h.run(["--plist"], { CONFIG_VALIDATE_BIN: "false" });
+      assert.notStrictEqual(status, 0);
+      assert.match(stderr, /config validation failed/);
+      const state = h.readState();
+      // No bootout scheduled.
+      assert.ok(!("bootout_at" in state));
+      assert.strictEqual(state.pid, "1111");
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("--plist: aborts before bootout when plist is malformed", () => {
+    const h = createHarness();
+    try {
+      h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 4444 });
+      // Syntactically broken plist — plutil -lint must reject before any bootout.
+      h.writePlist("<plist>not valid xml");
+      const { status, stderr } = h.run(["--plist"]);
+      assert.notStrictEqual(status, 0);
+      assert.match(stderr, /plist is malformed/);
+      const state = h.readState();
+      // No bootout scheduled, service still registered and running.
+      assert.ok(!("bootout_at" in state), `expected no bootout scheduled, got: ${JSON.stringify(state)}`);
+      assert.strictEqual(state.registered, "1");
+      assert.strictEqual(state.pid, "1111");
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("--plist: aborts before bootout when plist Label does not match", () => {
+    const h = createHarness();
+    try {
+      h.setState({ registered: 1, label: "ai.minime.telegram-bot", pid: 1111, next_pid: 4444 });
+      h.writePlist(VALID_PLIST("ai.minime.WRONG-LABEL", "X"));
+      const { status, stderr } = h.run(["--plist"]);
+      assert.notStrictEqual(status, 0);
+      assert.match(stderr, /Label .* does not match/);
+      const state = h.readState();
+      assert.ok(!("bootout_at" in state));
+      assert.strictEqual(state.registered, "1");
+      assert.strictEqual(state.pid, "1111");
+    } finally {
+      cleanup(h);
+    }
+  });
+
+  it("--plist: fails cleanly when plist file is missing", () => {
+    const h = createHarness();
+    try {
+      rmSync(h.plist);
+      const { status, stderr } = h.run(["--plist"]);
+      assert.notStrictEqual(status, 0);
+      assert.match(stderr, /plist not found/);
+    } finally {
+      cleanup(h);
+    }
+  });
+});
diff --git a/docs/plans/2026-04-18-issue-100-safe-restart.md b/docs/plans/2026-04-18-issue-100-safe-restart.md
new file mode 100644
index 0000000..3873ad8
--- /dev/null
+++ b/docs/plans/2026-04-18-issue-100-safe-restart.md
@@ -0,0 +1,166 @@
+# Safe scripted bot restart + rule update — Round 1
+
+## Goal
+
+Give operators one supported, scripted entry point to restart the bot safely — covering both code/config restarts and on-disk-plist changes — and update the operational rule to point at it. Resolves fitz123/claude-code-bot#100.
+
+## Validation Commands
+
+```bash
+cd bot && npm test
+cd bot && npx tsc --noEmit
+shellcheck bot/scripts/*.sh || true
+```
+
+## Reference: incident log
+
+Actual timestamps from an operator machine's `~/.minime/logs/telegram-bot.stdout.log`:
+
+```
+15:27:42Z  Received SIGTERM                  (first restart: launchctl kill SIGTERM)
+15:28:43Z  All sessions closed. Exiting.
+15:28:44Z  Bot version: 2d5fd7b              (launchd KeepAlive auto-restart, CACHED plist — new env var NOT applied)
+15:30:08Z  Received SIGTERM                  (second restart: launchctl bootout + bootstrap)
+15:30:14Z  All sessions closed. Exiting.
+[ 17m16s of bot downtime ]
+15:47:30Z  Bot version: 2d5fd7b              (manual recovery from a separate session)
+```
+
+Failing command sequence that caused the 17-minute outage:
+
+```bash
+launchctl bootout gui/$uid/ai.minime.telegram-bot
+sleep 2
+launchctl bootstrap gui/$uid ~/Library/LaunchAgents/ai.minime.telegram-bot.plist
+# → Bootstrap failed: 5: Input/output error
+```
+
+Root cause: `launchctl bootout` in the `gui` domain is asynchronous. The bootstrap call fired before teardown finished, got EIO, and service registration was gone so KeepAlive could not respawn.
+
+## Reference: launchctl list output
+
+Column format verified on macOS 14 (Darwin 23.6.0):
+
+```
+$ launchctl list | head -1
+PID     Status  Label
+$ launchctl list | awk '$3=="ai.minime.telegram-bot"'
+84395   0       ai.minime.telegram-bot
+```
+
+PID column shows `-` when the service is registered but the process is not currently running.
+
+## Reference: existing scripts in bot/scripts/
+
+```
+bot/scripts/deliver.sh
+bot/scripts/generate-plists.ts
+bot/scripts/run-cron.sh
+bot/scripts/start-bot.sh        # referenced from plist ProgramArguments
+```
+
+`deliver.sh` and `run-cron.sh` are the closest style references for a new shell script.
+
+## Reference: current rule (.claude/rules/platform/bot-operations.md)
+
+Full current content of `.claude/rules/platform/bot-operations.md` — the file Task 2 updates:
+
+```markdown
+# Bot Operations
+
+- **Bot restart requires explicit user confirmation.** No exceptions.
+- **Graceful restart:** `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot` — sends SIGTERM. Bot injects shutdown message into active sessions asking agents to wrap up, then waits up to 60s for turns to complete. Idle sessions close immediately. Launchd auto-restarts (KeepAlive=true).
+- **Wait for shutdown to complete.** After SIGTERM, the bot may take up to 60s to drain active sessions. Check logs for `All sessions closed. Exiting.` before concluding the restart failed. Running `launchctl list` during this window may show a stale exit code — that does NOT mean the restart failed.
+- **Never use `launchctl bootout` after SIGTERM.** If you `bootout` while the bot is still draining sessions, you remove the service definition from launchd — KeepAlive can no longer restart it. The bot dies with no way back except manual `launchctl load`.
+- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn.
+- **If auto-restart doesn't happen** after clean exit (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja.
+- **Config changes (hot-reloaded, no restart needed):** `agents` fields (`model`, `fallbackModel`, `maxTurns`, `systemPrompt`, `effort`, `workspaceCwd`) and `sessionDefaults` (`idleTimeoutMs`, `maxConcurrentSessions`) are re-read from `config.yaml` / `config.local.yaml` on every new session spawn. Edit the file and the next new session picks it up. Already-running sessions keep their original config.
+- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate`
+- **Cron changes:** edit crons.yaml → regenerate plists → load → test → verify logs
+```
+
+The rule does not specify a procedure for applying plist-on-disk changes and warns against `bootout` without pairing that with a safe alternative.
+
+## Tasks
+
+### Task 1: Add a bot-restart script under `bot/scripts/` (#100, P1)
+
+**Problem.** Operators currently restart the bot by typing raw `launchctl` commands. Two distinct triggers exist:
+
+1. Code or `config.yaml`/`config.local.yaml` changes — launchd's cached plist is still correct, so a graceful SIGTERM restart is enough (KeepAlive relaunches).
+2. Changes to `ai.minime.telegram-bot.plist` itself (env vars, ProgramArguments, etc) — launchd caches the plist in memory at bootstrap time, so a plain SIGTERM restart relaunches from the stale cache and silently drops the edit. To pick up the new plist, the service must be fully unregistered and re-bootstrapped from disk.
+
+Case 2 is where operators get hurt: the naive `bootout` + `bootstrap` sequence has a race (bootout is asynchronous, bootstrap returns EIO if called too early). On 2026-04-18 this produced 17 minutes of bot downtime — see incident reference above.
+
+**What we want.** A single supported, scripted entry point that an operator (or automation) can invoke to restart the bot safely, covering both cases. After the restart, the bot is running and — in case 2 — actually reflects the on-disk plist.
+
+- [x] There is an executable script in `bot/scripts/` that operators invoke to restart the bot. Exact name and CLI are the implementer's choice, but the name and usage MUST be referenced verbatim from the updated rule in Task 2 (one contract, one source of truth).
+- [x] Default / zero-argument invocation performs a graceful restart suitable for case 1 (code/config changes only); it sends SIGTERM, waits for the old process to exit, and returns successfully once a new PID is running.
+- [x] There is a clearly named mode (flag, subcommand, or separate script — implementer's choice) that performs case 2: after the script finishes successfully, the running bot process reflects the current on-disk plist (verifiable via `sudo launchctl procinfo <pid>` showing any new `EnvironmentVariables` from the plist).
+- [x] The script tolerates variable session-drain time — including the full 60-second shutdown window — without racing launchd's teardown and without relying on a fixed `sleep`. Simulating a slow shutdown (e.g. a stub that delays exit) must not cause false-positive or false-negative exits.
+- [x] The script never sends SIGKILL and never invokes `launchctl kickstart -k` or equivalent; graceful shutdown is not bypassable.
+- [x] On success the script prints the new PID and exits 0; on any failure to shut down or to bring the service back up, it prints a clear diagnostic and exits non-zero.
+- [x] Invoking with `-h` / `--help` or an unknown argument prints usage and exits with an appropriate status.
+- [x] Style is consistent with other shell scripts in `bot/scripts/` (see `deliver.sh`, `run-cron.sh`).
+- [x] `shellcheck` produces no errors on the new script (warnings acceptable if justified inline).
+- [x] Before any restart action the script runs config validation (`npx tsx bot/src/config.ts --validate` or equivalent for the case being applied) and aborts with a non-zero exit and a clear diagnostic if validation fails — the bot is never restarted with a broken config
+- [x] Automated test harness covers: (a) graceful path returns the new PID and exits 0; (b) plist-change mode results in the on-disk plist being reflected after success; (c) a slow-shutdown stub that delays exit up to the full 60-second drain window does not cause the script to race launchd's teardown or exit early; (d) a deliberately broken config aborts the restart before any SIGTERM is sent
+- [x] Verify existing tests pass
+
+### Task 2: Recommend the script in `.claude/rules/platform/bot-operations.md` (#100, P1)
+
+**Problem.** The current rule tells operators to type raw `launchctl` commands, and its "Never use `launchctl bootout` after SIGTERM" warning — taken literally — leaves an operator with no correct way to apply a plist edit at all. That gap is what the Task 1 script closes.
+
+**What we want.** After Task 1 lands, the rule points operators to the script as the canonical restart path. Operators who edit `config.yaml` / `config.local.yaml` or the plist should be able to follow the rule without ever running raw `launchctl` commands.
+
+- [x] The rule recommends the Task 1 script (by its actual name and CLI) as the default way to restart the bot, and explicitly covers the plist-change case.
+- [x] Any previous guidance that contradicts or misleads about the plist-change flow — specifically the blanket "Never use `launchctl bootout` after SIGTERM" line — is updated or removed so operators following the rule end up with a working bot after a plist edit.
+- [x] The "Never use `launchctl kickstart -k`" warning is preserved.
+- [x] The auto-restart fallback ("If auto-restart doesn't happen…") still works: it points at the script first, and only falls back to manual `launchctl` commands if the script itself fails.
+- [x] All existing sections on hot-reload fields, boot-level fields, and cron changes are preserved unchanged.
+- [x] Following the updated rule end-to-end (edit plist → run script → check `sudo launchctl procinfo <pid>`) successfully picks up the new on-disk plist without any raw `launchctl` commands typed by the operator
+- [x] Verify existing tests pass
+
+
+### Task 3: Address PR #103 Copilot review findings
+
+PR https://github.com/fitz123/claude-code-bot/pull/103 — 4 unresolved Copilot comments. Fix each:
+
+#### Finding 1: `HOME` unbound in `restart-bot.sh` (line 23)
+
+`bot/scripts/restart-bot.sh` runs with `set -u`, but uses `${BOT_PLIST:-$HOME/Library/...}` in the default value without ensuring `HOME` is set. In a launchd/automation context where `HOME` is unset, the script exits immediately with "unbound variable".
+
+- [x] Set `HOME` defensively at the top of `restart-bot.sh` (same pattern as `start-bot.sh` / `run-cron.sh`), OR build the default plist path without depending on `$HOME`.
+- [x] Add a regression test: invoke the script with `env -u HOME bash bot/scripts/restart-bot.sh ...` and confirm it does not blow up on the variable expansion.
+
+#### Finding 2: `eval` injection in `validate_config` (`restart-bot.sh:124`)
+
+`validate_config` uses `eval "$CONFIG_VALIDATE_CMD"`. Even though the override is intended only for tests, `eval` is a shell-injection footgun and breaks on subtle quoting/word-splitting if `CONFIG_VALIDATE_CMD` is ever passed from the environment.
+
+- [x] Replace the eval-based override. Pick one approach:
+  - (a) Drop the env-var override entirely and add a dedicated `SKIP_CONFIG_VALIDATE=1` flag the test harness can set, OR
+  - (b) Represent the validator as `CONFIG_VALIDATE_BIN` + `CONFIG_VALIDATE_ARGS` arrays and execute without `eval`.
+- [x] Update the test that exercises this path so it still works after the rewrite.
+- [x] Confirm `npx tsc --noEmit` and `npm test` both pass.
+
+#### Finding 3: `shasum` external dep in `restart-bot.test.ts` (line 281)
+
+`bot/src/__tests__/restart-bot.test.ts` invokes the external `shasum` binary to compute the expected plist signature. Tests should be hermetic.
+
+- [x] Replace `shasum` with Node's `crypto.createHash('sha1')` for the expected-signature computation.
+- [x] Confirm the test still asserts the same behaviour and passes.
+
+#### Finding 4: `shasum` in mock `launchctl` (`restart-bot.test.ts:79`)
+
+The mock `launchctl` shells out to `shasum` to compute a plist signature. Same hermeticity issue as Finding 3, plus this version is INSIDE a fixture script.
+
+- [x] Replace the `shasum` shell-out in the mock with a Node-based hash. Either:
+  - (a) Inline a `node -e '...'` invocation that uses `crypto.createHash('sha1')` over the plist contents, OR
+  - (b) Store the plist contents directly in the test fixture state so no hashing is needed at all in the mock.
+- [x] Confirm the mock still exercises the same code path (signature comparison) in `restart-bot.sh` and the test passes.
+
+#### Wrap-up
+- [x] All 4 findings addressed (boxes ticked above)
+- [x] `npx tsc --noEmit` clean
+- [x] `npm test` clean (or only pre-existing unrelated failures)
+- [x] Commit message references PR #103 and lists the findings fixed
diff --git a/docs/plans/completed/2026-04-18-issue-100-safe-restart.md b/docs/plans/completed/2026-04-18-issue-100-safe-restart.md
new file mode 100644
index 0000000..6d3678a
--- /dev/null
+++ b/docs/plans/completed/2026-04-18-issue-100-safe-restart.md
@@ -0,0 +1,122 @@
+# Safe scripted bot restart + rule update — Round 1
+
+## Goal
+
+Give operators one supported, scripted entry point to restart the bot safely — covering both code/config restarts and on-disk-plist changes — and update the operational rule to point at it. Resolves fitz123/claude-code-bot#100.
+
+## Validation Commands
+
+```bash
+cd bot && npm test
+cd bot && npx tsc --noEmit
+shellcheck bot/scripts/*.sh || true
+```
+
+## Reference: incident log
+
+Actual timestamps from an operator machine's `~/.minime/logs/telegram-bot.stdout.log`:
+
+```
+15:27:42Z  Received SIGTERM                  (first restart: launchctl kill SIGTERM)
+15:28:43Z  All sessions closed. Exiting.
+15:28:44Z  Bot version: 2d5fd7b              (launchd KeepAlive auto-restart, CACHED plist — new env var NOT applied)
+15:30:08Z  Received SIGTERM                  (second restart: launchctl bootout + bootstrap)
+15:30:14Z  All sessions closed. Exiting.
+[ 17m16s of bot downtime ]
+15:47:30Z  Bot version: 2d5fd7b              (manual recovery from a separate session)
+```
+
+Failing command sequence that caused the 17-minute outage:
+
+```bash
+launchctl bootout gui/$uid/ai.minime.telegram-bot
+sleep 2
+launchctl bootstrap gui/$uid ~/Library/LaunchAgents/ai.minime.telegram-bot.plist
+# → Bootstrap failed: 5: Input/output error
+```
+
+Root cause: `launchctl bootout` in the `gui` domain is asynchronous. The bootstrap call fired before teardown finished, got EIO, and service registration was gone so KeepAlive could not respawn.
+
+## Reference: launchctl list output
+
+Column format verified on macOS 14 (Darwin 23.6.0):
+
+```
+$ launchctl list | head -1
+PID     Status  Label
+$ launchctl list | awk '$3=="ai.minime.telegram-bot"'
+84395   0       ai.minime.telegram-bot
+```
+
+PID column shows `-` when the service is registered but the process is not currently running.
+
+## Reference: existing scripts in bot/scripts/
+
+```
+bot/scripts/deliver.sh
+bot/scripts/generate-plists.ts
+bot/scripts/run-cron.sh
+bot/scripts/start-bot.sh        # referenced from plist ProgramArguments
+```
+
+`deliver.sh` and `run-cron.sh` are the closest style references for a new shell script.
+
+## Reference: current rule (.claude/rules/platform/bot-operations.md)
+
+Full current content of `.claude/rules/platform/bot-operations.md` — the file Task 2 updates:
+
+```markdown
+# Bot Operations
+
+- **Bot restart requires explicit user confirmation.** No exceptions.
+- **Graceful restart:** `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot` — sends SIGTERM. Bot injects shutdown message into active sessions asking agents to wrap up, then waits up to 60s for turns to complete. Idle sessions close immediately. Launchd auto-restarts (KeepAlive=true).
+- **Wait for shutdown to complete.** After SIGTERM, the bot may take up to 60s to drain active sessions. Check logs for `All sessions closed. Exiting.` before concluding the restart failed. Running `launchctl list` during this window may show a stale exit code — that does NOT mean the restart failed.
+- **Never use `launchctl bootout` after SIGTERM.** If you `bootout` while the bot is still draining sessions, you remove the service definition from launchd — KeepAlive can no longer restart it. The bot dies with no way back except manual `launchctl load`.
+- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn.
+- **If auto-restart doesn't happen** after clean exit (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja.
+- **Config changes (hot-reloaded, no restart needed):** `agents` fields (`model`, `fallbackModel`, `maxTurns`, `systemPrompt`, `effort`, `workspaceCwd`) and `sessionDefaults` (`idleTimeoutMs`, `maxConcurrentSessions`) are re-read from `config.yaml` / `config.local.yaml` on every new session spawn. Edit the file and the next new session picks it up. Already-running sessions keep their original config.
+- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate`
+- **Cron changes:** edit crons.yaml → regenerate plists → load → test → verify logs
+```
+
+The rule does not specify a procedure for applying plist-on-disk changes and warns against `bootout` without pairing that with a safe alternative.
+
+## Tasks
+
+### Task 1: Add a bot-restart script under `bot/scripts/` (#100, P1)
+
+**Problem.** Operators currently restart the bot by typing raw `launchctl` commands. Two distinct triggers exist:
+
+1. Code or `config.yaml`/`config.local.yaml` changes — launchd's cached plist is still correct, so a graceful SIGTERM restart is enough (KeepAlive relaunches).
+2. Changes to `ai.minime.telegram-bot.plist` itself (env vars, ProgramArguments, etc) — launchd caches the plist in memory at bootstrap time, so a plain SIGTERM restart relaunches from the stale cache and silently drops the edit. To pick up the new plist, the service must be fully unregistered and re-bootstrapped from disk.
+
+Case 2 is where operators get hurt: the naive `bootout` + `bootstrap` sequence has a race (bootout is asynchronous, bootstrap returns EIO if called too early). On 2026-04-18 this produced 17 minutes of bot downtime — see incident reference above.
+
+**What we want.** A single supported, scripted entry point that an operator (or automation) can invoke to restart the bot safely, covering both cases. After the restart, the bot is running and — in case 2 — actually reflects the on-disk plist.
+
+- [x] There is an executable script in `bot/scripts/` that operators invoke to restart the bot. Exact name and CLI are the implementer's choice, but the name and usage MUST be referenced verbatim from the updated rule in Task 2 (one contract, one source of truth).
+- [x] Default / zero-argument invocation performs a graceful restart suitable for case 1 (code/config changes only); it sends SIGTERM, waits for the old process to exit, and returns successfully once a new PID is running.
+- [x] There is a clearly named mode (flag, subcommand, or separate script — implementer's choice) that performs case 2: after the script finishes successfully, the running bot process reflects the current on-disk plist (verifiable via `sudo launchctl procinfo <pid>` showing any new `EnvironmentVariables` from the plist).
+- [x] The script tolerates variable session-drain time — including the full 60-second shutdown window — without racing launchd's teardown and without relying on a fixed `sleep`. Simulating a slow shutdown (e.g. a stub that delays exit) must not cause false-positive or false-negative exits.
+- [x] The script never sends SIGKILL and never invokes `launchctl kickstart -k` or equivalent; graceful shutdown is not bypassable.
+- [x] On success the script prints the new PID and exits 0; on any failure to shut down or to bring the service back up, it prints a clear diagnostic and exits non-zero.
+- [x] Invoking with `-h` / `--help` or an unknown argument prints usage and exits with an appropriate status.
+- [x] Style is consistent with other shell scripts in `bot/scripts/` (see `deliver.sh`, `run-cron.sh`).
+- [x] `shellcheck` produces no errors on the new script (warnings acceptable if justified inline).
+- [x] Before any restart action the script runs config validation (`npx tsx bot/src/config.ts --validate` or equivalent for the case being applied) and aborts with a non-zero exit and a clear diagnostic if validation fails — the bot is never restarted with a broken config
+- [x] Automated test harness covers: (a) graceful path returns the new PID and exits 0; (b) plist-change mode results in the on-disk plist being reflected after success; (c) a slow-shutdown stub that delays exit up to the full 60-second drain window does not cause the script to race launchd's teardown or exit early; (d) a deliberately broken config aborts the restart before any SIGTERM is sent
+- [x] Verify existing tests pass
+
+### Task 2: Recommend the script in `.claude/rules/platform/bot-operations.md` (#100, P1)
+
+**Problem.** The current rule tells operators to type raw `launchctl` commands, and its "Never use `launchctl bootout` after SIGTERM" warning — taken literally — leaves an operator with no correct way to apply a plist edit at all. That gap is what the Task 1 script closes.
+
+**What we want.** After Task 1 lands, the rule points operators to the script as the canonical restart path. Operators who edit `config.yaml` / `config.local.yaml` or the plist should be able to follow the rule without ever running raw `launchctl` commands.
+
+- [x] The rule recommends the Task 1 script (by its actual name and CLI) as the default way to restart the bot, and explicitly covers the plist-change case.
+- [x] Any previous guidance that contradicts or misleads about the plist-change flow — specifically the blanket "Never use `launchctl bootout` after SIGTERM" line — is updated or removed so operators following the rule end up with a working bot after a plist edit.
+- [x] The "Never use `launchctl kickstart -k`" warning is preserved.
+- [x] The auto-restart fallback ("If auto-restart doesn't happen…") still works: it points at the script first, and only falls back to manual `launchctl` commands if the script itself fails.
+- [x] All existing sections on hot-reload fields, boot-level fields, and cron changes are preserved unchanged.
+- [x] manual test (skipped - not automatable): Following the updated rule end-to-end (edit plist → run script → check `sudo launchctl procinfo <pid>`) successfully picks up the new on-disk plist without any raw `launchctl` commands typed by the operator
+- [x] Verify existing tests pass