Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions .claude/rules/platform/bot-operations.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
# Bot Operations

- **Bot restart requires explicit user confirmation.** No exceptions.
- **Graceful restart:** `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot` — sends SIGTERM. Bot injects shutdown message into active sessions asking agents to wrap up, then waits up to 60s for turns to complete. Idle sessions close immediately. Launchd auto-restarts (KeepAlive=true).
- **Wait for shutdown to complete.** After SIGTERM, the bot may take up to 60s to drain active sessions. Check logs for `All sessions closed. Exiting.` before concluding the restart failed. Running `launchctl list` during this window may show a stale exit code — that does NOT mean the restart failed.
- **Never use `launchctl bootout` after SIGTERM.** If you `bootout` while the bot is still draining sessions, you remove the service definition from launchd — KeepAlive can no longer restart it. The bot dies with no way back except manual `launchctl load`.
- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn.
- **If auto-restart doesn't happen** after clean exit (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja.
- **Canonical restart path:** use `bot/scripts/restart-bot.sh`. Do not type raw `launchctl` commands. The script validates config, sends SIGTERM, polls launchd teardown so bootout never races bootstrap, and returns the new PID on success.
- `bot/scripts/restart-bot.sh` — graceful SIGTERM restart. Use after code changes or edits to `config.yaml` / `config.local.yaml`. KeepAlive relaunches from the cached plist.
- `bot/scripts/restart-bot.sh --plist` — full unregister + re-bootstrap. Use after edits to `~/Library/LaunchAgents/ai.minime.telegram-bot.plist` (env vars, ProgramArguments, etc). Required because launchd caches the plist at bootstrap time; a plain SIGTERM restart picks up the stale cache and silently drops the edit.
- `bot/scripts/restart-bot.sh -h` — usage.
- **Shutdown takes up to 60s.** The bot injects a shutdown message into active sessions, waits for turns to complete, then exits. Idle sessions close immediately. The script polls until the old PID is gone and a new PID is running — do not conclude failure from `launchctl list` output mid-drain.
- **Never bypass the script with raw `launchctl bootout`.** Manual `bootout` in the `gui` domain is asynchronous; pairing it with an immediate `bootstrap` races launchd's teardown and can leave the service unregistered with no way for KeepAlive to respawn (see incident 2026-04-18, 17 min outage). The `--plist` mode handles this safely by polling teardown to completion before bootstrap.
- **Never use `launchctl kickstart -k`** — it sends SIGKILL, bypasses graceful shutdown, kills active sessions mid-turn. The script never does this and neither should operators.
- **If the script fails** or auto-restart doesn't happen (>90s, no new PID), rerun `bot/scripts/restart-bot.sh --plist`. If that still fails, fall back to `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja.
- **Config changes (hot-reloaded, no restart needed):** `agents` fields (`model`, `fallbackModel`, `maxTurns`, `systemPrompt`, `effort`, `workspaceCwd`) and `sessionDefaults` (`idleTimeoutMs`, `maxConcurrentSessions`) are re-read from `config.yaml` / `config.local.yaml` on every new session spawn. Edit the file and the next new session picks it up. Already-running sessions keep their original config.
- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate`
- **Config changes (boot-level, restart required):** `telegramToken`, `discord.token`, `bindings`, `metricsPort`, `sessionDefaults.maxMessageAgeMs`, `sessionDefaults.requireMention`. Validate before restart: `npx tsx bot/src/config.ts --validate` (the script runs this automatically and aborts on failure).
- **Cron changes:** edit crons.yaml → regenerate plists → load → test → verify logs
18 changes: 9 additions & 9 deletions .claude/skills/bot-operations/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@ Reference for Telegram bot and cron system management.

## Bot Restart

1. Validate config: `cd ~/.minime/workspace/bot && npx tsx src/config.ts --validate`
2. Report result and reason to Ninja
3. Wait for explicit confirmation
4. Graceful restart: `launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot`
5. Wait for drain: check logs for `All sessions closed. Exiting.` (up to 60s)
6. Verify: `launchctl list | grep ai.minime.telegram-bot` — new PID, exit 0 (note: stale exit code during drain window is normal, wait for step 5 first)
1. Report intent and reason to Ninja
2. Wait for explicit confirmation
3. Restart via the canonical script (it validates config, sends SIGTERM, polls launchd teardown, returns the new PID):
- Code or `config.yaml` / `config.local.yaml` changes: `bot/scripts/restart-bot.sh`
- Plist-on-disk changes (`~/Library/LaunchAgents/ai.minime.telegram-bot.plist`): `bot/scripts/restart-bot.sh --plist`
- Usage: `bot/scripts/restart-bot.sh -h`

Bot injects shutdown message into active sessions, waits up to 60s for turns to complete, then launchd auto-restarts (KeepAlive=true).
Bot injects shutdown message into active sessions, waits up to 60s for turns to complete, then launchd auto-restarts (KeepAlive=true). The script polls until the old PID is gone and a new PID is running — do not conclude failure from `launchctl list` output mid-drain.

**Never use:**
- `launchctl kickstart -k` — sends SIGKILL, kills sessions mid-turn
- `launchctl bootout` after SIGTERM — removes service definition, prevents auto-restart
- Raw `launchctl bootout` paired with immediate `bootstrap` — async teardown races bootstrap and can leave the service unregistered (2026-04-18 incident, 17 min outage). Use `--plist` mode instead.

**If auto-restart fails** (>90s, no new PID): `launchctl load ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`
**If the script fails** or auto-restart doesn't happen (>90s, no new PID), rerun `bot/scripts/restart-bot.sh --plist`. If that still fails, fall back to `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist`. If that doesn't work — ask Ninja.

## Config Changes

Expand Down
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,8 +175,11 @@ The bot runs as a launchd service: `ai.minime.telegram-bot`.
# Check status
launchctl print gui/$(id -u)/ai.minime.telegram-bot 2>&1 | head -5

# Restart (graceful — waits for active sessions to finish)
launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot
# Restart (graceful — validates config, sends SIGTERM, waits for drain, returns new PID)
bot/scripts/restart-bot.sh

# Restart after editing ~/Library/LaunchAgents/ai.minime.telegram-bot.plist
bot/scripts/restart-bot.sh --plist

# Stop
launchctl bootout gui/$(id -u)/ai.minime.telegram-bot
Expand All @@ -185,7 +188,7 @@ launchctl bootout gui/$(id -u)/ai.minime.telegram-bot
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.minime.telegram-bot.plist
```

**Warning:** Graceful restart sends SIGTERM — the bot injects a shutdown message into active sessions and waits up to 60s for turns to complete before exiting. Idle sessions close immediately. launchd auto-restarts via KeepAlive. Still, active work is interrupted — always confirm before restarting.
**Warning:** Graceful restart sends SIGTERM — the bot injects a shutdown message into active sessions and waits up to 60s for turns to complete before exiting. Idle sessions close immediately. launchd auto-restarts via KeepAlive. Still, active work is interrupted — always confirm before restarting. Use `--plist` after editing the plist on disk, because launchd caches the plist at bootstrap time and a plain SIGTERM restart would pick up the stale cache.

## Add a Cron

Expand Down Expand Up @@ -247,10 +250,9 @@ To remove: `launchctl bootout gui/$(id -u)/ai.minime.cron.<name>`, delete from `

See [config.yaml](config.yaml) for all binding options including `requireMention`, `voiceTranscriptEcho`, `typingIndicator`, and per-topic overrides for forum supergroups.

2. Validate and restart:
2. Restart (the script validates config before sending SIGTERM):
```bash
cd ~/.minime/bot && npx tsx src/config.ts --validate
launchctl kill SIGTERM gui/$(id -u)/ai.minime.telegram-bot
bot/scripts/restart-bot.sh
```

## Add a Discord Binding
Expand Down
271 changes: 271 additions & 0 deletions bot/scripts/restart-bot.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
#!/bin/bash
# restart-bot.sh — Safely restart the Telegram bot launchd service
# Usage:
# restart-bot.sh Graceful SIGTERM restart (code / config.yaml changes)
# restart-bot.sh --plist Full unregister + re-bootstrap (plist-on-disk changes)
# restart-bot.sh -h|--help Show this help
#
# Never sends SIGKILL. Validates config before restarting. Polls launchd
# teardown so bootout is not raced against bootstrap.

set -euo pipefail

if [ -z "${HOME:-}" ]; then
if command -v dscl >/dev/null 2>&1; then
HOME="$(dscl . -read "/Users/$(id -un)" NFSHomeDirectory 2>/dev/null | awk '{print $2}')"
fi
fi
if [ -z "${HOME:-}" ]; then
if command -v getent >/dev/null 2>&1; then
HOME="$(getent passwd "$(id -un)" 2>/dev/null | cut -d: -f6)"
fi
fi
export HOME
if [ -z "$HOME" ]; then
echo "[restart-bot] Error: could not determine HOME from environment or fallback lookups" >&2
exit 1
fi
Comment on lines +13 to +27
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script exits early if it cannot determine HOME, but this happens before argument parsing. That means restart-bot.sh --help (and unknown-arg usage) can fail in restricted environments even though help output shouldn’t depend on HOME. Consider parsing --help/invalid args before the HOME fallback/guard (or only requiring HOME when actually needed for default plist path resolution).

Copilot uses AI. Check for mistakes.
export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:${PATH:-}"

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
BOT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"

LAUNCHCTL_BIN="${LAUNCHCTL_BIN:-/bin/launchctl}"
PLUTIL_BIN="${PLUTIL_BIN:-/usr/bin/plutil}"
BOT_LABEL="${BOT_LABEL:-ai.minime.telegram-bot}"
BOT_PLIST="${BOT_PLIST:-$HOME/Library/LaunchAgents/${BOT_LABEL}.plist}"
BOT_UID="${BOT_UID:-$(id -u)}"
DOMAIN="gui/${BOT_UID}"
SERVICE="${DOMAIN}/${BOT_LABEL}"

# Test-only: override the validator with a single executable (no args, no eval).
# Tests set this to `true` / `false` to simulate validation pass / fail paths.
CONFIG_VALIDATE_BIN="${CONFIG_VALIDATE_BIN:-}"

# Timeouts (seconds). Drain window is 60s — give headroom.
SHUTDOWN_TIMEOUT="${SHUTDOWN_TIMEOUT:-90}"
TEARDOWN_TIMEOUT="${TEARDOWN_TIMEOUT:-90}"
STARTUP_TIMEOUT="${STARTUP_TIMEOUT:-60}"
POLL_INTERVAL="${POLL_INTERVAL:-1}"

usage() {
cat <<EOF
Usage:
restart-bot.sh Graceful SIGTERM restart (code / config.yaml changes)
restart-bot.sh --plist Full unregister + re-bootstrap (plist-on-disk changes)
restart-bot.sh -h|--help Show this help

On success: prints new PID and exits 0.
On failure: prints a diagnostic and exits non-zero.
EOF
}

log() { echo "[restart-bot] $*"; }
err() { echo "[restart-bot] Error: $*" >&2; }

MODE="graceful"

while [ $# -gt 0 ]; do
case "$1" in
-h|--help) usage; exit 0 ;;
--plist) MODE="plist"; shift ;;
*) err "unknown argument: $1"; usage >&2; exit 2 ;;
esac
done

# get_pid prints one of:
# <numeric pid> — service is registered and running
# "" — service is registered but has no running process (PID = "-")
# exit status:
# 0 — registered (pid may be empty)
# 1 — not registered (launchctl query succeeded, label absent)
# 2 — launchctl query itself failed (unknown state)
get_pid() {
local out
if ! out=$("$LAUNCHCTL_BIN" list 2>/dev/null); then
return 2
fi
local line
line=$(printf '%s\n' "$out" | awk -v L="$BOT_LABEL" '$3==L { print; exit }')
if [ -z "$line" ]; then
return 1
fi
local pid
pid=$(printf '%s\n' "$line" | awk '{print $1}')
if [ "$pid" = "-" ]; then
echo ""
else
echo "$pid"
fi
return 0
}

# True only when launchctl query succeeded AND the service is registered.
# A transient query failure is NOT treated as "registered" or "not registered".
is_registered() {
local rc=0
get_pid >/dev/null 2>&1 || rc=$?
[ "$rc" -eq 0 ]
}

wait_until() {
# wait_until <timeout_seconds> <predicate_fn>
local timeout="$1"; local pred="$2"
local deadline
deadline=$(( $(date +%s) + timeout ))
while [ "$(date +%s)" -lt "$deadline" ]; do
if "$pred"; then
return 0
fi
sleep "$POLL_INTERVAL"
done
return 1
}

_old_pid=""
_pred_old_pid_gone() {
local cur rc=0
cur=$(get_pid 2>/dev/null) || rc=$?
case "$rc" in
0) [ "$cur" != "$_old_pid" ] ;;
1) return 0 ;; # explicitly not registered → old pid gone
*) return 1 ;; # query failed → unknown, keep polling
esac
}

# Distinguishes "confirmed not registered" from "query failed", so a transient
# launchctl error can't trick us into bootstrapping over a still-registered svc.
_pred_unregistered() {
local rc=0
get_pid >/dev/null 2>&1 || rc=$?
[ "$rc" -eq 1 ]
}

# Requires a successful query AND a non-empty PID that differs from the old PID,
# so a stale `launchctl list` response can't be mistaken for the new process.
_pred_running_pid() {
local pid rc=0
pid=$(get_pid 2>/dev/null) || rc=$?
[ "$rc" -eq 0 ] && [ -n "$pid" ] && [ "$pid" != "$_old_pid" ]
}

validate_plist() {
log "Validating plist at $BOT_PLIST…"
if ! "$PLUTIL_BIN" -lint "$BOT_PLIST" >/dev/null 2>&1; then
err "plist is malformed: $BOT_PLIST"
err "run: $PLUTIL_BIN -lint \"$BOT_PLIST\" for details"
return 1
fi
local plist_label
if ! plist_label=$("$PLUTIL_BIN" -extract Label raw "$BOT_PLIST" 2>/dev/null); then
err "plist is missing 'Label' key: $BOT_PLIST"
return 1
fi
if [ "$plist_label" != "$BOT_LABEL" ]; then
err "plist Label '$plist_label' does not match expected '$BOT_LABEL'"
return 1
Comment on lines +159 to +166
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plutil -extract is invoked without specifying an output target. On macOS, plutil -extract <key> raw expects -o - (stdout) or an output file; otherwise it fails and will make --plist mode abort as if the Label key were missing. Update the command to include -o - (and adjust the mock plutil + tests accordingly).

Copilot uses AI. Check for mistakes.
fi
}

validate_config() {
log "Validating config before restart…"
if [ -n "$CONFIG_VALIDATE_BIN" ]; then
if ! ( cd "$BOT_DIR" && "$CONFIG_VALIDATE_BIN" >/dev/null 2>&1 ); then
err "config validation failed; refusing to restart"
return 1
fi
return 0
fi
if ! ( cd "$BOT_DIR" && npx tsx src/config.ts --validate >/dev/null ); then
err "config validation failed; refusing to restart"
return 1
fi
}

graceful_restart() {
local old_pid
if ! old_pid=$(get_pid); then
err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist"
Comment on lines +187 to +188
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_pid documents exit status 2 as “launchctl query failed (unknown state)”, but graceful_restart treats any non-zero from get_pid as “service not registered”. This can misdiagnose transient/permission errors from launchctl list. Handle rc=2 explicitly (e.g., emit a distinct error like “launchctl list failed” and abort) so operators aren’t pointed at --plist when the real issue is an unreadable launchctl state.

Suggested change
if ! old_pid=$(get_pid); then
err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist"
local get_pid_rc=0
if old_pid=$(get_pid); then
:
else
get_pid_rc=$?
case "$get_pid_rc" in
1)
err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist"
;;
2)
err "launchctl list failed while querying service $BOT_LABEL; unable to determine launchd state"
;;
*)
err "failed to query PID for service $BOT_LABEL (get_pid exit status: $get_pid_rc)"
;;
esac

Copilot uses AI. Check for mistakes.
return 1
fi

if [ -z "$old_pid" ]; then
err "service $BOT_LABEL is registered but has no running process (PID=-); run: restart-bot.sh --plist"
return 1
fi

validate_config || return 1

log "Sending SIGTERM to $SERVICE (old PID: $old_pid)"
if ! "$LAUNCHCTL_BIN" kill SIGTERM "$SERVICE"; then
err "launchctl kill SIGTERM failed"
return 1
fi

_old_pid="$old_pid"
log "Waiting up to ${SHUTDOWN_TIMEOUT}s for old process $old_pid to exit…"
if ! wait_until "$SHUTDOWN_TIMEOUT" _pred_old_pid_gone; then
err "old process $old_pid did not exit within ${SHUTDOWN_TIMEOUT}s"
return 1
fi

log "Waiting up to ${STARTUP_TIMEOUT}s for KeepAlive to spawn a new PID…"
if ! wait_until "$STARTUP_TIMEOUT" _pred_running_pid; then
err "no new PID observed within ${STARTUP_TIMEOUT}s; KeepAlive did not restart"
return 1
fi

local new_pid
new_pid=$(get_pid 2>/dev/null || true)
log "Restart complete. New PID: ${new_pid:-unknown}"
echo "$new_pid"
}

plist_restart() {
if [ ! -f "$BOT_PLIST" ]; then
err "plist not found: $BOT_PLIST"
return 1
fi

validate_plist || return 1
validate_config || return 1

if is_registered; then
log "Unregistering $SERVICE (launchctl bootout)…"
# bootout may return non-zero even when the teardown is in progress;
# we rely on polling below, not the exit code.
"$LAUNCHCTL_BIN" bootout "$SERVICE" >/dev/null 2>&1 || true
Comment on lines +233 to +237
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_registered treats get_pid rc=2 (launchctl query failure/unknown state) the same as “not registered”. In --plist mode that can cause the script to skip bootout and go straight to bootstrap against a potentially still-registered service (often EIO). Consider detecting rc=2 explicitly and aborting with a clear “launchctl list failed” diagnostic instead of proceeding.

Copilot uses AI. Check for mistakes.

log "Waiting up to ${TEARDOWN_TIMEOUT}s for teardown to complete…"
if ! wait_until "$TEARDOWN_TIMEOUT" _pred_unregistered; then
err "service did not unregister within ${TEARDOWN_TIMEOUT}s; refusing to bootstrap"
err "bootout is still draining sessions — rerun once 'launchctl list' no longer shows $BOT_LABEL"
return 1
fi
else
log "Service not currently registered; skipping bootout."
fi

log "Bootstrapping from $BOT_PLIST…"
if ! "$LAUNCHCTL_BIN" bootstrap "$DOMAIN" "$BOT_PLIST"; then
err "launchctl bootstrap failed"
return 1
fi

log "Waiting up to ${STARTUP_TIMEOUT}s for a running PID…"
if ! wait_until "$STARTUP_TIMEOUT" _pred_running_pid; then
err "service registered but no running PID within ${STARTUP_TIMEOUT}s"
return 1
fi

local new_pid
new_pid=$(get_pid 2>/dev/null || true)
log "Restart complete. New PID: ${new_pid:-unknown}"
echo "$new_pid"
}

case "$MODE" in
graceful) graceful_restart ;;
plist) plist_restart ;;
*) err "internal: unhandled mode $MODE"; exit 1 ;;
esac
Loading
Loading