Conversation
20c5632 to
7064772
Compare
Single supported restart entry point for the launchd-managed bot. - bot/scripts/restart-bot.sh - Default mode: graceful SIGTERM, waits up to 60s drain, returns new PID - --plist mode: race-free bootout+bootstrap that reflects on-disk plist - Pre-restart config validation aborts without sending SIGTERM - Never SIGKILL, never kickstart -k - Tolerates variable drain time without fixed sleep - HOME fallback guard for restricted environments - bot/src/__tests__/restart-bot.test.ts: 13 tests covering both modes, slow-shutdown drain (5s), broken-config abort, malformed plist - .claude/rules/platform/bot-operations.md + bot-operations skill: operators use restart-bot.sh instead of raw launchctl - README.md: operator docs point at the script Closes #100 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7064772 to
40a1533
Compare
There was a problem hiding this comment.
Pull request overview
Adds a single supported restart-bot.sh entry point for safe launchd restarts (graceful by default, race-free plist reload via --plist), wires operator documentation to use it, and adds automated tests for the restart behavior.
Changes:
- Added
bot/scripts/restart-bot.shto perform graceful SIGTERM restarts and safebootout+bootstraprestarts for plist-on-disk changes, with config validation and polling (no fixed sleeps). - Added
bot/src/__tests__/restart-bot.test.tsto cover graceful/--plistpaths, slow shutdown, config validation aborts, and malformed plist handling. - Updated operator docs (
.claude/*,README.md) to direct restarts through the script.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
bot/scripts/restart-bot.sh |
New canonical restart script implementing graceful + plist restart flows with validation and polling. |
bot/src/__tests__/restart-bot.test.ts |
New test harness using mocked launchctl/plutil to validate script behavior across scenarios. |
.claude/rules/platform/bot-operations.md |
Updates operator rule to use the script and explains why --plist is required for plist edits. |
.claude/skills/bot-operations/SKILL.md |
Updates operational skill instructions to use the script and preserves safety warnings. |
README.md |
Updates restart documentation to use bot/scripts/restart-bot.sh and --plist. |
docs/plans/2026-04-18-issue-100-safe-restart.md |
Planning/trace doc for the change and addressed prior review items. |
docs/plans/completed/2026-04-18-issue-100-safe-restart.md |
Completed plan record for the same work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local plist_label | ||
| if ! plist_label=$("$PLUTIL_BIN" -extract Label raw "$BOT_PLIST" 2>/dev/null); then | ||
| err "plist is missing 'Label' key: $BOT_PLIST" | ||
| return 1 | ||
| fi | ||
| if [ "$plist_label" != "$BOT_LABEL" ]; then | ||
| err "plist Label '$plist_label' does not match expected '$BOT_LABEL'" | ||
| return 1 |
There was a problem hiding this comment.
plutil -extract is invoked without specifying an output target. On macOS, plutil -extract <key> raw expects -o - (stdout) or an output file; otherwise it fails and will make --plist mode abort as if the Label key were missing. Update the command to include -o - (and adjust the mock plutil + tests accordingly).
| file="\${4:-}" | ||
| if [ "\$key" != "Label" ] || [ "\$fmt" != "raw" ] || [ -z "\$file" ] || [ ! -f "\$file" ]; then |
There was a problem hiding this comment.
The mock plutil fixture only supports -extract Label raw <file> and will not match the real plutil -extract CLI once the script is corrected to pass -o -. Update this mock to accept (and ideally assert) the real argument pattern so the tests remain representative of production behavior.
| file="\${4:-}" | |
| if [ "\$key" != "Label" ] || [ "\$fmt" != "raw" ] || [ -z "\$file" ] || [ ! -f "\$file" ]; then | |
| out_flag="\${4:-}" | |
| out_target="\${5:-}" | |
| file="\${6:-}" | |
| if [ "\$key" != "Label" ] || [ "\$fmt" != "raw" ] || [ "\$out_flag" != "-o" ] || [ "\$out_target" != "-" ] || [ -z "\$file" ] || [ ! -f "\$file" ]; then |
| if [ -z "${HOME:-}" ]; then | ||
| if command -v dscl >/dev/null 2>&1; then | ||
| HOME="$(dscl . -read "/Users/$(id -un)" NFSHomeDirectory 2>/dev/null | awk '{print $2}')" | ||
| fi | ||
| fi | ||
| if [ -z "${HOME:-}" ]; then | ||
| if command -v getent >/dev/null 2>&1; then | ||
| HOME="$(getent passwd "$(id -un)" 2>/dev/null | cut -d: -f6)" | ||
| fi | ||
| fi | ||
| export HOME | ||
| if [ -z "$HOME" ]; then | ||
| echo "[restart-bot] Error: could not determine HOME from environment or fallback lookups" >&2 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
The script exits early if it cannot determine HOME, but this happens before argument parsing. That means restart-bot.sh --help (and unknown-arg usage) can fail in restricted environments even though help output shouldn’t depend on HOME. Consider parsing --help/invalid args before the HOME fallback/guard (or only requiring HOME when actually needed for default plist path resolution).
| if ! old_pid=$(get_pid); then | ||
| err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist" |
There was a problem hiding this comment.
get_pid documents exit status 2 as “launchctl query failed (unknown state)”, but graceful_restart treats any non-zero from get_pid as “service not registered”. This can misdiagnose transient/permission errors from launchctl list. Handle rc=2 explicitly (e.g., emit a distinct error like “launchctl list failed” and abort) so operators aren’t pointed at --plist when the real issue is an unreadable launchctl state.
| if ! old_pid=$(get_pid); then | |
| err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist" | |
| local get_pid_rc=0 | |
| if old_pid=$(get_pid); then | |
| : | |
| else | |
| get_pid_rc=$? | |
| case "$get_pid_rc" in | |
| 1) | |
| err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist" | |
| ;; | |
| 2) | |
| err "launchctl list failed while querying service $BOT_LABEL; unable to determine launchd state" | |
| ;; | |
| *) | |
| err "failed to query PID for service $BOT_LABEL (get_pid exit status: $get_pid_rc)" | |
| ;; | |
| esac |
| if is_registered; then | ||
| log "Unregistering $SERVICE (launchctl bootout)…" | ||
| # bootout may return non-zero even when the teardown is in progress; | ||
| # we rely on polling below, not the exit code. | ||
| "$LAUNCHCTL_BIN" bootout "$SERVICE" >/dev/null 2>&1 || true |
There was a problem hiding this comment.
is_registered treats get_pid rc=2 (launchctl query failure/unknown state) the same as “not registered”. In --plist mode that can cause the script to skip bootout and go straight to bootstrap against a potentially still-registered service (often EIO). Consider detecting rc=2 explicitly and aborting with a clear “launchctl list failed” diagnostic instead of proceeding.
Summary
bot/scripts/restart-bot.sh— single supported restart entry point--plistmode: race-free bootout+bootstrap that reflects on-disk plist after successkickstart -ksleepbot/src/__tests__/restart-bot.test.ts— 13 tests covering graceful path, plist mode, slow-shutdown stub (5s drain), broken-config abort, malformed plist.claude/rules/platform/bot-operations.mdand.claude/skills/bot-operations/SKILL.md— operators use the scriptREADME.md— operator docs point at the scriptRebased clean onto main after PR #104 merged. Supersedes closed PR #103. Contains only the restart-bot specific changes (media-store changes from the original 103 branch were superseded by PR #104's in-flight protection).
Closes #100
Test plan
npx tsx --test bot/src/__tests__/restart-bot.test.ts(13/13 pass)cd bot && npx tsc --noEmitshellcheck bot/scripts/restart-bot.shclean