Skip to content

feat: safe scripted bot restart (#100)#105

Merged
fitz123 merged 1 commit intomainfrom
issue-100-safe-restart
Apr 19, 2026
Merged

feat: safe scripted bot restart (#100)#105
fitz123 merged 1 commit intomainfrom
issue-100-safe-restart

Conversation

@fitz123
Copy link
Copy Markdown
Owner

@fitz123 fitz123 commented Apr 19, 2026

Summary

  • New bot/scripts/restart-bot.sh — single supported restart entry point
    • Default mode: graceful SIGTERM, waits up to 60s drain, returns new PID
    • --plist mode: race-free bootout+bootstrap that reflects on-disk plist after success
    • Pre-restart config validation aborts without sending SIGTERM
    • Never SIGKILL, never kickstart -k
    • Tolerates variable drain time without fixed sleep
    • HOME fallback guard for restricted environments
  • New bot/src/__tests__/restart-bot.test.ts — 13 tests covering graceful path, plist mode, slow-shutdown stub (5s drain), broken-config abort, malformed plist
  • .claude/rules/platform/bot-operations.md and .claude/skills/bot-operations/SKILL.md — operators use the script
  • README.md — operator docs point at the script

Rebased clean onto main after PR #104 merged. Supersedes closed PR #103. Contains only the restart-bot specific changes (media-store changes from the original 103 branch were superseded by PR #104's in-flight protection).

Closes #100

Test plan

  • npx tsx --test bot/src/__tests__/restart-bot.test.ts (13/13 pass)
  • cd bot && npx tsc --noEmit
  • shellcheck bot/scripts/restart-bot.sh clean
  • codex + claude + copilot reviews: addressed

Copilot AI review requested due to automatic review settings April 19, 2026 12:28
@fitz123 fitz123 force-pushed the issue-100-safe-restart branch from 20c5632 to 7064772 Compare April 19, 2026 12:30
Single supported restart entry point for the launchd-managed bot.

- bot/scripts/restart-bot.sh
  - Default mode: graceful SIGTERM, waits up to 60s drain, returns new PID
  - --plist mode: race-free bootout+bootstrap that reflects on-disk plist
  - Pre-restart config validation aborts without sending SIGTERM
  - Never SIGKILL, never kickstart -k
  - Tolerates variable drain time without fixed sleep
  - HOME fallback guard for restricted environments
- bot/src/__tests__/restart-bot.test.ts: 13 tests covering both modes,
  slow-shutdown drain (5s), broken-config abort, malformed plist
- .claude/rules/platform/bot-operations.md + bot-operations skill:
  operators use restart-bot.sh instead of raw launchctl
- README.md: operator docs point at the script

Closes #100

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@fitz123 fitz123 force-pushed the issue-100-safe-restart branch from 7064772 to 40a1533 Compare April 19, 2026 12:30
@fitz123 fitz123 merged commit 32e386f into main Apr 19, 2026
1 check passed
@fitz123 fitz123 deleted the issue-100-safe-restart branch April 19, 2026 12:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a single supported restart-bot.sh entry point for safe launchd restarts (graceful by default, race-free plist reload via --plist), wires operator documentation to use it, and adds automated tests for the restart behavior.

Changes:

  • Added bot/scripts/restart-bot.sh to perform graceful SIGTERM restarts and safe bootout+bootstrap restarts for plist-on-disk changes, with config validation and polling (no fixed sleeps).
  • Added bot/src/__tests__/restart-bot.test.ts to cover graceful/--plist paths, slow shutdown, config validation aborts, and malformed plist handling.
  • Updated operator docs (.claude/*, README.md) to direct restarts through the script.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
bot/scripts/restart-bot.sh New canonical restart script implementing graceful + plist restart flows with validation and polling.
bot/src/__tests__/restart-bot.test.ts New test harness using mocked launchctl/plutil to validate script behavior across scenarios.
.claude/rules/platform/bot-operations.md Updates operator rule to use the script and explains why --plist is required for plist edits.
.claude/skills/bot-operations/SKILL.md Updates operational skill instructions to use the script and preserves safety warnings.
README.md Updates restart documentation to use bot/scripts/restart-bot.sh and --plist.
docs/plans/2026-04-18-issue-100-safe-restart.md Planning/trace doc for the change and addressed prior review items.
docs/plans/completed/2026-04-18-issue-100-safe-restart.md Completed plan record for the same work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +159 to +166
local plist_label
if ! plist_label=$("$PLUTIL_BIN" -extract Label raw "$BOT_PLIST" 2>/dev/null); then
err "plist is missing 'Label' key: $BOT_PLIST"
return 1
fi
if [ "$plist_label" != "$BOT_LABEL" ]; then
err "plist Label '$plist_label' does not match expected '$BOT_LABEL'"
return 1
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plutil -extract is invoked without specifying an output target. On macOS, plutil -extract <key> raw expects -o - (stdout) or an output file; otherwise it fails and will make --plist mode abort as if the Label key were missing. Update the command to include -o - (and adjust the mock plutil + tests accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +114
file="\${4:-}"
if [ "\$key" != "Label" ] || [ "\$fmt" != "raw" ] || [ -z "\$file" ] || [ ! -f "\$file" ]; then
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock plutil fixture only supports -extract Label raw <file> and will not match the real plutil -extract CLI once the script is corrected to pass -o -. Update this mock to accept (and ideally assert) the real argument pattern so the tests remain representative of production behavior.

Suggested change
file="\${4:-}"
if [ "\$key" != "Label" ] || [ "\$fmt" != "raw" ] || [ -z "\$file" ] || [ ! -f "\$file" ]; then
out_flag="\${4:-}"
out_target="\${5:-}"
file="\${6:-}"
if [ "\$key" != "Label" ] || [ "\$fmt" != "raw" ] || [ "\$out_flag" != "-o" ] || [ "\$out_target" != "-" ] || [ -z "\$file" ] || [ ! -f "\$file" ]; then

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +27
if [ -z "${HOME:-}" ]; then
if command -v dscl >/dev/null 2>&1; then
HOME="$(dscl . -read "/Users/$(id -un)" NFSHomeDirectory 2>/dev/null | awk '{print $2}')"
fi
fi
if [ -z "${HOME:-}" ]; then
if command -v getent >/dev/null 2>&1; then
HOME="$(getent passwd "$(id -un)" 2>/dev/null | cut -d: -f6)"
fi
fi
export HOME
if [ -z "$HOME" ]; then
echo "[restart-bot] Error: could not determine HOME from environment or fallback lookups" >&2
exit 1
fi
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script exits early if it cannot determine HOME, but this happens before argument parsing. That means restart-bot.sh --help (and unknown-arg usage) can fail in restricted environments even though help output shouldn’t depend on HOME. Consider parsing --help/invalid args before the HOME fallback/guard (or only requiring HOME when actually needed for default plist path resolution).

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +188
if ! old_pid=$(get_pid); then
err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist"
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_pid documents exit status 2 as “launchctl query failed (unknown state)”, but graceful_restart treats any non-zero from get_pid as “service not registered”. This can misdiagnose transient/permission errors from launchctl list. Handle rc=2 explicitly (e.g., emit a distinct error like “launchctl list failed” and abort) so operators aren’t pointed at --plist when the real issue is an unreadable launchctl state.

Suggested change
if ! old_pid=$(get_pid); then
err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist"
local get_pid_rc=0
if old_pid=$(get_pid); then
:
else
get_pid_rc=$?
case "$get_pid_rc" in
1)
err "service $BOT_LABEL is not registered with launchd; run: restart-bot.sh --plist"
;;
2)
err "launchctl list failed while querying service $BOT_LABEL; unable to determine launchd state"
;;
*)
err "failed to query PID for service $BOT_LABEL (get_pid exit status: $get_pid_rc)"
;;
esac

Copilot uses AI. Check for mistakes.
Comment on lines +233 to +237
if is_registered; then
log "Unregistering $SERVICE (launchctl bootout)…"
# bootout may return non-zero even when the teardown is in progress;
# we rely on polling below, not the exit code.
"$LAUNCHCTL_BIN" bootout "$SERVICE" >/dev/null 2>&1 || true
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_registered treats get_pid rc=2 (launchctl query failure/unknown state) the same as “not registered”. In --plist mode that can cause the script to skip bootout and go straight to bootstrap against a potentially still-registered service (often EIO). Consider detecting rc=2 explicitly and aborting with a clear “launchctl list failed” diagnostic instead of proceeding.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document correct bot restart procedure for plist changes

2 participants