dashprotocol · dashprotocol · May 28, 2026 · May 28, 2026
diff --git a/docs/runbook/postgres-backup-and-recovery.md b/docs/runbook/postgres-backup-and-recovery.md
@@ -0,0 +1,243 @@
+# Runbook: PostgreSQL Backup and Recovery
+
+## Purpose
+Install PostgreSQL 16 on the Havenhold app host, configure daily automated backups with 7-day local retention, and provide a tested restore procedure. This runbook covers two phases:
+
+- **D0 phase (H-007):** PostgreSQL setup and cron configuration. Completed before app code is ever deployed. Steps A and C only.
+- **Pre-launch phase (H-010):** First deployment of app code, schema migration, seed, and restore test. Completed during the H-010 initial deploy cycle, before the app process is started for the first time. Steps B, D, and E.
+
+The restore test (Step E) can only be run after `prisma migrate deploy` has created the schema. Do not attempt it at D0 time on an empty database.
+
+## Security Handling
+- This runbook is git-tracked and must stay sanitized.
+- Do not commit `DB_PASSWORD`, connection strings, `.pgpass` contents, or operator-only CIDRs.
+- Store `DB_PASSWORD` in approved team secret storage (e.g. 1Password, AWS Secrets Manager).
+- Backup files contain full database contents — `/var/backups/havenhold/` is restricted to `postgres:postgres` (mode 750). Do not scp backups over unencrypted channels.
+- `.pgpass` files are mode 600; never commit them.
+
+## Prerequisites
+- Host hardening baseline complete (H-005) — admin user, UFW, and fail2ban active.
+- Nginx/TLS setup complete (H-006) — HTTPS serving the app.
+- Lightsail firewall has no inbound rule for port 5432 (PostgreSQL must not be externally reachable).
+- Repo checked out locally with `infra/postgres-setup.sh`, `infra/postgres-backup.sh`, and `infra/postgres-restore.sh` present.
+- A strong, randomly generated `DB_PASSWORD` stored in team secret storage.
+
+## Deploy Steps
+
+> **Phase label key**
+> - `[D0]` — run during H-007, before any app code is deployed
+> - `[H-010]` — run during H-010 initial deploy, with app code present but app process not yet started
+
+### Step A `[D0]`: Install and configure PostgreSQL
+
+Copy and run the setup script on the host:
+
+```bash
+KEY_PATH=/path/to/key.pem
+STATIC_IP=203.0.113.10
+ADMIN_USER=adminuser
+
+scp -i "$KEY_PATH" infra/postgres-setup.sh "$ADMIN_USER"@"$STATIC_IP":/tmp/postgres-setup.sh
+ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP" \
+  "DB_PASSWORD='<strong-password>' sudo -E bash /tmp/postgres-setup.sh"
+```
+
+Replace `<strong-password>` with the password from team secret storage. The script is idempotent — safe to re-run after fixing any failed step.
+
+### Step B `[H-010]`: Deploy app code, set DATABASE_URL, run migrations
+
+This step is performed during the H-010 initial deploy cycle. App code must be on the server before running these commands. The app process is **not started** yet.
+
+```bash
+# On host — edit the app env file with the DB connection string
+ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP"
+sudo nano /opt/havenhold/server/.env
+# Set: DATABASE_URL="postgresql://havenhold:<DB_PASSWORD>@127.0.0.1:5432/havenhold"
+```
+
+Run Prisma migrations to create the schema (no app process required):
+
+```bash
+ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP" \
+  "cd /opt/havenhold/server && npx prisma migrate deploy"
+```
+
+Seed demo data so the backup/restore test has rows to validate:
+
+```bash
+ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP" \
+  "cd /opt/havenhold/server && npx prisma db seed"
+```
+
+### Step C `[D0]`: Install daily backup cron job
+
+```bash
+scp -i "$KEY_PATH" infra/postgres-backup.sh "$ADMIN_USER"@"$STATIC_IP":/tmp/postgres-backup.sh
+ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP" \
+  "INSTALL_CRON=true sudo -E bash /tmp/postgres-backup.sh"
+```
+
+Default schedule: **02:30 UTC daily**. Override with `CRON_HOUR` and `CRON_MINUTE` env vars.
+
+### Step D `[H-010]`: Run first manual backup
+
+Run after Step B (migrations + seed) so the backup captures a schema with real rows:
+
+```bash
+ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP" \
+  "sudo -u postgres /usr/local/bin/postgres-backup.sh"
+```
+
+Expected: `Backup written: /var/backups/havenhold/havenhold_<timestamp>.sql`
+
+### Step E `[H-010]`: Test restore (required AC — restore tested once)
+
+The app process has not been started yet at this point, so there are no active connections to terminate. Run restore directly:
+
+```bash
+scp -i "$KEY_PATH" infra/postgres-restore.sh "$ADMIN_USER"@"$STATIC_IP":/tmp/postgres-restore.sh
+ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP" \
+  "sudo bash /tmp/postgres-restore.sh"
+```
+
+The script auto-selects the most recent backup, prompts for confirmation, drops and recreates the database, restores, and prints row counts. Fill in the [Restore Tested Once](#restore-tested-once-h-010-pre-launch-confirmation) table below.
+
+After the restore test passes, the app can be started for the first time:
+
+```bash
+ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP" \
+  "sudo systemctl start havenhold-app"
+```
+
+## D0 Verification Checklist (H-007 — infra only)
+
+Complete these before any app code is deployed:
+
+- [ ] PostgreSQL 16 installed and active (`systemctl status postgresql`).
+- [ ] Database `havenhold` exists and app user `havenhold` can connect from `127.0.0.1`.
+- [ ] PostgreSQL binds to localhost only — no external port 5432 exposure.
+- [ ] Lightsail firewall has no inbound 5432 rule; UFW has no 5432 rule.
+- [ ] `/etc/cron.d/havenhold-backup` installed with `CRON_TZ=UTC` and `postgres` as the run user.
+
+## Pre-Launch Checklist (H-010 — complete before starting the app process)
+
+Complete these during the H-010 initial deploy, after app code is on the server but before the app process is started:
+
+- [ ] `DATABASE_URL` set in `/opt/havenhold/server/.env`.
+- [ ] `npx prisma migrate deploy` completed without errors.
+- [ ] `npx prisma db seed` completed — demo patient/user rows exist.
+- [ ] First manual backup written to `/var/backups/havenhold/` with non-zero size.
+- [ ] `postgres-restore.sh` run against the most recent backup; row-count validation passed.
+- [ ] [Restore Tested Once](#restore-tested-once-h-010-pre-launch-confirmation) table filled in and committed.
+
+## Evidence Commands
+
+Run on the host unless noted:
+
+```bash
+# PostgreSQL service health
+systemctl status postgresql --no-pager
+
+# Database and user exist
+sudo -u postgres psql -c "\l" | grep havenhold
+sudo -u postgres psql -c "\du" | grep havenhold
+
+# Localhost-only binding (expect 127.0.0.1:5432 only, no 0.0.0.0)
+sudo ss -tulpen | grep 5432
+
+# UFW has no 5432 rule (no output = correct)
+sudo ufw status verbose | grep 5432 || echo "No 5432 rule — correct"
+
+# Lightsail firewall (run locally, not on host)
+aws lightsail get-instance-port-states --instance-name havenhold-app-01 \
+  | grep -i fromport | grep 5432 || echo "No 5432 Lightsail rule — correct"
+
+# Backup directory and files
+ls -lh /var/backups/havenhold/
+
+# Cron job
+sudo cat /etc/cron.d/havenhold-backup
+
+# Backup log (after first run)
+sudo tail -30 /var/log/havenhold-backup.log
+
+# Quick pg_dump smoke test (dumps to /dev/null — non-destructive)
+sudo -u postgres pg_dump --no-password havenhold > /dev/null && echo "pg_dump OK"
+```
+
+## Restore Tested Once (H-010 Pre-Launch Confirmation)
+
+Fill in this table during the H-010 pre-launch restore test (Step E) and commit the change before starting the app process for the first time.
+
+| Field | Value |
+|---|---|
+| Date of test | _________________ |
+| Operator | _________________ |
+| Backup file used | _________________ |
+| Backup file size | _________________ |
+| Restore duration (approx) | _________________ |
+| `Patient` row count post-restore | _________________ |
+| `User` row count post-restore | _________________ |
+| `Medication` row count post-restore | _________________ |
+| Script exit code | _________________ |
+| Pass / Fail | _________________ |
+
+## Operations: Periodic Backup Verification
+
+### Check last backup ran successfully
+
+```bash
+sudo tail -20 /var/log/havenhold-backup.log
+ls -lht /var/backups/havenhold/ | head -5
+```
+
+### Check cron job is active
+
+```bash
+sudo cat /etc/cron.d/havenhold-backup
+systemctl status cron --no-pager
+```
+
+### Force a manual backup
+
+```bash
+sudo -u postgres DB_NAME=havenhold BACKUP_DIR=/var/backups/havenhold \
+  /usr/local/bin/postgres-backup.sh
+```
+
+### List all backups with sizes
+
+```bash
+ls -lh /var/backups/havenhold/
+```
+
+### Check what would be pruned (backups older than 7 days)
+
+```bash
+find /var/backups/havenhold/ -name "havenhold_*.sql" -mtime +7 -ls
+```
+
+### Restore from a specific backup file
+
+```bash
+BACKUP_FILE=/var/backups/havenhold/havenhold_20260528T023001Z.sql \
+  sudo -E bash /tmp/postgres-restore.sh
+```
+
+## Rollback / Recovery
+
+- **`postgres-setup.sh` failed mid-way:** Re-run — the script is idempotent.
+- **pg_dump fails in cron:** Check `/var/log/havenhold-backup.log`. Common causes: disk full (`df -h`), PostgreSQL service stopped (`systemctl status postgresql`). Resolve and run manually to verify.
+- **`postgres-restore.sh` fails at DROP — active connections persist:** Stop the application first (`sudo systemctl stop havenhold-app`), then retry.
+- **Restore SQL errors (`ON_ERROR_STOP` exits):** The backup may be corrupt or from an incompatible schema version. Try the previous day's backup from `/var/backups/havenhold/`.
+- **`pg_hba.conf` conflict on a partially provisioned host:** The validation step at the end of `postgres-setup.sh` catches this. Inspect the file manually (`sudo cat /etc/postgresql/16/main/pg_hba.conf`) and remove duplicate or conflicting entries before re-running.
+- **All local backups lost (disk failure):** No S3 offload exists at D0. Restore the Lightsail instance snapshot from the AWS console, then re-run `postgres-setup.sh` with the original `DB_PASSWORD`. Pending migrations can be replayed with `npx prisma migrate deploy`.
+- **Disk pressure from backups:** Emergency prune: `find /var/backups/havenhold/ -name "*.sql" -mtime +2 -delete` (keeps last 2 days). Adjust `RETENTION_DAYS` in `/etc/cron.d/havenhold-backup` for the long term.
+
+## Notes
+
+- PostgreSQL is intentionally not exposed outside localhost. There is no pgAdmin or external connection. All DBA work is done via `sudo -u postgres psql` on the host.
+- Backup files are plain SQL (`pg_dump -Fp`). They can be inspected with `less` or `grep`. To upgrade to compressed custom format (`-Fc`) in a future ticket, restore uses `pg_restore -d havenhold file.dump` instead of `psql -f file.sql`.
+- S3 backup offload is deferred to a follow-up ticket. Until then, Lightsail instance snapshots are the only off-host recovery path.
+- At typical Havenhold data volumes (< 10 MB/day), 7 days of backups costs under 100 MB of disk. Adjust via `RETENTION_DAYS` in `/etc/cron.d/havenhold-backup`.
+- After any restore, run `npx prisma migrate deploy` to verify the schema migration state is current.
diff --git a/infra/postgres-backup.sh b/infra/postgres-backup.sh
@@ -0,0 +1,123 @@
+#!/usr/bin/env bash
+# infra/postgres-backup.sh — pg_dump daily backup for Havenhold
+#
+# BACKUP MODE (default — run as postgres OS user or via sudo -u postgres):
+#   sudo -u postgres DB_NAME=havenhold BACKUP_DIR=/var/backups/havenhold \
+#     /usr/local/bin/postgres-backup.sh
+#
+# CRON INSTALL MODE (run once as root to schedule daily backups):
+#   Operator invocation (from repo root):
+#     KEY_PATH=/path/to/key.pem
+#     STATIC_IP=203.0.113.10
+#     ADMIN_USER=adminuser
+#     scp -i "$KEY_PATH" infra/postgres-backup.sh "$ADMIN_USER"@"$STATIC_IP":/tmp/postgres-backup.sh
+#     ssh -i "$KEY_PATH" "$ADMIN_USER"@"$STATIC_IP" \
+#       "INSTALL_CRON=true sudo -E bash /tmp/postgres-backup.sh"
+
+set -euo pipefail
+
+DB_NAME="${DB_NAME:-havenhold}"
+BACKUP_DIR="${BACKUP_DIR:-/var/backups/havenhold}"
+RETENTION_DAYS="${RETENTION_DAYS:-7}"
+INSTALL_CRON="${INSTALL_CRON:-false}"
+CRON_HOUR="${CRON_HOUR:-2}"
+CRON_MINUTE="${CRON_MINUTE:-30}"
+
+# Validate inputs before any value is written into the cron file.
+# A newline in DB_NAME or BACKUP_DIR would inject additional cron lines
+# (executed by the cron daemon as separate commands, potentially as root).
+[[ "${DB_NAME}" =~ ^[a-zA-Z0-9_]+$ ]] \
+  || { echo "ERROR: DB_NAME '${DB_NAME}' contains invalid characters"; exit 1; }
+[[ "${BACKUP_DIR}" =~ ^[a-zA-Z0-9/_-]+$ ]] \
+  || { echo "ERROR: BACKUP_DIR '${BACKUP_DIR}' contains invalid characters"; exit 1; }
+[[ "${RETENTION_DAYS}" =~ ^[0-9]+$ ]] \
+  || { echo "ERROR: RETENTION_DAYS '${RETENTION_DAYS}' must be a number"; exit 1; }
+[[ "${CRON_HOUR}" =~ ^([0-9]|1[0-9]|2[0-3])$ ]] \
+  || { echo "ERROR: CRON_HOUR '${CRON_HOUR}' must be 0-23"; exit 1; }
+[[ "${CRON_MINUTE}" =~ ^([0-9]|[1-5][0-9])$ ]] \
+  || { echo "ERROR: CRON_MINUTE '${CRON_MINUTE}' must be 0-59"; exit 1; }
+
+log() { echo "[postgres-backup] $(date -u +'%H:%M:%S') $*"; }
+
+# ---------------------------------------------------------------------------
+# CRON INSTALL MODE
+# ---------------------------------------------------------------------------
+if [[ "${INSTALL_CRON}" == "true" ]]; then
+  if [[ "$(id -u)" -ne 0 ]]; then
+    exec sudo -E bash "$0" "$@"
+  fi
+
+  SCRIPT_DEST="/usr/local/bin/postgres-backup.sh"
+  log "Installing backup script to ${SCRIPT_DEST}"
+  cp "$(realpath "$0")" "${SCRIPT_DEST}"
+  chmod 750 "${SCRIPT_DEST}"
+  chown root:postgres "${SCRIPT_DEST}"
+
+  CRON_FILE="/etc/cron.d/havenhold-backup"
+  # Build the desired content first so we can compare and only write on change.
+  # Path-presence check alone is not enough — CRON_HOUR, CRON_MINUTE, RETENTION_DAYS
+  # etc. may have changed since the last install.
+  NEW_CRON_CONTENT="$(cat <<EOF
+# Havenhold daily PostgreSQL backup — managed by postgres-backup.sh
+SHELL=/bin/bash
+PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+CRON_TZ=UTC
+${CRON_MINUTE} ${CRON_HOUR} * * * postgres DB_NAME=${DB_NAME} BACKUP_DIR=${BACKUP_DIR} RETENTION_DAYS=${RETENTION_DAYS} ${SCRIPT_DEST} >> /var/log/havenhold-backup.log 2>&1
+EOF
+  )"
+
+  if [[ -f "${CRON_FILE}" ]] && [[ "$(cat "${CRON_FILE}")" == "${NEW_CRON_CONTENT}" ]]; then
+    log "Cron job already up to date — skipping"
+  else
+    printf '%s\n' "${NEW_CRON_CONTENT}" > "${CRON_FILE}"
+    chmod 644 "${CRON_FILE}"
+    log "Cron job written: ${CRON_FILE} (schedule: ${CRON_HOUR}:$(printf '%02d' "${CRON_MINUTE}") UTC daily)"
+  fi
+
+  log "Cron install complete."
+  log "Test with: sudo -u postgres ${SCRIPT_DEST}"
+  log "Logs at:   sudo tail -f /var/log/havenhold-backup.log"
+  exit 0
+fi
+
+# ---------------------------------------------------------------------------
+# BACKUP MODE
+# ---------------------------------------------------------------------------
+log "Step 1/4: Verify backup directory"
+if [[ ! -d "${BACKUP_DIR}" ]]; then
+  log "ERROR: ${BACKUP_DIR} does not exist. Run postgres-setup.sh first." >&2
+  exit 1
+fi
+
+log "Step 2/4: Running pg_dump for '${DB_NAME}'"
+TIMESTAMP="$(date -u +'%Y%m%dT%H%M%SZ')"
+BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql"
+
+# Peer auth: script runs as postgres OS user — no password or PGPASSWORD needed.
+# --no-password ensures we fail fast rather than hang if auth is misconfigured.
+pg_dump \
+  --format=plain \
+  --no-password \
+  --file="${BACKUP_FILE}" \
+  "${DB_NAME}"
+
+log "Step 3/4: Verify backup is non-empty"
+if [[ ! -s "${BACKUP_FILE}" ]]; then
+  log "ERROR: Backup file is empty — removing and aborting." >&2
+  rm -f "${BACKUP_FILE}"
+  exit 1
+fi
+BACKUP_SIZE="$(du -sh "${BACKUP_FILE}" | cut -f1)"
+log "Backup written: ${BACKUP_FILE} (${BACKUP_SIZE})"
+
+log "Step 4/4: Pruning backups older than ${RETENTION_DAYS} days"
+PRUNED=0
+while IFS= read -r -d '' OLD_FILE; do
+  rm -f "${OLD_FILE}"
+  log "Pruned: ${OLD_FILE}"
+  PRUNED=$((PRUNED + 1))
+done < <(find "${BACKUP_DIR}" -maxdepth 1 -name "${DB_NAME}_*.sql" \
+          -mtime "+${RETENTION_DAYS}" -print0)
+log "Pruned ${PRUNED} old backup(s)"
+
+log "Backup complete: ${BACKUP_FILE}"