Skip to content

feat: wire an opt-in Slack notifier and fold alert delivery into Run 2#41

Merged
rhoninl merged 1 commit into
mainfrom
feat/alertmanager-run2-notifier
May 22, 2026
Merged

feat: wire an opt-in Slack notifier and fold alert delivery into Run 2#41
rhoninl merged 1 commit into
mainfrom
feat/alertmanager-run2-notifier

Conversation

@rhoninl

@rhoninl rhoninl commented May 22, 2026

Copy link
Copy Markdown
Owner

What

The observability overlay's Alertmanager routed only to a null receiver, so all 11 alert rules paged nobody. The Run 2 plan deferred wiring a real notifier to a post-Run-2 "~5-minute change" — but the alert delivery path is end-to-end-only-testable: a wrong webhook URL passes every config-load and healthcheck, then fails silently in the real incident, which is worse than no alerting because operators trust it.

This makes Slack delivery an opt-in the burn-in itself validates, folding it into Run 2 Phase 3 instead of deferring it.

Changes

  • deploy/alertmanager.slack.yml (new) — a Slack-delivering Alertmanager config. The webhook URL is read at notification time from a tmpfs file via api_url_file, never stored in the repo.
  • deploy/docker-compose.observability.yml — the alertmanager service gains an entrypoint that materializes INFLOOP_ALERT_SLACK_WEBHOOK_URL into that tmpfs file and selects the Slack config when set, the null-receiver alertmanager.yml when not. Mirrors the sibling prometheus scrape-token service exactly. The local burn-in default is unchanged.
  • deploy/alertmanager.yml — header comment rewritten; behaviour unchanged (still the null receiver default).
  • .env.example — new optional INFLOOP_ALERT_SLACK_WEBHOOK_URL entry.
  • deploy/burn-in-preflight.test.ts — three static wiring checks, including a guard that the two configs' route: blocks stay in sync.
  • docsburn-in-run2-plan.md Phase 3 now validates observed alert delivery; production-readiness-decision.md records the deferral as withdrawn; burn-in.md Phase 3 + sign-off updated.

Validation

  • bun test1255 pass / 1 skip / 0 fail (+3 new tests)
  • bun run typecheck (tsc) — clean
  • amtool check-configSUCCESS on both alertmanager.yml and alertmanager.slack.yml
  • docker compose config merges cleanly; entrypoint branch logic verified both ways (unset → null receiver, set → Slack config)
  • Design review (sound-with-changes) and senior code review (ship-with-minor-fixes) both completed; all findings incorporated.

Follow-up after merge

The burn-in-run2-candidate tag must be moved forward to the merge commit — it currently points at 0425c9f, which lacks alertmanager.slack.yml, so a Run 2 operator doing git checkout burn-in-run2-candidate would not get this change. (Not done in this PR because the tag must point at a commit actually on main.)

🤖 Generated with Claude Code

The observability overlay's Alertmanager routed only to a null receiver,
so all 11 alert rules paged nobody. The earlier plan deferred wiring a
real notifier to a post-Run-2 "~5-minute change" — but the alert delivery
path is end-to-end-only-testable: a wrong webhook URL passes every config
load and healthcheck, then fails silently in the real incident.

Make Slack delivery an opt-in the burn-in itself can validate:

- deploy/alertmanager.slack.yml: a Slack-delivering config; the webhook
  URL is read at notification time from a tmpfs file via api_url_file,
  never stored in the repo.
- docker-compose.observability.yml: the alertmanager service gains an
  entrypoint that materializes INFLOOP_ALERT_SLACK_WEBHOOK_URL into that
  tmpfs file and selects the Slack config when set, the null-receiver
  alertmanager.yml when not — mirroring the prometheus scrape-token
  service. The local burn-in default is unchanged.
- burn-in-preflight.test.ts: three static wiring checks, including a
  guard that the two configs' route blocks stay in sync.
- docs: Run 2 Phase 3 now validates observed alert delivery; the
  production-readiness decision records the deferral as withdrawn.

bun test 1255 pass / 1 skip / 0 fail; tsc clean; amtool check-config
passes on both configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rhoninl rhoninl merged commit b695fe7 into main May 22, 2026
2 checks passed
@rhoninl rhoninl deleted the feat/alertmanager-run2-notifier branch May 22, 2026 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant