RFC: customizable alert configuration

### Context

We've been building out the alert system over the past year: shard failure alerts, data movement stalled alerts, and more recently, abandoned task detection alerts (idle, chronically failing, auto-disabled). We want to keep going. Specifically:

1. Make `ShardFailed` alerts default-on for all users
2. Enable auto-disabling of abandoned tasks with confidence
3. Support bulk and programmatic alert configuration without requiring per-task spec changes

Goals 1 and 2 are blocked because **a single global threshold doesn't work for all tasks.** Tasks that retry frequently but still move data every hour would be classified as "chronically failing" under a 30-day definition, but should not be disabled. The 30-day idle threshold is too short for tasks that sync monthly. Every time we try to tighten one scenario, we break another. Per-task tuning is needed to make these features safe to enable broadly.

Goal 3 is blocked because **there's no good place for alert configuration to live.** The only per-task config today (`DataMovementStalled` interval) is in a side-table (`alert_data_processing`) that flowctl doesn't know about. All other thresholds are global env vars with no per-task or per-prefix override.

The solution: a new prefix-scoped database table for alert configuration, analogous to storage mappings. Alert config is only ever read by the control plane (not the data plane or reactor), so it doesn't need to be in the task spec. A database table also allows bulk configuration by prefix and instant changes without publication.

### Current state

Per-task alert configuration currently exists in a limited form: `DataMovementStalled` has a per-task `evaluation_interval` stored in the `alert_data_processing` database table. The UI writes to this table directly via PostgREST. This table is not visible to flowctl and is not version-controlled.

All other alert thresholds (`ALERT_AFTER_SHARD_FAILURES`, `SHARD_FAILURE_RETENTION`, `CHRONICALLY_FAILING_THRESHOLD`, `IDLE_THRESHOLD`, etc.) are global env vars in `ControllerConfig` (`crates/agent/src/controllers/mod.rs`). These apply identically to every task with no per-task or per-prefix override.

### Proposed solution

Create a new `alert_configs` table that associates alert configuration with catalog prefixes or exact task names:

```sql
CREATE TABLE public.alert_configs (
    id                      public.flowid NOT NULL DEFAULT internal.id_generator(),
    catalog_prefix_or_name  text NOT NULL,
    config                  jsonb NOT NULL,
    detail                  text,
    created_at              timestamptz NOT NULL DEFAULT now(),
    updated_at              timestamptz NOT NULL DEFAULT now(),
    last_modified_by        uuid REFERENCES auth.users(id),
    UNIQUE (catalog_prefix_or_name)
);
```

A config row applies to all tasks whose catalog name matches:

```json
{
  "dataMovementStalled": { "threshold": "2h" },
  "shardFailed": { "failureThreshold": 3, "retentionWindow": "8h" },
  "taskChronicallyFailing": { "threshold": "30d" },
  "taskIdle": { "threshold": "60d" },
  "autoDisable": { "idle": true, "failing": false }
}
```

Example: a config at `acmeCo/prod/` applies to all tasks under that prefix. A config at `acmeCo/prod/source-postgres` (exact name, no trailing `/`) applies to only that task. An exact name match takes priority over a prefix match.

Changes to alert config are instant database writes with no publication required.

### Alert type categorization

Not all 12 alert types make sense as configurable. The split:

**Configurable (in the `alert_configs` table):**

| Alert type | Parameter | Current source | Current default | Meaning |
| --- | --- | --- | --- | --- |
| `dataMovementStalled` | `threshold` | `alert_data_processing` table | 2h | How long without data movement before alerting |
| `shardFailed` | `failureThreshold` | `ALERT_AFTER_SHARD_FAILURES` env var | 3 | Number of failures within the retention window required to fire the alert |
| `shardFailed` | `retentionWindow` | `SHARD_FAILURE_RETENTION` env var | 8h | Time window within which failures are counted; older failures are discarded |
| `taskChronicallyFailing` | `threshold` | `CHRONICALLY_FAILING_THRESHOLD` env var | 30d | How long `ShardFailed` must be continuously active before firing |
| `taskIdle` | `threshold` | `IDLE_THRESHOLD` env var | 35d | How long without data movement before firing (bumped from 30d to accommodate monthly syncs) |
| (auto-disable) | `autoDisable.idle` | `DISABLE_IDLE_TASKS` env var | false | Whether idle tasks matching this config are auto-disabled |
| (auto-disable) | `autoDisable.failing` | `DISABLE_FAILING_TASKS` env var | false | Whether chronically failing tasks matching this config are auto-disabled |

Some notes:

* For tasks with unusual cadences (monthly syncs, seasonal data sources), setting a longer `taskIdle.threshold` is the intended mechanism for preventing false alerts and unwanted auto-disable.
* Auto-disable grace periods (`CHRONICALLY_FAILING_DISABLE_AFTER`, `IDLE_DISABLE_AFTER`, both default 7d) remain global. The threshold controls when an alert fires; the grace period is a system-level policy about how much notice to give before disabling, and doesn't need to vary per-prefix.
* `resolve_shard_failed_alert_after` (how long of healthy shard status before resolving a `ShardFailed` alert, default 2h) remains a global setting.
* `user_pub_threshold` (how recently a user must have published to suppress abandoned-task alerts, default 14d) remains a global setting. It represents a system-level definition of "active ownership" rather than a user-facing knob.

**Not configurable (remain global or tenant-level):**

- `freeTrial`, `freeTrialEnding`, `freeTrialStalled`, `missingPaymentMethod`: billing/tenant-level, not task-level
- `autoDiscoverFailed`, `backgroundPublicationFailed`: binary (failed or didn't), no meaningful threshold to configure
- `taskAutoDisabledFailing`, `taskAutoDisabledIdle`: these fire as consequences of the chronically-failing and idle alert chains, not independently configurable

### Design decisions

#### Prefix resolution

Longest-prefix-match with exact-name override. For a task `acmeCo/prod/source-postgres`:

1. An exact-name config `acmeCo/prod/source-postgres` wins if it exists
2. Otherwise, the longest matching prefix wins (e.g., `acmeCo/prod/` over `acmeCo/`)
3. For fields not specified in the matching config, fall back to `ControllerConfig` env var defaults

#### Per-prefix-or-exact-name matching (discussion topic)

The `catalog_prefix_or_name` column accepts both prefixes (ending in `/`) and exact catalog names (no trailing `/`). A prefix applies to all matching tasks; an exact name applies to one task. This differs from `storage_mappings`, which uses the `catalog_prefix` domain type and only supports prefixes.

**This is a design decision open for discussion**. The alternative is prefix-only, which would mean per-task config requires a naming convention that puts the task in its own sub-prefix. Most customers manage tasks as cattle, but exact-name matching provides an escape hatch for the cases that need it.

#### Auto-disable

Currently `DISABLE_IDLE_TASKS` and `DISABLE_FAILING_TASKS` are global env var booleans. Moving them into `alert_configs` makes auto-disable prefix-scoped. A customer (or the support team) can enable auto-disable for `acmeCo/prod/` but leave it off for `acmeCo/dev/`. This addresses a real use case where a customer's shard failure alert subscriptions had to be restricted to prod because dev pipelines were noisy.

#### Config modeling

The `config` JSONB column is validated against a typed Rust struct in the GraphQL resolver before writing to the table. Each alert type has a different config shape (`DataMovementStalledConfig` has `threshold`; `ShardFailedConfig` has `failureThreshold` and `retentionWindow`). Typos and non-configurable alert type names produce validation errors in the API layer.

Adding a new configurable alert type requires a code change to the struct. This is the standard approach for typed configuration.

#### DataMovementStalled: move evaluation into the controller

The `DataMovementStalled` evaluation currently runs through a DB view (`alert_data_movement_stalled`) polled by an `AlertEvaluator` automation. This is a legacy pattern: the view joins `alert_data_processing.evaluation_interval` with `catalog_stats_hourly` to determine which tasks have stalled, and the evaluator diffs the results against `alert_history` to fire/resolve alerts.

Moving `DataMovementStalled` evaluation into the controller eliminates this indirection. The controller already evaluates a similar condition for `TaskIdle` via `fetch_last_data_movement_ts()` in `abandon.rs`. Adding `DataMovementStalled` evaluation is the same pattern: check `catalog_stats_hourly` for recent byte movement, compare against the threshold from `alert_configs`, fire or resolve the alert.

This replaces the `alert_data_movement_stalled` DB view and the `AlertEvaluator<DataMovementStalledAlerts>` automation. The `AlertEvaluator<TenantAlerts>` automation remains for now to limit scope; tenant-level billing alerts could move to controllers in the future.

A comment in `evaluator.rs` hints at this direction:

https://github.com/estuary/flow/blob/61635c47300f46d987560d5b60309a7addab5852/crates/agent/src/alerts/evaluator.rs#L201-L204

#### API and authorization

The table is accessed exclusively through GraphQL, with authorization evaluated in the resolver layer rather than through RLS.

Authorization follows the storage mappings pattern: admin capability on the prefix is required to create, update, or delete an `alert_configs` row. GraphQL queries support `exactPrefixes` and `underPrefix` filters.

#### Audit trail

Deferred. The `last_modified_by` column captures who made each change. Full change history can be derived from CDC events on the table in the future.

### Migration from `alert_data_processing`

The migration avoids a flag-day deploy by using a transitional fallback in the controller. The ordering is important to avoid stale data shadowing fresh writes:

1. Create the `alert_configs` table (empty). Deploy the controller reading from `alert_configs` with fallback to `alert_data_processing` when no matching `alert_configs` row exists. Since the table is empty, everything falls back to `alert_data_processing`. No behavioral change.
2. Deploy UI changes to read/write `alert_configs` via GraphQL instead of `alert_data_processing` via PostgREST. New user changes go to the new table. Old untouched values still resolve via fallback to `alert_data_processing`.
3. Run a data migration to copy remaining `alert_data_processing` rows (those not already overwritten by UI writes in step 2) into `alert_configs`. After this, `alert_configs` has all values.
4. Remove the fallback from the controller and drop `alert_data_processing`.

The UI must switch before the data migration. If the data migration ran first, subsequent UI writes to `alert_data_processing` would be shadowed by stale copies in `alert_configs` (since `alert_configs` takes priority in the fallback logic).

### Alternatives considered

**Alert config in the task spec.** Putting an `alerts` block directly on `CaptureSpec`/`MaterializationSpec`/`DerivationSpec` would make config version-controlled, automatically visible to flowctl, and preserved in publication history. We ended up not going down this path because:

- Changing config would require a full publication which is slow, can fail due to unrelated connector issues, and unneccesarily restarts the connector.
- There's no way to configure by prefix, meaning that every task must be tuned individually
- Alert config is only consumed by the control plane so it doesn't need to live in the spec.


Alert type	Parameter	Current source	Current default	Meaning
`dataMovementStalled`	`threshold`	`alert_data_processing` table	2h	How long without data movement before alerting
`shardFailed`	`failureThreshold`	`ALERT_AFTER_SHARD_FAILURES` env var	3	Number of failures within the retention window required to fire the alert
`shardFailed`	`retentionWindow`	`SHARD_FAILURE_RETENTION` env var	8h	Time window within which failures are counted; older failures are discarded
`taskChronicallyFailing`	`threshold`	`CHRONICALLY_FAILING_THRESHOLD` env var	30d	How long `ShardFailed` must be continuously active before firing
`taskIdle`	`threshold`	`IDLE_THRESHOLD` env var	35d	How long without data movement before firing (bumped from 30d to accommodate monthly syncs)
(auto-disable)	`autoDisable.idle`	`DISABLE_IDLE_TASKS` env var	false	Whether idle tasks matching this config are auto-disabled
(auto-disable)	`autoDisable.failing`	`DISABLE_FAILING_TASKS` env var	false	Whether chronically failing tasks matching this config are auto-disabled

	// This queries the `internal.alert_data_movement_stalled` view for
	// historical reasons. If we ever need to change that view, we should
	// consider dropping the view in favor of a regular sql query, which is
	// easier to manage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: customizable alert configuration #2845

Context

Current state

Proposed solution

Alert type categorization

Design decisions

Prefix resolution

Per-prefix-or-exact-name matching (discussion topic)

Auto-disable

Config modeling

DataMovementStalled: move evaluation into the controller

API and authorization

Audit trail

Migration from `alert_data_processing`

Alternatives considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: customizable alert configuration #2845

Description

Context

Current state

Proposed solution

Alert type categorization

Design decisions

Prefix resolution

Per-prefix-or-exact-name matching (discussion topic)

Auto-disable

Config modeling

DataMovementStalled: move evaluation into the controller

API and authorization

Audit trail

Migration from alert_data_processing

Alternatives considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Migration from `alert_data_processing`