Skip to content

RFC: customizable alert configuration #2845

@jshearer

Description

@jshearer

Context

We've been building out the alert system over the past year: shard failure alerts, data movement stalled alerts, and more recently, abandoned task detection alerts (idle, chronically failing, auto-disabled). We want to keep going. Specifically:

  1. Make ShardFailed alerts default-on for all users
  2. Enable auto-disabling of abandoned tasks with confidence
  3. Support bulk and programmatic alert configuration without requiring per-task spec changes

Goals 1 and 2 are blocked because a single global threshold doesn't work for all tasks. Tasks that retry frequently but still move data every hour would be classified as "chronically failing" under a 30-day definition, but should not be disabled. The 30-day idle threshold is too short for tasks that sync monthly. Every time we try to tighten one scenario, we break another. Per-task tuning is needed to make these features safe to enable broadly.

Goal 3 is blocked because there's no good place for alert configuration to live. The only per-task config today (DataMovementStalled interval) is in a side-table (alert_data_processing) that flowctl doesn't know about. All other thresholds are global env vars with no per-task or per-prefix override.

The solution: a new prefix-scoped database table for alert configuration, analogous to storage mappings. Alert config is only ever read by the control plane (not the data plane or reactor), so it doesn't need to be in the task spec. A database table also allows bulk configuration by prefix and instant changes without publication.

Current state

Per-task alert configuration currently exists in a limited form: DataMovementStalled has a per-task evaluation_interval stored in the alert_data_processing database table. The UI writes to this table directly via PostgREST. This table is not visible to flowctl and is not version-controlled.

All other alert thresholds (ALERT_AFTER_SHARD_FAILURES, SHARD_FAILURE_RETENTION, CHRONICALLY_FAILING_THRESHOLD, IDLE_THRESHOLD, etc.) are global env vars in ControllerConfig (crates/agent/src/controllers/mod.rs). These apply identically to every task with no per-task or per-prefix override.

Proposed solution

Create a new alert_configs table that associates alert configuration with catalog prefixes or exact task names:

CREATE TABLE public.alert_configs (
    id                      public.flowid NOT NULL DEFAULT internal.id_generator(),
    catalog_prefix_or_name  text NOT NULL,
    config                  jsonb NOT NULL,
    detail                  text,
    created_at              timestamptz NOT NULL DEFAULT now(),
    updated_at              timestamptz NOT NULL DEFAULT now(),
    last_modified_by        uuid REFERENCES auth.users(id),
    UNIQUE (catalog_prefix_or_name)
);

A config row applies to all tasks whose catalog name matches:

{
  "dataMovementStalled": { "threshold": "2h" },
  "shardFailed": { "failureThreshold": 3, "retentionWindow": "8h" },
  "taskChronicallyFailing": { "threshold": "30d" },
  "taskIdle": { "threshold": "60d" },
  "autoDisable": { "idle": true, "failing": false }
}

Example: a config at acmeCo/prod/ applies to all tasks under that prefix. A config at acmeCo/prod/source-postgres (exact name, no trailing /) applies to only that task. An exact name match takes priority over a prefix match.

Changes to alert config are instant database writes with no publication required.

Alert type categorization

Not all 12 alert types make sense as configurable. The split:

Configurable (in the alert_configs table):

Alert type Parameter Current source Current default Meaning
dataMovementStalled threshold alert_data_processing table 2h How long without data movement before alerting
shardFailed failureThreshold ALERT_AFTER_SHARD_FAILURES env var 3 Number of failures within the retention window required to fire the alert
shardFailed retentionWindow SHARD_FAILURE_RETENTION env var 8h Time window within which failures are counted; older failures are discarded
taskChronicallyFailing threshold CHRONICALLY_FAILING_THRESHOLD env var 30d How long ShardFailed must be continuously active before firing
taskIdle threshold IDLE_THRESHOLD env var 35d How long without data movement before firing (bumped from 30d to accommodate monthly syncs)
(auto-disable) autoDisable.idle DISABLE_IDLE_TASKS env var false Whether idle tasks matching this config are auto-disabled
(auto-disable) autoDisable.failing DISABLE_FAILING_TASKS env var false Whether chronically failing tasks matching this config are auto-disabled

Some notes:

  • For tasks with unusual cadences (monthly syncs, seasonal data sources), setting a longer taskIdle.threshold is the intended mechanism for preventing false alerts and unwanted auto-disable.
  • Auto-disable grace periods (CHRONICALLY_FAILING_DISABLE_AFTER, IDLE_DISABLE_AFTER, both default 7d) remain global. The threshold controls when an alert fires; the grace period is a system-level policy about how much notice to give before disabling, and doesn't need to vary per-prefix.
  • resolve_shard_failed_alert_after (how long of healthy shard status before resolving a ShardFailed alert, default 2h) remains a global setting.
  • user_pub_threshold (how recently a user must have published to suppress abandoned-task alerts, default 14d) remains a global setting. It represents a system-level definition of "active ownership" rather than a user-facing knob.

Not configurable (remain global or tenant-level):

  • freeTrial, freeTrialEnding, freeTrialStalled, missingPaymentMethod: billing/tenant-level, not task-level
  • autoDiscoverFailed, backgroundPublicationFailed: binary (failed or didn't), no meaningful threshold to configure
  • taskAutoDisabledFailing, taskAutoDisabledIdle: these fire as consequences of the chronically-failing and idle alert chains, not independently configurable

Design decisions

Prefix resolution

Longest-prefix-match with exact-name override. For a task acmeCo/prod/source-postgres:

  1. An exact-name config acmeCo/prod/source-postgres wins if it exists
  2. Otherwise, the longest matching prefix wins (e.g., acmeCo/prod/ over acmeCo/)
  3. For fields not specified in the matching config, fall back to ControllerConfig env var defaults

Per-prefix-or-exact-name matching (discussion topic)

The catalog_prefix_or_name column accepts both prefixes (ending in /) and exact catalog names (no trailing /). A prefix applies to all matching tasks; an exact name applies to one task. This differs from storage_mappings, which uses the catalog_prefix domain type and only supports prefixes.

This is a design decision open for discussion. The alternative is prefix-only, which would mean per-task config requires a naming convention that puts the task in its own sub-prefix. Most customers manage tasks as cattle, but exact-name matching provides an escape hatch for the cases that need it.

Auto-disable

Currently DISABLE_IDLE_TASKS and DISABLE_FAILING_TASKS are global env var booleans. Moving them into alert_configs makes auto-disable prefix-scoped. A customer (or the support team) can enable auto-disable for acmeCo/prod/ but leave it off for acmeCo/dev/. This addresses a real use case where a customer's shard failure alert subscriptions had to be restricted to prod because dev pipelines were noisy.

Config modeling

The config JSONB column is validated against a typed Rust struct in the GraphQL resolver before writing to the table. Each alert type has a different config shape (DataMovementStalledConfig has threshold; ShardFailedConfig has failureThreshold and retentionWindow). Typos and non-configurable alert type names produce validation errors in the API layer.

Adding a new configurable alert type requires a code change to the struct. This is the standard approach for typed configuration.

DataMovementStalled: move evaluation into the controller

The DataMovementStalled evaluation currently runs through a DB view (alert_data_movement_stalled) polled by an AlertEvaluator automation. This is a legacy pattern: the view joins alert_data_processing.evaluation_interval with catalog_stats_hourly to determine which tasks have stalled, and the evaluator diffs the results against alert_history to fire/resolve alerts.

Moving DataMovementStalled evaluation into the controller eliminates this indirection. The controller already evaluates a similar condition for TaskIdle via fetch_last_data_movement_ts() in abandon.rs. Adding DataMovementStalled evaluation is the same pattern: check catalog_stats_hourly for recent byte movement, compare against the threshold from alert_configs, fire or resolve the alert.

This replaces the alert_data_movement_stalled DB view and the AlertEvaluator<DataMovementStalledAlerts> automation. The AlertEvaluator<TenantAlerts> automation remains for now to limit scope; tenant-level billing alerts could move to controllers in the future.

A comment in evaluator.rs hints at this direction:

// This queries the `internal.alert_data_movement_stalled` view for
// historical reasons. If we ever need to change that view, we should
// consider dropping the view in favor of a regular sql query, which is
// easier to manage.

API and authorization

The table is accessed exclusively through GraphQL, with authorization evaluated in the resolver layer rather than through RLS.

Authorization follows the storage mappings pattern: admin capability on the prefix is required to create, update, or delete an alert_configs row. GraphQL queries support exactPrefixes and underPrefix filters.

Audit trail

Deferred. The last_modified_by column captures who made each change. Full change history can be derived from CDC events on the table in the future.

Migration from alert_data_processing

The migration avoids a flag-day deploy by using a transitional fallback in the controller. The ordering is important to avoid stale data shadowing fresh writes:

  1. Create the alert_configs table (empty). Deploy the controller reading from alert_configs with fallback to alert_data_processing when no matching alert_configs row exists. Since the table is empty, everything falls back to alert_data_processing. No behavioral change.
  2. Deploy UI changes to read/write alert_configs via GraphQL instead of alert_data_processing via PostgREST. New user changes go to the new table. Old untouched values still resolve via fallback to alert_data_processing.
  3. Run a data migration to copy remaining alert_data_processing rows (those not already overwritten by UI writes in step 2) into alert_configs. After this, alert_configs has all values.
  4. Remove the fallback from the controller and drop alert_data_processing.

The UI must switch before the data migration. If the data migration ran first, subsequent UI writes to alert_data_processing would be shadowed by stale copies in alert_configs (since alert_configs takes priority in the fallback logic).

Alternatives considered

Alert config in the task spec. Putting an alerts block directly on CaptureSpec/MaterializationSpec/DerivationSpec would make config version-controlled, automatically visible to flowctl, and preserved in publication history. We ended up not going down this path because:

  • Changing config would require a full publication which is slow, can fail due to unrelated connector issues, and unneccesarily restarts the connector.
  • There's no way to configure by prefix, meaning that every task must be tuned individually
  • Alert config is only consumed by the control plane so it doesn't need to live in the spec.

Metadata

Metadata

Assignees

Labels

control-planecontrol-plane-apiChange affecting the API of control-plane, may impact the UI, flowctl, etccontrol-plane-modelsChangs the data models of catalog specs (the `models` crate). Impacts all consumers of all APIs

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions