Context
We've been building out the alert system over the past year: shard failure alerts, data movement stalled alerts, and more recently, abandoned task detection alerts (idle, chronically failing, auto-disabled). We want to keep going. Specifically:
- Make
ShardFailed alerts default-on for all users
- Enable auto-disabling of abandoned tasks with confidence
- Support bulk and programmatic alert configuration without requiring per-task spec changes
Goals 1 and 2 are blocked because a single global threshold doesn't work for all tasks. Tasks that retry frequently but still move data every hour would be classified as "chronically failing" under a 30-day definition, but should not be disabled. The 30-day idle threshold is too short for tasks that sync monthly. Every time we try to tighten one scenario, we break another. Per-task tuning is needed to make these features safe to enable broadly.
Goal 3 is blocked because there's no good place for alert configuration to live. The only per-task config today (DataMovementStalled interval) is in a side-table (alert_data_processing) that flowctl doesn't know about. All other thresholds are global env vars with no per-task or per-prefix override.
The solution: a new prefix-scoped database table for alert configuration, analogous to storage mappings. Alert config is only ever read by the control plane (not the data plane or reactor), so it doesn't need to be in the task spec. A database table also allows bulk configuration by prefix and instant changes without publication.
Current state
Per-task alert configuration currently exists in a limited form: DataMovementStalled has a per-task evaluation_interval stored in the alert_data_processing database table. The UI writes to this table directly via PostgREST. This table is not visible to flowctl and is not version-controlled.
All other alert thresholds (ALERT_AFTER_SHARD_FAILURES, SHARD_FAILURE_RETENTION, CHRONICALLY_FAILING_THRESHOLD, IDLE_THRESHOLD, etc.) are global env vars in ControllerConfig (crates/agent/src/controllers/mod.rs). These apply identically to every task with no per-task or per-prefix override.
Proposed solution
Create a new alert_configs table that associates alert configuration with catalog prefixes or exact task names:
CREATE TABLE public.alert_configs (
id public.flowid NOT NULL DEFAULT internal.id_generator(),
catalog_prefix_or_name text NOT NULL,
config jsonb NOT NULL,
detail text,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
last_modified_by uuid REFERENCES auth.users(id),
UNIQUE (catalog_prefix_or_name)
);
A config row applies to all tasks whose catalog name matches:
{
"dataMovementStalled": { "threshold": "2h" },
"shardFailed": { "failureThreshold": 3, "retentionWindow": "8h" },
"taskChronicallyFailing": { "threshold": "30d" },
"taskIdle": { "threshold": "60d" },
"autoDisable": { "idle": true, "failing": false }
}
Example: a config at acmeCo/prod/ applies to all tasks under that prefix. A config at acmeCo/prod/source-postgres (exact name, no trailing /) applies to only that task. An exact name match takes priority over a prefix match.
Changes to alert config are instant database writes with no publication required.
Alert type categorization
Not all 12 alert types make sense as configurable. The split:
Configurable (in the alert_configs table):
| Alert type |
Parameter |
Current source |
Current default |
Meaning |
dataMovementStalled |
threshold |
alert_data_processing table |
2h |
How long without data movement before alerting |
shardFailed |
failureThreshold |
ALERT_AFTER_SHARD_FAILURES env var |
3 |
Number of failures within the retention window required to fire the alert |
shardFailed |
retentionWindow |
SHARD_FAILURE_RETENTION env var |
8h |
Time window within which failures are counted; older failures are discarded |
taskChronicallyFailing |
threshold |
CHRONICALLY_FAILING_THRESHOLD env var |
30d |
How long ShardFailed must be continuously active before firing |
taskIdle |
threshold |
IDLE_THRESHOLD env var |
35d |
How long without data movement before firing (bumped from 30d to accommodate monthly syncs) |
| (auto-disable) |
autoDisable.idle |
DISABLE_IDLE_TASKS env var |
false |
Whether idle tasks matching this config are auto-disabled |
| (auto-disable) |
autoDisable.failing |
DISABLE_FAILING_TASKS env var |
false |
Whether chronically failing tasks matching this config are auto-disabled |
Some notes:
- For tasks with unusual cadences (monthly syncs, seasonal data sources), setting a longer
taskIdle.threshold is the intended mechanism for preventing false alerts and unwanted auto-disable.
- Auto-disable grace periods (
CHRONICALLY_FAILING_DISABLE_AFTER, IDLE_DISABLE_AFTER, both default 7d) remain global. The threshold controls when an alert fires; the grace period is a system-level policy about how much notice to give before disabling, and doesn't need to vary per-prefix.
resolve_shard_failed_alert_after (how long of healthy shard status before resolving a ShardFailed alert, default 2h) remains a global setting.
user_pub_threshold (how recently a user must have published to suppress abandoned-task alerts, default 14d) remains a global setting. It represents a system-level definition of "active ownership" rather than a user-facing knob.
Not configurable (remain global or tenant-level):
freeTrial, freeTrialEnding, freeTrialStalled, missingPaymentMethod: billing/tenant-level, not task-level
autoDiscoverFailed, backgroundPublicationFailed: binary (failed or didn't), no meaningful threshold to configure
taskAutoDisabledFailing, taskAutoDisabledIdle: these fire as consequences of the chronically-failing and idle alert chains, not independently configurable
Design decisions
Prefix resolution
Longest-prefix-match with exact-name override. For a task acmeCo/prod/source-postgres:
- An exact-name config
acmeCo/prod/source-postgres wins if it exists
- Otherwise, the longest matching prefix wins (e.g.,
acmeCo/prod/ over acmeCo/)
- For fields not specified in the matching config, fall back to
ControllerConfig env var defaults
Per-prefix-or-exact-name matching (discussion topic)
The catalog_prefix_or_name column accepts both prefixes (ending in /) and exact catalog names (no trailing /). A prefix applies to all matching tasks; an exact name applies to one task. This differs from storage_mappings, which uses the catalog_prefix domain type and only supports prefixes.
This is a design decision open for discussion. The alternative is prefix-only, which would mean per-task config requires a naming convention that puts the task in its own sub-prefix. Most customers manage tasks as cattle, but exact-name matching provides an escape hatch for the cases that need it.
Auto-disable
Currently DISABLE_IDLE_TASKS and DISABLE_FAILING_TASKS are global env var booleans. Moving them into alert_configs makes auto-disable prefix-scoped. A customer (or the support team) can enable auto-disable for acmeCo/prod/ but leave it off for acmeCo/dev/. This addresses a real use case where a customer's shard failure alert subscriptions had to be restricted to prod because dev pipelines were noisy.
Config modeling
The config JSONB column is validated against a typed Rust struct in the GraphQL resolver before writing to the table. Each alert type has a different config shape (DataMovementStalledConfig has threshold; ShardFailedConfig has failureThreshold and retentionWindow). Typos and non-configurable alert type names produce validation errors in the API layer.
Adding a new configurable alert type requires a code change to the struct. This is the standard approach for typed configuration.
DataMovementStalled: move evaluation into the controller
The DataMovementStalled evaluation currently runs through a DB view (alert_data_movement_stalled) polled by an AlertEvaluator automation. This is a legacy pattern: the view joins alert_data_processing.evaluation_interval with catalog_stats_hourly to determine which tasks have stalled, and the evaluator diffs the results against alert_history to fire/resolve alerts.
Moving DataMovementStalled evaluation into the controller eliminates this indirection. The controller already evaluates a similar condition for TaskIdle via fetch_last_data_movement_ts() in abandon.rs. Adding DataMovementStalled evaluation is the same pattern: check catalog_stats_hourly for recent byte movement, compare against the threshold from alert_configs, fire or resolve the alert.
This replaces the alert_data_movement_stalled DB view and the AlertEvaluator<DataMovementStalledAlerts> automation. The AlertEvaluator<TenantAlerts> automation remains for now to limit scope; tenant-level billing alerts could move to controllers in the future.
A comment in evaluator.rs hints at this direction:
|
// This queries the `internal.alert_data_movement_stalled` view for |
|
// historical reasons. If we ever need to change that view, we should |
|
// consider dropping the view in favor of a regular sql query, which is |
|
// easier to manage. |
API and authorization
The table is accessed exclusively through GraphQL, with authorization evaluated in the resolver layer rather than through RLS.
Authorization follows the storage mappings pattern: admin capability on the prefix is required to create, update, or delete an alert_configs row. GraphQL queries support exactPrefixes and underPrefix filters.
Audit trail
Deferred. The last_modified_by column captures who made each change. Full change history can be derived from CDC events on the table in the future.
Migration from alert_data_processing
The migration avoids a flag-day deploy by using a transitional fallback in the controller. The ordering is important to avoid stale data shadowing fresh writes:
- Create the
alert_configs table (empty). Deploy the controller reading from alert_configs with fallback to alert_data_processing when no matching alert_configs row exists. Since the table is empty, everything falls back to alert_data_processing. No behavioral change.
- Deploy UI changes to read/write
alert_configs via GraphQL instead of alert_data_processing via PostgREST. New user changes go to the new table. Old untouched values still resolve via fallback to alert_data_processing.
- Run a data migration to copy remaining
alert_data_processing rows (those not already overwritten by UI writes in step 2) into alert_configs. After this, alert_configs has all values.
- Remove the fallback from the controller and drop
alert_data_processing.
The UI must switch before the data migration. If the data migration ran first, subsequent UI writes to alert_data_processing would be shadowed by stale copies in alert_configs (since alert_configs takes priority in the fallback logic).
Alternatives considered
Alert config in the task spec. Putting an alerts block directly on CaptureSpec/MaterializationSpec/DerivationSpec would make config version-controlled, automatically visible to flowctl, and preserved in publication history. We ended up not going down this path because:
- Changing config would require a full publication which is slow, can fail due to unrelated connector issues, and unneccesarily restarts the connector.
- There's no way to configure by prefix, meaning that every task must be tuned individually
- Alert config is only consumed by the control plane so it doesn't need to live in the spec.
Context
We've been building out the alert system over the past year: shard failure alerts, data movement stalled alerts, and more recently, abandoned task detection alerts (idle, chronically failing, auto-disabled). We want to keep going. Specifically:
ShardFailedalerts default-on for all usersGoals 1 and 2 are blocked because a single global threshold doesn't work for all tasks. Tasks that retry frequently but still move data every hour would be classified as "chronically failing" under a 30-day definition, but should not be disabled. The 30-day idle threshold is too short for tasks that sync monthly. Every time we try to tighten one scenario, we break another. Per-task tuning is needed to make these features safe to enable broadly.
Goal 3 is blocked because there's no good place for alert configuration to live. The only per-task config today (
DataMovementStalledinterval) is in a side-table (alert_data_processing) that flowctl doesn't know about. All other thresholds are global env vars with no per-task or per-prefix override.The solution: a new prefix-scoped database table for alert configuration, analogous to storage mappings. Alert config is only ever read by the control plane (not the data plane or reactor), so it doesn't need to be in the task spec. A database table also allows bulk configuration by prefix and instant changes without publication.
Current state
Per-task alert configuration currently exists in a limited form:
DataMovementStalledhas a per-taskevaluation_intervalstored in thealert_data_processingdatabase table. The UI writes to this table directly via PostgREST. This table is not visible to flowctl and is not version-controlled.All other alert thresholds (
ALERT_AFTER_SHARD_FAILURES,SHARD_FAILURE_RETENTION,CHRONICALLY_FAILING_THRESHOLD,IDLE_THRESHOLD, etc.) are global env vars inControllerConfig(crates/agent/src/controllers/mod.rs). These apply identically to every task with no per-task or per-prefix override.Proposed solution
Create a new
alert_configstable that associates alert configuration with catalog prefixes or exact task names:A config row applies to all tasks whose catalog name matches:
{ "dataMovementStalled": { "threshold": "2h" }, "shardFailed": { "failureThreshold": 3, "retentionWindow": "8h" }, "taskChronicallyFailing": { "threshold": "30d" }, "taskIdle": { "threshold": "60d" }, "autoDisable": { "idle": true, "failing": false } }Example: a config at
acmeCo/prod/applies to all tasks under that prefix. A config atacmeCo/prod/source-postgres(exact name, no trailing/) applies to only that task. An exact name match takes priority over a prefix match.Changes to alert config are instant database writes with no publication required.
Alert type categorization
Not all 12 alert types make sense as configurable. The split:
Configurable (in the
alert_configstable):dataMovementStalledthresholdalert_data_processingtableshardFailedfailureThresholdALERT_AFTER_SHARD_FAILURESenv varshardFailedretentionWindowSHARD_FAILURE_RETENTIONenv vartaskChronicallyFailingthresholdCHRONICALLY_FAILING_THRESHOLDenv varShardFailedmust be continuously active before firingtaskIdlethresholdIDLE_THRESHOLDenv varautoDisable.idleDISABLE_IDLE_TASKSenv varautoDisable.failingDISABLE_FAILING_TASKSenv varSome notes:
taskIdle.thresholdis the intended mechanism for preventing false alerts and unwanted auto-disable.CHRONICALLY_FAILING_DISABLE_AFTER,IDLE_DISABLE_AFTER, both default 7d) remain global. The threshold controls when an alert fires; the grace period is a system-level policy about how much notice to give before disabling, and doesn't need to vary per-prefix.resolve_shard_failed_alert_after(how long of healthy shard status before resolving aShardFailedalert, default 2h) remains a global setting.user_pub_threshold(how recently a user must have published to suppress abandoned-task alerts, default 14d) remains a global setting. It represents a system-level definition of "active ownership" rather than a user-facing knob.Not configurable (remain global or tenant-level):
freeTrial,freeTrialEnding,freeTrialStalled,missingPaymentMethod: billing/tenant-level, not task-levelautoDiscoverFailed,backgroundPublicationFailed: binary (failed or didn't), no meaningful threshold to configuretaskAutoDisabledFailing,taskAutoDisabledIdle: these fire as consequences of the chronically-failing and idle alert chains, not independently configurableDesign decisions
Prefix resolution
Longest-prefix-match with exact-name override. For a task
acmeCo/prod/source-postgres:acmeCo/prod/source-postgreswins if it existsacmeCo/prod/overacmeCo/)ControllerConfigenv var defaultsPer-prefix-or-exact-name matching (discussion topic)
The
catalog_prefix_or_namecolumn accepts both prefixes (ending in/) and exact catalog names (no trailing/). A prefix applies to all matching tasks; an exact name applies to one task. This differs fromstorage_mappings, which uses thecatalog_prefixdomain type and only supports prefixes.This is a design decision open for discussion. The alternative is prefix-only, which would mean per-task config requires a naming convention that puts the task in its own sub-prefix. Most customers manage tasks as cattle, but exact-name matching provides an escape hatch for the cases that need it.
Auto-disable
Currently
DISABLE_IDLE_TASKSandDISABLE_FAILING_TASKSare global env var booleans. Moving them intoalert_configsmakes auto-disable prefix-scoped. A customer (or the support team) can enable auto-disable foracmeCo/prod/but leave it off foracmeCo/dev/. This addresses a real use case where a customer's shard failure alert subscriptions had to be restricted to prod because dev pipelines were noisy.Config modeling
The
configJSONB column is validated against a typed Rust struct in the GraphQL resolver before writing to the table. Each alert type has a different config shape (DataMovementStalledConfighasthreshold;ShardFailedConfighasfailureThresholdandretentionWindow). Typos and non-configurable alert type names produce validation errors in the API layer.Adding a new configurable alert type requires a code change to the struct. This is the standard approach for typed configuration.
DataMovementStalled: move evaluation into the controller
The
DataMovementStalledevaluation currently runs through a DB view (alert_data_movement_stalled) polled by anAlertEvaluatorautomation. This is a legacy pattern: the view joinsalert_data_processing.evaluation_intervalwithcatalog_stats_hourlyto determine which tasks have stalled, and the evaluator diffs the results againstalert_historyto fire/resolve alerts.Moving
DataMovementStalledevaluation into the controller eliminates this indirection. The controller already evaluates a similar condition forTaskIdleviafetch_last_data_movement_ts()inabandon.rs. AddingDataMovementStalledevaluation is the same pattern: checkcatalog_stats_hourlyfor recent byte movement, compare against the threshold fromalert_configs, fire or resolve the alert.This replaces the
alert_data_movement_stalledDB view and theAlertEvaluator<DataMovementStalledAlerts>automation. TheAlertEvaluator<TenantAlerts>automation remains for now to limit scope; tenant-level billing alerts could move to controllers in the future.A comment in
evaluator.rshints at this direction:flow/crates/agent/src/alerts/evaluator.rs
Lines 201 to 204 in 61635c4
API and authorization
The table is accessed exclusively through GraphQL, with authorization evaluated in the resolver layer rather than through RLS.
Authorization follows the storage mappings pattern: admin capability on the prefix is required to create, update, or delete an
alert_configsrow. GraphQL queries supportexactPrefixesandunderPrefixfilters.Audit trail
Deferred. The
last_modified_bycolumn captures who made each change. Full change history can be derived from CDC events on the table in the future.Migration from
alert_data_processingThe migration avoids a flag-day deploy by using a transitional fallback in the controller. The ordering is important to avoid stale data shadowing fresh writes:
alert_configstable (empty). Deploy the controller reading fromalert_configswith fallback toalert_data_processingwhen no matchingalert_configsrow exists. Since the table is empty, everything falls back toalert_data_processing. No behavioral change.alert_configsvia GraphQL instead ofalert_data_processingvia PostgREST. New user changes go to the new table. Old untouched values still resolve via fallback toalert_data_processing.alert_data_processingrows (those not already overwritten by UI writes in step 2) intoalert_configs. After this,alert_configshas all values.alert_data_processing.The UI must switch before the data migration. If the data migration ran first, subsequent UI writes to
alert_data_processingwould be shadowed by stale copies inalert_configs(sincealert_configstakes priority in the fallback logic).Alternatives considered
Alert config in the task spec. Putting an
alertsblock directly onCaptureSpec/MaterializationSpec/DerivationSpecwould make config version-controlled, automatically visible to flowctl, and preserved in publication history. We ended up not going down this path because: