Skip to content

feat(monitoring): declarative alert rules engine#1663

Open
Nixxx19 wants to merge 1 commit into
mofa-org:mainfrom
Nixxx19:nityam/monitoring-alert-rules
Open

feat(monitoring): declarative alert rules engine#1663
Nixxx19 wants to merge 1 commit into
mofa-org:mainfrom
Nixxx19:nityam/monitoring-alert-rules

Conversation

@Nixxx19
Copy link
Copy Markdown
Contributor

@Nixxx19 Nixxx19 commented Apr 21, 2026

Summary

Adds a declarative alert-rules engine to mofa-monitoring. Plugs into the existing MetricsCollector / Prometheus / OpenTelemetry stack as a consumer — the evaluation loop reads metrics, runs rules, and fans events through pluggable notifiers.

Closes the SLO/alerting gap in the dashboard: today mofa-monitoring exports metrics to Prometheus but has no first-class way to express "page me when LLM error rate exceeds 5% for two minutes".

Architecture

flowchart LR
    subgraph Sources [Metric sources]
        MC[MetricsCollector]
        PR[Prometheus scrape]
        IM[InMemoryMetricSource<br/>test fixture]
    end

    subgraph Engine [Alerts engine]
        RS[Rule set]
        EV[Evaluator]
    end

    subgraph Sinks [Notifiers]
        LN[LogNotifier]
        CN[CollectingNotifier]
        CO[CompositeNotifier]
    end

    MC -->|MetricSource| EV
    PR -->|MetricSource| EV
    IM -->|MetricSource| EV
    RS --> EV
    EV -->|AlertEvent| CO
    CO --> LN
    CO --> CN
Loading

Both edges are trait-parameterised: MetricSource abstracts the read side; Notifier abstracts delivery. Downstream integrations (webhook, Slack, PagerDuty, YAML rule loader) plug in without changing the evaluator.

Rule model

classDiagram
    class Rule {
        +name: String
        +description: String
        +severity: Severity
        +condition: Condition
        +for_duration: Duration
        +labels: HashMap
        +annotations: HashMap
    }
    class Severity {
        <<enumeration>>
        Info
        Warning
        Critical
    }
    class Condition {
        <<enumeration>>
        Threshold
        RateOfChange
        Absent
    }
    class ComparisonOp {
        <<enumeration>>
        Gt
        Gte
        Lt
        Lte
        Eq
        Neq
    }
    Rule --> Severity
    Rule --> Condition
    Condition --> ComparisonOp
Loading

Condition families:

Condition Semantics
Threshold Fire when metric OP threshold (e.g. error_rate > 0.05).
RateOfChange Fire when per-second derivative over a sliding window satisfies OP threshold.
Absent Fire when metric is missing or has not been observed within a staleness window — liveness/heartbeat checks.

State machine (Prometheus-compatible)

stateDiagram-v2
    [*] --> Inactive
    Inactive --> Firing: match && for_duration == 0
    Inactive --> Pending: match && for_duration > 0
    Pending --> Firing: match && elapsed >= for_duration
    Pending --> Inactive: no match (silent)
    Firing --> Firing: match
    Firing --> Inactive: no match (emits Resolved)
Loading
Transition Emitted
Inactive → Pending Pending
Pending → Firing Firing
Inactive → Firing (when for=0) Firing
Firing → Inactive Resolved
Pending → Inactive silent

Only transitions emit events — the evaluator does not duplicate-fire while a rule is stably firing.

Evaluation sequence

sequenceDiagram
    participant Tick as Tick loop
    participant EV as Evaluator
    participant MS as MetricSource
    participant N as Notifier

    Tick->>EV: evaluate()
    loop per rule
        EV->>MS: sample(metric)
        MS-->>EV: Some(MetricSample) | None
        EV->>EV: apply condition
        EV->>EV: update state machine
        alt state transition
            EV-->>Tick: AlertEvent
        end
    end
    Tick->>N: notify(event) per event
Loading

Notifiers shipped

Notifier Purpose
LogNotifier Emit through tracing. Info → info!, Warning/Critical → warn!. Safe default.
CollectingNotifier Bounded in-memory ring buffer. Powers dashboard "recent alerts" panels and tests.
CompositeNotifier Best-effort fan-out so a single broken sink can't block others.

Usage

use std::sync::Arc;
use std::time::Duration;
use mofa_monitoring::alerts::{
    ComparisonOp, CollectingNotifier, Condition, Evaluator,
    InMemoryMetricSource, Rule, Severity,
};

let source = Arc::new(InMemoryMetricSource::new());
source.set("llm_error_rate", 0.07);

let rule = Rule::new(
    "high-error-rate",
    "LLM error rate above 5%",
    Severity::Warning,
    Condition::Threshold {
        metric: "llm_error_rate".into(),
        op: ComparisonOp::Gt,
        threshold: 0.05,
    },
)
.with_for(Duration::from_secs(120))
.with_label("team", "platform");

let evaluator = Evaluator::new(vec![rule], source);
let sink = CollectingNotifier::with_capacity(256);

for event in evaluator.evaluate().await {
    sink.notify(&event).await;
}

Architecture adherence

  • Lives under mofa-monitoring::alerts as a self-contained module.
  • No cross-crate changes — depends only on existing workspace dependencies (tokio, serde, tracing, async-trait).
  • All public enums are #[non_exhaustive].
  • Rule, Condition, AlertEvent serialise via serde so rules can ship in configuration files and events can stream over the existing dashboard WS channel.
  • #[must_use] on every builder method.
  • Evaluator state is keyed under a Mutex; the contract document explains the shard-across-instances scaling pattern.

Test coverage

37 tests pass covering:

  • Rule model: builder invariants, primary-metric extraction, JSON round-trip, severity ordering, display/string consistency.
  • Comparison ops: every operator against equal, above, below, and epsilon-boundary values.
  • Event model: state display, firing/resolved disjointness, JSON round-trip, summary formatting with and without observed value.
  • Metric source: read/write, overwrite, forget, set-at-timestamp, missing-returns-None.
  • Evaluator state transitions: immediate fire (for=0), pending-to-firing soak, resolved emission, pending-to-inactive silent, no-double-fire across stable ticks.
  • Absent-rule semantics: fires on missing, fires on stale, silent on fresh.
  • Rate-of-change rule: fires when derivative exceeds threshold.
  • Notifier family: collecting buffer capacity eviction, composite fan-out, log notifier infallibility, clear-and-reuse.

Follow-ups

  • MetricsCollector adapter — bridge the existing in-process collector into MetricSource so agents emit directly into alerts.
  • Prometheus scrape adapter — let the evaluator consume the exported metrics endpoint (useful for cross-process deployments).
  • Webhook / Slack / PagerDuty notifiers.
  • YAML rule-file loader so rules live in config rather than code.
  • Dashboard REST route /api/alerts/recent backed by a CollectingNotifier.

Test plan

  • cargo check -p mofa-monitoring
  • cargo test -p mofa-monitoring alerts --lib — 37/37 pass
  • cargo test workspace (CI)
  • cargo clippy (CI)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant