feat(monitoring): declarative alert rules engine#1663
Open
Nixxx19 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a declarative alert-rules engine to
mofa-monitoring. Plugs into the existingMetricsCollector/ Prometheus / OpenTelemetry stack as a consumer — the evaluation loop reads metrics, runs rules, and fans events through pluggable notifiers.Closes the SLO/alerting gap in the dashboard: today
mofa-monitoringexports metrics to Prometheus but has no first-class way to express "page me when LLM error rate exceeds 5% for two minutes".Architecture
flowchart LR subgraph Sources [Metric sources] MC[MetricsCollector] PR[Prometheus scrape] IM[InMemoryMetricSource<br/>test fixture] end subgraph Engine [Alerts engine] RS[Rule set] EV[Evaluator] end subgraph Sinks [Notifiers] LN[LogNotifier] CN[CollectingNotifier] CO[CompositeNotifier] end MC -->|MetricSource| EV PR -->|MetricSource| EV IM -->|MetricSource| EV RS --> EV EV -->|AlertEvent| CO CO --> LN CO --> CNBoth edges are trait-parameterised:
MetricSourceabstracts the read side;Notifierabstracts delivery. Downstream integrations (webhook, Slack, PagerDuty, YAML rule loader) plug in without changing the evaluator.Rule model
classDiagram class Rule { +name: String +description: String +severity: Severity +condition: Condition +for_duration: Duration +labels: HashMap +annotations: HashMap } class Severity { <<enumeration>> Info Warning Critical } class Condition { <<enumeration>> Threshold RateOfChange Absent } class ComparisonOp { <<enumeration>> Gt Gte Lt Lte Eq Neq } Rule --> Severity Rule --> Condition Condition --> ComparisonOpCondition families:
Thresholdmetric OP threshold(e.g.error_rate > 0.05).RateOfChangeOP threshold.AbsentState machine (Prometheus-compatible)
stateDiagram-v2 [*] --> Inactive Inactive --> Firing: match && for_duration == 0 Inactive --> Pending: match && for_duration > 0 Pending --> Firing: match && elapsed >= for_duration Pending --> Inactive: no match (silent) Firing --> Firing: match Firing --> Inactive: no match (emits Resolved)Inactive → PendingPendingPending → FiringFiringInactive → Firing(whenfor=0)FiringFiring → InactiveResolvedPending → InactiveOnly transitions emit events — the evaluator does not duplicate-fire while a rule is stably firing.
Evaluation sequence
sequenceDiagram participant Tick as Tick loop participant EV as Evaluator participant MS as MetricSource participant N as Notifier Tick->>EV: evaluate() loop per rule EV->>MS: sample(metric) MS-->>EV: Some(MetricSample) | None EV->>EV: apply condition EV->>EV: update state machine alt state transition EV-->>Tick: AlertEvent end end Tick->>N: notify(event) per eventNotifiers shipped
LogNotifiertracing. Info →info!, Warning/Critical →warn!. Safe default.CollectingNotifierCompositeNotifierUsage
Architecture adherence
mofa-monitoring::alertsas a self-contained module.tokio,serde,tracing,async-trait).#[non_exhaustive].Rule,Condition,AlertEventserialise viaserdeso rules can ship in configuration files and events can stream over the existing dashboard WS channel.#[must_use]on every builder method.Mutex; the contract document explains the shard-across-instances scaling pattern.Test coverage
37 tests pass covering:
for=0), pending-to-firing soak, resolved emission, pending-to-inactive silent, no-double-fire across stable ticks.Follow-ups
MetricsCollectoradapter — bridge the existing in-process collector intoMetricSourceso agents emit directly into alerts./api/alerts/recentbacked by aCollectingNotifier.Test plan
cargo check -p mofa-monitoringcargo test -p mofa-monitoring alerts --lib— 37/37 passcargo testworkspace (CI)cargo clippy(CI)