diff --git a/docs/alert-rule-classification.md b/docs/alert-rule-classification.md new file mode 100644 index 000000000..ec42704f8 --- /dev/null +++ b/docs/alert-rule-classification.md @@ -0,0 +1,254 @@ +# Alert Rule Classification - Design and Usage + +## Overview +The backend classifies Prometheus alerting rules into a "component" and an "impact layer". It: +- Computes an `openshift_io_alert_rule_id` per alerting rule. +- Determines component/layer based on matcher logic and rule labels. +- Allows operator-managed classification overrides via AlertRelabelConfigs (ARCs) for platform + rules. Operator-managed classification overrides of user-defined workload rules require the `ENABLE_USER_WORKLOAD_ARCS` feature flag. +- Enriches the Alerts API response with `openshift_io_alert_rule_id`, `openshift_io_alert_component`, and `openshift_io_alert_layer`. + +This document explains how it works, how to override, and how to test it. + + +## Terminology +- openshift_io_alert_rule_id: Identifier for an alerting rule. Computed from a canonicalized view of the rule definition and encoded as `rid_` + base64url(nopad(sha256(payload))). Independent of `PrometheusRule` name. +- component: Logical owner of the alert (e.g., `kube-apiserver`, `etcd`, a namespace, etc.). +- layer: Impact scope. Allowed values: + - `cluster` + - `namespace` + +Notes: +- **Stability**: + - The id is **always derived from the rule spec**. If the rule definition changes (expr/for/business labels/name), the id may change. + - For **platform rules**, this API currently only supports label updates via `AlertRelabelConfig` (not editing expr/for), so the id is effectively stable unless the upstream operator changes the rule definition. + - For **user-defined rules**, the API stamps the computed id into the `PrometheusRule` rule labels. If you update the rule definition, the API returns the **new** id and migrates any existing classification override to the new id. +- Layer values are validated as `cluster|namespace` when set. To remove an override, set the field to `null` via the API; empty/invalid values are ignored at read time. + +## Rule ID computation (openshift_io_alert_rule_id) +Location: `pkg/alert_rule/alert_rule.go` + +The backend computes a specHash-like value from: +- `kind`/`name`: `alert` + `alert:` name or `record` + `record:` name +- `expr`: trimmed with consecutive whitespace collapsed +- `for`: trimmed (duration string as written in the rule) +- `labels`: only non-system labels + - excludes labels with `openshift_io_` prefix and the `alertname` label + - drops empty values + - keeps only valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`) + - sorted by key and joined as `key=value` lines + +Annotations are intentionally ignored to reduce id churn on documentation-only changes. + +## Classification Logic (How component/layer are determined) +Location: `pkg/alertcomponent/matcher.go` + +1) The code adapts `cluster-health-analyzer` matchers: + - CVO-related alerts (update/upgrade) → component/layer based on known patterns + - Compute / node-related alerts + - Core control plane components (renamed to layer `cluster`) + - Workload/namespace-level alerts (renamed to layer `namespace`) + +2) Fallback: + - If the computed component is empty or "Others", we set: + - `component = other` + - `layer` derived from source: + - `openshift_io_alert_source=platform` → `cluster` + - `openshift_io_prometheus_rule_namespace=openshift-monitoring` → `cluster` + - `prometheus` label starting with `openshift-monitoring/` → `cluster` + - otherwise → `namespace` + +3) Result: + - Each alerting rule is assigned a `(component, layer)` tuple following the above logic. + +## Developer Overrides via Rule Labels (Recommended) +If you want explicit component/layer values and do not want to rely on the matcher, set +these labels on each rule in your `PrometheusRule`: +- `openshift_io_alert_rule_component` +- `openshift_io_alert_rule_layer` + +Both are validated the same way as API overrides: +- `component`: 1-253 chars, alphanumeric + `._-`, must start/end alphanumeric +- `layer`: `cluster` or `namespace` + +When these labels are present and valid, they override matcher-derived values. + +## Classification Override Storage + +Location: `pkg/management/update_classification.go`, `pkg/management/get_alerts.go` + +Classification overrides are stored differently depending on the rule type: + +### Platform rules → AlertRelabelConfig (ARC) + +For operator-managed platform rules (rules whose `PrometheusRule` is registered as a +platform resource), overrides are stored in an `AlertRelabelConfig` (ARC) CR in the +`openshift-monitoring` namespace. + +- **ARC naming**: `arc--` + (generated by `k8s.GetAlertRelabelConfigName`) +- **ARC namespace**: `openshift-monitoring` +- **Shared ARC**: classification labels are written into the same ARC that the platform + alert management path uses for other label changes (severity, Drop/Restore). This avoids + creating separate CRs per concern. +- **Labels on the ARC**: + - `monitoring.openshift.io/prometheus-rule-name`: name of the source `PrometheusRule` + - `monitoring.openshift.io/alert-name`: alert name +- **Annotation on the ARC**: + - `monitoring.openshift.io/alert-rule-id`: the `openshift_io_alert_rule_id` + +The ARC contains `RelabelConfig` entries that: +1. Match the rule by its original labels (alert name + all non-namespace labels) and + stamp `openshift_io_alert_rule_id` via a `Replace` action. +2. Apply each classification label as a `Replace` action keyed on `openshift_io_alert_rule_id`. + +When all overrides are removed, the ARC is deleted. + +**AlertingRule CR distinction:** Some platform alerts are defined via `AlertingRule` CRs, +which the cluster-monitoring-operator reconciles into `PrometheusRule` resources. When +the owning `AlertingRule` CR is operator-managed (has operator owner references), the +backend cannot modify it directly (the operator would reconcile the change back). In +this case, label updates are applied through an ARC instead. When the `AlertingRule` CR +is not externally managed, label updates are written directly into the CR. Classification +overrides always use the ARC path regardless of the `AlertingRule` management status. + +### User-defined workload rules → blocked by default, ARC when enabled + +Classification updates for operator-managed user-defined workload rules are **not +allowed by default**. The API returns a `NotAllowedError` when the feature flag is +disabled. + +### Feature flag: `ENABLE_USER_WORKLOAD_ARCS` + +Setting the environment variable `ENABLE_USER_WORKLOAD_ARCS=true` enables full +alert management for operator-managed user-defined workload rules, including +classification overrides, label updates, and rule disable/enable (Drop/Restore). +When enabled, these rules use the same ARC-based path as platform rules, with +ARCs stored in the `openshift-user-workload-monitoring` namespace. + +### Dynamic classification (`_from` labels) + +Two special labels allow deriving component/layer dynamically from the alert itself +at query time: +- `openshift_io_alert_rule_component_from`: name of an alert label whose value + becomes the component (e.g., `"name"` → use the alert's `name` label). +- `openshift_io_alert_rule_layer_from`: same pattern for layer. + +These `_from` labels are stored in the ARC alongside static classification labels. +At read time, `ApplyDynamicClassification` resolves them against the alert's labels. + +### Read path + +The read path is unified regardless of storage mechanism: +1. The relabeled rules cache (`k8s.RelabeledRules().Get`) returns each rule with all + ARC relabel configs already applied. This means classification labels (whether set + via ARC or directly on the `PrometheusRule`) are available as rule labels. +2. `ApplyDynamicClassification` checks for `_from` labels on the relabeled rule and + resolves them against the alert's own labels to produce the final component/layer. + +Notes: +- `_from` values must be valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`). +- If a `_from` label is present but the alert does not carry that label or the derived + value is invalid, the backend falls back to static values (if present) or defaults. +- If all overrides are removed, the ARC is deleted. + + +## Alerts API Enrichment +Location: `pkg/management/get_alerts.go`, `pkg/k8s/prometheus_alerts.go` + +- Endpoint: `GET /api/v1/alerting/alerts` (prom-compatible schema) +- The backend fetches active alerts and enriches each alert with: + - `openshift_io_alert_rule_id` + - `openshift_io_alert_component` + - `openshift_io_alert_layer` + - `prometheusRuleName`: name of the PrometheusRule resource the alert originates from + - `prometheusRuleNamespace`: namespace of that PrometheusRule resource + - `alertingRuleName`: name of the AlertingRule CR that generated the PrometheusRule (empty when the PrometheusRule is not owned by an AlertingRule CR) +- Prometheus compatibility: + - Base response matches Prometheus `/api/v1/alerts`. + - Additional fields are additive and safe for clients like Perses. + +## Prometheus/Thanos Sources +Location: `pkg/k8s/prometheus_alerts.go` + +- Order of candidates: + 1) Thanos Route `thanos-querier` at `/api` + `/v1/alerts` (oauth-proxied) + 2) In-cluster Thanos service `https://thanos-querier.openshift-monitoring.svc:9091/api/v1/alerts` + 3) In-cluster Prometheus `https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts` + 4) In-cluster Prometheus (plain HTTP) `http://prometheus-k8s.openshift-monitoring.svc:9090/api/v1/alerts` (fallback) + 5) Prometheus Route `prometheus-k8s` at `/api/v1/alerts` + +- TLS and Auth: + - Bearer token: service account token from in-cluster config. + - CA trust: system pool + `SSL_CERT_FILE` + `/var/run/configmaps/service-ca/service-ca.crt`. + +RBAC: +- Read routes in `openshift-monitoring`. +- Access `prometheuses/api` as needed for oauth-proxied endpoints. + +## Updating Rules Classification +APIs: +- Single update: + - Method: `PATCH /api/v1/alerting/rules/{ruleId}` + - Request body: + ```json + { + "classification": { + "openshift_io_alert_rule_component": "team-x", + "openshift_io_alert_rule_layer": "namespace", + "openshift_io_alert_rule_component_from": "name", + "openshift_io_alert_rule_layer_from": "layer" + } + } + ``` + - `openshift_io_alert_rule_layer`: `cluster` or `namespace` + - To remove a classification override, set the field to `null` (e.g. `"openshift_io_alert_rule_layer": null`). + - Response: + - 200 OK with a status payload (same format as other rule PATCH responses), where `status_code` is 204 on success. + - Standard error body on failure (400 validation, 404 not found, etc.) +- Bulk update: + - Method: `PATCH /api/v1/alerting/rules` + - Request body: + ```json + { + "ruleIds": ["", ""], + "classification": { + "openshift_io_alert_rule_component": "etcd", + "openshift_io_alert_rule_layer": "cluster" + } + } + ``` + - Response: + - 200 OK with per-rule results (same format as other bulk rule PATCH responses). Clients should handle partial failures. + +Direct K8s (supported for power users/GitOps): +- For platform rules: create or update the `AlertRelabelConfig` CR in `openshift-monitoring` + with the appropriate relabel configs (respect `resourceVersion` for optimistic concurrency). +- For user-defined rules (requires `ENABLE_USER_WORKLOAD_ARCS=true`): create or update the + `AlertRelabelConfig` CR in `openshift-user-workload-monitoring`. +- UI should check update permissions with SelfSubjectAccessReview before showing an editor. + +Notes: +- These endpoints are intended for updating **classification only** (component/layer overrides), + with permissions enforced based on the rule's ownership (platform, user workload, operator-managed, + GitOps-managed). +- To update other rule fields (expr/labels/annotations/etc.), use `PATCH /api/v1/alerting/rules/{ruleId}`. + Clients that need to update both should issue two requests. The combined operation is not atomic. + +## Security Notes +- Classification overrides are stored in AlertRelabelConfig CRs (`openshift-monitoring` + for platform rules, `openshift-user-workload-monitoring` for user-defined rules when + enabled), subject to standard Kubernetes RBAC. +- No secrets or sensitive data are persisted in classification metadata. + +## Testing and Ops +Unit tests: +- `pkg/management/update_classification_test.go` + - ARC-based classification for platform rules, blocked-by-default for user-defined + rules, ARC in user-workload namespace when flag enabled, dynamic `_from` label resolution. +- `pkg/management/get_alerts_test.go` + - Alert enrichment with classification labels, `_from` label behavior, fallback behavior. + +## Future Work +- Optional composite update API if we need to update rule fields and classification atomically. +- De-duplication/merge logic when aggregating alerts across sources. diff --git a/internal/managementrouter/alert_rule_classification_patch.go b/internal/managementrouter/alert_rule_classification_patch.go new file mode 100644 index 000000000..6a64f6a2a --- /dev/null +++ b/internal/managementrouter/alert_rule_classification_patch.go @@ -0,0 +1,65 @@ +package managementrouter + +import "encoding/json" + +// AlertRuleClassificationPatch represents a partial update ("patch") payload for +// alert rule classification labels. +// +// This type supports a three-state contract per field: +// - omitted: leave unchanged +// - null: clear the override +// - string: set the override +// +// Note: Go's encoding/json cannot represent "explicit null" vs "omitted" using **string +// (both decode to nil), so we custom-unmarshal and track key presence with *Set flags. +type AlertRuleClassificationPatch struct { + Component *string `json:"openshift_io_alert_rule_component,omitempty"` + ComponentSet bool `json:"-"` + Layer *string `json:"openshift_io_alert_rule_layer,omitempty"` + LayerSet bool `json:"-"` + ComponentFrom *string `json:"openshift_io_alert_rule_component_from,omitempty"` + ComponentFromSet bool `json:"-"` + LayerFrom *string `json:"openshift_io_alert_rule_layer_from,omitempty"` + LayerFromSet bool `json:"-"` +} + +func (p *AlertRuleClassificationPatch) UnmarshalJSON(b []byte) error { + var m map[string]json.RawMessage + if err := json.Unmarshal(b, &m); err != nil { + return err + } + + decodeNullableString := func(key string) (set bool, v *string, err error) { + raw, ok := m[key] + if !ok { + return false, nil, nil + } + if len(raw) == 0 || string(raw) == "null" { + return true, nil, nil + } + var s string + if err := json.Unmarshal(raw, &s); err != nil { + return true, nil, err + } + return true, &s, nil + } + + var err error + p.ComponentSet, p.Component, err = decodeNullableString("openshift_io_alert_rule_component") + if err != nil { + return err + } + p.LayerSet, p.Layer, err = decodeNullableString("openshift_io_alert_rule_layer") + if err != nil { + return err + } + p.ComponentFromSet, p.ComponentFrom, err = decodeNullableString("openshift_io_alert_rule_component_from") + if err != nil { + return err + } + p.LayerFromSet, p.LayerFrom, err = decodeNullableString("openshift_io_alert_rule_layer_from") + if err != nil { + return err + } + return nil +} diff --git a/internal/managementrouter/alert_rule_classification_patch_test.go b/internal/managementrouter/alert_rule_classification_patch_test.go new file mode 100644 index 000000000..23e3c33ff --- /dev/null +++ b/internal/managementrouter/alert_rule_classification_patch_test.go @@ -0,0 +1,50 @@ +package managementrouter_test + +import ( + "encoding/json" + "testing" + + "github.com/openshift/monitoring-plugin/internal/managementrouter" +) + +func TestAlertRuleClassificationPatch_FieldOmitted(t *testing.T) { + var p managementrouter.AlertRuleClassificationPatch + if err := json.Unmarshal([]byte(`{}`), &p); err != nil { + t.Fatalf("unexpected error: %v", err) + } + if p.ComponentSet { + t.Error("expected ComponentSet to be false when field is omitted") + } + if p.Component != nil { + t.Error("expected Component to be nil when field is omitted") + } +} + +func TestAlertRuleClassificationPatch_FieldExplicitNull(t *testing.T) { + var p managementrouter.AlertRuleClassificationPatch + if err := json.Unmarshal([]byte(`{"openshift_io_alert_rule_component":null}`), &p); err != nil { + t.Fatalf("unexpected error: %v", err) + } + if !p.ComponentSet { + t.Error("expected ComponentSet to be true when field is explicitly null") + } + if p.Component != nil { + t.Error("expected Component to be nil when field is explicitly null") + } +} + +func TestAlertRuleClassificationPatch_FieldString(t *testing.T) { + var p managementrouter.AlertRuleClassificationPatch + if err := json.Unmarshal([]byte(`{"openshift_io_alert_rule_component":"team-x"}`), &p); err != nil { + t.Fatalf("unexpected error: %v", err) + } + if !p.ComponentSet { + t.Error("expected ComponentSet to be true when field is a string") + } + if p.Component == nil { + t.Fatal("expected Component to be non-nil when field is a string") + } + if *p.Component != "team-x" { + t.Errorf("expected Component %q, got %q", "team-x", *p.Component) + } +} diff --git a/pkg/alert_rule/alert_rule.go b/pkg/alert_rule/alert_rule.go index 862cb59ac..0b1ee5dc4 100644 --- a/pkg/alert_rule/alert_rule.go +++ b/pkg/alert_rule/alert_rule.go @@ -4,7 +4,6 @@ import ( "crypto/sha256" "encoding/base64" "fmt" - "regexp" "sort" "strings" "unicode/utf8" @@ -12,11 +11,10 @@ import ( monitoringv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1" "github.com/prometheus/prometheus/promql/parser" + "github.com/openshift/monitoring-plugin/pkg/classification" "github.com/openshift/monitoring-plugin/pkg/managementlabels" ) -var promLabelNameRegexp = regexp.MustCompile(`^[A-Za-z_][A-Za-z0-9_]*$`) - func GetAlertingRuleId(alertRule *monitoringv1.Rule) string { var name string var kind string @@ -74,14 +72,12 @@ func normalizedBusinessLabelsBlock(in map[string]string) string { continue } if strings.HasPrefix(key, "openshift_io_") || key == managementlabels.AlertNameLabel { - // Skip system labels continue } - if !promLabelNameRegexp.MatchString(key) { + if !classification.ValidatePromLabelName(key) { continue } if v == "" { - // Align with specHash behavior: drop empty values continue } if !utf8.ValidString(v) { diff --git a/pkg/classification/validation.go b/pkg/classification/validation.go new file mode 100644 index 000000000..32f78b784 --- /dev/null +++ b/pkg/classification/validation.go @@ -0,0 +1,34 @@ +package classification + +import ( + "regexp" + "strings" +) + +var allowedLayers = map[string]struct{}{ + "cluster": {}, + "namespace": {}, +} + +var labelValueRegexp = regexp.MustCompile(`^[A-Za-z0-9]([A-Za-z0-9_.-]*[A-Za-z0-9])?$`) +var labelNameRegexp = regexp.MustCompile(`^[A-Za-z_][A-Za-z0-9_]*$`) + +// ValidateLayer returns true if the provided layer is one of the allowed values. +func ValidateLayer(layer string) bool { + _, ok := allowedLayers[strings.ToLower(strings.TrimSpace(layer))] + return ok +} + +// ValidateComponent returns true if the component is a reasonable label value. +// Accept 1-253 chars, [A-Za-z0-9._-], must start/end alphanumeric. +func ValidateComponent(component string) bool { + c := strings.TrimSpace(component) + if c == "" || len(c) > 253 { + return false + } + return labelValueRegexp.MatchString(c) +} + +func ValidatePromLabelName(name string) bool { + return labelNameRegexp.MatchString(strings.TrimSpace(name)) +} diff --git a/pkg/k8s/const.go b/pkg/k8s/const.go index 243cea8d8..699dc452e 100644 --- a/pkg/k8s/const.go +++ b/pkg/k8s/const.go @@ -1,5 +1,6 @@ package k8s const ( - ClusterMonitoringNamespace = "openshift-monitoring" + ClusterMonitoringNamespace = "openshift-monitoring" + UserWorkloadMonitoringNamespace = "openshift-user-workload-monitoring" ) diff --git a/pkg/k8s/relabeled_rules.go b/pkg/k8s/relabeled_rules.go index a8a862908..02452c385 100644 --- a/pkg/k8s/relabeled_rules.go +++ b/pkg/k8s/relabeled_rules.go @@ -38,8 +38,10 @@ const ( PrometheusRuleLabelName = "openshift_io_prometheus_rule_name" AlertRuleLabelId = "openshift_io_alert_rule_id" - AlertRuleClassificationComponentKey = "openshift_io_alert_rule_component" - AlertRuleClassificationLayerKey = "openshift_io_alert_rule_layer" + AlertRuleClassificationComponentKey = "openshift_io_alert_rule_component" + AlertRuleClassificationLayerKey = "openshift_io_alert_rule_layer" + AlertRuleClassificationComponentFromKey = "openshift_io_alert_rule_component_from" + AlertRuleClassificationLayerFromKey = "openshift_io_alert_rule_layer_from" AppKubernetesIoComponent = "app.kubernetes.io/component" AppKubernetesIoComponentAlertManagementApi = "alert-management-api" diff --git a/pkg/management/alert_rule_id_match.go b/pkg/management/alert_rule_id_match.go index b202b16dd..8fc5f2cdf 100644 --- a/pkg/management/alert_rule_id_match.go +++ b/pkg/management/alert_rule_id_match.go @@ -18,6 +18,8 @@ func ruleMatchesAlertRuleID(rule monitoringv1.Rule, alertRuleId string) bool { return alertRuleId != "" && alertRuleId == alertrule.GetAlertingRuleId(&rule) } +// getOriginalPlatformRule fetches the PrometheusRule and delegates the rule +// lookup to getOriginalPlatformRuleFromPR. func (c *client) getOriginalPlatformRule(ctx context.Context, namespace string, name string, alertRuleId string) (*monitoringv1.Rule, error) { pr, found, err := c.k8sClient.PrometheusRules().Get(ctx, namespace, name) if err != nil { @@ -31,19 +33,5 @@ func (c *client) getOriginalPlatformRule(ctx context.Context, namespace string, AdditionalInfo: fmt.Sprintf("PrometheusRule %s/%s not found", namespace, name), } } - - for groupIdx := range pr.Spec.Groups { - for ruleIdx := range pr.Spec.Groups[groupIdx].Rules { - rule := &pr.Spec.Groups[groupIdx].Rules[ruleIdx] - if ruleMatchesAlertRuleID(*rule, alertRuleId) { - return rule, nil - } - } - } - - return nil, &NotFoundError{ - Resource: "AlertRule", - Id: alertRuleId, - AdditionalInfo: fmt.Sprintf("in PrometheusRule %s/%s", namespace, name), - } + return getOriginalPlatformRuleFromPR(pr, namespace, name, alertRuleId) } diff --git a/pkg/management/client_factory.go b/pkg/management/client_factory.go index e71b7f93b..de5b940ce 100644 --- a/pkg/management/client_factory.go +++ b/pkg/management/client_factory.go @@ -2,6 +2,8 @@ package management import ( "context" + "os" + "strings" "github.com/openshift/monitoring-plugin/pkg/k8s" ) @@ -9,6 +11,7 @@ import ( // New creates a new management client. func New(ctx context.Context, k8sClient k8s.Client) Client { return &client{ - k8sClient: k8sClient, + k8sClient: k8sClient, + enableUserWorkloadARCs: strings.EqualFold(strings.TrimSpace(os.Getenv("ENABLE_USER_WORKLOAD_ARCS")), "true"), } } diff --git a/pkg/management/management.go b/pkg/management/management.go index 652ac14de..b7eec3c09 100644 --- a/pkg/management/management.go +++ b/pkg/management/management.go @@ -7,7 +7,8 @@ import ( ) type client struct { - k8sClient k8s.Client + k8sClient k8s.Client + enableUserWorkloadARCs bool } // isPlatformManagedPrometheusRule returns true when the target diff --git a/pkg/management/types.go b/pkg/management/types.go index 837f75e88..96fb4cd7d 100644 --- a/pkg/management/types.go +++ b/pkg/management/types.go @@ -16,6 +16,11 @@ type Client interface { // CreatePlatformAlertRule creates a new platform alert rule CreatePlatformAlertRule(ctx context.Context, alertRule monitoringv1.Rule) (alertRuleId string, err error) + + // UpdateAlertRuleClassification updates component/layer for a single alert rule id + UpdateAlertRuleClassification(ctx context.Context, req UpdateRuleClassificationRequest) error + // BulkUpdateAlertRuleClassification updates classification for multiple rule ids + BulkUpdateAlertRuleClassification(ctx context.Context, items []UpdateRuleClassificationRequest) []error } // PrometheusRuleOptions specifies options for selecting PrometheusRule resources and groups diff --git a/pkg/management/update_classification.go b/pkg/management/update_classification.go new file mode 100644 index 000000000..83d092bf3 --- /dev/null +++ b/pkg/management/update_classification.go @@ -0,0 +1,459 @@ +package management + +import ( + "context" + "fmt" + "regexp" + "sort" + "strings" + + osmv1 "github.com/openshift/api/monitoring/v1" + monitoringv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/types" + + "github.com/openshift/monitoring-plugin/pkg/classification" + "github.com/openshift/monitoring-plugin/pkg/k8s" + "github.com/openshift/monitoring-plugin/pkg/managementlabels" +) + +// UpdateRuleClassificationRequest represents a single classification update +type UpdateRuleClassificationRequest struct { + RuleId string `json:"ruleId"` + Component *string `json:"openshift_io_alert_rule_component,omitempty"` + ComponentSet bool `json:"-"` + Layer *string `json:"openshift_io_alert_rule_layer,omitempty"` + LayerSet bool `json:"-"` + ComponentFrom *string `json:"openshift_io_alert_rule_component_from,omitempty"` + ComponentFromSet bool `json:"-"` + LayerFrom *string `json:"openshift_io_alert_rule_layer_from,omitempty"` + LayerFromSet bool `json:"-"` +} + +// UpdateAlertRuleClassification updates classification labels for a single +// operator-managed alert rule. +func (c *client) UpdateAlertRuleClassification(ctx context.Context, req UpdateRuleClassificationRequest) error { + if req.RuleId == "" { + return &ValidationError{Message: "ruleId is required"} + } + + if req.Component != nil && !classification.ValidateComponent(*req.Component) { + return &ValidationError{Message: fmt.Sprintf("invalid component %q", *req.Component)} + } + if req.Layer != nil && !classification.ValidateLayer(*req.Layer) { + return &ValidationError{Message: fmt.Sprintf("invalid layer %q (allowed: cluster, namespace)", *req.Layer)} + } + if req.ComponentFrom != nil { + v := strings.TrimSpace(*req.ComponentFrom) + if v != "" && !classification.ValidatePromLabelName(v) { + return &ValidationError{Message: fmt.Sprintf("invalid openshift_io_alert_rule_component_from %q (must be a valid Prometheus label name)", *req.ComponentFrom)} + } + } + if req.LayerFrom != nil { + v := strings.TrimSpace(*req.LayerFrom) + if v != "" && !classification.ValidatePromLabelName(v) { + return &ValidationError{Message: fmt.Sprintf("invalid openshift_io_alert_rule_layer_from %q (must be a valid Prometheus label name)", *req.LayerFrom)} + } + } + + // Find the base rule to locate its PrometheusRule namespace + rule, found := c.k8sClient.RelabeledRules().Get(ctx, req.RuleId) + if !found { + return &NotFoundError{Resource: "AlertRule", Id: req.RuleId} + } + + if !req.ComponentSet && !req.LayerSet && !req.ComponentFromSet && !req.LayerFromSet { + return nil + } + + labels := buildClassificationLabels(req) + + ns := rule.Labels[k8s.PrometheusRuleLabelNamespace] + name := rule.Labels[k8s.PrometheusRuleLabelName] + + if c.isPlatformManagedPrometheusRule(types.NamespacedName{Namespace: ns, Name: name}) { + return c.applyClassificationViaARC(ctx, req.RuleId, rule, labels, k8s.ClusterMonitoringNamespace) + } + + if !c.enableUserWorkloadARCs { + return &NotAllowedError{Message: "classification updates for user-defined workload rules require ENABLE_USER_WORKLOAD_ARCS"} + } + + return c.applyClassificationViaARC(ctx, req.RuleId, rule, labels, k8s.UserWorkloadMonitoringNamespace) +} + +// BulkUpdateAlertRuleClassification updates multiple entries; returns per-item errors collected by caller +func (c *client) BulkUpdateAlertRuleClassification(ctx context.Context, items []UpdateRuleClassificationRequest) []error { + errs := make([]error, len(items)) + for i := range items { + errs[i] = c.UpdateAlertRuleClassification(ctx, items[i]) + } + return errs +} + +// buildClassificationLabels converts the classification request fields into a +// label map suitable for the label-based update paths. Empty string means "drop +// this label" which the ARC/PR update paths already handle. +func buildClassificationLabels(req UpdateRuleClassificationRequest) map[string]string { + labels := map[string]string{} + anySet := false + + if req.ComponentSet { + if req.Component == nil { + labels[k8s.AlertRuleClassificationComponentKey] = "" + } else { + labels[k8s.AlertRuleClassificationComponentKey] = *req.Component + } + anySet = true + } + if req.LayerSet { + if req.Layer == nil { + labels[k8s.AlertRuleClassificationLayerKey] = "" + } else { + labels[k8s.AlertRuleClassificationLayerKey] = strings.ToLower(strings.TrimSpace(*req.Layer)) + } + anySet = true + } + if req.ComponentFromSet { + if req.ComponentFrom == nil { + labels[k8s.AlertRuleClassificationComponentFromKey] = "" + } else { + labels[k8s.AlertRuleClassificationComponentFromKey] = strings.TrimSpace(*req.ComponentFrom) + } + anySet = true + } + if req.LayerFromSet { + if req.LayerFrom == nil { + labels[k8s.AlertRuleClassificationLayerFromKey] = "" + } else { + labels[k8s.AlertRuleClassificationLayerFromKey] = strings.TrimSpace(*req.LayerFrom) + } + anySet = true + } + + if anySet { + labels[managementlabels.ClassificationManagedByKey] = managementlabels.ClassificationManagedByValue + } + return labels +} + +// applyClassificationViaARC applies classification labels through an AlertRelabelConfig. +// The ARC is named per-rule and shared with other label changes (severity, etc.). +// arcNamespace determines where the ARC is stored (e.g. openshift-monitoring for +// platform rules, openshift-user-workload-monitoring for user-defined rules). +func (c *client) applyClassificationViaARC( + ctx context.Context, + alertRuleId string, + relabeled monitoringv1.Rule, + classificationLabels map[string]string, + arcNamespace string, +) error { + namespace := relabeled.Labels[k8s.PrometheusRuleLabelNamespace] + name := relabeled.Labels[k8s.PrometheusRuleLabelName] + + pr, prFound, err := c.k8sClient.PrometheusRules().Get(ctx, namespace, name) + if err != nil { + return err + } + if !prFound { + return &NotFoundError{ + Resource: "PrometheusRule", + Id: alertRuleId, + AdditionalInfo: fmt.Sprintf("PrometheusRule %s/%s not found", namespace, name), + } + } + + originalRule, err := getOriginalPlatformRuleFromPR(pr, namespace, name, alertRuleId) + if err != nil { + return err + } + + prName := relabeled.Labels[k8s.PrometheusRuleLabelName] + arcName := k8s.GetAlertRelabelConfigName(prName, alertRuleId) + + original := copyStringMap(originalRule.Labels) + + existingArc, found, err := c.k8sClient.AlertRelabelConfigs().Get(ctx, arcNamespace, arcName) + if err != nil { + return fmt.Errorf("failed to get AlertRelabelConfig %s/%s: %w", arcNamespace, arcName, err) + } + + existingOverrides, existingDrops := collectExistingFromARC(found, existingArc) + existingRuleDrops := getExistingRuleDrops(existingArc, alertRuleId) + effective := computeEffectiveLabels(original, existingOverrides, existingDrops) + + desired := buildDesiredLabels(effective, classificationLabels) + nextChanges := buildNextLabelChanges(original, desired) + + if len(nextChanges) == 0 { + if found { + if err := c.k8sClient.AlertRelabelConfigs().Delete(ctx, arcNamespace, arcName); err != nil { + return fmt.Errorf("failed to delete AlertRelabelConfig %s/%s: %w", arcNamespace, arcName, err) + } + } + return nil + } + + relabelConfigs := buildRelabelConfigs(originalRule.Alert, original, alertRuleId, nextChanges) + relabelConfigs = appendPreservedRuleDrops(relabelConfigs, existingRuleDrops) + + return upsertAlertRelabelConfig(c.k8sClient, ctx, arcNamespace, arcName, prName, originalRule.Alert, alertRuleId, found, existingArc, relabelConfigs) +} + +// --- ARC helpers (shared with platform alert rule updates in downstream branches) --- + +func copyStringMap(in map[string]string) map[string]string { + out := make(map[string]string, len(in)) + for k, v := range in { + out[k] = v + } + return out +} + +func collectExistingFromARC(found bool, arc *osmv1.AlertRelabelConfig) (map[string]string, map[string]struct{}) { + overrides := map[string]string{} + drops := map[string]struct{}{} + if found && arc != nil { + for _, rc := range arc.Spec.Configs { + switch rc.Action { + case "Replace": + if rc.TargetLabel != "" && rc.Replacement != "" { + overrides[string(rc.TargetLabel)] = rc.Replacement + } + case "LabelDrop": + if rc.Regex != "" { + drops[rc.Regex] = struct{}{} + } + } + } + } + return overrides, drops +} + +func computeEffectiveLabels(original map[string]string, overrides map[string]string, drops map[string]struct{}) map[string]string { + effective := copyStringMap(original) + for k, v := range overrides { + effective[k] = v + } + for dropKey := range drops { + delete(effective, dropKey) + } + return effective +} + +func buildDesiredLabels(effective map[string]string, newLabels map[string]string) map[string]string { + desired := copyStringMap(effective) + for k, v := range newLabels { + if v == "" { + delete(desired, k) + } else { + desired[k] = v + } + } + return desired +} + +func buildNextLabelChanges(original map[string]string, desired map[string]string) []labelChange { + var changes []labelChange + for k, v := range desired { + if k == k8s.AlertRuleLabelId { + continue + } + if ov, ok := original[k]; !ok || ov != v { + changes = append(changes, labelChange{ + action: "Replace", + targetLabel: k, + value: v, + }) + } + } + return changes +} + +type labelChange struct { + action string + sourceLabel string + targetLabel string + value string +} + +func getExistingRuleDrops(arc *osmv1.AlertRelabelConfig, alertRuleId string) []osmv1.RelabelConfig { + if arc == nil { + return nil + } + var out []osmv1.RelabelConfig + escaped := regexp.QuoteMeta(alertRuleId) + for _, rc := range arc.Spec.Configs { + if rc.Action != "Drop" { + continue + } + if len(rc.SourceLabels) == 1 && rc.SourceLabels[0] == k8s.AlertRuleLabelId && + (rc.Regex == alertRuleId || rc.Regex == escaped) { + out = append(out, rc) + } + } + return out +} + +func appendPreservedRuleDrops(configs []osmv1.RelabelConfig, drops []osmv1.RelabelConfig) []osmv1.RelabelConfig { + if len(drops) == 0 { + return configs + } +nextDrop: + for _, d := range drops { + for _, cfg := range configs { + if cfg.Action == "Drop" && cfg.Regex == d.Regex && + len(cfg.SourceLabels) == 1 && cfg.SourceLabels[0] == k8s.AlertRuleLabelId { + continue nextDrop + } + } + configs = append(configs, d) + } + return configs +} + +func buildRelabelConfigs(alertName string, originalLabels map[string]string, alertRuleId string, changes []labelChange) []osmv1.RelabelConfig { + var configs []osmv1.RelabelConfig + + var keys []string + for k := range originalLabels { + if k == "namespace" { + continue + } + keys = append(keys, k) + } + sort.Strings(keys) + source := []osmv1.LabelName{managementlabels.AlertNameLabel} + values := []string{alertName} + for _, k := range keys { + source = append(source, osmv1.LabelName(k)) + values = append(values, originalLabels[k]) + } + pat := "^" + regexp.QuoteMeta(strings.Join(values, ";")) + "$" + configs = append(configs, osmv1.RelabelConfig{ + SourceLabels: source, + Regex: pat, + TargetLabel: k8s.AlertRuleLabelId, + Replacement: alertRuleId, + Action: "Replace", + }) + + for _, change := range changes { + switch change.action { + case "Replace": + configs = append(configs, osmv1.RelabelConfig{ + SourceLabels: []osmv1.LabelName{k8s.AlertRuleLabelId}, + Regex: regexp.QuoteMeta(alertRuleId), + TargetLabel: change.targetLabel, + Replacement: change.value, + Action: "Replace", + }) + case "LabelDrop": + configs = append(configs, osmv1.RelabelConfig{ + Regex: change.sourceLabel, + Action: "LabelDrop", + }) + } + } + + return configs +} + +func upsertAlertRelabelConfig( + k8sClient k8s.Client, + ctx context.Context, + namespace string, + arcName string, + prName string, + alertName string, + alertRuleId string, + found bool, + existingArc *osmv1.AlertRelabelConfig, + relabelConfigs []osmv1.RelabelConfig, +) error { + if found { + arc := existingArc + arc.Spec = osmv1.AlertRelabelConfigSpec{Configs: relabelConfigs} + if arc.Labels == nil { + arc.Labels = map[string]string{} + } + arc.Labels[managementlabels.ARCLabelPrometheusRuleNameKey] = prName + arc.Labels[managementlabels.ARCLabelAlertNameKey] = alertName + if arc.Annotations == nil { + arc.Annotations = map[string]string{} + } + arc.Annotations[managementlabels.ARCAnnotationAlertRuleIDKey] = alertRuleId + if err := k8sClient.AlertRelabelConfigs().Update(ctx, *arc); err != nil { + return fmt.Errorf("failed to update AlertRelabelConfig %s/%s: %w", arc.Namespace, arc.Name, err) + } + return nil + } + + arc := &osmv1.AlertRelabelConfig{ + ObjectMeta: metav1.ObjectMeta{ + Name: arcName, + Namespace: namespace, + Labels: map[string]string{ + managementlabels.ARCLabelPrometheusRuleNameKey: prName, + managementlabels.ARCLabelAlertNameKey: alertName, + }, + Annotations: map[string]string{ + managementlabels.ARCAnnotationAlertRuleIDKey: alertRuleId, + }, + }, + Spec: osmv1.AlertRelabelConfigSpec{Configs: relabelConfigs}, + } + if _, err := k8sClient.AlertRelabelConfigs().Create(ctx, *arc); err != nil { + return fmt.Errorf("failed to create AlertRelabelConfig %s/%s: %w", arc.Namespace, arc.Name, err) + } + return nil +} + +// getOriginalPlatformRuleFromPR returns the original rule from a pre-fetched PrometheusRule. +func getOriginalPlatformRuleFromPR(pr *monitoringv1.PrometheusRule, namespace string, name string, alertRuleId string) (*monitoringv1.Rule, error) { + if pr == nil { + return nil, &NotFoundError{ + Resource: "PrometheusRule", + Id: alertRuleId, + AdditionalInfo: fmt.Sprintf("PrometheusRule %s/%s not found", namespace, name), + } + } + for groupIdx := range pr.Spec.Groups { + for ruleIdx := range pr.Spec.Groups[groupIdx].Rules { + rule := &pr.Spec.Groups[groupIdx].Rules[ruleIdx] + if ruleMatchesAlertRuleID(*rule, alertRuleId) { + return rule, nil + } + } + } + return nil, &NotFoundError{ + Resource: "AlertRule", + Id: alertRuleId, + AdditionalInfo: fmt.Sprintf("in PrometheusRule %s/%s", namespace, name), + } +} + +// ApplyDynamicClassification resolves the effective component and layer for an +// alert by applying _from indirection. If a rule carries a component_from or +// layer_from label, the corresponding alert label value is used instead of the +// static default. Unresolvable or empty lookups fall back to the supplied +// defaults. +func ApplyDynamicClassification(ruleLabels, alertLabels map[string]string, defaultComponent, defaultLayer string) (string, string) { + component := defaultComponent + layer := defaultLayer + + if ruleLabels != nil { + if fromKey := ruleLabels[k8s.AlertRuleClassificationComponentFromKey]; fromKey != "" { + if v, ok := alertLabels[fromKey]; ok && v != "" { + component = v + } + } + if fromKey := ruleLabels[k8s.AlertRuleClassificationLayerFromKey]; fromKey != "" { + if v, ok := alertLabels[fromKey]; ok && v != "" { + layer = strings.ToLower(v) + } + } + } + + return component, layer +} diff --git a/pkg/management/update_classification_test.go b/pkg/management/update_classification_test.go new file mode 100644 index 000000000..42ed908f3 --- /dev/null +++ b/pkg/management/update_classification_test.go @@ -0,0 +1,521 @@ +package management_test + +import ( + "context" + "errors" + "testing" + + monitoringv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + + alertrule "github.com/openshift/monitoring-plugin/pkg/alert_rule" + "github.com/openshift/monitoring-plugin/pkg/k8s" + "github.com/openshift/monitoring-plugin/pkg/management" + "github.com/openshift/monitoring-plugin/pkg/management/testutils" + "github.com/openshift/monitoring-plugin/pkg/managementlabels" +) + +var ( + platformOriginal = monitoringv1.Rule{ + Alert: "CannotRetrieveUpdates", + Labels: map[string]string{ + "severity": "warning", + }, + } + platformRuleId = alertrule.GetAlertingRuleId(&platformOriginal) + + userOriginal = monitoringv1.Rule{ + Alert: "HighLatency", + Labels: map[string]string{ + "severity": "warning", + }, + } + userRuleId = alertrule.GetAlertingRuleId(&userOriginal) + + clTestPlatformNamespace = "openshift-monitoring" + clTestUserNamespace = "my-app" + clTestRuleName = "my-rules" +) + +func makePlatformRelabeled() monitoringv1.Rule { + return monitoringv1.Rule{ + Alert: platformOriginal.Alert, + Labels: map[string]string{ + k8s.AlertRuleLabelId: platformRuleId, + k8s.PrometheusRuleLabelNamespace: clTestPlatformNamespace, + k8s.PrometheusRuleLabelName: clTestRuleName, + "severity": "warning", + }, + } +} + +func makeUserRelabeled() monitoringv1.Rule { + return monitoringv1.Rule{ + Alert: userOriginal.Alert, + Labels: map[string]string{ + k8s.AlertRuleLabelId: userRuleId, + k8s.PrometheusRuleLabelNamespace: clTestUserNamespace, + k8s.PrometheusRuleLabelName: clTestRuleName, + "severity": "warning", + }, + } +} + +func makeClassificationPR(ns, name string, rules ...monitoringv1.Rule) *monitoringv1.PrometheusRule { + return &monitoringv1.PrometheusRule{ + ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: name}, + Spec: monitoringv1.PrometheusRuleSpec{ + Groups: []monitoringv1.RuleGroup{{Name: "default", Rules: rules}}, + }, + } +} + +func newClassificationClient(t *testing.T) (management.Client, *testutils.MockClient) { + t.Helper() + mockK8s := &testutils.MockClient{} + return management.New(context.Background(), mockK8s), mockK8s +} + +func mockRelabeledRules(id string, rule monitoringv1.Rule) func() k8s.RelabeledRulesInterface { + return func() k8s.RelabeledRulesInterface { + return &testutils.MockRelabeledRulesInterface{ + GetFunc: func(_ context.Context, gotID string) (monitoringv1.Rule, bool) { + if gotID == id { + return rule, true + } + return monitoringv1.Rule{}, false + }, + } + } +} + +// --- Validation tests --- + +func TestUpdateAlertRuleClassification_EmptyRuleId(t *testing.T) { + client, _ := newClassificationClient(t) + err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{}) + if err == nil { + t.Fatal("expected error for empty ruleId") + } + var ve *management.ValidationError + if !errors.As(err, &ve) { + t.Errorf("expected ValidationError, got %T: %v", err, err) + } +} + +func TestUpdateAlertRuleClassification_InvalidLayer(t *testing.T) { + client, mockK8s := newClassificationClient(t) + rule := makePlatformRelabeled() + mockK8s.RelabeledRulesFunc = mockRelabeledRules(platformRuleId, rule) + + bad := "invalid" + err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: platformRuleId, + Layer: &bad, + LayerSet: true, + }) + if err == nil { + t.Fatal("expected ValidationError for invalid layer") + } + var ve *management.ValidationError + if !errors.As(err, &ve) { + t.Errorf("expected ValidationError, got %T: %v", err, err) + } +} + +func TestUpdateAlertRuleClassification_InvalidComponent(t *testing.T) { + client, mockK8s := newClassificationClient(t) + rule := makePlatformRelabeled() + mockK8s.RelabeledRulesFunc = mockRelabeledRules(platformRuleId, rule) + + empty := "" + err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: platformRuleId, + Component: &empty, + ComponentSet: true, + }) + if err == nil { + t.Fatal("expected ValidationError for empty component") + } + var ve *management.ValidationError + if !errors.As(err, &ve) { + t.Errorf("expected ValidationError, got %T: %v", err, err) + } +} + +func TestUpdateAlertRuleClassification_InvalidComponentFrom(t *testing.T) { + client, mockK8s := newClassificationClient(t) + rule := makePlatformRelabeled() + mockK8s.RelabeledRulesFunc = mockRelabeledRules(platformRuleId, rule) + + bad := "bad-label" + err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: platformRuleId, + ComponentFrom: &bad, + ComponentFromSet: true, + }) + if err == nil { + t.Fatal("expected ValidationError for invalid component_from") + } + var ve *management.ValidationError + if !errors.As(err, &ve) { + t.Errorf("expected ValidationError, got %T: %v", err, err) + } +} + +func TestUpdateAlertRuleClassification_InvalidLayerFrom(t *testing.T) { + client, mockK8s := newClassificationClient(t) + rule := makePlatformRelabeled() + mockK8s.RelabeledRulesFunc = mockRelabeledRules(platformRuleId, rule) + + bad := "1layer" + err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: platformRuleId, + LayerFrom: &bad, + LayerFromSet: true, + }) + if err == nil { + t.Fatal("expected ValidationError for invalid layer_from") + } + var ve *management.ValidationError + if !errors.As(err, &ve) { + t.Errorf("expected ValidationError, got %T: %v", err, err) + } +} + +func TestUpdateAlertRuleClassification_NotFound(t *testing.T) { + client, mockK8s := newClassificationClient(t) + mockK8s.RelabeledRulesFunc = func() k8s.RelabeledRulesInterface { + return &testutils.MockRelabeledRulesInterface{ + GetFunc: func(_ context.Context, _ string) (monitoringv1.Rule, bool) { + return monitoringv1.Rule{}, false + }, + } + } + + val := "cluster" + err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: "missing", + Layer: &val, + LayerSet: true, + }) + if err == nil { + t.Fatal("expected NotFoundError") + } + var nf *management.NotFoundError + if !errors.As(err, &nf) { + t.Errorf("expected NotFoundError, got %T: %v", err, err) + } + if nf.Resource != "AlertRule" { + t.Errorf("expected Resource %q, got %q", "AlertRule", nf.Resource) + } +} + +func TestUpdateAlertRuleClassification_EmptyPayloadIsNoOp(t *testing.T) { + client, mockK8s := newClassificationClient(t) + rule := makePlatformRelabeled() + mockK8s.RelabeledRulesFunc = mockRelabeledRules(platformRuleId, rule) + + if err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: platformRuleId, + }); err != nil { + t.Fatalf("expected no-op but got error: %v", err) + } +} + +// --- Platform rules (ARC path) --- + +func TestUpdateAlertRuleClassification_PlatformRule_CreatesARC(t *testing.T) { + client, mockK8s := newClassificationClient(t) + + mockK8s.NamespaceFunc = func() k8s.NamespaceInterface { + return &testutils.MockNamespaceInterface{ + MonitoringNamespaces: map[string]bool{clTestPlatformNamespace: true}, + } + } + + relabeled := makePlatformRelabeled() + pr := makeClassificationPR(clTestPlatformNamespace, clTestRuleName, platformOriginal) + + mockK8s.RelabeledRulesFunc = mockRelabeledRules(platformRuleId, relabeled) + prStore := &testutils.MockPrometheusRuleInterface{ + PrometheusRules: map[string]*monitoringv1.PrometheusRule{ + clTestPlatformNamespace + "/" + clTestRuleName: pr, + }, + } + mockK8s.PrometheusRulesFunc = func() k8s.PrometheusRuleInterface { return prStore } + + arcStore := &testutils.MockAlertRelabelConfigInterface{} + mockK8s.AlertRelabelConfigsFunc = func() k8s.AlertRelabelConfigInterface { return arcStore } + + component := "networking" + layer := "cluster" + if err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: platformRuleId, + Component: &component, + ComponentSet: true, + Layer: &layer, + LayerSet: true, + }); err != nil { + t.Fatalf("unexpected error: %v", err) + } + + if len(arcStore.AlertRelabelConfigs) != 1 { + t.Fatalf("expected 1 ARC, got %d", len(arcStore.AlertRelabelConfigs)) + } + + for _, arc := range arcStore.AlertRelabelConfigs { + if arc.Labels[managementlabels.ARCLabelPrometheusRuleNameKey] != clTestRuleName { + t.Errorf("ARC missing expected prometheus rule name label") + } + if arc.Labels[managementlabels.ARCLabelAlertNameKey] != "CannotRetrieveUpdates" { + t.Errorf("ARC missing expected alert name label") + } + if arc.Annotations[managementlabels.ARCAnnotationAlertRuleIDKey] != platformRuleId { + t.Errorf("ARC missing expected alert rule ID annotation") + } + + hasComponent, hasLayer, hasManagedBy := false, false, false + for _, rc := range arc.Spec.Configs { + if rc.Action == "Replace" && rc.TargetLabel == k8s.AlertRuleClassificationComponentKey { + if rc.Replacement != "networking" { + t.Errorf("expected component replacement %q, got %q", "networking", rc.Replacement) + } + hasComponent = true + } + if rc.Action == "Replace" && rc.TargetLabel == k8s.AlertRuleClassificationLayerKey { + if rc.Replacement != "cluster" { + t.Errorf("expected layer replacement %q, got %q", "cluster", rc.Replacement) + } + hasLayer = true + } + if rc.Action == "Replace" && rc.TargetLabel == managementlabels.ClassificationManagedByKey { + hasManagedBy = true + } + } + if !hasComponent { + t.Error("ARC should have component replace config") + } + if !hasLayer { + t.Error("ARC should have layer replace config") + } + if !hasManagedBy { + t.Error("ARC should have managed-by replace config") + } + } +} + +func TestUpdateAlertRuleClassification_PlatformRule_CreatesARCWithFromLabels(t *testing.T) { + client, mockK8s := newClassificationClient(t) + + mockK8s.NamespaceFunc = func() k8s.NamespaceInterface { + return &testutils.MockNamespaceInterface{ + MonitoringNamespaces: map[string]bool{clTestPlatformNamespace: true}, + } + } + + relabeled := makePlatformRelabeled() + pr := makeClassificationPR(clTestPlatformNamespace, clTestRuleName, platformOriginal) + + mockK8s.RelabeledRulesFunc = mockRelabeledRules(platformRuleId, relabeled) + prStore := &testutils.MockPrometheusRuleInterface{ + PrometheusRules: map[string]*monitoringv1.PrometheusRule{ + clTestPlatformNamespace + "/" + clTestRuleName: pr, + }, + } + mockK8s.PrometheusRulesFunc = func() k8s.PrometheusRuleInterface { return prStore } + + arcStore := &testutils.MockAlertRelabelConfigInterface{} + mockK8s.AlertRelabelConfigsFunc = func() k8s.AlertRelabelConfigInterface { return arcStore } + + componentFrom := "namespace" + layerFrom := "tier" + if err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: platformRuleId, + ComponentFrom: &componentFrom, + ComponentFromSet: true, + LayerFrom: &layerFrom, + LayerFromSet: true, + }); err != nil { + t.Fatalf("unexpected error: %v", err) + } + + if len(arcStore.AlertRelabelConfigs) != 1 { + t.Fatalf("expected 1 ARC, got %d", len(arcStore.AlertRelabelConfigs)) + } + + for _, arc := range arcStore.AlertRelabelConfigs { + hasComponentFrom, hasLayerFrom := false, false + for _, rc := range arc.Spec.Configs { + if rc.Action == "Replace" && rc.TargetLabel == k8s.AlertRuleClassificationComponentFromKey { + if rc.Replacement != "namespace" { + t.Errorf("expected component_from replacement %q, got %q", "namespace", rc.Replacement) + } + hasComponentFrom = true + } + if rc.Action == "Replace" && rc.TargetLabel == k8s.AlertRuleClassificationLayerFromKey { + if rc.Replacement != "tier" { + t.Errorf("expected layer_from replacement %q, got %q", "tier", rc.Replacement) + } + hasLayerFrom = true + } + } + if !hasComponentFrom { + t.Error("ARC should have component_from replace config") + } + if !hasLayerFrom { + t.Error("ARC should have layer_from replace config") + } + } +} + +// --- User-defined rules --- + +func TestUpdateAlertRuleClassification_UserRule_NotAllowedWhenFlagDisabled(t *testing.T) { + client, mockK8s := newClassificationClient(t) + + mockK8s.NamespaceFunc = func() k8s.NamespaceInterface { + return &testutils.MockNamespaceInterface{ + MonitoringNamespaces: map[string]bool{clTestPlatformNamespace: true}, + } + } + + relabeled := makeUserRelabeled() + mockK8s.RelabeledRulesFunc = mockRelabeledRules(userRuleId, relabeled) + + component := "team_a" + err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: userRuleId, + Component: &component, + ComponentSet: true, + }) + if err == nil { + t.Fatal("expected NotAllowedError when ENABLE_USER_WORKLOAD_ARCS is disabled") + } + var na *management.NotAllowedError + if !errors.As(err, &na) { + t.Errorf("expected NotAllowedError, got %T: %v", err, err) + } +} + +func TestUpdateAlertRuleClassification_UserRule_CreatesARCInUserWorkloadNamespace(t *testing.T) { + t.Setenv("ENABLE_USER_WORKLOAD_ARCS", "true") + // Recreate client to pick up the env var. + mockK8s := &testutils.MockClient{} + client := management.New(context.Background(), mockK8s) + + mockK8s.NamespaceFunc = func() k8s.NamespaceInterface { + return &testutils.MockNamespaceInterface{ + MonitoringNamespaces: map[string]bool{clTestPlatformNamespace: true}, + } + } + + relabeled := makeUserRelabeled() + pr := makeClassificationPR(clTestUserNamespace, clTestRuleName, userOriginal) + + mockK8s.RelabeledRulesFunc = mockRelabeledRules(userRuleId, relabeled) + prStore := &testutils.MockPrometheusRuleInterface{ + PrometheusRules: map[string]*monitoringv1.PrometheusRule{ + clTestUserNamespace + "/" + clTestRuleName: pr, + }, + } + mockK8s.PrometheusRulesFunc = func() k8s.PrometheusRuleInterface { return prStore } + + arcStore := &testutils.MockAlertRelabelConfigInterface{} + mockK8s.AlertRelabelConfigsFunc = func() k8s.AlertRelabelConfigInterface { return arcStore } + + component := "team_a" + layer := "namespace" + if err := client.UpdateAlertRuleClassification(context.Background(), management.UpdateRuleClassificationRequest{ + RuleId: userRuleId, + Component: &component, + ComponentSet: true, + Layer: &layer, + LayerSet: true, + }); err != nil { + t.Fatalf("unexpected error: %v", err) + } + + if len(arcStore.AlertRelabelConfigs) != 1 { + t.Fatalf("expected 1 ARC, got %d", len(arcStore.AlertRelabelConfigs)) + } + + for _, arc := range arcStore.AlertRelabelConfigs { + if arc.Namespace != k8s.UserWorkloadMonitoringNamespace { + t.Errorf("expected ARC namespace %q, got %q", k8s.UserWorkloadMonitoringNamespace, arc.Namespace) + } + if arc.Annotations[managementlabels.ARCAnnotationAlertRuleIDKey] != userRuleId { + t.Errorf("ARC missing expected alert rule ID annotation") + } + + hasComponent, hasLayer := false, false + for _, rc := range arc.Spec.Configs { + if rc.Action == "Replace" && rc.TargetLabel == k8s.AlertRuleClassificationComponentKey { + if rc.Replacement != "team_a" { + t.Errorf("expected component replacement %q, got %q", "team_a", rc.Replacement) + } + hasComponent = true + } + if rc.Action == "Replace" && rc.TargetLabel == k8s.AlertRuleClassificationLayerKey { + if rc.Replacement != "namespace" { + t.Errorf("expected layer replacement %q, got %q", "namespace", rc.Replacement) + } + hasLayer = true + } + } + if !hasComponent { + t.Error("ARC should have component replace config") + } + if !hasLayer { + t.Error("ARC should have layer replace config") + } + } +} + +// --- ApplyDynamicClassification tests --- + +func TestApplyDynamicClassification_DefaultsWhenNoFromLabels(t *testing.T) { + c, l := management.ApplyDynamicClassification(nil, nil, "comp", "cluster") + if c != "comp" { + t.Errorf("expected %q, got %q", "comp", c) + } + if l != "cluster" { + t.Errorf("expected %q, got %q", "cluster", l) + } +} + +func TestApplyDynamicClassification_UsesComponentFrom(t *testing.T) { + ruleLabels := map[string]string{ + k8s.AlertRuleClassificationComponentFromKey: "name", + } + alertLabels := map[string]string{ + "name": "dns", + } + c, _ := management.ApplyDynamicClassification(ruleLabels, alertLabels, "default", "cluster") + if c != "dns" { + t.Errorf("expected %q, got %q", "dns", c) + } +} + +func TestApplyDynamicClassification_UsesLayerFrom(t *testing.T) { + ruleLabels := map[string]string{ + k8s.AlertRuleClassificationLayerFromKey: "tier", + } + alertLabels := map[string]string{ + "tier": "Cluster", + } + _, l := management.ApplyDynamicClassification(ruleLabels, alertLabels, "comp", "namespace") + if l != "cluster" { + t.Errorf("expected %q (lowercased), got %q", "cluster", l) + } +} + +func TestApplyDynamicClassification_FallsBackWhenFromLabelMissing(t *testing.T) { + ruleLabels := map[string]string{ + k8s.AlertRuleClassificationComponentFromKey: "missing_label", + } + c, _ := management.ApplyDynamicClassification(ruleLabels, map[string]string{}, "fallback", "cluster") + if c != "fallback" { + t.Errorf("expected fallback %q, got %q", "fallback", c) + } +} diff --git a/pkg/managementlabels/management_labels.go b/pkg/managementlabels/management_labels.go index 757c00fdf..e700a06d2 100644 --- a/pkg/managementlabels/management_labels.go +++ b/pkg/managementlabels/management_labels.go @@ -26,3 +26,9 @@ const ( // ARCAnnotationAlertRuleIDKey stores the computed alert rule ID for cross-referencing. ARCAnnotationAlertRuleIDKey = "monitoring.openshift.io/alertRuleId" ) + +// Classification provenance labels +const ( + ClassificationManagedByKey = "openshift_io_alert_rule_classification_managed_by" + ClassificationManagedByValue = "monitoring-plugin" +)