Skip to content

feat(stack): allow controller metrics scraping#95

Open
gasarekubex wants to merge 20 commits into
masterfrom
gasare/feat/controller-metrics
Open

feat(stack): allow controller metrics scraping#95
gasarekubex wants to merge 20 commits into
masterfrom
gasare/feat/controller-metrics

Conversation

@gasarekubex
Copy link
Copy Markdown
Member

Summary

Add controller metrics exposure and scrape integration for both Prometheus Operator users and the bundled stack Prometheus path.

Make kubex-automation-stack chart able to discover the controller metrics service.

Document the new metrics scrape setup.

Testing

Prometheus query:

curl -s 'http://127.0.0.1:9090/api/v1/query?query=%7Bjob%3D%22kubernetes-service-endpointslice%22%2Cnamespace%3D%22kubex%22%2Cservice%3D%22controller-kubex-automation-engine-metrics-service%22%7D'

Result:
Verified up{service="controller-kubex-automation-engine-metrics-service"} == 1 in the local Kind cluster.

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "up",
          "app_kubernetes_io_instance": "controller",
          "app_kubernetes_io_managed_by": "Helm",
          "app_kubernetes_io_name": "kubex-automation-engine",
          "app_kubernetes_io_version": "0.4.0",
          "control_plane": "controller-manager",
          "helm_sh_chart": "kubex-automation-engine-0.4.0",
          "instance": "10.244.0.16:8443",
          "job": "kubernetes-service-endpointslice",
          "namespace": "kubex",
          "node": "e2e-135-control-plane",
          "service": "controller-kubex-automation-engine-metrics-service"
        },
        "value": [1778090924.089, "1"
        ]
      },
      {
        "metric": {
          "__name__": "scrape_duration_seconds",
          "app_kubernetes_io_instance": "controller",
          "app_kubernetes_io_managed_by": "Helm",
          "app_kubernetes_io_name": "kubex-automation-engine",
          "app_kubernetes_io_version": "0.4.0",
          "control_plane": "controller-manager",
          "helm_sh_chart": "kubex-automation-engine-0.4.0",
          "instance": "10.244.0.16:8443",
          "job": "kubernetes-service-endpointslice",
          "namespace": "kubex",
          "node": "e2e-135-control-plane",
          "service": "controller-kubex-automation-engine-metrics-service"
        },
        "value": [1778090924.089, "0.003503145"
        ]
      },
      {
        "metric": {
          "__name__": "scrape_samples_scraped",
          "app_kubernetes_io_instance": "controller",
          "app_kubernetes_io_managed_by": "Helm",
          "app_kubernetes_io_name": "kubex-automation-engine",
          "app_kubernetes_io_version": "0.4.0",
          "control_plane": "controller-manager",
          "helm_sh_chart": "kubex-automation-engine-0.4.0",
          "instance": "10.244.0.16:8443",
          "job": "kubernetes-service-endpointslice",
          "namespace": "kubex",
          "node": "e2e-135-control-plane",
          "service": "controller-kubex-automation-engine-metrics-service"
        },
        "value": [1778090924.089, "699"
        ]
      },
      {
        "metric": {
          "__name__": "scrape_samples_post_metric_relabeling",
          "app_kubernetes_io_instance": "controller",
          "app_kubernetes_io_managed_by": "Helm",
          "app_kubernetes_io_name": "kubex-automation-engine",
          "app_kubernetes_io_version": "0.4.0",
          "control_plane": "controller-manager",
          "helm_sh_chart": "kubex-automation-engine-0.4.0",
          "instance": "10.244.0.16:8443",
          "job": "kubernetes-service-endpointslice",
          "namespace": "kubex",
          "node": "e2e-135-control-plane",
          "service": "controller-kubex-automation-engine-metrics-service"
        },
        "value": [1778090924.089, "0"
        ]
      },
      {
        "metric": {
          "__name__": "scrape_series_added",
          "app_kubernetes_io_instance": "controller",
          "app_kubernetes_io_managed_by": "Helm",
          "app_kubernetes_io_name": "kubex-automation-engine",
          "app_kubernetes_io_version": "0.4.0",
          "control_plane": "controller-manager",
          "helm_sh_chart": "kubex-automation-engine-0.4.0",
          "instance": "10.244.0.16:8443",
          "job": "kubernetes-service-endpointslice",
          "namespace": "kubex",
          "node": "e2e-135-control-plane",
          "service": "controller-kubex-automation-engine-metrics-service"
        },
        "value": [1778090924.089, "0"
        ]
      }
    ]
  }
}

@gasarekubex gasarekubex requested a review from a team as a code owner May 6, 2026 18:18
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: This is a focused, low-risk single-line change that adds the kubex-automation-engine metrics service to the Prometheus endpointslice scrape allowlist. The intent is clear, the testing evidence is solid, and the change is confined to a single value. One minor regex hygiene issue and a missing chart version bump are worth addressing.

Risk level: Low

Critical issues:

  • None

Major issues:

  • None

Minor issues:

  1. Redundant trailing .* inside the new alternative (Location: charts/kubex-automation-stack/values.yaml, changed line — the new regex value)

    The new alternative is .*kubex-automation-engine.*metrics-service.*, but the outer regex already has a trailing .* applied to the entire group (…).*. The extra .* at the end of the inner alternative is redundant, adds noise, and makes the regex slightly harder to audit.

    Suggested fix — make the inner alternative consistent with its siblings:

    regex: '((kubex|densify)-(kube-state-metrics|prometheus-node-exporter|ephemeral-storage-collector)|.*dcgm|k8s-ephemeral-storage-metrics|.*kubex-automation-engine.*metrics-service).*'
  2. Over-broad leading wildcard (Confidence: Low — may be intentional)

    The leading .* in .*kubex-automation-engine.*metrics-service means any endpointslice name containing kubex-automation-engine and later metrics-service will be scraped, regardless of prefix. Contrast this with the other fixed-prefix alternatives (kubex-kube-state-metrics, densify-prometheus-node-exporter, etc.). If the Helm release name is always controller (as shown in the test output), a more anchored pattern such as controller-kubex-automation-engine-metrics-service would be less permissive. This is low-confidence feedback — the .* may be intentional to handle arbitrary release name prefixes.

  3. No chart version bump (Confidence: Medium)

    The charts/kubex-automation-stack/Chart.yaml does not appear to have been updated in this PR. Adding a new scrape target to values.yaml is a functional behaviour change for users of the bundled Prometheus stack, so a patch-version bump (e.g., 0.x.y → 0.x.(y+1)) and a CHANGELOG/NOTES.txt entry would follow standard Helm chart conventions and aid in upgradability tracking.

DRY improvement opportunities:

  • None in scope for this change.

Suggested next steps:

  1. Remove the redundant trailing .* from inside the new regex alternative.
  2. Consider whether the leading .* in .*kubex-automation-engine.*metrics-service should be anchored more tightly (e.g., use a known release-name prefix), or document in a comment why the wildcard prefix is required.
  3. Bump the kubex-automation-stack chart patch version in Chart.yaml to reflect the behavioural change in scrape configuration.

Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED


Overall assessment: This is a well-scoped, minimal change that adds the kubex-automation-engine controller metrics service to the Prometheus endpointslice scrape allowlist and correctly bumps the chart version (1.0.8 → 1.0.9). However, the change introduces a correctness risk around TLS handling and a likely silent data-loss issue due to the existing metric_relabel_configs allowlist that was not updated to admit the new service's metrics.

Risk level: Medium


Critical issues:

  • None

Major issues:

  1. scheme / TLS mismatch (values.yaml line 219)
    The test output shows the controller metrics run on port 8443 (HTTPS). The kubernetes-service-endpointslice job uses scheme: http unless the Service carries prometheus.io/scheme: https. If that annotation is absent, all scrapes will silently fail (up == 0). The PR should either confirm the annotation is present in the engine chart or add a dedicated HTTPS-capable scrape job.

Minor issues:

  1. metric_relabel_configs allowlist drops all controller metrics (values.yaml line 219)
    The existing metric_relabel_configs in the kubernetes-service-endpointslice job is a tightly-scoped allowlist for kube-state-metrics, node-exporter, DCGM, and ephemeral storage metrics. kubex-automation-engine metrics (controller-runtime, custom business metrics) will be scraped and then entirely dropped. If the intent is to expose controller metrics to users of the stack, the allowlist must be extended with the relevant metric name patterns — or a separate job without the allowlist should be used.

  2. Over-broad leading .* in the regex alternative (values.yaml line 219)
    The pattern .*kubex-automation-engine.*metrics-service can match unintended endpointslices. All other alternatives in the group use fixed prefixes. If the Helm release name is always controller, consider anchoring to controller-kubex-automation-engine-metrics-service. If arbitrary release names must be supported, document the reason with an inline comment.

DRY improvement opportunities:

  • The PR description mentions "Prometheus Operator users" but no ServiceMonitor CRD resource changes appear in the diff. If Prometheus Operator support was deferred, tracking it as a follow-up issue would be useful.

Suggested next steps:

  1. Confirm (or enforce via chart annotations) that the kubex-automation-engine metrics Service carries prometheus.io/scheme: https — or create a dedicated scrape job that hard-codes scheme: https and TLS config.
  2. Extend the metric_relabel_configs keep-regex to include kubex-automation-engine metric name patterns, or isolate the controller scrape in a separate job without the restrictive allowlist.
  3. Consider tightening the endpointslice regex to avoid inadvertent scraping of similarly-named services.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: This PR correctly extends the endpointslice scrape allowlist and updates the metric_relabel_configs keep-regex to admit controller metrics. The chart version bump (1.0.8 → 1.0.9) is present. However, two structural issues introduce correctness risk — the TLS/scheme dependency on a Service annotation that may not be set, and the introduction of very broad metric-name wildcards (go_.*, process_.*, workqueue_.*, etc.) into a shared, tightly-scoped allowlist.

Risk level: Medium


Critical issues:

  • None

Major issues:

  1. TLS scheme dependency on Service annotation (values.yaml line 219)
    The kubernetes-service-endpointslice job only uses HTTPS when the Service carries prometheus.io/scheme: https. The controller metrics run on port 8443 (HTTPS). If the engine chart does not set this annotation by default, all scrapes will silently fail (up == 0) in user environments even though the PR's local Kind test passed. A dedicated job with hard-coded scheme: https (like the kubernetes-apiservers job) removes this fragile annotation dependency.

  2. Broad metric-name wildcards in a shared allowlist (values.yaml line 228)
    The new patterns go_.*, process_.*, workqueue_.*, controller_runtime_.*, and rest_client_.* apply to all targets scraped by kubernetes-service-endpointslice — not just the controller metrics service. Any other Go-based workload added to the endpointslice allowlist in the future will have its Go runtime metrics retained without an explicit choice. A dedicated scrape job for the controller metrics keeps the allowlist intent intact and avoids this cross-contamination.

Minor issues:

  1. Over-broad leading .* in endpointslice regex (values.yaml line 219)
    The pattern .*kubex-automation-engine.*metrics-service is more permissive than all peer alternatives. Tightening to .*-kubex-automation-engine-metrics-service (or anchoring to the known release prefix controller-) reduces inadvertent matches. If the wildcard is intentional, a short inline comment stating why would prevent future regressions.

DRY improvement opportunities:

  • Both the endpointslice filter and the metric allowlist would be cleaner and safer if the controller metrics used a dedicated scrape job rather than being co-located in kubernetes-service-endpointslice. This is a design suggestion, not a blocker — but it resolves issues #1, #2, and #3 above in one move.

Suggested next steps:

  1. Confirm or enforce that the kubex-automation-engine metrics Service has prometheus.io/scheme: https. If that cannot be guaranteed, create a dedicated kubernetes-controller-metrics scrape job with scheme: https and in-cluster TLS config.
  2. Move the controller-specific metric-name patterns (controller_runtime_.*, go_.*, process_.*, workqueue_.*, rest_client_.*) out of the shared kubernetes-service-endpointslice allowlist — either into a dedicated job or guarded by a service-label condition.
  3. Tighten or document the leading .* in the endpointslice keep-regex.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: This PR correctly introduces a dedicated scrape job (kubex-automation-engine-metrics-endpointslice) for the controller metrics service rather than co-opting the shared kubernetes-service-endpointslice job — a sound architectural decision that avoids polluting the shared metric_relabel_configs allowlist. The new job also correctly omits metric_relabel_configs entirely, so all controller metrics pass through unfiltered. The main risk is that TLS and authentication are not hard-coded in the new job, making it fragile when Service annotations are absent.

Risk level: Medium


Critical issues:

  • None

Major issues:

  1. Missing TLS config and bearer token (values.yaml, line 232 — new job declaration)
    The new job relies entirely on the prometheus.io/scheme annotation to select HTTPS, and provides no bearer_token_file. The controller metrics endpoint runs on port 8443 (HTTPS, likely RBAC-protected). If the Service lacks prometheus.io/scheme: https, the job defaults to HTTP and scrapes fail silently. If the endpoint requires a bearer token (standard for controller-manager metrics), it will return 401 regardless of TLS. The kubernetes-apiservers job in the same file shows the correct pattern: hard-code scheme: https, tls_config, and bearer_token_file.

  2. Verbatim copy-paste of 10 relabel_configs rules (values.yaml, lines 237–268)
    The entire relabel_configs block is duplicated from kubernetes-service-endpointslice with only the keep-filter regex changed. Any future maintenance change to the shared relabelling logic must be applied in two places with no mechanical enforcement. A Helm named template in _helpers.tpl or a YAML anchor would eliminate this duplication.


Minor issues:

  1. Over-broad and redundant regex (values.yaml, line 260)
    '.*kubex-automation-engine.*metrics-service.*' has a leading .* that permits accidental matches in multi-tenant clusters, and a trailing .* that is functionally redundant. All peer alternatives use anchored prefixes. Consider tightening to 'controller-kubex-automation-engine-metrics-service' (or document why a wildcard prefix is required).

  2. Version skip: 1.0.8 → 1.0.10 (Chart.yaml, line 4)
    Skipping 1.0.9 creates a gap in the release history. If 1.0.9 was never published this is a safe sequential bump to 1.0.9. If it was published separately, document the reason for the skip in the PR description.


DRY improvement opportunities:

  • The shared relabel_configs logic (scheme override, path override, address rewrite, label-map, namespace/service/node label extraction) is now duplicated across two jobs. Extracting it into a Helm named template is the highest-value refactoring opportunity in this file.

Suggested next steps:

  1. Hard-code scheme: https, tls_config.ca_file, and bearer_token_file in the new job (mirroring the kubernetes-apiservers job), or confirm and document that the kubex-automation-engine metrics Service always carries prometheus.io/scheme: https.
  2. Extract the shared relabel_configs block into a Helm named template to eliminate the copy-paste duplication.
  3. Tighten the endpointslice keep-regex or add an inline comment explaining the wildcard prefix.
  4. Reconcile the version skip (1.0.8 → 1.0.10) with an explanation or correct to 1.0.9.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/Chart.yaml Outdated
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: This PR introduces a well-structured dedicated scrape job for the kubex-automation-engine controller metrics, correctly avoiding co-option of the shared kubernetes-service-endpointslice job and its tight metric_relabel_configs allowlist. The architecture is sound. However, one correctness risk remains around TLS/auth configuration, and there are minor issues with the endpointslice regex and the version bump.

Risk level: Medium


Critical issues:

  • None

Major issues:

  1. scheme: http vs observed HTTPS port + missing bearer token (values.yaml line 232)
    The inline comment says the controller exposes metrics over HTTP on port 8080, but the PR's own test output shows the scraped instance at 10.244.0.16:8443 — an HTTPS port. scheme: http combined with no bearer_token_file means that in any environment where the endpoint is actually HTTPS/RBAC-protected, every scrape will fail silently with a TLS error or a 401. The kubernetes-apiservers job in the same file shows the correct pattern: hard-coded scheme: https, tls_config.ca_file, and bearer_token_file. This should be reconciled before merging.

Minor issues:

  1. Over-broad and partially redundant endpointslice keep-regex (values.yaml line 258)
    '.*kubex-automation-engine.*metrics-service.*' has a leading .* that allows accidental matches in multi-tenant clusters (all peer alternatives use fixed prefixes), and a trailing .* that is redundant in Prometheus relabel action: keep regex matching. Consider anchoring to 'controller-kubex-automation-engine-metrics-service' or documenting why a wildcard prefix is required.

  2. Version skip: 1.0.8 → 1.0.11 (Chart.yaml line 4)
    The chart version jumps two patch levels with no explanation. If 1.0.9 and 1.0.10 were never published this should be 1.0.9. If they were published, the PR description should explain the skip.

DRY improvement opportunities:

  • The 10-rule relabel_configs block is copy-pasted verbatim from the kubernetes-service-endpointslice job. Extracting the shared rules into a Helm named template (_helpers.tpl) would eliminate the duplication and ensure the two jobs stay in sync during future maintenance.

Suggested next steps:

  1. Reconcile scheme: http with the observed HTTPS/8443 endpoint. If HTTPS, add scheme: https, tls_config.ca_file, and bearer_token_file (mirror the kubernetes-apiservers job). Update the inline comment accordingly.
  2. Tighten the endpointslice keep-regex or add an inline comment explaining why a wildcard prefix is required.
  3. Correct the chart version to 1.0.9 (or document the skip to 1.0.11).
  4. (Non-blocking) Extract shared relabel_configs into a Helm named template to eliminate copy-paste drift.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/Chart.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: This PR adds a new dedicated Prometheus scrape job for the kubex-automation-engine controller metrics endpoint. The approach is reasonable, but there is a critical factual contradiction in the code (HTTP vs HTTPS) that must be resolved before merging, and the absence of metric_relabel_configs represents an intentional (but undocumented) architectural departure from the rest of the file.

Risk level: Medium


Critical / Major issues:

  • HTTP/HTTPS contradiction (values.yaml:232): The new job hard-codes scheme: http and an inline comment says the port is 8080 (HTTP), but the PR's own verified test output shows the instance at 10.244.0.16:8443 — a standard HTTPS port. If the endpoint is HTTPS, every scrape will silently fail (up == 0) on a fresh install. Additionally, if the endpoint is RBAC-protected (common for controller-manager metrics), a bearer_token_file is required — see the existing kubernetes-apiservers job for the established pattern in this file. This must be clarified before merge.

  • No metric_relabel_configs (values.yaml:258 area): All other scrape jobs in this file use a tightly-scoped allowlist to control stored metric cardinality. This new job stores everything the controller emits (699 samples in test). That may be intentional for full controller observability, but it should be explicitly documented — either with a comment or by adding a scoped allowlist.


Minor issues:

  • Overly broad keep-regex (values.yaml:261): '.*kubex-automation-engine.*metrics-service.*' uses open-ended wildcards on both sides. All peer patterns in this file use fixed prefixes. The PR test output shows the name is controller-kubex-automation-engine-metrics-service; consider anchoring with .+-kubex-automation-engine-metrics-service or documenting the intent of the broad pattern. The trailing .* is also redundant.

  • Chart version (Chart.yaml:4): The version jumps from 1.0.8 to 1.0.9 in this PR, which is fine if 1.0.9 has not been published. If other changes are already staged for 1.0.9 or it has already been released, coordinate to avoid a version collision.


DRY improvement opportunity:

  • Copy-pasted relabel_configs (values.yaml:236–267): The 9-rule relabel block is a near-verbatim copy of the block in kubernetes-service-endpointslice. As a Helm chart, extracting shared rules into a _helpers.tpl named template would eliminate future drift. Non-blocking, but worth tracking as a follow-up.

Suggested next steps:

  1. Clarify and fix the HTTP vs HTTPS discrepancy — confirm the actual protocol/port and update scheme, tls_config, and bearer_token_file accordingly.
  2. Add a metric_relabel_configs block (or an explicit comment) documenting which controller metrics are intentionally stored.
  3. Tighten the endpointslice keep-regex to prevent unintended multi-tenant matches.
  4. (Follow-up) Extract shared relabel rules into a _helpers.tpl template.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: The PR makes a good structural improvement by introducing YAML anchors (&/*) to DRY up the shared relabel_configs between the two scrape jobs — this directly addresses duplication that was flagged in earlier review rounds. However, the refactoring introduced two behavioral regressions in the existing kubernetes-service-endpointslice job, and the new controller metrics job has an unresolved HTTP/HTTPS ambiguity that could cause silent scrape failures.

Risk level: Medium


Critical issues

None.


Major issues

  1. service and node labels silently dropped from kubernetes-service-endpointslice (regression)
    Location: charts/kubex-automation-stack/values.yaml (diff context around line 219)
    The two relabel rules that populated the service (from __meta_kubernetes_service_name) and node (from __meta_kubernetes_pod_node_name) labels were removed as part of the YAML anchor refactoring. These labels are not part of the shared anchors, so every time series scraped by the existing job (kube_*, node_*, DCGM, ephemeral storage metrics) will lose those labels after a chart upgrade. Any dashboard or alert filtering on {service="..."} or {node="..."} for those metrics will break silently.

  2. scheme: http hard-coded on new job, but test shows port 8443 (HTTPS)
    Location: charts/kubex-automation-stack/values.yaml:231
    The PR's test output shows instance: 10.244.0.16:8443. Port 8443 is universally HTTPS. The new job neither inherits the annotation-driven scheme rule (it was excluded from the shared anchors) nor provides TLS/bearer token config. If the endpoint is HTTPS, every scrape will fail with a TLS handshake error (up == 0). Needs explicit clarification and, if HTTPS, scheme: https + tls_config + bearer_token_file.

  3. prometheus.io/scheme annotation rule moved to after the keep filter in kubernetes-service-endpointslice
    Location: charts/kubex-automation-stack/values.yaml:222
    The scheme-detection rule was repositioned lower in the relabel_configs block. While functionally harmless in current Prometheus versions (since __scheme__ is consumed at scrape time), the new placement buries it after the keep filter and reduces clarity. It now appears after namespace assignment rather than as the first visible signal that HTTPS endpoints are supported. At minimum, an inline comment should explain the reordering.


Minor issues

  1. Undocumented .+ wildcard prefix in new job's keep regex
    Location: charts/kubex-automation-stack/values.yaml:248
    The regex .+-kubex-automation-engine-metrics-service uses .+ to accommodate any Helm release name prefix. All other patterns in the sibling job use fixed prefixes. Without a comment explaining the intent, future maintainers may tighten this inadvertently (or leave it too loose in multi-tenant clusters). Add an inline comment.

  2. No metric_relabel_configs on new job — intentional but underdocumented
    Location: charts/kubex-automation-stack/values.yaml — new job block (after line 248)
    The inline comment (# Store the controller-runtime, Go, process, and workqueue metrics as-is.) partially documents the intent. Consider enumerating the expected metric families (controller_runtime_, go_, process_, workqueue_) to make future audits easier and set a clear boundary for what's expected.


DRY improvement opportunities

  • The YAML anchor approach (&/*) is the right move and correctly DRYs up the shared relabel rules. No further DRY work is needed for this PR; the Helm _helpers.tpl approach suggested in earlier reviews is no longer necessary given the anchor solution.

Suggested next steps

  1. Confirm whether service and node label removal is intentional. If it is an accidental side-effect of the anchor refactoring, restore those two rules to the kubernetes-service-endpointslice job (they should not be part of the shared anchors since the new job doesn't need them).
  2. Resolve the HTTP vs HTTPS ambiguity for the controller metrics endpoint and update scheme: and TLS/auth config accordingly on the new job.
  3. Move or annotate the scheme rule in the existing job so its purpose remains obvious to readers.
  4. Add a comment on the .+ wildcard in the new job's keep regex explaining that it intentionally supports variable Helm release name prefixes.

Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: The PR introduces a well-structured dedicated scrape job for kubex-automation-engine controller metrics. The YAML anchor (&/*) approach cleanly solves the DRY problem that earlier review iterations flagged, and correctly preserves service and node labels in the existing kubernetes-service-endpointslice job via the anchors. The architecture is sound. One factual contradiction must be resolved before merging, and a missing authentication consideration should be addressed.

Risk level: Medium


Critical issues:

  • None

Major issues:

  1. HTTP/HTTPS contradiction — scheme: http vs port 8443 in test output (values.yaml line 240)
    The job hard-codes scheme: http and the inline comment says port 8080 (HTTP), but the PR's own verified test output shows instance: 10.244.0.16:8443 — an HTTPS port. Additionally, unlike the sibling kubernetes-service-endpointslice job, the new job's relabel_configs does not include the prometheus.io/scheme annotation-override rule (intentionally excluded from the shared anchors), so there is no annotation-based escape valve. If the endpoint is HTTPS, every scrape will fail with up == 0 on a fresh install, silently.
    Must be resolved before merge. Confirm the actual protocol/port and either keep scheme: http (with an accurate comment) or switch to scheme: https with tls_config + bearer_token_file, mirroring the kubernetes-apiservers job pattern.

  2. No bearer_token_file (values.yaml line 240)
    Controller-runtime/kubebuilder metrics endpoints are commonly RBAC-protected. Without bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token, Prometheus will receive 401 Unauthorized if the endpoint requires authentication. The existing kubernetes-apiservers job in this file always sets the bearer token. If the kubex-automation-engine metrics endpoint is explicitly unauthenticated, that should be confirmed with an inline comment; otherwise the token should be added.


Minor issues:

  1. Endpointslice keep-regex may match suffix variants (values.yaml line 260)
    .+-kubex-automation-engine-metrics-service uses .+ with implicit Prometheus full-string anchoring (Go regexp.MatchString with full-value matching for action: keep). The inline comment correctly explains the variable-prefix intent. No change strictly required, but explicitly documenting that Prometheus matches the full endpointslice name string (not a substring) would remove ambiguity for future maintainers.

  2. No documented metric cardinality boundary for the new job (values.yaml trailing comment on new job)
    The comment # Store the controller-runtime, Go, process, and workqueue metrics as-is. correctly signals the unbounded-storage intent, but names the metric families only partially. Enumerating the full expected set (controller_runtime_*, go_*, process_*, workqueue_*, rest_client_*) and noting where to add an allowlist if cardinality becomes a concern would make this a more durable guideline.


DRY improvement opportunities:

  • The YAML anchor approach is the right solution for this codebase and has been correctly implemented. No further DRY work is needed in this PR. The _helpers.tpl approach suggested in earlier reviews is superseded.

Suggested next steps:

  1. Clarify and fix the scheme/port discrepancy — confirm HTTP/8080 or HTTPS/8443, update the job config and inline comment to match reality.
  2. Add bearer_token_file (or explicitly document that the endpoint is unauthenticated).
  3. (Optional) Expand the trailing comment on the new job to enumerate all expected metric families as a cardinality guardrail.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall Assessment

This PR successfully introduces a dedicated Prometheus scrape job for kubex-automation-engine controller metrics and uses YAML anchors (&/*) to eliminate the previous copy-paste duplication of relabel_config rules — a genuine improvement over the earlier approach. The chart version bump (1.0.8 → 1.0.9) is sequential and correct. The keep filter reordering (moved after service/node label rules) is functionally safe since __meta_kubernetes_endpointslice_name is a discovery meta-label available from the start of relabelling.

The one outstanding blocker is an unresolved contradiction between the inline comment (port 8080, plain HTTP) and the PR's own verified test output (instance: 10.244.0.16:8443, an HTTPS port). This ambiguity must be resolved before merge to avoid silent scrape failures (up == 0) for chart consumers.

Risk level: Medium


Critical issues

None.

Major issues

  • scheme: http vs port 8443 contradiction (values.yaml line 240): The inline comment claims HTTP/8080 but the PR test output shows HTTPS/8443. Because the shared YAML anchors deliberately exclude the prometheus.io/scheme annotation-override rule, there is no runtime escape valve — http is hardwired for this job. If the endpoint is actually HTTPS, every scrape on a fresh install will silently fail.
  • Missing bearer_token_file (values.yaml line 241): No comment confirms the endpoint is unauthenticated. Controller-runtime operators typically RBAC-protect their /metrics port; if that is the case here, Prometheus will receive 401 Unauthorized. The existing kubernetes-apiservers job in this file sets bearer_token_file explicitly — the new job should either do the same or clearly document why it is not needed.

Minor issues

  • Regex end-of-string anchoring (values.yaml line 260): .+-kubex-automation-engine-metrics-service may match service names with unexpected suffixes (e.g., -canary, -v2) depending on whether Prometheus applies full-string or substring matching semantics. Adding a trailing $ or a confirming comment about Prometheus's implicit anchoring behaviour would eliminate the ambiguity.
  • Trailing comment lacks cardinality expectations (values.yaml line 261): All other jobs in this file use strict metric_relabel_configs allowlists. The new job intentionally omits them, but the comment should enumerate the expected metric families and note that a metric_relabel_configs block should be added if cardinality becomes a concern.

DRY improvement opportunities

  • The YAML anchor approach introduced in this PR is the right direction and correctly avoids the copy-paste problem flagged in earlier review rounds. No further DRY action is needed at this time.

Suggested next steps

  1. Clarify the protocol/port used by the kubex-automation-engine metrics Service. If it is HTTP/8080, update the PR test evidence or explain the 8443 instance. If it is HTTPS/8443, replace scheme: http with scheme: https + tls_config + bearer_token_file.
  2. Confirm or document authentication requirements for the /metrics endpoint in the kubex-automation-engine chart. If unauthenticated, say so explicitly in the comment.
  3. Optionally anchor the keep-regex with $ and expand the trailing comment to enumerate expected metric families for future maintainability.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
@gasarekubex
Copy link
Copy Markdown
Member Author

Follow-up on the latest review items:

  • The controller metrics endpoint is intentionally plain HTTP on port 8080 in the current chart defaults.
  • The endpoint is intentionally unauthenticated, so no bearer_token_file is required for the dedicated kubex-automation-engine-metrics-endpointslice scrape job.
  • The dedicated controller scrape job is expected to store metrics as-is; the chart now documents the intended families: controller_runtime_*, go_*, process_*, workqueue_*, rest_client_*, and automation_controller_*.
  • The EndpointSlice keep regex comment now documents that the prefix is intentionally flexible for variable Helm release names and the generated EndpointSlice naming pattern.
  • The shared kubernetes-service-endpointslice job keeps its service/node relabels and its prometheus.io/scheme rule is back at the top for readability.

Validation rerun after these follow-ups:

  • helm lint charts/kubex-automation-stack --set container-optimization-data-forwarder.config.forwarder.densify.url.scheme=http --set container-optimization-data-forwarder.config.forwarder.densify.url.host=k8sengv.dom.densify.com --set container-optimization-data-forwarder.config.forwarder.densify.url.port=30748 --set container-optimization-data-forwarder.config.clusters[0].name=kubex-local --set stack.densify.username=test --set stack.densify.encrypted_password=test
  • helm lint charts/kubex-automation-engine --set kubex.url.host=k8sengv.dom.densify.com --set kubex.clusterName=kubex-local --set kubexCredentials.username=test --set kubexCredentials.epassword=test

All GitHub review threads on this PR are now resolved.

Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment

This PR introduces a well-structured approach to reusing Prometheus relabel config blocks via YAML anchors/aliases, and the dedicated scrape job for controller metrics is the right architectural choice. However, there is a critical contradiction between the inline comment and the PR's own test evidence regarding HTTP vs HTTPS, plus a missing metric_relabel_configs allowlist that creates an open-ended cardinality risk in production.

Risk level: Medium


Critical issues

  • scheme: http vs HTTPS/8443 contradiction (values.yaml:240): The PR test output shows the scrape target at port 8443 (HTTPS), but the inline comment and scheme: http claim port 8080/HTTP. One of these is wrong. If the endpoint is actually HTTPS and requires TLS + a bearer token, this job will produce zero data silently.

Major issues

  • No metric_relabel_configs allowlist (values.yaml:265): Every other job in this file gates metrics through a strict allowlist. Leaving the new job completely open stores go_*, process_*, workqueue_*, and rest_client_* metrics unbounded — families that are high-cardinality and shared with many other in-cluster components.
  • Unanchored keep-regex (values.yaml:261): .+-kubex-automation-engine-metrics-service will match any EndpointSlice name ending with that suffix in any namespace, potentially scraping unintended targets in multi-tenant or multi-release deployments.

Minor issues

  • prometheus.io/port annotation dependency (values.yaml:243): The reused *kubexEndpointsliceAddress alias relies on the prometheus.io/port annotation to set the correct port. Verify the controller metrics service carries this annotation; if not, the port is inferred from the EndpointSlice and may not match the expected 8080.
  • Relabel rule reordering in the existing job (values.yaml:225): The service/node label assignments were moved ahead of the endpointslice_name keep-filter to enable anchor placement. Functionally safe, but worth confirming the intent is deliberate.

DRY improvement opportunities

  • The YAML anchor/alias pattern used here is a good improvement for this file, but anchors in values.yaml are YAML-native only — they do not survive helm template rendering for users who override values.yaml externally (e.g., via --set or a separate values.yaml file). This is an inherent limitation of the approach. A Helm _helpers.tpl named template would be more robust for cross-file reuse, but the YAML anchor approach is acceptable within a single values.yaml.

Suggested next steps

  1. Resolve the HTTP/HTTPS contradiction by checking the actual --metrics-bind-address flag on the controller deployment and matching scheme, tls_config, and bearer_token_file accordingly.
  2. Add a metric_relabel_configs allowlist scoped to automation_controller_* and the specific controller_runtime_* families you need, rather than leaving the job open.
  3. Tighten the EndpointSlice keep-regex or add a namespace filter to prevent cross-tenant scraping.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
@gasarekubex
Copy link
Copy Markdown
Member Author

Fresh end-to-end validation on a new Kind cluster surfaced one real issue in the dedicated controller scrape job and confirmed the rest of the recent review concerns were either stale or intentional:

  • Fixed: the dedicated kubex-automation-engine-metrics-endpointslice job now keeps targets by __meta_kubernetes_service_name instead of the EndpointSlice name. In practice this cleanly selects the controller metrics Service and avoids noisy discovery across unrelated targets.
  • Confirmed current controller defaults from source + fresh install: metrics are exposed on plain HTTP on port 8080, unauthenticated.
  • Confirmed multi-release intent: the service-name regex intentionally supports multiple Helm release prefixes sharing the -kubex-automation-engine-metrics-service suffix.

Fresh validation results after the fix:

  • controller metrics Service annotations rendered as expected: prometheus.io/scheme=http, prometheus.io/port=8080, prometheus.io/path=/metrics
  • Prometheus target: job="kubex-automation-engine-metrics-endpointslice", instance="10.244.0.3:8080"
  • Query result: up{job="kubex-automation-engine-metrics-endpointslice"} == 1
  • Total series for that job: 703
  • Matching expected controller/go/process/workqueue/rest_client/automation families: 695

This also means the earlier :8443 evidence referenced in prior review rounds was stale and no longer reflects the current chart defaults.

Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: This PR cleanly adds controller metrics scraping via a new dedicated Prometheus scrape job and uses YAML anchors to reduce duplication across relabel configs — a solid improvement over copy-paste. The author has already addressed many issues from prior review rounds. A few residual concerns remain, primarily around honor_labels, a reordering in the existing job, and a missing regex anchor.

Risk level: Medium


Major issues:

  • honor_labels: true in the new scrape job (values.yaml ~line 235): Allows scraped metric labels to silently override Prometheus server-side labels (job, instance, namespace). This is unsafe for a direct controller-runtime endpoint and should be removed. See inline comment.

  • keep rule moved to the end of the existing kubernetes-service-endpointslice job's relabel_configs (values.yaml ~line 228): The action: keep rule was reordered to after the service and node label-replace rules, which deviates from the original design. While Prometheus semantics are preserved, this wastes relabeling cycles on targets that are ultimately dropped and is a regression risk for future edits. The keep rule should remain earlier in the pipeline.


Minor issues:

  • Missing $ end anchor in controller job keep regex (values.yaml ~line 252): .+-kubex-automation-engine-metrics-service should be .+-kubex-automation-engine-metrics-service$ to prevent unintended suffix matches. See inline comment.

  • scheme annotation rule asymmetry (values.yaml ~line 197): The prometheus.io/scheme relabel rule applies only to the original job; the new controller job hard-codes scheme: http. This is intentional but the comment doesn't make the asymmetry explicit, which could confuse future maintainers if the controller endpoint is ever moved to HTTPS. See inline comment.


DRY improvement opportunities:

  • The YAML anchor approach introduced here (&kubexEndpointsliceMetricsPath, etc.) is a good pattern. Consider also anchoring the prometheus.io/scheme rule if a third scrape job is added in the future, to keep the pattern consistent.

Suggested next steps:

  1. Remove honor_labels: true from kubex-automation-engine-metrics-endpointslice (or explicitly justify why it is needed).
  2. Move the action: keep endpointslice-name filter rule back to before the service/node label-replace rules in kubernetes-service-endpointslice.
  3. Add $ to the controller job's keep regex: '.+-kubex-automation-engine-metrics-service$'.
  4. Clarify the comment on the scheme annotation rule to document the intentional asymmetry between the two jobs.

Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: The PR correctly introduces YAML anchors to enable config reuse across Prometheus scrape jobs and adds a dedicated controller metrics scrape job. The approach is sound, but there are two correctness concerns in the new job: a latent port-selection dependency on an annotation, and an absence of namespace scoping that could lead to cross-namespace scraping in multi-tenant clusters.

Risk level: Medium

Critical issues:

  • None

Major issues:

  • Port selection depends on annotation being present (values.yaml new job): *kubexEndpointsliceAddress rewrites __address__ only when prometheus.io/port is annotated on the Service. If that annotation is absent or mismatched, Prometheus scrapes the wrong port silently. The job's fixed-port intent should be enforced with a hard-coded replacement rule, not an annotation-driven one.
  • No namespace scoping on the new job (values.yaml new job, keep regex): The .+-kubex-automation-engine-metrics-service$ keep regex matches in all namespaces discovered by kubernetes_sd_configs. In multi-tenant clusters this can cause Prometheus to scrape identically-named Services in unintended namespaces. A namespace filter should be added to kubernetes_sd_configs or as an explicit __meta_kubernetes_namespace keep rule.

Minor issues:

  • Reordering of action: keep in the existing job (kubernetes-service-endpointslice): Moving the endpointslice_name keep filter after the service/node relabels is an unannounced behavioral change to a pre-existing job. Prometheus still works correctly, but unnecessary relabeling work is performed on targets that are subsequently dropped, and the intent is not documented.
  • Reuse of annotation-driven anchors in a fixed-endpoint job: *kubexEndpointsliceMetricsPath and *kubexEndpointsliceParamLabels introduce annotation-driven override paths into a job explicitly designed to ignore annotation-driven scheme changes. This contradicts the documented intent and may surprise future maintainers.

DRY improvement opportunities:

  • The YAML anchor strategy (&kubexEndpointslice* / *kubexEndpointslice*) is a good pattern for this config. Consider whether future jobs (e.g., a ServiceMonitor path or additional controller releases) should also reuse these, and document the anchor conventions near the top of the scrape_configs block.

Suggested next steps:

  1. Add a hard-coded __address__ replacement rule in the new job (e.g., replacement: $1:8080) to enforce port 8080 independent of annotation state, and remove reliance on *kubexEndpointsliceAddress in this job.
  2. Add a namespace keep rule before the service-name keep rule in the new job, scoping discovery to the release namespace to prevent cross-namespace scraping.
  3. Consider restoring the endpointslice_name keep filter to its original position in kubernetes-service-endpointslice, or add a comment explaining why it was moved.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall Assessment

The PR's approach — adding YAML anchors to DRY-up shared relabels, introducing a dedicated scrape job for the controller metrics service, and keeping it separated from the shared allowlist — is architecturally sound. The main risk is a port/scheme mismatch in the diff that directly contradicts the author's own validated test results and will cause up == 0 in production if shipped as-is. The remaining issues are minor maintainability concerns.

Risk level: Medium (one correctness bug, two minor maintainability issues)


Critical issues

None.

Major issues

  • scheme: http + hardcoded :8443 in address replacement are contradictory (values.yaml, new controller job). Port 8443 is a TLS port; combining it with scheme: http will produce TLS handshake failures on every scrape (up == 0). The author's latest comment and validation evidence confirms the correct port is 8080, but the diff was not updated. This must be fixed before merge.

Minor issues

  • Anchors defined but selectively unused without explanation&kubexEndpointsliceMetricsPath, &kubexEndpointsliceAddress, and &kubexEndpointsliceParamLabels are defined in the shared job but not referenced in the new controller job. This is intentional (different behavior), but no comment explains which rules were deliberately omitted, creating confusion for future maintainers.
  • scheme: http is hardcoded with no Helm values override — operators who run the controller with TLS-enabled metrics have no way to change the scheme without modifying the chart. The sibling job handles this dynamically via annotations; the new job should at minimum expose a values.yaml key for scheme and port.

DRY improvement opportunities

  • The YAML anchor approach adopted in this PR is a good step. The &kubexEndpointsliceServiceLabels, &kubexEndpointsliceNamespace, &kubexEndpointsliceServiceName, and &kubexEndpointsliceNodeName anchors are correctly reused in the new job.
  • Consider also anchoring the __metrics_path__ override (replacement: /metrics) if more jobs of this type are added in future.

Suggested next steps

  1. Fix the port: Change replacement: '$1:8443'replacement: '$1:8080' in the new controller job's address relabel (or parameterize via values).
  2. Add a comment block at the top of the new job listing which shared relabels were intentionally omitted (scheme-annotation override, port-annotation override, param label map) and why — this makes future audits straightforward.
  3. (Optional but recommended) Add prometheus.controllerMetrics.scheme and prometheus.controllerMetrics.port to values.yaml so the job is operator-configurable without a chart release.

Comment thread charts/kubex-automation-stack/values.yaml Outdated
Comment thread charts/kubex-automation-stack/values.yaml
Comment thread charts/kubex-automation-stack/values.yaml
Copy link
Copy Markdown
Contributor

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: The PR introduces a well-structured dedicated Prometheus scrape job for the controller metrics endpoint, with good use of YAML anchors to avoid duplication across the two jobs. However, there are two correctness bugs in the new job's relabel_configs — a regex missing an end-anchor that will produce garbage __address__ values, and a keep filter running after the address-rewrite mutation rather than before it — which will cause silent scrape failures in production. There are also minor maintainability concerns around unused anchors and the potential for future double-scraping overlap between the two jobs.

Risk level: Medium (the address-regex bug and filter-ordering issue will cause the dedicated job to produce up == 0 or wrong scrape targets in practice)


Critical issues

(none)

Major issues

  • Address-rewrite regex missing end anchor (values.yaml line ~254): The regex '(.+?)(?::\d+)?' uses a lazy quantifier with no $ anchor. For an address like 10.244.0.3:8080, Go's RE2 engine will satisfy the lazy match with just the first character, producing 1:8080 as the replacement instead of 10.244.0.3:8080. Use '(.+?)(?::\d+)?$' or '([^:]+)(?::\d+)?$'. See inline comment at line 254.

  • keep filter runs after the address mutation, not before (values.yaml line ~261): In the new job's relabel_configs, the __address__ port-rewrite ($1:8080) executes on every endpointslice in the cluster before the action: keep on __meta_kubernetes_service_name has narrowed the target set. This means all cluster endpoints get their port overwritten, then most are dropped. The keep filter should be moved above the address-rewrite. See inline comment at line 261.

Minor issues

  • Unused YAML anchors (values.yaml ~line 200): &kubexEndpointsliceMetricsPath, &kubexEndpointsliceAddress, and &kubexEndpointsliceParamLabels are defined but never aliased in the new job. This is either dead code or a maintenance trap (future authors completing the alias pattern would accidentally break the hard-coded port). Either remove the anchor markers or add a guard comment explaining they are intentionally not aliased. See inline comment at line 200.

  • Missing guard comment on asymmetric anchor usage (values.yaml ~line 247): The new job selectively aliases *kubexEndpointsliceServiceLabels, *kubexEndpointsliceNamespace, *kubexEndpointsliceServiceName, and *kubexEndpointsliceNodeName, but explicitly inlines the path and address rules. Without a comment, future authors may "complete the pattern" by adding *kubexEndpointsliceAddress, silently overriding the port. See inline comment at line 247.

  • Potential double-scrape overlap between the two jobs (values.yaml ~line 219): The shared kubernetes-service-endpointslice job has no explicit exclusion for the controller metrics service. If a future release changes the controller's EndpointSlice name to match the shared job's regex (which is partially wildcard-based via .*dcgm-style alternatives), both jobs will scrape the same endpoint and the shared job's allowlist will silently drop all controller metrics. Consider adding an explicit action: drop for .+-kubex-automation-engine-metrics-service$ to the shared job. See inline comment at line 219.

DRY improvement opportunities

  • The YAML anchor mechanism (&/*) is already well-applied for the six shared relabel rules. The three anchors that are currently unused (MetricsPath, Address, ParamLabels) could be removed to keep the anchor list minimal and honest.

Suggested next steps

  1. Fix the address-rewrite regex to add a $ end anchor (line 254).
  2. Move the action: keep filter above the __address__ rewrite (line 261).
  3. Either remove the three unused anchors (MetricsPath, Address, ParamLabels) or add explicit "do not alias" guard comments.
  4. Consider adding an explicit drop rule in the shared job to prevent future overlap with the controller metrics service.

regex: '(.+?)(?::\d+)?'
replacement: '$1:8080'
# Matches any Helm release name prefix (for example 'controller-' or 'prod-').
# This intentionally supports scraping multiple controller releases when their service names share this suffix.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:261

Issue: The action: keep on __meta_kubernetes_service_name fires before any endpointslice-to-service name propagation from __meta_kubernetes_endpointslice_name, meaning that for multi-port services the address rewrite on line 254 ($1:8080) will execute on all discovered targets across the entire cluster before the keep filter narrows them — this is wasteful but, more critically, the address-rewrite rule (which overwrites __address__) runs at line 254 while the keep only runs at line 261. For Prometheus, all relabel_configs are evaluated in list order, so every endpoint in the cluster gets its port forced to 8080, then the keep fires. The side-effect is that if the same endpoint is also scraped by the sibling kubernetes-service-endpointslice job (because the service also matches that job's regex), it will be double-scraped; however the main concern is the ordering: move the keep filter to run before the __address__ rewrite so that the port rewrite only applies to targets that have already been selected.

Why it matters: The address-rewrite $1:8080 runs on every endpointslice target discovered cluster-wide before the keep filter narrows the set. For large clusters this inflates the set of active scrape connections (even temporarily) and can produce spurious scrape errors for unrelated services whose address was rewritten but the keep then drops. It also makes the configuration harder to reason about: readers expect the filter to come before the mutation.

Suggested fix: Re-order relabel_configs so that the service-name keep filter runs first, then the __address__ rewrite:

relabel_configs:
  - target_label: __metrics_path__
    replacement: /metrics
  - *kubexEndpointsliceServiceLabels
  - *kubexEndpointsliceNamespace
  # Keep filter FIRST — before any address mutation
  - source_labels: [__meta_kubernetes_service_name]
    action: keep
    regex: '.+-kubex-automation-engine-metrics-service$'
  # Port rewrite only runs for already-selected targets
  - source_labels: [__address__]
    action: replace
    target_label: __address__
    regex: '(.+?)(?::\d+)?'
    replacement: '$1:8080'
  - *kubexEndpointsliceServiceName
  - *kubexEndpointsliceNodeName

- target_label: __metrics_path__
replacement: /metrics
- *kubexEndpointsliceServiceLabels
- *kubexEndpointsliceNamespace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:254

Issue: The address-rewrite regex '(.+?)(?::\d+)?' is missing the end anchor $. Without it, (.+?) is lazy and (?::\d+)? is optional, so the overall match will succeed with (.+?) capturing just the first character of the IP, leaving the rest (including a real port) in the unmatched tail. For an input like 10.244.0.3:6443, the lazy (.+?) will match 1, satisfying the full regex, and the replacement $1:8080 produces 1:8080 rather than 10.244.0.3:8080.

Why it matters: The rewritten __address__ values will be garbage, causing every scrape by this job to fail with a connection error.

Suggested fix: Anchor the regex to the full string:

- source_labels: [__address__]
  action: replace
  target_label: __address__
  regex: '(.+?)(?::\d+)?$'
  replacement: '$1:8080'

Or use a non-lazy quantifier with the anchor:

regex: '([^:]+)(?::\d+)?$'
replacement: '$1:8080'

# The target is scraped by service-name suffix so multiple controller releases can be collected when needed.
# No bearer token is required for this /metrics endpoint.
scheme: http
kubernetes_sd_configs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:247

Issue: The &kubexEndpointsliceMetricsPath and &kubexEndpointsliceAddress and &kubexEndpointsliceParamLabels anchors are defined in the shared job but are never referenced (*) by the new dedicated job, even though they would be harmless no-ops. However, the &kubexEndpointsliceAddress rule specifically rewrites __address__ by combining the host with the prometheus.io/port annotation. If this anchor were accidentally aliased into the new job in a future edit, it would override the explicit $1:8080 port. More importantly right now: the new job deliberately replaces __address__ with a hard-coded port but sets target_label: __metrics_path__ via an inline rule (not via the *kubexEndpointsliceMetricsPath alias). This is consistent — but should be explicitly commented to prevent a future author from "completing the pattern" by adding *kubexEndpointsliceAddress, which would break the port rewrite. Consider adding a # Note: do NOT alias *kubexEndpointsliceAddress here guard comment.

Why it matters: The asymmetric use of anchors (some aliased, some re-inlined) creates a maintenance trap. A future author completing the pattern by adding the missing aliases will silently break the port override.

Suggested fix: Add a guard comment immediately before the first alias in the new job:

relabel_configs:
  # Fixed path/port — do NOT add *kubexEndpointsliceAddress or *kubexEndpointsliceMetricsPath here;
  # this job intentionally hard-codes port 8080 and path /metrics.
  - target_label: __metrics_path__
    replacement: /metrics
  - *kubexEndpointsliceServiceLabels
  ...

- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:197

Issue: The &kubexEndpointsliceMetricsPath, &kubexEndpointsliceAddress, and &kubexEndpointsliceParamLabels anchors are defined in the shared job but never consumed by the new job or any other section. Unused YAML anchors are dead weight and add cognitive overhead when readers scan for alias usages. If these three anchors are retained for future use, a comment should say so explicitly; otherwise they should be removed.

Why it matters: Unused anchors suggest incomplete refactoring and may mislead future readers into thinking there is a second consumer somewhere in the file.

Suggested fix: Either remove the &kubexEndpointsliceMetricsPath, &kubexEndpointsliceAddress, and &kubexEndpointsliceParamLabels anchor markers (keep the rules, just drop the &name labels), or add an inline comment:

- &kubexEndpointsliceAddress   # anchor reserved for future controller HTTPS variant
  source_labels: [__address__, ...]

action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
- &kubexEndpointsliceNamespace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:219

Issue: The action: keep regex on __meta_kubernetes_endpointslice_name in the shared kubernetes-service-endpointslice job was not updated to exclude the new controller metrics service, even though a dedicated job now exists for it. If the controller's EndpointSlice name ever happens to match the shared job's regex (e.g., a future release where the name contains kubex), the shared job and the dedicated job will both scrape it — and the shared job's metric_relabel_configs allowlist will then silently discard all controller-runtime/go/workqueue metrics.

Why it matters: Double-scraping the same endpoint wastes resources; the shared job's allowlist then drops all controller metrics that don't match the kube_*/node_*/DCGM_*/openshift_* pattern, so users would see gaps if they ever routed the controller through the shared job.

Suggested fix: Add a negative lookahead (or a separate drop rule) to the shared job's endpointslice-name keep regex so it explicitly excludes the controller metrics service name suffix:

- source_labels: [__meta_kubernetes_endpointslice_name]
  action: keep
  regex: '((kubex|densify)-(kube-state-metrics|prometheus-node-exporter|ephemeral-storage-collector)|.*dcgm|k8s-ephemeral-storage-metrics).*'
# Optionally, add a drop to prevent accidental overlap with the dedicated job:
- source_labels: [__meta_kubernetes_service_name]
  action: drop
  regex: '.+-kubex-automation-engine-metrics-service$'

@gasarekubex
Copy link
Copy Markdown
Member Author

gasarekubex commented May 11, 2026

Follow-up on the remaining review threads:

  • Moved the dedicated controller metrics job’s keep filter before the fixed port rewrite.
  • Removed the unused shared-job anchor labels to reduce noise in the relabel config.
  • Kept the controller scrape intentionally fixed to plain HTTP on port 8080, with comments documenting the contract and the expected metric families.

Validation rerun:

  • helm lint charts/kubex-automation-stack --set container-optimization-data-forwarder.config.forwarder.densify.url.scheme=http --set container-optimization-data-forwarder.config.forwarder.densify.url.host=example.com --set container-optimization-data-forwarder.config.forwarder.densify.url.port=443 --set container-optimization-data-forwarder.config.clusters[0].name=test --set stack.densify.username=test --set stack.densify.encrypted_password=test

The chart now reflects the current controller scrape contract and the remaining bot concerns are either stale or intentional.

@gasarekubex
Copy link
Copy Markdown
Member Author

Follow-up applied to the controller metrics scrape config:

  • Added an explicit drop for *-kubex-automation-engine-metrics-service in the shared kubernetes-service-endpointslice job so the dedicated controller scrape does not overlap.
  • Kept the dedicated controller job fixed to plain HTTP on port 8080 and tightened the wording around the shared annotation-driven rules.
  • Validation reran with helm lint charts/kubex-automation-stack and the required chart values; lint passes.

This addresses the remaining review concerns around overlap and config clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant