fix(common-services): fix 2.0.1 deployment failures by gnanirahulnutakki · Pull Request #58 · radiantlogic-devops/helm

gnanirahulnutakki · 2026-02-09T17:47:59Z

Summary

Fixes multiple deployment failures in common-services 2.0.1 discovered during IDO customer onboarding testing. Addresses ServiceMonitor CRD dependencies, missing Grafana dashboard ConfigMaps, and several pre-existing blockers.

Problem

When deploying common-services 2.0.1 without PrometheusOperator (or with --reuse-values from older releases):

ServiceMonitor/PodMonitor/PrometheusRule CRDs required — 14 ServiceMonitors and 4 PrometheusRules defaulted to enabled: true, causing failures on clusters without PrometheusOperator
Grafana pod stuck in ContainerCreating — 6 dashboard providers (opensearch, velero, loki, flink, alloy, kibana) referenced ConfigMaps that didn't exist
Kibana ServiceMonitor template nil pointer — kibana.serviceMonitor.enabled crashes when the key is absent (e.g., --reuse-values from pre-2.0.1)
Loki CrashLoopBackOff — object_store: s3 set but no S3 bucket configured; storage type was filesystem
PostgreSQL ImagePullBackOff — bitnami/postgres-exporter:0.11.1-debian-11-r34 removed from Docker Hub
ArgoCD CRD installation — crds.install was missing, preventing fresh installs

Changes

`values.yaml`

Category	Change	Count
ServiceMonitor/PodMonitor `enabled`	`true` → `false`	14
PrometheusRule `enabled`	`true` → `false`	4
`dashboardsConfigMaps` entries added	opensearch, velero, loki, flink, alloy, kibana	6
Loki `object_store`	`s3` → `filesystem`	1
Loki service annotations	Added `prometheus.io/*` for metrics discovery	3
PostgreSQL exporter image	`radiantone/postgres-exporter:0.11.1-debian-11-r34`	1
ArgoCD CRDs	Added `crds.install: true`	1
Kibana exporter auth	Documented auth options	1

`templates/grafana-dashboards-metrics.yaml`

Added 5 conditional ConfigMap templates for dashboard providers that had no backing data:

ConfigMap	Condition	Folder
`opensearch-dashboard`	`opensearch.enabled`	Infrastructure
`velero-dashboard`	`velero.enabled`	Backup
`loki-dashboard`	`loki.enabled`	Logging
`flink-dashboard`	`flink-kubernetes-operator.enabled`	Data Processing
`alloy-dashboard`	`alloy.enabled`	Infrastructure

`templates/kibana/kibana-servicemonitor.yaml`

Fixed nil pointer: ((.Values.kibana.serviceMonitor).enabled) — safe when key absent

How Metrics Still Work

All services retain prometheus.io/* annotations on their Services/Pods. Prometheus discovers targets via its kubernetes-service-endpoints and kubernetes-pods scrape jobs — no ServiceMonitor CRDs needed.

Service	Port	Path	Discovery
ZooKeeper	9141	/metrics	Service annotation
ArgoCD (4)	8082-8084	/metrics	Service annotation
Grafana	3000	/eoc-backend/grafana/metrics	Service annotation
Kibana (exporter)	9684	/metrics	Service annotation
HAProxy	8404	/metrics	Service annotation
PostgreSQL (exporter)	9187	/metrics	Service annotation
Fluent-Bit	2020	/api/v1/metrics/prometheus	Service annotation
Velero	8085	/metrics	Pod annotation
OpenSearch (plugin)	9200	/_prometheus/metrics	Pod annotation
Flink Operator	9999	/metrics	Pod annotation
Alloy	12345	/metrics	Service annotation
ES Exporter	9108	/metrics	Service annotation
Loki	3100	/metrics	Service annotation (NEW)
Prometheus	9090	/metrics	Static self-scrape

Test Plan

helm template renders cleanly — zero errors, zero ServiceMonitor/PodMonitor/PrometheusRule resources
Fresh helm install on self-managed-test-dev01 cluster, duploservices-saasops2 namespace
All 45+ pods Running (Grafana 2/2, Kibana 2/2, Loki 2/2, PostgreSQL 2/2)
Zero CrashLoopBackOff, zero ImagePullBackOff, zero ContainerCreating
All 6 new dashboard ConfigMaps created and mounted in Grafana
ArgoCD server, applicationset-controller healthy (CRDs installed)
Verify Grafana UI shows imported dashboards
Verify Prometheus targets page shows all metrics endpoints
Test with serviceMonitor.enabled: true override on cluster with PrometheusOperator

…s for ClusterImageCatalog

…egration Features: - Add Prometheus scrape annotations to all services (14 total) - Add Alloy dashboard (Grafana Dashboard ID: 19624) - Add Grafana self-monitoring dashboard - Fix Grafana metrics path for subpath configuration - Add Flink operator metrics annotations - Update README with metrics/dashboards documentation Services with automatic metrics: - ArgoCD (controller, server, repo-server, redis) - Elasticsearch (via exporter) - Fluent-bit - Grafana - HAProxy - Kibana - PostgreSQL - Zookeeper - CloudNative-PG - Alloy - Flink Operator Tested: E2E deployment on duploservices-qaibtest - 14 services with prometheus annotations - 12/14 targets UP in Prometheus - 64 dashboards loaded in Grafana - 2,259+ metrics collected

Kibana Metrics Integration: - Add custom Kibana Prometheus exporter sidecar (rahulnutakki/kibana-prometheus-exporter) - Add kibana-metrics-service.yaml for metrics endpoint exposure - Add kibana-servicemonitor.yaml for Prometheus scraping - Add Kibana dashboard with 17 panels (status, memory, performance, requests, system) New Dashboards: - ArgoCD: 20-panel dashboard (replaces old argo-cd-dashboard.json) - Prometheus: 19-panel self-monitoring dashboard - ZooKeeper: 15-panel JMX metrics dashboard - Grafana: Add grafana-internals dashboard Dashboard Updates: - Elasticsearch: Custom 37-panel dashboard with actual metric labels - HAProxy: Custom 14-panel load balancer dashboard - Fluent Bit: Custom 13-panel log pipeline dashboard - PostgreSQL: Custom 19-panel database dashboard Removed: - CloudNative-PG dashboard (no CNPG clusters deployed, operator-only) - Old ArgoCD dashboard (replaced with new version) Configuration: - Update values.yaml with Kibana exporter config (basePath support) - Update override-values.yaml with new exporter settings - Update grafana-dashboards-metrics.yaml template with Kibana section

- Updated from :latest to :v1.1.0 - v1.1.0 built with Go 1.25.6, 0 CVEs (Trivy verified)

Add two new Grafana dashboards for monitoring Flink Graph Pipeline: - graph-pipeline-sources: monitors source events, vertices, edges and transformation times - graph-pipeline-sinks: monitors sink events, batch processing, errors and retries Dashboards are automatically provisioned via existing ConfigMap mechanism. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Add kibana.serviceMonitor.enabled flag to control ServiceMonitor creation.

Comment out cloudnative-pg dashboard configuration to prevent errors when the dashboard is not available.

…tion

Comment out dashboard providers and configmaps for services that are commonly disabled (argo-cd, zookeeper, fluent-bit) to prevent Grafana pod mount failures when these services are not deployed.

Default all ServiceMonitor, PodMonitor, and PrometheusRule resources to disabled, fix missing Grafana dashboard ConfigMaps causing pod crashes, and resolve several pre-existing deployment blockers. ServiceMonitors/PodMonitors/PrometheusRules (18 changes): - Default all 14 ServiceMonitor/PodMonitor enabled flags to false - Default all 4 PrometheusRule enabled flags to false - Metrics collection remains active via prometheus.io/* annotations - Users can re-enable individually when PrometheusOperator is available Grafana dashboard fixes: - Add 5 missing ConfigMap templates (opensearch, velero, loki, flink, alloy) - Add 6 missing dashboardsConfigMaps entries (opensearch, velero, loki, flink, alloy, kibana) that map providers to their ConfigMaps - Fixes Grafana pod stuck in ContainerCreating due to missing volume mounts Kibana ServiceMonitor template: - Fix nil pointer crash when kibana.serviceMonitor key is absent - Use nil-safe access pattern: ((.Values.kibana.serviceMonitor).enabled) Loki fixes: - Add prometheus.io service annotations for metrics discovery (port 3100) - Fix object_store: s3 -> filesystem to match storage.type: filesystem - Prevents CrashLoopBackOff from missing S3 bucket configuration PostgreSQL fix: - Pin metrics exporter image to radiantone/postgres-exporter:0.11.1 - Upstream bitnami/postgres-exporter:0.11.1-debian-11-r34 removed from Docker Hub ArgoCD fix: - Add crds.install: true (was missing, causing CRD installation to fail) Kibana exporter auth: - Document authentication options for kibana-prometheus-exporter sidecar Tested: fresh helm install on self-managed-test-dev01 cluster, duploservices-saasops2 namespace. All 45+ pods Running, zero CrashLoopBackOff, zero ImagePullBackOff.

pgodey · 2026-02-20T22:44:59Z

charts/common-services/override-values.yaml

+# This file only enables services with metrics and dashboards configured
+
+# Global nodeSelector - for EKS cluster (tenantname label)
+nodeSelector:


Why do have duplo references in this file?

pgodey · 2026-02-20T22:47:02Z