Skip to content

fix(common-services): fix 2.0.1 deployment failures#58

Open
gnanirahulnutakki wants to merge 22 commits intomasterfrom
common-services/2.0.1
Open

fix(common-services): fix 2.0.1 deployment failures#58
gnanirahulnutakki wants to merge 22 commits intomasterfrom
common-services/2.0.1

Conversation

@gnanirahulnutakki
Copy link
Member

Summary

Fixes multiple deployment failures in common-services 2.0.1 discovered during IDO customer onboarding testing. Addresses ServiceMonitor CRD dependencies, missing Grafana dashboard ConfigMaps, and several pre-existing blockers.

Problem

When deploying common-services 2.0.1 without PrometheusOperator (or with --reuse-values from older releases):

  1. ServiceMonitor/PodMonitor/PrometheusRule CRDs required — 14 ServiceMonitors and 4 PrometheusRules defaulted to enabled: true, causing failures on clusters without PrometheusOperator
  2. Grafana pod stuck in ContainerCreating — 6 dashboard providers (opensearch, velero, loki, flink, alloy, kibana) referenced ConfigMaps that didn't exist
  3. Kibana ServiceMonitor template nil pointerkibana.serviceMonitor.enabled crashes when the key is absent (e.g., --reuse-values from pre-2.0.1)
  4. Loki CrashLoopBackOffobject_store: s3 set but no S3 bucket configured; storage type was filesystem
  5. PostgreSQL ImagePullBackOffbitnami/postgres-exporter:0.11.1-debian-11-r34 removed from Docker Hub
  6. ArgoCD CRD installationcrds.install was missing, preventing fresh installs

Changes

values.yaml

Category Change Count
ServiceMonitor/PodMonitor enabled truefalse 14
PrometheusRule enabled truefalse 4
dashboardsConfigMaps entries added opensearch, velero, loki, flink, alloy, kibana 6
Loki object_store s3filesystem 1
Loki service annotations Added prometheus.io/* for metrics discovery 3
PostgreSQL exporter image radiantone/postgres-exporter:0.11.1-debian-11-r34 1
ArgoCD CRDs Added crds.install: true 1
Kibana exporter auth Documented auth options 1

templates/grafana-dashboards-metrics.yaml

Added 5 conditional ConfigMap templates for dashboard providers that had no backing data:

ConfigMap Condition Folder
opensearch-dashboard opensearch.enabled Infrastructure
velero-dashboard velero.enabled Backup
loki-dashboard loki.enabled Logging
flink-dashboard flink-kubernetes-operator.enabled Data Processing
alloy-dashboard alloy.enabled Infrastructure

templates/kibana/kibana-servicemonitor.yaml

  • Fixed nil pointer: ((.Values.kibana.serviceMonitor).enabled) — safe when key absent

How Metrics Still Work

All services retain prometheus.io/* annotations on their Services/Pods. Prometheus discovers targets via its kubernetes-service-endpoints and kubernetes-pods scrape jobs — no ServiceMonitor CRDs needed.

Service Port Path Discovery
ZooKeeper 9141 /metrics Service annotation
ArgoCD (4) 8082-8084 /metrics Service annotation
Grafana 3000 /eoc-backend/grafana/metrics Service annotation
Kibana (exporter) 9684 /metrics Service annotation
HAProxy 8404 /metrics Service annotation
PostgreSQL (exporter) 9187 /metrics Service annotation
Fluent-Bit 2020 /api/v1/metrics/prometheus Service annotation
Velero 8085 /metrics Pod annotation
OpenSearch (plugin) 9200 /_prometheus/metrics Pod annotation
Flink Operator 9999 /metrics Pod annotation
Alloy 12345 /metrics Service annotation
ES Exporter 9108 /metrics Service annotation
Loki 3100 /metrics Service annotation (NEW)
Prometheus 9090 /metrics Static self-scrape

Test Plan

  • helm template renders cleanly — zero errors, zero ServiceMonitor/PodMonitor/PrometheusRule resources
  • Fresh helm install on self-managed-test-dev01 cluster, duploservices-saasops2 namespace
  • All 45+ pods Running (Grafana 2/2, Kibana 2/2, Loki 2/2, PostgreSQL 2/2)
  • Zero CrashLoopBackOff, zero ImagePullBackOff, zero ContainerCreating
  • All 6 new dashboard ConfigMaps created and mounted in Grafana
  • ArgoCD server, applicationset-controller healthy (CRDs installed)
  • Verify Grafana UI shows imported dashboards
  • Verify Prometheus targets page shows all metrics endpoints
  • Test with serviceMonitor.enabled: true override on cluster with PrometheusOperator

jmcorne and others added 17 commits December 9, 2025 12:05
…egration

Features:
- Add Prometheus scrape annotations to all services (14 total)
- Add Alloy dashboard (Grafana Dashboard ID: 19624)
- Add Grafana self-monitoring dashboard
- Fix Grafana metrics path for subpath configuration
- Add Flink operator metrics annotations
- Update README with metrics/dashboards documentation

Services with automatic metrics:
- ArgoCD (controller, server, repo-server, redis)
- Elasticsearch (via exporter)
- Fluent-bit
- Grafana
- HAProxy
- Kibana
- PostgreSQL
- Zookeeper
- CloudNative-PG
- Alloy
- Flink Operator

Tested: E2E deployment on duploservices-qaibtest
- 14 services with prometheus annotations
- 12/14 targets UP in Prometheus
- 64 dashboards loaded in Grafana
- 2,259+ metrics collected
Kibana Metrics Integration:
- Add custom Kibana Prometheus exporter sidecar (rahulnutakki/kibana-prometheus-exporter)
- Add kibana-metrics-service.yaml for metrics endpoint exposure
- Add kibana-servicemonitor.yaml for Prometheus scraping
- Add Kibana dashboard with 17 panels (status, memory, performance, requests, system)

New Dashboards:
- ArgoCD: 20-panel dashboard (replaces old argo-cd-dashboard.json)
- Prometheus: 19-panel self-monitoring dashboard
- ZooKeeper: 15-panel JMX metrics dashboard
- Grafana: Add grafana-internals dashboard

Dashboard Updates:
- Elasticsearch: Custom 37-panel dashboard with actual metric labels
- HAProxy: Custom 14-panel load balancer dashboard
- Fluent Bit: Custom 13-panel log pipeline dashboard
- PostgreSQL: Custom 19-panel database dashboard

Removed:
- CloudNative-PG dashboard (no CNPG clusters deployed, operator-only)
- Old ArgoCD dashboard (replaced with new version)

Configuration:
- Update values.yaml with Kibana exporter config (basePath support)
- Update override-values.yaml with new exporter settings
- Update grafana-dashboards-metrics.yaml template with Kibana section
- Updated from :latest to :v1.1.0
- v1.1.0 built with Go 1.25.6, 0 CVEs (Trivy verified)
Add two new Grafana dashboards for monitoring Flink Graph Pipeline:
- graph-pipeline-sources: monitors source events, vertices, edges and transformation times
- graph-pipeline-sinks: monitors sink events, batch processing, errors and retries

Dashboards are automatically provisioned via existing ConfigMap mechanism.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add kibana.serviceMonitor.enabled flag to control ServiceMonitor creation.
Comment out cloudnative-pg dashboard configuration to prevent errors
when the dashboard is not available.
Comment out dashboard providers and configmaps for services that are
commonly disabled (argo-cd, zookeeper, fluent-bit) to prevent Grafana
pod mount failures when these services are not deployed.
Default all ServiceMonitor, PodMonitor, and PrometheusRule resources to
disabled, fix missing Grafana dashboard ConfigMaps causing pod crashes,
and resolve several pre-existing deployment blockers.

ServiceMonitors/PodMonitors/PrometheusRules (18 changes):
- Default all 14 ServiceMonitor/PodMonitor enabled flags to false
- Default all 4 PrometheusRule enabled flags to false
- Metrics collection remains active via prometheus.io/* annotations
- Users can re-enable individually when PrometheusOperator is available

Grafana dashboard fixes:
- Add 5 missing ConfigMap templates (opensearch, velero, loki, flink, alloy)
- Add 6 missing dashboardsConfigMaps entries (opensearch, velero, loki,
  flink, alloy, kibana) that map providers to their ConfigMaps
- Fixes Grafana pod stuck in ContainerCreating due to missing volume mounts

Kibana ServiceMonitor template:
- Fix nil pointer crash when kibana.serviceMonitor key is absent
- Use nil-safe access pattern: ((.Values.kibana.serviceMonitor).enabled)

Loki fixes:
- Add prometheus.io service annotations for metrics discovery (port 3100)
- Fix object_store: s3 -> filesystem to match storage.type: filesystem
- Prevents CrashLoopBackOff from missing S3 bucket configuration

PostgreSQL fix:
- Pin metrics exporter image to radiantone/postgres-exporter:0.11.1
- Upstream bitnami/postgres-exporter:0.11.1-debian-11-r34 removed from
  Docker Hub

ArgoCD fix:
- Add crds.install: true (was missing, causing CRD installation to fail)

Kibana exporter auth:
- Document authentication options for kibana-prometheus-exporter sidecar

Tested: fresh helm install on self-managed-test-dev01 cluster,
duploservices-saasops2 namespace. All 45+ pods Running, zero
CrashLoopBackOff, zero ImagePullBackOff.
@gnanirahulnutakki gnanirahulnutakki self-assigned this Feb 9, 2026
# This file only enables services with metrics and dashboards configured

# Global nodeSelector - for EKS cluster (tenantname label)
nodeSelector:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do have duplo references in this file?


zookeeper:
enabled: false
enabled: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is zookeeper enabled by default?


opensearch:
enabled: false
enabled: true
Copy link
Collaborator

@pgodey pgodey Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is opensearch enabled by default?


fluent-bit:
enabled: false
enabled: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is fluent-bit enabled by default?


velero:
enabled: false
enabled: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why si velero enabled by default?

# -----------------------------------------------------------------
cloudnative-pg:
enabled: false
enabled: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is cnpg enabled by default?

# Loki -- https://grafana.com/docs/loki/latest/setup/install
# https://github.com/grafana/loki/blob/main/production/helm/loki/
# -----------------------------------------------------------------
loki:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is loki enabled?

# -----------------------------------------------------------------
flink-kubernetes-operator:
enabled: false
enabled: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is flink enabled?

# -----------------------------------------------------------------
alloy:
enabled: false
enabled: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is alloy enabled by default?

- Remove broken namespace filter from prometheus self-monitoring dashboard
  (prometheus self-scraped metrics lack namespace label, causing all 19
  panels to render empty)
- Add nil-pointer guards to postgres, backup, and kibana templates
- Align values.yaml defaults between 2.0.0 and 2.0.1 (remove stale entries)
- Improve NOTES.txt with upgrade-aware messaging and route details
Addresses audit report dashboard timeouts on customer clusters with
30-50GB Elasticsearch indexes. Customers querying 12h+ time ranges
were hitting the 30s default across all three proxy layers.

- Set Grafana dataproxy.timeout to 90s (was 30s default)
- Set Grafana dataproxy.keep_alive_seconds to 30s
- Add timeout: 90 to Elasticsearch datasource jsonData
- Add timeout server 90s to HAProxy grafana_backend only
  (other backends stay at 30s defaults — only Grafana proxies
  heavy ES aggregation queries through the EOC iframe)
GHCR OCI tag 2.0.1 is immutable and contains pre-fix values
(hardcoded dashboardsConfigMaps causing Grafana FailedMount).
Bump to 2.0.2 so GitHub Actions publishes a corrected image.
… CI timeout

Elasticsearch 7.17.3 crashes on GitHub Actions runners due to a cgroups v2
JVM incompatibility (NullPointerException in CgroupV2Subsystem). Upgrading
to 7.17.25 which includes the JVM fix. Also increases CI test timeout from
15m to 20m to accommodate the full chart install.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants