fix(common-services): fix 2.0.1 deployment failures#58
Open
gnanirahulnutakki wants to merge 22 commits intomasterfrom
Open
fix(common-services): fix 2.0.1 deployment failures#58gnanirahulnutakki wants to merge 22 commits intomasterfrom
gnanirahulnutakki wants to merge 22 commits intomasterfrom
Conversation
…s for ClusterImageCatalog
…egration Features: - Add Prometheus scrape annotations to all services (14 total) - Add Alloy dashboard (Grafana Dashboard ID: 19624) - Add Grafana self-monitoring dashboard - Fix Grafana metrics path for subpath configuration - Add Flink operator metrics annotations - Update README with metrics/dashboards documentation Services with automatic metrics: - ArgoCD (controller, server, repo-server, redis) - Elasticsearch (via exporter) - Fluent-bit - Grafana - HAProxy - Kibana - PostgreSQL - Zookeeper - CloudNative-PG - Alloy - Flink Operator Tested: E2E deployment on duploservices-qaibtest - 14 services with prometheus annotations - 12/14 targets UP in Prometheus - 64 dashboards loaded in Grafana - 2,259+ metrics collected
Kibana Metrics Integration: - Add custom Kibana Prometheus exporter sidecar (rahulnutakki/kibana-prometheus-exporter) - Add kibana-metrics-service.yaml for metrics endpoint exposure - Add kibana-servicemonitor.yaml for Prometheus scraping - Add Kibana dashboard with 17 panels (status, memory, performance, requests, system) New Dashboards: - ArgoCD: 20-panel dashboard (replaces old argo-cd-dashboard.json) - Prometheus: 19-panel self-monitoring dashboard - ZooKeeper: 15-panel JMX metrics dashboard - Grafana: Add grafana-internals dashboard Dashboard Updates: - Elasticsearch: Custom 37-panel dashboard with actual metric labels - HAProxy: Custom 14-panel load balancer dashboard - Fluent Bit: Custom 13-panel log pipeline dashboard - PostgreSQL: Custom 19-panel database dashboard Removed: - CloudNative-PG dashboard (no CNPG clusters deployed, operator-only) - Old ArgoCD dashboard (replaced with new version) Configuration: - Update values.yaml with Kibana exporter config (basePath support) - Update override-values.yaml with new exporter settings - Update grafana-dashboards-metrics.yaml template with Kibana section
- Updated from :latest to :v1.1.0 - v1.1.0 built with Go 1.25.6, 0 CVEs (Trivy verified)
Add two new Grafana dashboards for monitoring Flink Graph Pipeline: - graph-pipeline-sources: monitors source events, vertices, edges and transformation times - graph-pipeline-sinks: monitors sink events, batch processing, errors and retries Dashboards are automatically provisioned via existing ConfigMap mechanism. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add kibana.serviceMonitor.enabled flag to control ServiceMonitor creation.
Comment out cloudnative-pg dashboard configuration to prevent errors when the dashboard is not available.
Comment out dashboard providers and configmaps for services that are commonly disabled (argo-cd, zookeeper, fluent-bit) to prevent Grafana pod mount failures when these services are not deployed.
Default all ServiceMonitor, PodMonitor, and PrometheusRule resources to disabled, fix missing Grafana dashboard ConfigMaps causing pod crashes, and resolve several pre-existing deployment blockers. ServiceMonitors/PodMonitors/PrometheusRules (18 changes): - Default all 14 ServiceMonitor/PodMonitor enabled flags to false - Default all 4 PrometheusRule enabled flags to false - Metrics collection remains active via prometheus.io/* annotations - Users can re-enable individually when PrometheusOperator is available Grafana dashboard fixes: - Add 5 missing ConfigMap templates (opensearch, velero, loki, flink, alloy) - Add 6 missing dashboardsConfigMaps entries (opensearch, velero, loki, flink, alloy, kibana) that map providers to their ConfigMaps - Fixes Grafana pod stuck in ContainerCreating due to missing volume mounts Kibana ServiceMonitor template: - Fix nil pointer crash when kibana.serviceMonitor key is absent - Use nil-safe access pattern: ((.Values.kibana.serviceMonitor).enabled) Loki fixes: - Add prometheus.io service annotations for metrics discovery (port 3100) - Fix object_store: s3 -> filesystem to match storage.type: filesystem - Prevents CrashLoopBackOff from missing S3 bucket configuration PostgreSQL fix: - Pin metrics exporter image to radiantone/postgres-exporter:0.11.1 - Upstream bitnami/postgres-exporter:0.11.1-debian-11-r34 removed from Docker Hub ArgoCD fix: - Add crds.install: true (was missing, causing CRD installation to fail) Kibana exporter auth: - Document authentication options for kibana-prometheus-exporter sidecar Tested: fresh helm install on self-managed-test-dev01 cluster, duploservices-saasops2 namespace. All 45+ pods Running, zero CrashLoopBackOff, zero ImagePullBackOff.
pgodey
reviewed
Feb 20, 2026
| # This file only enables services with metrics and dashboards configured | ||
|
|
||
| # Global nodeSelector - for EKS cluster (tenantname label) | ||
| nodeSelector: |
Collaborator
There was a problem hiding this comment.
Why do have duplo references in this file?
pgodey
reviewed
Feb 20, 2026
charts/common-services/values.yaml
Outdated
|
|
||
| zookeeper: | ||
| enabled: false | ||
| enabled: true |
Collaborator
There was a problem hiding this comment.
Why is zookeeper enabled by default?
pgodey
reviewed
Feb 20, 2026
charts/common-services/values.yaml
Outdated
|
|
||
| opensearch: | ||
| enabled: false | ||
| enabled: true |
Collaborator
There was a problem hiding this comment.
why is opensearch enabled by default?
pgodey
reviewed
Feb 20, 2026
charts/common-services/values.yaml
Outdated
|
|
||
| fluent-bit: | ||
| enabled: false | ||
| enabled: true |
Collaborator
There was a problem hiding this comment.
why is fluent-bit enabled by default?
pgodey
reviewed
Feb 20, 2026
charts/common-services/values.yaml
Outdated
|
|
||
| velero: | ||
| enabled: false | ||
| enabled: true |
Collaborator
There was a problem hiding this comment.
why si velero enabled by default?
pgodey
reviewed
Feb 20, 2026
charts/common-services/values.yaml
Outdated
| # ----------------------------------------------------------------- | ||
| cloudnative-pg: | ||
| enabled: false | ||
| enabled: true |
Collaborator
There was a problem hiding this comment.
why is cnpg enabled by default?
pgodey
reviewed
Feb 20, 2026
| # Loki -- https://grafana.com/docs/loki/latest/setup/install | ||
| # https://github.com/grafana/loki/blob/main/production/helm/loki/ | ||
| # ----------------------------------------------------------------- | ||
| loki: |
pgodey
reviewed
Feb 20, 2026
charts/common-services/values.yaml
Outdated
| # ----------------------------------------------------------------- | ||
| flink-kubernetes-operator: | ||
| enabled: false | ||
| enabled: true |
pgodey
reviewed
Feb 20, 2026
charts/common-services/values.yaml
Outdated
| # ----------------------------------------------------------------- | ||
| alloy: | ||
| enabled: false | ||
| enabled: true |
Collaborator
There was a problem hiding this comment.
why is alloy enabled by default?
pgodey
requested changes
Feb 20, 2026
- Remove broken namespace filter from prometheus self-monitoring dashboard (prometheus self-scraped metrics lack namespace label, causing all 19 panels to render empty) - Add nil-pointer guards to postgres, backup, and kibana templates - Align values.yaml defaults between 2.0.0 and 2.0.1 (remove stale entries) - Improve NOTES.txt with upgrade-aware messaging and route details
Addresses audit report dashboard timeouts on customer clusters with 30-50GB Elasticsearch indexes. Customers querying 12h+ time ranges were hitting the 30s default across all three proxy layers. - Set Grafana dataproxy.timeout to 90s (was 30s default) - Set Grafana dataproxy.keep_alive_seconds to 30s - Add timeout: 90 to Elasticsearch datasource jsonData - Add timeout server 90s to HAProxy grafana_backend only (other backends stay at 30s defaults — only Grafana proxies heavy ES aggregation queries through the EOC iframe)
GHCR OCI tag 2.0.1 is immutable and contains pre-fix values (hardcoded dashboardsConfigMaps causing Grafana FailedMount). Bump to 2.0.2 so GitHub Actions publishes a corrected image.
This reverts commit 6cc4af6.
… CI timeout Elasticsearch 7.17.3 crashes on GitHub Actions runners due to a cgroups v2 JVM incompatibility (NullPointerException in CgroupV2Subsystem). Upgrading to 7.17.25 which includes the JVM fix. Also increases CI test timeout from 15m to 20m to accommodate the full chart install. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes multiple deployment failures in common-services 2.0.1 discovered during IDO customer onboarding testing. Addresses ServiceMonitor CRD dependencies, missing Grafana dashboard ConfigMaps, and several pre-existing blockers.
Problem
When deploying common-services 2.0.1 without PrometheusOperator (or with
--reuse-valuesfrom older releases):enabled: true, causing failures on clusters without PrometheusOperatorkibana.serviceMonitor.enabledcrashes when the key is absent (e.g.,--reuse-valuesfrom pre-2.0.1)object_store: s3set but no S3 bucket configured; storage type wasfilesystembitnami/postgres-exporter:0.11.1-debian-11-r34removed from Docker Hubcrds.installwas missing, preventing fresh installsChanges
values.yamlenabledtrue→falseenabledtrue→falsedashboardsConfigMapsentries addedobject_stores3→filesystemprometheus.io/*for metrics discoveryradiantone/postgres-exporter:0.11.1-debian-11-r34crds.install: truetemplates/grafana-dashboards-metrics.yamlAdded 5 conditional ConfigMap templates for dashboard providers that had no backing data:
opensearch-dashboardopensearch.enabledvelero-dashboardvelero.enabledloki-dashboardloki.enabledflink-dashboardflink-kubernetes-operator.enabledalloy-dashboardalloy.enabledtemplates/kibana/kibana-servicemonitor.yaml((.Values.kibana.serviceMonitor).enabled)— safe when key absentHow Metrics Still Work
All services retain
prometheus.io/*annotations on their Services/Pods. Prometheus discovers targets via itskubernetes-service-endpointsandkubernetes-podsscrape jobs — no ServiceMonitor CRDs needed.Test Plan
helm templaterenders cleanly — zero errors, zero ServiceMonitor/PodMonitor/PrometheusRule resourceshelm installonself-managed-test-dev01cluster,duploservices-saasops2namespaceserviceMonitor.enabled: trueoverride on cluster with PrometheusOperator