feat(deployment): Add OpenTelemetry Collector.#2278
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds OpenTelemetry telemetry: controller toggles and OTEL env population, a new otel-collector service with config, OTEL env injection into fourteen services, and Compose-file wiring to enable the collector via a telemetry profile. ChangesOpenTelemetry Telemetry Integration
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/clp-package-utils/clp_package_utils/controller.py`:
- Around line 1019-1045: The telemetry block should defensively handle collector
endpoint and normalize more architectures: when building env_vars (the same
place where OTEL_RESOURCE_ATTRIBUTES, OTEL_EXPORTER_OTLP_ENDPOINT and
TELEMETRY_ENDPOINT are set), 1) only set OTEL_EXPORTER_OTLP_ENDPOINT from a
configurable value (e.g., self._clp_config.telemetry.collector_endpoint) and/or
set SDK timeout/connection env vars (e.g., OTEL_EXPORTER_OTLP_TRACES_TIMEOUT or
a generic OTEL_EXPORTER_OTLP_TIMEOUT) so instrumented services won't hang if the
collector is unreachable; 2) expand the arch normalization that currently maps
platform.machine().lower() -> "amd64"/"arm64" in the controller to include
common values (i386, i686 -> "386"; armv7l -> "armv7"; ppc64le, s390x, etc.) and
keep a clear fallback for unknown values; and 3) only set TELEMETRY_ENDPOINT
from self._clp_config.telemetry.endpoint if it is present/non-empty. Update the
code around the env_vars assignments and the arch mapping logic to reference the
existing variables (_clp_config.telemetry, _clp_home, env_vars,
OTEL_RESOURCE_ATTRIBUTES, OTEL_EXPORTER_OTLP_ENDPOINT) so the changes are
localized to the telemetry block.
In `@tools/deployment/package/docker-compose-all.yaml`:
- Around line 715-726: Update the otel-collector service: change the image tag
from "otel/opentelemetry-collector-contrib:0.127.0" to a current stable version
(e.g., 0.152.0) and add a healthcheck using the existing *healthcheck_defaults
anchor so the service follows the same health pattern as other services; modify
the otel-collector service block (service name: otel-collector, image field) to
the new tag and append a healthcheck: *healthcheck_defaults entry under that
service.
In `@tools/deployment/package/otel-collector-config.yaml`:
- Around line 16-21: The config only defines a metrics pipeline
(service.pipelines.metrics with receivers: [otlp], processors: [batch],
exporters: [otlphttp]) so traces and logs will be dropped; add
service.pipelines.traces and service.pipelines.logs entries mirroring the
metrics pipeline (using the same receivers, processors and exporters or
appropriate variants) so all three OTEL signal types are handled, or explicitly
document why traces/logs should remain excluded.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 014a63d0-cea7-4ccf-9684-c9cf8b589f4a
📒 Files selected for processing (8)
components/clp-package-utils/clp_package_utils/controller.pytools/deployment/package/docker-compose-all.yamltools/deployment/package/docker-compose-base.yamltools/deployment/package/docker-compose-spider-base.yamltools/deployment/package/docker-compose-spider.yamltools/deployment/package/docker-compose.dev.yamltools/deployment/package/docker-compose.yamltools/deployment/package/otel-collector-config.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tools/deployment/package/otel-collector-config.yaml (1)
20-26:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd traces and logs pipelines to complete OpenTelemetry support.
The collector configuration only defines a
metricspipeline. OpenTelemetry supports three signal types: metrics, traces, and logs. If instrumented services emit traces or logs (which many OTEL SDKs do by default), that telemetry data will be dropped or result in errors.📊 Proposed fix to add all telemetry signals
service: extensions: [health_check] pipelines: metrics: receivers: [otlp] processors: [batch] exporters: [otlphttp] + traces: + receivers: [otlp] + processors: [batch] + exporters: [otlphttp] + logs: + receivers: [otlp] + processors: [batch] + exporters: [otlphttp]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/deployment/package/otel-collector-config.yaml` around lines 20 - 26, The config only defines a metrics pipeline so traces and logs will be dropped; add pipelines named "traces" and "logs" under service.pipelines mirroring the existing metrics pipeline (use receivers: [otlp], processors: [batch], exporters: [otlphttp]) so that the OpenTelemetry collector handles all three signal types; ensure the new pipeline names match the OpenTelemetry signal keys ("traces" and "logs") and follow the same structure as the existing "metrics" block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/deployment/package/docker-compose-all.yaml`:
- Around line 715-734: Apply the shared service defaults to the otel-collector
service by adding the YAML merge key for the existing service_defaults anchor
(i.e., include "<<: *service_defaults" under the otel-collector service), remove
the explicit restart: "${CLP_RESTART_POLICY...}" line (since restart is
inherited), and keep its existing healthcheck: block (which already uses
*healthcheck_defaults); this will ensure it inherits logging driver, user
(CLP_THIRD_PARTY_SERVICE_UID_GID), stop_grace_period, and extra_hosts like other
third-party services.
---
Duplicate comments:
In `@tools/deployment/package/otel-collector-config.yaml`:
- Around line 20-26: The config only defines a metrics pipeline so traces and
logs will be dropped; add pipelines named "traces" and "logs" under
service.pipelines mirroring the existing metrics pipeline (use receivers:
[otlp], processors: [batch], exporters: [otlphttp]) so that the OpenTelemetry
collector handles all three signal types; ensure the new pipeline names match
the OpenTelemetry signal keys ("traces" and "logs") and follow the same
structure as the existing "metrics" block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 255f26d4-25a0-4fe6-a430-01cc93fdfe2c
📒 Files selected for processing (2)
tools/deployment/package/docker-compose-all.yamltools/deployment/package/otel-collector-config.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/deployment/package/otel-collector-config.yaml`:
- Around line 3-4: Duplicate YAML key "endpoint" under the health_check block
causes ambiguity; remove one of the entries so only a single endpoint remains
(e.g., keep endpoint: "0.0.0.0:13133" and delete the other). Locate the
health_check section and ensure only one endpoint key is present and correctly
formatted, then validate the YAML to confirm no duplicate keys remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e7ca0525-caf7-4399-8df5-cffcb5310704
📒 Files selected for processing (1)
tools/deployment/package/otel-collector-config.yaml
This reverts commit 1dfd5a6.
There was a problem hiding this comment.
Actionable comments posted: 13
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/clp-package-utils/clp_package_utils/controller.py`:
- Around line 1176-1177: The function-level imports of json, time (and a
duplicate platform) should be moved to the module top: remove the in-function
import statements (the ones importing json and time around lines ~1176 and the
platform import around ~1217) and ensure json and time are imported alongside
the existing top-level platform import; update any functions that relied on
those local imports to use the module-level names (no other code changes
needed).
- Around line 1235-1257: The current code in controller.py spawns a docker
container via subprocess.run to POST metrics (the block that constructs the
docker run array referencing self._project_name and payload_str), which is
costly and fragile; replace that subprocess.run call with a direct HTTP POST
from the controller process (e.g., use requests.post or urllib) to
"http://otel-collector:4318/v1/metrics" with headers
{"Content-Type":"application/json"}, body payload_str, and timeout=5, preserve
the existing exception handling and logger.warning(e) behavior (the same message
used where you currently catch Exception as e) so failures are logged the same
way but without requiring the curl image or spawning containers.
- Around line 1213-1231: Duplicate logic: reading VERSION from
self._clp_home/VERSION and building resource_attrs is repeated between
set_up_env and _emit_topology_metrics; refactor by extracting this into a helper
or caching it on the instance. Specifically, in set_up_env populate
self._clp_version (from VERSION file) and self._resource_attrs (a dict/list
matching the current resource_attrs shape) and then update
_emit_topology_metrics to use self._clp_version and self._resource_attrs instead
of re-reading VERSION or reconstructing resource_attrs and appending directly to
payload["resourceMetrics"][0]["resource"]["attributes"]; ensure keys like
"clp.deployment.id", "service.version", "clp.deployment.method",
"clp.storage.engine", "os.type", "host.arch", and "service.name" are set once
and reused.
In `@tools/deployment/package-helm/templates/_helpers.tpl`:
- Around line 53-58: The clp.resourceAttributes template is incorrectly using
.Capabilities.KubeVersion for os.type (producing values like v1.28.0); update
the template (define "clp.resourceAttributes") to use a proper OS source such as
a new Values field or a static default, e.g., replace the os.type expression
with .Values.hostOS | default "linux" (or simply the literal "linux" if you
don't want a new value), ensuring the attribute reads os.type=<os> instead of
the Kubernetes version.
In `@tools/deployment/package-helm/templates/otel-collector-configmap.yaml`:
- Line 5: Replace the hardcoded ConfigMap name "otel-collector-config" with the
chart fullname helper so the resource uses {{ include "clp.fullname" . }} (i.e.,
set metadata.name to the clp.fullname helper) and update the corresponding
reference in otel-collector-deployment.yaml (the ConfigMap volume/envRef
reference at line ~34) to use the same include "clp.fullname" . helper so both
templates resolve to the same release-scoped name.
In `@tools/deployment/package-helm/templates/otel-collector-deployment.yaml`:
- Around line 6-7: Replace the hard-coded labels block in the otel-collector
Deployment template with the chart's standard Helm label helpers: use the
clp.labels helper for pod/metadata labels and clp.selectorLabels for the
selector/matchLabels so labels are consistent with the rest of the chart; update
the labels at the Deployment metadata and the pod template metadata (the blocks
currently using app: otel-collector) and the spec.selector.matchLabels to call
{{ include "clp.labels" . }} and {{ include "clp.selectorLabels" . }} (or the
chart's equivalent helper names) so selectors and labels remain in sync.
- Line 5: The Deployment name is hardcoded as "otel-collector"; change it to use
the chart helper so it follows the release-scoped naming pattern by replacing
the literal with the template helper call (use the same pattern as other
resources: {{ include "clp.fullname" . }}-otel-collector or similar) in the
otel-collector Deployment manifest and update any places that reference that
name (e.g., the corresponding Service and selector/ownerReferences) so all
references use the helper-derived name consistently; ensure you use the exact
helper symbol "clp.fullname" as used elsewhere in the chart.
- Around line 19-30: The otel-collector container (name: otel-collector) is
missing Kubernetes health probes; add both readinessProbe and livenessProbe to
the container spec using HTTP GET checks against the collector health endpoint
(e.g., path /health on port 13133) with sensible initialDelaySeconds,
periodSeconds and failureThreshold values; ensure your collector configuration
includes the health_check extension (or change the probe path/port to match your
config) so the probes succeed.
- Around line 21-22: The container command overrides the image ENTRYPOINT and
currently only supplies the "--config=/etc/otelcol-contrib/config.yaml" arg,
which prevents the otelcol binary from being executed; update the pod spec's
command array to include the binary name "/otelcontribcol" as the first element
followed by the existing "--config=/etc/otelcol-contrib/config.yaml" argument so
the process runs correctly (modify the command: array that currently contains
"--config=..." to be ["/otelcontribcol",
"--config=/etc/otelcol-contrib/config.yaml"]).
- Line 20: The template uses a guard checking .Values.clpConfig.telemetry but
then references .Values.telemetry.collector.image and
.Values.telemetry.collector.resources which are the wrong value paths; update
the image and resources references to use
.Values.clpConfig.telemetry.collector.image and
.Values.clpConfig.telemetry.collector.resources respectively so they match the
conditional (.Values.clpConfig.telemetry) and the configmap usage
(.Values.clpConfig.telemetry.endpoint).
In `@tools/deployment/package-helm/templates/otel-collector-service.yaml`:
- Around line 6-7: Replace the hardcoded labels and selector in the
otel-collector service template with the chart's standard label helpers: change
the literal "labels: app: otel-collector" block to use {{ include "clp.labels" .
}} for metadata.labels and replace the selector block to use {{ include
"clp.selectorLabels" . }} so the service uses the same controlled label/selector
helpers as the rest of the chart.
- Around line 10-12: The Service port block for the otel collector currently
lacks a name; add a descriptive name field (e.g., name: "otlp-http") to the port
definition where port: 4318 / protocol: TCP / targetPort: 4318 is declared so
the Service port is named (helps future multi-port exposure and follows k8s best
practices). Update the Service port entry in the otel-collector Service manifest
(the port definition with port: 4318, protocol: TCP, targetPort: 4318) to
include the name key.
- Line 5: The Service manifest currently hardcodes the name "otel-collector";
change it to use the chart helper clp.fullname so the Service name follows the
chart's naming convention and avoids collisions. Locate the Service resource
where name: otel-collector is set (the otel-collector Service template) and
replace the literal with the Helm template helper invocation that outputs the
release-scoped fullname (use the clp.fullname helper), ensuring any surrounding
quotes/templating match other templates in the chart.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 338898c2-d7e8-41c1-9d8c-544a0633c11a
📒 Files selected for processing (15)
components/clp-package-utils/clp_package_utils/controller.pytools/deployment/package-helm/Chart.yamltools/deployment/package-helm/templates/_helpers.tpltools/deployment/package-helm/templates/api-server-deployment.yamltools/deployment/package-helm/templates/compression-scheduler-deployment.yamltools/deployment/package-helm/templates/compression-worker-deployment.yamltools/deployment/package-helm/templates/log-ingestor-deployment.yamltools/deployment/package-helm/templates/otel-collector-configmap.yamltools/deployment/package-helm/templates/otel-collector-deployment.yamltools/deployment/package-helm/templates/otel-collector-service.yamltools/deployment/package-helm/templates/query-scheduler-deployment.yamltools/deployment/package-helm/templates/query-worker-deployment.yamltools/deployment/package-helm/templates/reducer-deployment.yamltools/deployment/package/docker-compose-all.yamltools/deployment/package/otel-collector-config.yaml
* Revert "Reapply "remove server"" This reverts commit c71e830. * use resourcedetectionprocessor * matching style quotes * use resource detection * move configmap.yaml * persisting uuid * follow deployment.yaml * imagepull policy * move file * replica count method * avoid hardcode * remove dev.yaml * healthchecks * remove telemetry-server
junhaoliao
left a comment
There was a problem hiding this comment.
very good. here are the very last batch of suggestions
for the title, how about simply:
feat(deployment): Add OpenTelemetry Collector.
Co-authored-by: Junhao Liao <junhao@junhao.ca>
There was a problem hiding this comment.
Pull request overview
This PR adds optional (opt-out) OpenTelemetry telemetry support to CLP deployments by introducing an OpenTelemetry Collector in both Docker Compose and Helm, wiring OTEL environment variables into services, and emitting initial “deployment topology” gauge metrics at startup.
Changes:
- Adds an
otel-collectorservice to Docker Compose deployments and a conditional collector Deployment/Service to the Helm chart. - Injects OpenTelemetry-related env vars (
OTEL_SERVICE_NAME,OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_RESOURCE_ATTRIBUTES,CLP_DISABLE_TELEMETRY) into (some) components. - Emits topology metrics on Docker Compose startup and adds a Helm hook Job to emit a similar payload post-install/upgrade.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/deployment/package/docker-compose.yaml | Includes otel-collector in the full Docker Compose deployment. |
| tools/deployment/package/docker-compose-spider.yaml | Includes otel-collector in the spider full Docker Compose deployment. |
| tools/deployment/package/docker-compose-spider-base.yaml | Includes otel-collector in the spider base Docker Compose deployment. |
| tools/deployment/package/docker-compose-base.yaml | Includes otel-collector in the base Docker Compose deployment. |
| tools/deployment/package/docker-compose-all.yaml | Adds shared telemetry env defaults, OTEL env injection to services, and defines the otel-collector service. |
| tools/deployment/package-helm/values.yaml | Adds Helm values for collector image and telemetry config. |
| tools/deployment/package-helm/templates/topology-metrics-emitter-job.yaml | New Helm hook Job that POSTs topology metrics to the in-cluster collector. |
| tools/deployment/package-helm/templates/reducer-deployment.yaml | Injects OTEL env vars into reducer pod. |
| tools/deployment/package-helm/templates/query-worker-deployment.yaml | Injects OTEL env vars into query-worker pod. |
| tools/deployment/package-helm/templates/query-scheduler-deployment.yaml | Injects OTEL env vars into query-scheduler pod. |
| tools/deployment/package-helm/templates/otel-collector-service.yaml | New Service for the in-cluster OTEL collector (conditional). |
| tools/deployment/package-helm/templates/otel-collector-deployment.yaml | New Deployment for the in-cluster OTEL collector (conditional). |
| tools/deployment/package-helm/templates/log-ingestor-deployment.yaml | Injects OTEL env vars into log-ingestor pod. |
| tools/deployment/package-helm/templates/instance-id-configmap.yaml | New ConfigMap to persist a stable instance UUID across Helm upgrades. |
| tools/deployment/package-helm/templates/configmap.yaml | Adds collector config content into the chart ConfigMap (conditional). |
| tools/deployment/package-helm/templates/compression-worker-deployment.yaml | Injects OTEL env vars into compression-worker pod. |
| tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml | Injects OTEL env vars into compression-scheduler pod. |
| tools/deployment/package-helm/templates/api-server-deployment.yaml | Injects OTEL env vars into api-server pod. |
| tools/deployment/package-helm/templates/_helpers.tpl | Adds helpers for instanceId persistence, resource attributes, topology payload generation, and telemetry env var injection. |
| tools/deployment/package-helm/Chart.yaml | Bumps chart version. |
| components/package-template/src/etc/otel-collector/config.yaml | Adds a default OTEL collector config file to the package template (Docker Compose path). |
| components/clp-package-utils/clp_package_utils/general.py | Adds a small HTTP helper with retries for emitting topology metrics. |
| components/clp-package-utils/clp_package_utils/controller.py | Wires telemetry env setup, collector configuration, and topology metrics emission into DockerComposeController startup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| volumes: | ||
| - type: "bind" | ||
| source: "${CLP_OTEL_COLLECTOR_CONF_FILE_HOST:-./etc/otel-collector/config.yaml}" | ||
| target: "/etc/otelcol-contrib/config.yaml" | ||
| read_only: true | ||
| - type: "bind" | ||
| source: "/var/run/docker.sock" | ||
| target: "/var/run/docker.sock" |
There was a problem hiding this comment.
removing this access would mean needing env vars to pass in the system info. Not sure if the security benefit is worth the complexity of changing. @junhaoliao can you please advise
| {{/* | ||
| Provides environment variables for telemetry (except service.name). | ||
| */}} | ||
| {{- define "clp.telemetryEnv" -}} | ||
| {{- if and .Values.clpConfig.telemetry .Values.clpConfig.telemetry.disable }} | ||
| - name: CLP_DISABLE_TELEMETRY | ||
| value: "true" | ||
| {{- else }} | ||
| - name: OTEL_EXPORTER_OTLP_ENDPOINT | ||
| value: "http://{{ include "clp.fullname" . }}-otel-collector:4318" | ||
| - name: OTEL_RESOURCE_ATTRIBUTES | ||
| value: {{ include "clp.resourceAttributes" . | quote }} | ||
| {{- end }} | ||
| {{- end -}} |
There was a problem hiding this comment.
@junhaoliao we dont want to include anything not yet used right?
Description
telemetryprofile/config flag)OTEL_SERVICE_NAME,OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_RESOURCE_ATTRIBUTES, andCLP_DISABLE_TELEMETRYenv vars into every service/component.DockerComposeControllerreads the CLP version, builds resource attributes (deployment ID, version, storage engine, OS, arch), and emits deployment topology gauges (worker replicas, concurrency) to the collector at startup.CLP_DISABLE_TELEMETRY=true(Docker Compose) orclpConfig.telemetry.disable: true(Helm) to disable.Checklist
breaking change.
Validation performed
Started CLP with the local telemetry backend (from the
clp-telemetry-serverrepo). Verifiedotel-collectorlogs showed incoming OTLP data.test end-to-end with the telemetry server:
1.
change build/clp-package/etc/clp-config.yaml's endpoint to http://172.17.0.1:4318
test docker compose with start-clp.sh.
check telemetry:
test helm uuid generation