Skip to content

feat(deployment): Add OpenTelemetry metrics for Rust#2286

Open
Nathan903 wants to merge 15 commits into
y-scope:mainfrom
Nathan903:pr3submit
Open

feat(deployment): Add OpenTelemetry metrics for Rust#2286
Nathan903 wants to merge 15 commits into
y-scope:mainfrom
Nathan903:pr3submit

Conversation

@Nathan903
Copy link
Copy Markdown
Contributor

@Nathan903 Nathan903 commented May 16, 2026

Description

Adds telemetry metrics to CLP's Rust services (api-server, log-ingestor).

  • New telemetry module in clp-rust-utils with init_telemetry(), TelemetryGuard, and shutdown_telemetry()
  • api-server: emits clp.service.event counter on startup
  • log-ingestor: emits clp.ingest.bytes_total and clp.ingest.records_total counters per ingested object in Buffer::add
  • Bug fix: CLP_DISABLE_TELEMETRY now checks the env var's value against ["1", "true", "yes", "y"] instead of checking existence

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Started CLP with the local telemetry backend (from the clp-telemetry-server repo). Verified otel-collector logs showed incoming OTLP data.

test end-to-end with the telemetry server:
1.

cd ~/clp/tools/deployment/telemetry_server
docker compose up -d
  1. change build/clp-package/etc/clp-config.yaml's endpoint to http://172.17.0.1:4318

  2. test docker compose with start-clp.sh.

  3. check telemetry:

cd ~/clp/tools/deployment/telemetry_server

docker exec telemetry_server-clickhouse-1 clickhouse-client \
  --user default --password clickhouse \
  --query "SELECT * FROM clp_telemetry.otel_metrics_sum ORDER BY TimeUnix DESC LIMIT 5 FORMAT Vertical"

Summary by CodeRabbit

  • New Features

    • Application-wide telemetry and metrics added: service startup event, ingest counters and service event counters.
    • Services initialize monitoring on startup and gracefully shut down telemetry on exit.
    • Telemetry can be disabled via configuration or the CLP_DISABLE_TELEMETRY environment variable.
  • Bug Fixes

    • OTLP exporter build failures are surfaced as errors instead of causing panics.

Review Change Stack

@Nathan903 Nathan903 requested a review from a team as a code owner May 16, 2026 10:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)

Walkthrough

OpenTelemetry metrics added: clp-rust-utils exposes init/shutdown utilities and error variant; api-server and log-ingestor initialize telemetry, register meters and counters, emit startup metrics, and rely on guard drop for provider shutdown. Ingestion buffer records bytes and record counters.

Changes

OpenTelemetry Metrics Infrastructure

Layer / File(s) Summary
Core telemetry utilities and infrastructure
components/clp-rust-utils/Cargo.toml, components/clp-rust-utils/src/lib.rs, components/clp-rust-utils/src/telemetry.rs, components/clp-rust-utils/src/error.rs
OpenTelemetry SDK, metrics, and OTLP exporter dependencies added. telemetry module exported. init_telemetry conditionally initializes an HTTP OTLP metric exporter and SdkMeterProvider, registers it globally, and returns Result<Option<SdkMeterProvider>, Error>. shutdown_telemetry shuts down the provider. New TelemetryExporterBuild(String) error variant added.
api-server telemetry integration
components/api-server/Cargo.toml, components/api-server/src/bin/api_server.rs, components/api-server/src/error.rs
api-server adds the opentelemetry dependency, calls init_telemetry during startup to obtain a provider, creates an api-server meter, builds a clp.service.event counter, records a type=start event, maps telemetry exporter build errors in From<clp_rust_utils::Error> for ClientError, and relies on a Drop guard to call shutdown_telemetry on exit. Server await handling captures and returns the serve result.
log-ingestor telemetry integration
components/log-ingestor/Cargo.toml, components/log-ingestor/src/bin/log_ingestor.rs
log-ingestor adds the opentelemetry dependency, calls init_telemetry during startup and retains the returned provider, adjusts server await/shutdown flow to call shutdown_telemetry after the server future completes, and returns the serve result.
ingest buffer metrics
components/log-ingestor/src/compression/buffer.rs
Adds opentelemetry imports and instrumentation in Buffer::add: creates a meter, registers clp.ingest.bytes_total and clp.ingest.records_total u64 counters, and increments them per ingested entry.

Sequence Diagram

sequenceDiagram
  participant Service as Service (api-server/log-ingestor)
  participant InitTel as init_telemetry()
  participant OTLPExp as HTTP OTLP Exporter
  participant Provider as SdkMeterProvider
  participant Global as global::set_meter_provider
  participant ShutdownTel as shutdown_telemetry()
  Service->>InitTel: call with telemetry config and service_name
  InitTel->>OTLPExp: create HTTP metric exporter
  InitTel->>Provider: build SdkMeterProvider with exporter
  InitTel->>Global: register as global meter provider
  InitTel-->>Service: return Some(provider)
  Service->>Service: create service meter and counters
  Service->>Service: emit startup/operational metrics
  Service->>ShutdownTel: call with provider on graceful exit (Drop)
  ShutdownTel->>Provider: shutdown()
  ShutdownTel-->>Service: return
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • y-scope/clp#2251: Introduces telemetry consent/config and environment plumbing (telemetry.disable, OTLP endpoint, opt-out flags) that control telemetry initialization.

Suggested reviewers

  • junhaoliao
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately describes the main objective of adding OpenTelemetry metrics support for Rust components, which is clearly demonstrated across multiple file changes including telemetry module creation, metric initialization, and instrumentation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/api-server/src/bin/api_server.rs`:
- Around line 82-87: The
axum::serve(...).with_graceful_shutdown(shutdown_signal()).await currently uses
? which returns early and skips
clp_rust_utils::telemetry::shutdown_telemetry(tel_provider); change this to
capture the result (e.g., let serve_res =
axum::serve(...).with_graceful_shutdown(...).await;), then call
clp_rust_utils::telemetry::shutdown_telemetry(tel_provider); and finally
propagate the serve result (return or use serve_res?); ensure you reference the
axum::serve call, shutdown_signal, and
clp_rust_utils::telemetry::shutdown_telemetry(tel_provider) so telemetry always
runs on both success and error paths.

In `@components/clp-rust-utils/src/telemetry.rs`:
- Around line 26-30: The telemetry initialization currently panics via expect
when building the OTLP metric exporter (the call to
opentelemetry_otlp::MetricExporter::builder().with_http().with_endpoint(endpoint).build().expect(...)),
so change the telemetry init function to return a Result (or return an Option)
and handle exporter build failures instead of panicking: replace the expect call
with proper error handling (e.g., let exporter = ...build()? or match the Result
and log the error and disable telemetry by returning Ok(None) / an appropriate
Err), update the telemetry init function signature to return
Result<TelemetryHandle, Error> (or Option) and propagate or gracefully handle
the error where the function is called (ensuring code paths can run without
telemetry).

In `@components/log-ingestor/src/bin/log_ingestor.rs`:
- Around line 85-90: The call to
axum::serve(...).with_graceful_shutdown(shutdown_signal()).await? can return
early and skip clp_rust_utils::telemetry::shutdown_telemetry(tel_provider);
change it to first capture the serve result (e.g., let serve_result =
axum::serve(...).with_graceful_shutdown(...).await;), then always call
clp_rust_utils::telemetry::shutdown_telemetry(tel_provider); and finally
propagate the serve outcome (e.g., use serve_result? or return serve_result) so
telemetry is shut down regardless of server error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9b53f282-438c-467d-b5b0-7ba36000e79d

📥 Commits

Reviewing files that changed from the base of the PR and between 137af2c and a06eeb1.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • components/api-server/Cargo.toml
  • components/api-server/src/bin/api_server.rs
  • components/clp-rust-utils/Cargo.toml
  • components/clp-rust-utils/src/lib.rs
  • components/clp-rust-utils/src/telemetry.rs
  • components/log-ingestor/Cargo.toml
  • components/log-ingestor/src/bin/log_ingestor.rs

Comment thread components/api-server/src/bin/api_server.rs Outdated
Comment thread components/clp-rust-utils/src/telemetry.rs Outdated
Comment thread components/log-ingestor/src/bin/log_ingestor.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/api-server/src/bin/api_server.rs`:
- Around line 58-60: The startup metric is emitted too early; move the
startup_counter.add(1, &[opentelemetry::KeyValue::new("type", "start")]) call so
it runs only after all critical startup preconditions (binding/listening,
connecting dependencies, router initialization) have completed successfully.
Locate the meter/u64_counter creation (meter and startup_counter) and keep
creation where it is but remove the add() call from its current position and
place it immediately after the code path that confirms successful
bind/connect/router setup (the success/ready branch that actually starts the
server), ensuring any error paths do not call startup_counter.add.
- Around line 56-57: Wrap the result of
clp_rust_utils::telemetry::init_telemetry(&config.telemetry) (stored in
tel_provider) in a drop guard so telemetry::shutdown is always called on
unwind/early return; implement a small RAII guard type (e.g., TelemetryGuard)
whose Drop calls clp_rust_utils::telemetry::shutdown/tel_provider.shutdown and
store the guard next to tel_provider, or use the scopeguard crate to defer the
shutdown, and do the same for the second telemetry init (the later init at the
other call site) so shutdown runs on all exit paths, not just after serve.

In `@components/clp-rust-utils/src/telemetry.rs`:
- Around line 31-34: Replace the use of Resource::default() when building the
SdkMeterProvider so metrics carry an explicit service identity: build a Resource
via Resource::builder_empty().with_attributes(vec![KeyValue::new("service.name",
"<service-identifier>")]) and pass that resource to
SdkMeterProvider::builder().with_reader(reader).with_resource(...) before
.build(); use an appropriate concrete service identifier string in place of
"<service-identifier>" so SdkMeterProvider (and exported metrics) are
unambiguously attributed.

In `@components/log-ingestor/src/bin/log_ingestor.rs`:
- Around line 64-67: Counters _bytes_total and _records_total are created with
meter.u64_counter("clp.ingest.bytes_total") and
meter.u64_counter("clp.ingest.records_total") but never used; find the ingest
path (e.g., the function handling incoming records / the loop that
writes/forwards data) and call add(...) on those instruments after successful
processing—increment _bytes_total.add(n_bytes, &[...]) with the number of bytes
ingested and _records_total.add(1, &[...]) per record (include any relevant
attributes/tags you already use). Ensure you keep the variables in scope (remove
the leading underscore if necessary) and place the add(...) calls in the success
branch (not on errors) so the counters reflect actual ingested bytes and
records.
- Around line 62-63: The telemetry provider created by
clp_rust_utils::telemetry::init_telemetry (assigned to tel_provider) may be
skipped on early returns because main uses the ? operator; wrap tel_provider in
a RAII guard or use a defer-like Drop handler so shutdown always runs on any
return from main (including pre-serve failures). Concretely, create a small
guard type or use scopeguard::defer in main right after tel_provider is created
(or replace tel_provider with a guard that holds it) and call the telemetry
shutdown/flush in its Drop/closure so telemetry shutdown executes regardless of
where main returns (this should cover the code paths around the tel_provider
creation and the later return points around the serve/exit logic).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bebc6afd-7b87-4f08-83c0-076b758311d0

📥 Commits

Reviewing files that changed from the base of the PR and between a06eeb1 and 354334f.

📒 Files selected for processing (4)
  • components/api-server/src/bin/api_server.rs
  • components/clp-rust-utils/src/error.rs
  • components/clp-rust-utils/src/telemetry.rs
  • components/log-ingestor/src/bin/log_ingestor.rs

Comment thread components/api-server/src/bin/api_server.rs
Comment thread components/api-server/src/bin/api_server.rs Outdated
Comment thread components/clp-rust-utils/src/telemetry.rs
Comment thread components/log-ingestor/src/bin/log_ingestor.rs Outdated
Comment thread components/log-ingestor/src/bin/log_ingestor.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/log-ingestor/src/bin/log_ingestor.rs`:
- Line 62: The telemetry provider created by
clp_rust_utils::telemetry::init_telemetry (tel_provider) can be leaked on early
returns from TcpListener::bind, IngestionJobManagerState::from_config, or
create_router; wrap tel_provider in a guard (e.g., scopeguard::guard) or use an
RAII wrapper type from clp-rust-utils::telemetry that implements Drop to call
clp_rust_utils::telemetry::shutdown_telemetry automatically, and then remove the
explicit shutdown_telemetry call currently executed at the end of main so the
provider is always flushed and cleaned up on scope exit.

In `@components/log-ingestor/src/compression/buffer.rs`:
- Around line 81-82: The metrics calls to bytes_total.add(...) and
records_total.add(...) use empty attribute slices (&[]); if contextual labels
like job_id, source, or object_type are available in scope, construct an
attributes slice (e.g. using KeyValue::new(...) items) and pass it into both
bytes_total.add(entry.size, &attributes) and records_total.add(1, &attributes)
so the metrics become queryable; update the code around the bytes_total and
records_total uses to build and reuse the attributes array (or leave &[] if no
context is available).
- Around line 76-78: The meter and counters (meter, bytes_total, records_total)
are being created inside the hot add() path; move their creation out of add()
and into the Buffer struct as fields (e.g., add fields meter: Meter,
bytes_total: u64::Counter, records_total: u64::Counter or the correct
OpenTelemetry types) and initialize them once in Buffer::new (note Buffer::new
will no longer be const fn because instrument construction is runtime). Then
update add() to use these struct fields instead of rebuilding the instruments on
each call.
- Line 5: Remove the unused import opentelemetry::KeyValue from the top of the
file (it's unused because attributes are passed as empty slices around the code
where attributes are constructed at lines ~81–82); simply delete the `use
opentelemetry::KeyValue;` line to clean up imports and run a build/lint to
confirm no other references remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 88323138-4538-41c4-bfd5-2e420b819937

📥 Commits

Reviewing files that changed from the base of the PR and between d2f8e5e and 6b1fc22.

📒 Files selected for processing (2)
  • components/log-ingestor/src/bin/log_ingestor.rs
  • components/log-ingestor/src/compression/buffer.rs

Comment thread components/log-ingestor/src/bin/log_ingestor.rs
Comment thread components/log-ingestor/src/compression/buffer.rs Outdated
Comment thread components/log-ingestor/src/compression/buffer.rs Outdated
Comment thread components/log-ingestor/src/compression/buffer.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
components/clp-rust-utils/src/telemetry.rs (1)

21-21: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore caller-provided service.name on the meter provider.

Resource::default() makes service identity depend on env/default fallback, so api-server and log-ingestor can both end up as unknown_service:* when OTEL_SERVICE_NAME is unset. Please thread service_name: &str through init_telemetry and build the Resource from that instead of using the default.

For opentelemetry-sdk Rust 0.27.x metrics, how should `service.name` be set on an `SdkMeterProvider`, and what fallback value is used when it is omitted?

Based on learnings: In components/clp-rust-utils/src/telemetry.rs, init_telemetry is intended to take service_name: &str from callers and use it to populate the service.name Resource attribute on the SdkMeterProvider; it should not hardcode the service name inside the utility.

Also applies to: 36-39

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/clp-rust-utils/src/telemetry.rs` at line 21, The init_telemetry
function currently uses Resource::default(), which lets OTEL fallbacks produce
unknown_service:*; change init_telemetry(telemetry_config: &Telemetry) -> ... to
accept an extra service_name: &str parameter and construct the Resource
explicitly from that service_name (e.g.,
Resource::new(vec![KeyValue::new("service.name", service_name)]) or equivalent)
before building the SdkMeterProvider; update any callers to pass their concrete
service name so SdkMeterProvider contains the caller-provided service.name
instead of relying on env/default fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@components/clp-rust-utils/src/telemetry.rs`:
- Line 21: The init_telemetry function currently uses Resource::default(), which
lets OTEL fallbacks produce unknown_service:*; change
init_telemetry(telemetry_config: &Telemetry) -> ... to accept an extra
service_name: &str parameter and construct the Resource explicitly from that
service_name (e.g., Resource::new(vec![KeyValue::new("service.name",
service_name)]) or equivalent) before building the SdkMeterProvider; update any
callers to pass their concrete service name so SdkMeterProvider contains the
caller-provided service.name instead of relying on env/default fallback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ae23d7ed-7b45-46b8-b671-949a8fe800c8

📥 Commits

Reviewing files that changed from the base of the PR and between d2f8e5e and d0b9ac1.

📒 Files selected for processing (4)
  • components/api-server/src/bin/api_server.rs
  • components/clp-rust-utils/src/telemetry.rs
  • components/log-ingestor/src/bin/log_ingestor.rs
  • components/log-ingestor/src/compression/buffer.rs

@LinZhihao-723 LinZhihao-723 self-requested a review May 19, 2026 20:54
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry metrics support to CLP’s Rust services by introducing a shared telemetry initializer in clp-rust-utils, wiring it into api-server and log-ingestor, and emitting basic counters for service startup and ingestion volume.

Changes:

  • Introduce clp-rust-utils::telemetry with init_telemetry(), TelemetryGuard, and shutdown_telemetry() for OTLP metrics export.
  • Emit metrics:
    • api-server: clp.service.event counter on startup.
    • log-ingestor: clp.ingest.bytes_total / clp.ingest.records_total counters per ingested object in Buffer::add.
  • Update Rust telemetry disabling behavior to check env var values (vs. mere existence) and propagate exporter build failures as errors.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
components/log-ingestor/src/compression/buffer.rs Adds OpenTelemetry counters to ingestion buffering to track bytes/records ingested.
components/log-ingestor/src/bin/log_ingestor.rs Initializes telemetry at startup and ensures telemetry guard lifetime covers server execution.
components/log-ingestor/Cargo.toml Adds opentelemetry dependency for metrics instrumentation.
components/clp-rust-utils/src/telemetry.rs New shared telemetry initialization/shutdown module for OTLP metric export.
components/clp-rust-utils/src/lib.rs Exposes the new telemetry module.
components/clp-rust-utils/src/error.rs Adds a dedicated error variant for OTLP exporter build failures.
components/clp-rust-utils/Cargo.toml Adds OpenTelemetry SDK + OTLP exporter dependencies.
components/api-server/src/error.rs Maps the new telemetry exporter build error into ClientError.
components/api-server/src/bin/api_server.rs Initializes telemetry and emits a startup event metric.
components/api-server/Cargo.toml Adds opentelemetry dependency for metrics instrumentation.
Cargo.lock Locks new OpenTelemetry deps and updates transitive crate versions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread components/clp-rust-utils/src/telemetry.rs Outdated
Comment thread components/clp-rust-utils/src/telemetry.rs
Comment thread components/clp-rust-utils/Cargo.toml Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants