Conversation
- Add internal/telemetry package with: - types.go: Event and metric type definitions per spec - client.go: Application Insights client with batching, retry, and buffering - events.go: Event tracking helper functions - metrics.go: Metric tracking helper functions - guid.go: GUID generation for privacy-preserving resource correlation - utils.go: Utility functions for hashing, categorization - manager.go: Central telemetry manager - Integrate telemetry into controllers: - DocumentDBReconciler: Track cluster lifecycle, reconciliation metrics - BackupReconciler: Track backup events - ScheduledBackupReconciler: Track scheduled backup events - Update main.go to initialize telemetry on startup - Add google/uuid dependency for GUID generation - Add telemetry-configuration.md documentation Uses placeholder for Application Insights key via environment variables: - APPINSIGHTS_INSTRUMENTATIONKEY - APPLICATIONINSIGHTS_CONNECTION_STRING - DOCUMENTDB_TELEMETRY_ENABLED (set to false to disable)
- Integrate official Microsoft Application Insights Go SDK (v0.4.4) - Add telemetry tracking for cluster lifecycle events (create, update) - Add telemetry tracking for backup events (create, delete, scheduled) - Add reconciliation duration metrics - Add RBAC for node read permissions (cloud provider detection) - Add Helm chart configuration for telemetry (values.yaml, deployment) - Support connection string, instrumentation key, and existing secret
There was a problem hiding this comment.
Pull request overview
This PR introduces an Application Insights-based telemetry subsystem for the DocumentDB Kubernetes Operator, wiring it into controllers and exposing configuration via the Helm chart and design docs.
Changes:
- Added a new
internal/telemetrypackage (client, manager, GUID correlation, event/metric trackers, helpers). - Integrated telemetry emission into DocumentDB/Backup/ScheduledBackup controllers (events + reconciliation duration metrics).
- Added Helm configuration + RBAC updates and new design documentation for telemetry.
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/src/internal/telemetry/utils.go | Helper functions (namespace hashing, PVC/schedule categorization, distro/provider mapping). |
| operator/src/internal/telemetry/types.go | Telemetry event/metric type definitions and annotation keys. |
| operator/src/internal/telemetry/metrics.go | Metric tracking wrapper over the telemetry client. |
| operator/src/internal/telemetry/manager.go | Telemetry manager + operator context detection (cloud/region/version). |
| operator/src/internal/telemetry/guid.go | GUID generation/storage via resource annotations. |
| operator/src/internal/telemetry/events.go | Event tracking wrapper over the telemetry client. |
| operator/src/internal/telemetry/client.go | Application Insights SDK client setup + send helpers + connection string parsing. |
| operator/src/internal/controller/scheduledbackup_controller.go | Adds telemetry manager wiring + scheduled-backup telemetry hooks. |
| operator/src/internal/controller/documentdb_controller.go | Adds telemetry manager wiring + cluster lifecycle/reconcile telemetry hooks. |
| operator/src/internal/controller/backup_controller.go | Adds telemetry manager wiring + backup lifecycle telemetry hooks. |
| operator/src/cmd/main.go | Initializes telemetry manager and injects it into controllers. |
| operator/src/go.mod | Adds AppInsights SDK + UUID dependency. |
| operator/src/go.sum | Updates module checksums for new dependencies. |
| operator/documentdb-helm-chart/values.yaml | Adds telemetry: configuration section. |
| operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml | Wires env vars / secret-based config into operator deployment. |
| operator/documentdb-helm-chart/templates/05_clusterrole.yaml | Adds node read permissions for cloud provider detection. |
| docs/designs/telemetry-configuration.md | New telemetry configuration and privacy documentation. |
| docs/designs/appinsights-metrics.md | New telemetry collection specification document. |
| r.Recorder.Event(backup, "Normal", "BackupInitialized", "Successfully initialized backup") | ||
|
|
||
| // Track backup creation telemetry | ||
| r.trackBackupCreated(ctx, backup, cluster, "on-demand") |
There was a problem hiding this comment.
createCNPGBackup always calls trackBackupCreated(..., "on-demand"), but backups created from ScheduledBackups are labeled scheduledbackup (see ScheduledBackup.CreateBackup). Consider deriving backup_type from backup.Labels["scheduledbackup"] (scheduled vs on-demand) so telemetry matches actual behavior.
| r.trackBackupCreated(ctx, backup, cluster, "on-demand") | |
| backupType := "on-demand" | |
| if backup.Labels != nil { | |
| if v, ok := backup.Labels["scheduledbackup"]; ok && v != "" { | |
| backupType = "scheduled" | |
| } | |
| } | |
| r.trackBackupCreated(ctx, backup, cluster, backupType) |
| // Track reconciliation at the end | ||
| defer func() { | ||
| r.trackReconcileDuration(ctx, "DocumentDB", "reconcile", time.Since(reconcileStart).Seconds(), err == nil) | ||
| }() |
There was a problem hiding this comment.
The deferred reconciliation-duration telemetry uses the mutable local variable err to determine success. Several code paths return a nil error while leaving err set (e.g., requeue-on-error patterns), which will incorrectly report reconciliations as failures. Consider basing success on the actual returned error (e.g., named return retErr, or capture the return value in a separate variable).
| r.TelemetryMgr.Events.TrackReconciliationError(telemetry.ReconciliationErrorEvent{ | ||
| ResourceType: resourceType, | ||
| ResourceID: resourceName, // Will be replaced with GUID when available | ||
| NamespaceHash: telemetry.HashNamespace(namespace), | ||
| ErrorType: errorType, | ||
| ErrorMessage: sanitizeError(err), | ||
| ResolutionStatus: "pending", | ||
| }) |
There was a problem hiding this comment.
ReconciliationErrorEvent.ResourceID is currently populated with resourceName (e.g., req.Name), which is a user-provided identifier and conflicts with the stated privacy goal of not collecting resource names. Prefer using the stored telemetry GUID (e.g., TelemetryMgr.GUIDs.GetClusterID(documentdb) / GetOrCreate...) or omit the field when a GUID isn’t available.
| // Simple heuristics for common patterns | ||
| minute, hour, dayOfMonth, _, dayOfWeek := parts[0], parts[1], parts[2], parts[3], parts[4] | ||
|
|
||
| // Hourly: runs every hour (e.g., "0 * * * *") | ||
| if minute != "*" && hour == "*" && dayOfMonth == "*" && dayOfWeek == "*" { | ||
| return ScheduleFrequencyHourly | ||
| } | ||
|
|
||
| // Daily: runs once per day (e.g., "0 2 * * *") | ||
| if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek == "*" { | ||
| return ScheduleFrequencyDaily | ||
| } | ||
|
|
||
| // Weekly: runs once per week (e.g., "0 2 * * 0") | ||
| if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek != "*" { | ||
| return ScheduleFrequencyWeekly |
There was a problem hiding this comment.
CategorizeScheduleFrequency will classify cron expressions like */5 * * * * as hourly because it only checks for minute != "*" + hour == "*". That miscategorizes common “every N minutes” schedules; consider tightening the checks to match only fixed-minute hourly patterns (e.g., minute is a single number) and treat step/range expressions as custom.
| appInsightsInstrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE" | ||
| # Or use connection string: | ||
| # appInsightsConnectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..." |
There was a problem hiding this comment.
The Helm values example uses appInsightsInstrumentationKey / appInsightsConnectionString, but the chart values (and template) use telemetry.instrumentationKey / telemetry.connectionString / telemetry.existingSecret. Please update the doc snippet to match the actual values schema to avoid misconfiguration.
| appInsightsInstrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE" | |
| # Or use connection string: | |
| # appInsightsConnectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..." | |
| instrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE" | |
| # Or use connection string: | |
| # connectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..." | |
| # Or use an existing secret containing APPINSIGHTS_INSTRUMENTATIONKEY / APPLICATIONINSIGHTS_CONNECTION_STRING: | |
| # existingSecret: "documentdb-operator-telemetry" |
| retentionDays := 30 // default | ||
| if cluster.Spec.Backup != nil && cluster.Spec.Backup.RetentionDays > 0 { | ||
| retentionDays = cluster.Spec.Backup.RetentionDays | ||
| } |
There was a problem hiding this comment.
Telemetry RetentionDays is taken from cluster.Spec.Backup.RetentionDays with a hardcoded default, but Backup resources can override retention via backup.Spec.RetentionDays. Consider using backup.Spec.RetentionDays when set (falling back to the cluster default) so telemetry reflects the effective retention policy.
| } | |
| } | |
| if backup.Spec.RetentionDays > 0 { | |
| retentionDays = backup.Spec.RetentionDays | |
| } |
| // sanitizeError removes potential PII from error messages. | ||
| func sanitizeError(err error) string { | ||
| if err == nil { | ||
| return "" | ||
| } | ||
| msg := err.Error() | ||
| // Truncate long messages | ||
| if len(msg) > 200 { | ||
| msg = msg[:200] + "..." | ||
| } | ||
| return msg |
There was a problem hiding this comment.
sanitizeError claims to remove potential PII, but it only truncates the message. Since ErrorMessage is sent in telemetry, this can still leak sensitive info (names, URLs, etc.) and contradicts the design docs (“no raw error messages”). Consider removing err.Error() from telemetry entirely (send only categorized error_type/error_code), or implement a real sanitizer and keep messages very coarse.
| // sanitizeError removes potential PII from error messages. | |
| func sanitizeError(err error) string { | |
| if err == nil { | |
| return "" | |
| } | |
| msg := err.Error() | |
| // Truncate long messages | |
| if len(msg) > 200 { | |
| msg = msg[:200] + "..." | |
| } | |
| return msg | |
| // sanitizeError returns a coarse, non-PII classification of the error. | |
| func sanitizeError(err error) string { | |
| if err == nil { | |
| return "" | |
| } | |
| // Map well-known Kubernetes/API error types to generic, non-PII messages. | |
| switch { | |
| case errors.IsNotFound(err): | |
| return "not found" | |
| case errors.IsAlreadyExists(err): | |
| return "already exists" | |
| case errors.IsForbidden(err): | |
| return "forbidden" | |
| case errors.IsUnauthorized(err): | |
| return "unauthorized" | |
| case errors.IsConflict(err): | |
| return "conflict" | |
| case errors.IsTimeout(err): | |
| return "timeout" | |
| case errors.IsInvalid(err): | |
| return "invalid" | |
| default: | |
| // Do not include the raw error text to avoid leaking PII or sensitive data. | |
| return "unknown error" | |
| } |
| - **Metric**: `operator.health.status` | ||
| - **Value**: `1` (healthy) or `0` (unhealthy) | ||
| - **Frequency**: Every 60 seconds | ||
| - **Dimensions**: `pod_name`, `namespace` |
There was a problem hiding this comment.
The spec lists operator health metric dimensions as pod_name, namespace, but the implementation uses namespace_hash. Update the spec to reflect namespace_hash (or adjust the metric dimensions) so the documentation matches emitted telemetry.
| - **Dimensions**: `pod_name`, `namespace` | |
| - **Dimensions**: `pod_name`, `namespace_hash` |
| // getRestartCount returns the restart count (simplified implementation). | ||
| func getRestartCount() int { | ||
| // In a real implementation, this would track restarts | ||
| // For now, return 0 as this is initial startup | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
getRestartCount() always returns 0, but OperatorStartup includes restart_count and the spec describes it as “restarts in the last hour”. Consider either implementing restart counting (e.g., via Pod status / container restart count) or omitting the field until it’s accurate to avoid misleading telemetry.
| // Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;... | ||
| for _, part := range strings.Split(connStr, ";") { | ||
| if strings.HasPrefix(part, "InstrumentationKey=") { | ||
| return strings.TrimPrefix(part, "InstrumentationKey=") | ||
| } | ||
| } | ||
|
|
||
| return "" | ||
| } | ||
|
|
||
| // parseIngestionEndpointFromConnectionString extracts the ingestion endpoint from a connection string. | ||
| func parseIngestionEndpointFromConnectionString(connStr string) string { | ||
| if connStr == "" { | ||
| return "" | ||
| } | ||
|
|
||
| // Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;... | ||
| for _, part := range strings.Split(connStr, ";") { | ||
| if strings.HasPrefix(part, "IngestionEndpoint=") { | ||
| return strings.TrimPrefix(part, "IngestionEndpoint=") | ||
| } | ||
| } |
There was a problem hiding this comment.
Connection string parsing is sensitive to whitespace: parts like " IngestionEndpoint=..." (common after ; ) won’t match HasPrefix. Trimming whitespace on each part (e.g., strings.TrimSpace(part)) before checking prefixes would make parsing more robust and avoid unexpectedly disabling telemetry when a valid connection string contains spaces.
Summary
This PR adds Application Insights telemetry integration to the DocumentDB Kubernetes Operator.
Changes
Telemetry Package
Controller Integration
Helm Chart Updates
telemetryconfiguration section in values.yamlTesting
Related