Appinsights implementation by Ritvik-Jayaswal · Pull Request #237 · documentdb/documentdb-kubernetes-operator

Ritvik-Jayaswal · 2026-02-12T18:10:24Z

Summary

This PR adds Application Insights telemetry integration to the DocumentDB Kubernetes Operator.

Changes

Telemetry Package

Integrated official Microsoft Application Insights Go SDK (v0.4.4)
Rewrote telemetry client to use SDK with automatic batching (30s intervals, 100 items max)
Cloud provider detection from node labels/providerID

Controller Integration

DocumentDB Controller: Track ClusterCreated and ClusterUpdated events
Backup Controller: Track BackupCreated and BackupDeleted events
ScheduledBackup Controller: Track ScheduledBackupCreated and triggered events
Reconciliation duration metrics for all controllers

Helm Chart Updates

Added telemetry configuration section in values.yaml
Support for connection string, instrumentation key, or existing secret
Added POD_NAMESPACE environment variable for namespace hashing
Added node read RBAC permissions for cloud provider detection

Testing

Tested on Kind cluster and AKS (K8s v1.32.10)
Verified events transmitting to Application Insights (HTTP 200)
Confirmed cloud_provider detection works with proper RBAC

- Add internal/telemetry package with: - types.go: Event and metric type definitions per spec - client.go: Application Insights client with batching, retry, and buffering - events.go: Event tracking helper functions - metrics.go: Metric tracking helper functions - guid.go: GUID generation for privacy-preserving resource correlation - utils.go: Utility functions for hashing, categorization - manager.go: Central telemetry manager - Integrate telemetry into controllers: - DocumentDBReconciler: Track cluster lifecycle, reconciliation metrics - BackupReconciler: Track backup events - ScheduledBackupReconciler: Track scheduled backup events - Update main.go to initialize telemetry on startup - Add google/uuid dependency for GUID generation - Add telemetry-configuration.md documentation Uses placeholder for Application Insights key via environment variables: - APPINSIGHTS_INSTRUMENTATIONKEY - APPLICATIONINSIGHTS_CONNECTION_STRING - DOCUMENTDB_TELEMETRY_ENABLED (set to false to disable)

- Integrate official Microsoft Application Insights Go SDK (v0.4.4) - Add telemetry tracking for cluster lifecycle events (create, update) - Add telemetry tracking for backup events (create, delete, scheduled) - Add reconciliation duration metrics - Add RBAC for node read permissions (cloud provider detection) - Add Helm chart configuration for telemetry (values.yaml, deployment) - Support connection string, instrumentation key, and existing secret

Copilot

Pull request overview

This PR introduces an Application Insights-based telemetry subsystem for the DocumentDB Kubernetes Operator, wiring it into controllers and exposing configuration via the Helm chart and design docs.

Changes:

Added a new internal/telemetry package (client, manager, GUID correlation, event/metric trackers, helpers).
Integrated telemetry emission into DocumentDB/Backup/ScheduledBackup controllers (events + reconciliation duration metrics).
Added Helm configuration + RBAC updates and new design documentation for telemetry.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
operator/src/internal/telemetry/utils.go	Helper functions (namespace hashing, PVC/schedule categorization, distro/provider mapping).
operator/src/internal/telemetry/types.go	Telemetry event/metric type definitions and annotation keys.
operator/src/internal/telemetry/metrics.go	Metric tracking wrapper over the telemetry client.
operator/src/internal/telemetry/manager.go	Telemetry manager + operator context detection (cloud/region/version).
operator/src/internal/telemetry/guid.go	GUID generation/storage via resource annotations.
operator/src/internal/telemetry/events.go	Event tracking wrapper over the telemetry client.
operator/src/internal/telemetry/client.go	Application Insights SDK client setup + send helpers + connection string parsing.
operator/src/internal/controller/scheduledbackup_controller.go	Adds telemetry manager wiring + scheduled-backup telemetry hooks.
operator/src/internal/controller/documentdb_controller.go	Adds telemetry manager wiring + cluster lifecycle/reconcile telemetry hooks.
operator/src/internal/controller/backup_controller.go	Adds telemetry manager wiring + backup lifecycle telemetry hooks.
operator/src/cmd/main.go	Initializes telemetry manager and injects it into controllers.
operator/src/go.mod	Adds AppInsights SDK + UUID dependency.
operator/src/go.sum	Updates module checksums for new dependencies.
operator/documentdb-helm-chart/values.yaml	Adds `telemetry:` configuration section.
operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml	Wires env vars / secret-based config into operator deployment.
operator/documentdb-helm-chart/templates/05_clusterrole.yaml	Adds node read permissions for cloud provider detection.
docs/designs/telemetry-configuration.md	New telemetry configuration and privacy documentation.
docs/designs/appinsights-metrics.md	New telemetry collection specification document.

Copilot · 2026-02-12T18:20:57Z

operator/src/internal/controller/backup_controller.go

 	r.Recorder.Event(backup, "Normal", "BackupInitialized", "Successfully initialized backup")

+	// Track backup creation telemetry
+	r.trackBackupCreated(ctx, backup, cluster, "on-demand")


createCNPGBackup always calls trackBackupCreated(..., "on-demand"), but backups created from ScheduledBackups are labeled scheduledbackup (see ScheduledBackup.CreateBackup). Consider deriving backup_type from backup.Labels["scheduledbackup"] (scheduled vs on-demand) so telemetry matches actual behavior.

Suggested change

r.trackBackupCreated(ctx, backup, cluster, "on-demand")

backupType := "on-demand"

if backup.Labels != nil {

if v, ok := backup.Labels["scheduledbackup"]; ok && v != "" {

backupType = "scheduled"

}

}

r.trackBackupCreated(ctx, backup, cluster, backupType)

Copilot · 2026-02-12T18:20:57Z

operator/src/internal/controller/documentdb_controller.go

+	// Track reconciliation at the end
+	defer func() {
+		r.trackReconcileDuration(ctx, "DocumentDB", "reconcile", time.Since(reconcileStart).Seconds(), err == nil)
+	}()


The deferred reconciliation-duration telemetry uses the mutable local variable err to determine success. Several code paths return a nil error while leaving err set (e.g., requeue-on-error patterns), which will incorrectly report reconciliations as failures. Consider basing success on the actual returned error (e.g., named return retErr, or capture the return value in a separate variable).

Copilot · 2026-02-12T18:20:57Z

operator/src/internal/controller/documentdb_controller.go

+	r.TelemetryMgr.Events.TrackReconciliationError(telemetry.ReconciliationErrorEvent{
+		ResourceType:     resourceType,
+		ResourceID:       resourceName, // Will be replaced with GUID when available
+		NamespaceHash:    telemetry.HashNamespace(namespace),
+		ErrorType:        errorType,
+		ErrorMessage:     sanitizeError(err),
+		ResolutionStatus: "pending",
+	})


ReconciliationErrorEvent.ResourceID is currently populated with resourceName (e.g., req.Name), which is a user-provided identifier and conflicts with the stated privacy goal of not collecting resource names. Prefer using the stored telemetry GUID (e.g., TelemetryMgr.GUIDs.GetClusterID(documentdb) / GetOrCreate...) or omit the field when a GUID isn’t available.

Copilot · 2026-02-12T18:20:58Z

operator/src/internal/telemetry/utils.go

+	// Simple heuristics for common patterns
+	minute, hour, dayOfMonth, _, dayOfWeek := parts[0], parts[1], parts[2], parts[3], parts[4]
+
+	// Hourly: runs every hour (e.g., "0 * * * *")
+	if minute != "*" && hour == "*" && dayOfMonth == "*" && dayOfWeek == "*" {
+		return ScheduleFrequencyHourly
+	}
+
+	// Daily: runs once per day (e.g., "0 2 * * *")
+	if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek == "*" {
+		return ScheduleFrequencyDaily
+	}
+
+	// Weekly: runs once per week (e.g., "0 2 * * 0")
+	if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek != "*" {
+		return ScheduleFrequencyWeekly


CategorizeScheduleFrequency will classify cron expressions like */5 * * * * as hourly because it only checks for minute != "*" + hour == "*". That miscategorizes common “every N minutes” schedules; consider tightening the checks to match only fixed-minute hourly patterns (e.g., minute is a single number) and treat step/range expressions as custom.

Copilot · 2026-02-12T18:20:58Z

docs/designs/telemetry-configuration.md

+  appInsightsInstrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE"
+  # Or use connection string:
+  # appInsightsConnectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..."


The Helm values example uses appInsightsInstrumentationKey / appInsightsConnectionString, but the chart values (and template) use telemetry.instrumentationKey / telemetry.connectionString / telemetry.existingSecret. Please update the doc snippet to match the actual values schema to avoid misconfiguration.

Suggested change

appInsightsInstrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE"

# Or use connection string:

# appInsightsConnectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..."

instrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE"

# Or use connection string:

# connectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..."

# Or use an existing secret containing APPINSIGHTS_INSTRUMENTATIONKEY / APPLICATIONINSIGHTS_CONNECTION_STRING:

# existingSecret: "documentdb-operator-telemetry"

Copilot · 2026-02-12T18:20:59Z

operator/src/internal/controller/backup_controller.go

+	retentionDays := 30 // default
+	if cluster.Spec.Backup != nil && cluster.Spec.Backup.RetentionDays > 0 {
+		retentionDays = cluster.Spec.Backup.RetentionDays
+	}


Telemetry RetentionDays is taken from cluster.Spec.Backup.RetentionDays with a hardcoded default, but Backup resources can override retention via backup.Spec.RetentionDays. Consider using backup.Spec.RetentionDays when set (falling back to the cluster default) so telemetry reflects the effective retention policy.

Suggested change

}

}

if backup.Spec.RetentionDays > 0 {

retentionDays = backup.Spec.RetentionDays

}

Copilot · 2026-02-12T18:20:59Z

operator/src/internal/controller/documentdb_controller.go

+// sanitizeError removes potential PII from error messages.
+func sanitizeError(err error) string {
+	if err == nil {
+		return ""
+	}
+	msg := err.Error()
+	// Truncate long messages
+	if len(msg) > 200 {
+		msg = msg[:200] + "..."
+	}
+	return msg


sanitizeError claims to remove potential PII, but it only truncates the message. Since ErrorMessage is sent in telemetry, this can still leak sensitive info (names, URLs, etc.) and contradicts the design docs (“no raw error messages”). Consider removing err.Error() from telemetry entirely (send only categorized error_type/error_code), or implement a real sanitizer and keep messages very coarse.

Suggested change

// sanitizeError removes potential PII from error messages.

func sanitizeError(err error) string {

if err == nil {

return ""

}

msg := err.Error()

// Truncate long messages

if len(msg) > 200 {

msg = msg[:200] + "..."

}

return msg

// sanitizeError returns a coarse, non-PII classification of the error.

func sanitizeError(err error) string {

if err == nil {

return ""

}

// Map well-known Kubernetes/API error types to generic, non-PII messages.

switch {

case errors.IsNotFound(err):

return "not found"

case errors.IsAlreadyExists(err):

return "already exists"

case errors.IsForbidden(err):

return "forbidden"

case errors.IsUnauthorized(err):

return "unauthorized"

case errors.IsConflict(err):

return "conflict"

case errors.IsTimeout(err):

return "timeout"

case errors.IsInvalid(err):

return "invalid"

default:

// Do not include the raw error text to avoid leaking PII or sensitive data.

return "unknown error"

}

Copilot · 2026-02-12T18:20:59Z

docs/designs/appinsights-metrics.md

+- **Metric**: `operator.health.status`
+- **Value**: `1` (healthy) or `0` (unhealthy)
+- **Frequency**: Every 60 seconds
+- **Dimensions**: `pod_name`, `namespace`


The spec lists operator health metric dimensions as pod_name, namespace, but the implementation uses namespace_hash. Update the spec to reflect namespace_hash (or adjust the metric dimensions) so the documentation matches emitted telemetry.

Suggested change

- **Dimensions**: `pod_name`, `namespace`

- **Dimensions**: `pod_name`, `namespace_hash`

Copilot · 2026-02-12T18:21:00Z

operator/src/internal/telemetry/manager.go

+// getRestartCount returns the restart count (simplified implementation).
+func getRestartCount() int {
+	// In a real implementation, this would track restarts
+	// For now, return 0 as this is initial startup
+	return 0
+}


getRestartCount() always returns 0, but OperatorStartup includes restart_count and the spec describes it as “restarts in the last hour”. Consider either implementing restart counting (e.g., via Pod status / container restart count) or omitting the field until it’s accurate to avoid misleading telemetry.

Copilot · 2026-02-12T18:21:00Z

operator/src/internal/telemetry/client.go

+	// Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;...
+	for _, part := range strings.Split(connStr, ";") {
+		if strings.HasPrefix(part, "InstrumentationKey=") {
+			return strings.TrimPrefix(part, "InstrumentationKey=")
+		}
+	}
+
+	return ""
+}
+
+// parseIngestionEndpointFromConnectionString extracts the ingestion endpoint from a connection string.
+func parseIngestionEndpointFromConnectionString(connStr string) string {
+	if connStr == "" {
+		return ""
+	}
+
+	// Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;...
+	for _, part := range strings.Split(connStr, ";") {
+		if strings.HasPrefix(part, "IngestionEndpoint=") {
+			return strings.TrimPrefix(part, "IngestionEndpoint=")
+		}
+	}


Connection string parsing is sensitive to whitespace: parts like " IngestionEndpoint=..." (common after ; ) won’t match HasPrefix. Trimming whitespace on each part (e.g., strings.TrimSpace(part)) before checking prefixes would make parsing more robust and avoid unexpectedly disabling telemetry when a valid connection string contains spaces.

rjayaswal added 4 commits January 8, 2026 13:06

added Appinsights doc

db6ad55

dixed PII

0165f5c

Ritvik-Jayaswal requested review from alaye-ms and xgerman as code owners February 12, 2026 18:10

Copilot AI review requested due to automatic review settings February 12, 2026 18:10

Ritvik-Jayaswal requested a review from hossain-rayhan as a code owner February 12, 2026 18:10

Copilot started reviewing on behalf of Ritvik-Jayaswal February 12, 2026 18:10 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

Ritvik-Jayaswal marked this pull request as draft February 13, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appinsights implementation#237

Appinsights implementation#237
Ritvik-Jayaswal wants to merge 4 commits intodocumentdb:mainfrom
Ritvik-Jayaswal:users/rjayaswal/appinsights-implementation

Ritvik-Jayaswal commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-	r.trackBackupCreated(ctx, backup, cluster, "on-demand")
+	backupType := "on-demand"
+	if backup.Labels != nil {
+		if v, ok := backup.Labels["scheduledbackup"]; ok && v != "" {
+			backupType = "scheduled"
+		}
+	}
+	r.trackBackupCreated(ctx, backup, cluster, backupType)

-	}
+	}
+	if backup.Spec.RetentionDays > 0 {
+		retentionDays = backup.Spec.RetentionDays
+	}

-// sanitizeError removes potential PII from error messages.
-func sanitizeError(err error) string {
-	if err == nil {
-		return ""
-	}
-	msg := err.Error()
-	// Truncate long messages
-	if len(msg) > 200 {
-		msg = msg[:200] + "..."
-	}
-	return msg
+// sanitizeError returns a coarse, non-PII classification of the error.
+func sanitizeError(err error) string {
+	if err == nil {
+		return ""
+	}
+	// Map well-known Kubernetes/API error types to generic, non-PII messages.
+	switch {
+	case errors.IsNotFound(err):
+		return "not found"
+	case errors.IsAlreadyExists(err):
+		return "already exists"
+	case errors.IsForbidden(err):
+		return "forbidden"
+	case errors.IsUnauthorized(err):
+		return "unauthorized"
+	case errors.IsConflict(err):
+		return "conflict"
+	case errors.IsTimeout(err):
+		return "timeout"
+	case errors.IsInvalid(err):
+		return "invalid"
+	default:
+		// Do not include the raw error text to avoid leaking PII or sensitive data.
+		return "unknown error"
+	}

	- Dimensions: `pod_name`, `namespace`
	- Dimensions: `pod_name`, `namespace_hash`

Conversation

Ritvik-Jayaswal commented Feb 12, 2026

Summary

Changes

Telemetry Package

Controller Integration

Helm Chart Updates

Testing

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants