Skip to content

Appinsights implementation#237

Draft
Ritvik-Jayaswal wants to merge 4 commits intodocumentdb:mainfrom
Ritvik-Jayaswal:users/rjayaswal/appinsights-implementation
Draft

Appinsights implementation#237
Ritvik-Jayaswal wants to merge 4 commits intodocumentdb:mainfrom
Ritvik-Jayaswal:users/rjayaswal/appinsights-implementation

Conversation

@Ritvik-Jayaswal
Copy link
Collaborator

Summary

This PR adds Application Insights telemetry integration to the DocumentDB Kubernetes Operator.

Changes

Telemetry Package

  • Integrated official Microsoft Application Insights Go SDK (v0.4.4)
  • Rewrote telemetry client to use SDK with automatic batching (30s intervals, 100 items max)
  • Cloud provider detection from node labels/providerID

Controller Integration

  • DocumentDB Controller: Track ClusterCreated and ClusterUpdated events
  • Backup Controller: Track BackupCreated and BackupDeleted events
  • ScheduledBackup Controller: Track ScheduledBackupCreated and triggered events
  • Reconciliation duration metrics for all controllers

Helm Chart Updates

  • Added telemetry configuration section in values.yaml
  • Support for connection string, instrumentation key, or existing secret
  • Added POD_NAMESPACE environment variable for namespace hashing
  • Added node read RBAC permissions for cloud provider detection

Testing

  • Tested on Kind cluster and AKS (K8s v1.32.10)
  • Verified events transmitting to Application Insights (HTTP 200)
  • Confirmed cloud_provider detection works with proper RBAC

Related

  • Spec document: docs/designs/appinsights-metrics.md

- Add internal/telemetry package with:
  - types.go: Event and metric type definitions per spec
  - client.go: Application Insights client with batching, retry, and buffering
  - events.go: Event tracking helper functions
  - metrics.go: Metric tracking helper functions
  - guid.go: GUID generation for privacy-preserving resource correlation
  - utils.go: Utility functions for hashing, categorization
  - manager.go: Central telemetry manager

- Integrate telemetry into controllers:
  - DocumentDBReconciler: Track cluster lifecycle, reconciliation metrics
  - BackupReconciler: Track backup events
  - ScheduledBackupReconciler: Track scheduled backup events

- Update main.go to initialize telemetry on startup

- Add google/uuid dependency for GUID generation

- Add telemetry-configuration.md documentation

Uses placeholder for Application Insights key via environment variables:
- APPINSIGHTS_INSTRUMENTATIONKEY
- APPLICATIONINSIGHTS_CONNECTION_STRING
- DOCUMENTDB_TELEMETRY_ENABLED (set to false to disable)
- Integrate official Microsoft Application Insights Go SDK (v0.4.4)
- Add telemetry tracking for cluster lifecycle events (create, update)
- Add telemetry tracking for backup events (create, delete, scheduled)
- Add reconciliation duration metrics
- Add RBAC for node read permissions (cloud provider detection)
- Add Helm chart configuration for telemetry (values.yaml, deployment)
- Support connection string, instrumentation key, and existing secret
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an Application Insights-based telemetry subsystem for the DocumentDB Kubernetes Operator, wiring it into controllers and exposing configuration via the Helm chart and design docs.

Changes:

  • Added a new internal/telemetry package (client, manager, GUID correlation, event/metric trackers, helpers).
  • Integrated telemetry emission into DocumentDB/Backup/ScheduledBackup controllers (events + reconciliation duration metrics).
  • Added Helm configuration + RBAC updates and new design documentation for telemetry.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
operator/src/internal/telemetry/utils.go Helper functions (namespace hashing, PVC/schedule categorization, distro/provider mapping).
operator/src/internal/telemetry/types.go Telemetry event/metric type definitions and annotation keys.
operator/src/internal/telemetry/metrics.go Metric tracking wrapper over the telemetry client.
operator/src/internal/telemetry/manager.go Telemetry manager + operator context detection (cloud/region/version).
operator/src/internal/telemetry/guid.go GUID generation/storage via resource annotations.
operator/src/internal/telemetry/events.go Event tracking wrapper over the telemetry client.
operator/src/internal/telemetry/client.go Application Insights SDK client setup + send helpers + connection string parsing.
operator/src/internal/controller/scheduledbackup_controller.go Adds telemetry manager wiring + scheduled-backup telemetry hooks.
operator/src/internal/controller/documentdb_controller.go Adds telemetry manager wiring + cluster lifecycle/reconcile telemetry hooks.
operator/src/internal/controller/backup_controller.go Adds telemetry manager wiring + backup lifecycle telemetry hooks.
operator/src/cmd/main.go Initializes telemetry manager and injects it into controllers.
operator/src/go.mod Adds AppInsights SDK + UUID dependency.
operator/src/go.sum Updates module checksums for new dependencies.
operator/documentdb-helm-chart/values.yaml Adds telemetry: configuration section.
operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml Wires env vars / secret-based config into operator deployment.
operator/documentdb-helm-chart/templates/05_clusterrole.yaml Adds node read permissions for cloud provider detection.
docs/designs/telemetry-configuration.md New telemetry configuration and privacy documentation.
docs/designs/appinsights-metrics.md New telemetry collection specification document.

r.Recorder.Event(backup, "Normal", "BackupInitialized", "Successfully initialized backup")

// Track backup creation telemetry
r.trackBackupCreated(ctx, backup, cluster, "on-demand")
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createCNPGBackup always calls trackBackupCreated(..., "on-demand"), but backups created from ScheduledBackups are labeled scheduledbackup (see ScheduledBackup.CreateBackup). Consider deriving backup_type from backup.Labels["scheduledbackup"] (scheduled vs on-demand) so telemetry matches actual behavior.

Suggested change
r.trackBackupCreated(ctx, backup, cluster, "on-demand")
backupType := "on-demand"
if backup.Labels != nil {
if v, ok := backup.Labels["scheduledbackup"]; ok && v != "" {
backupType = "scheduled"
}
}
r.trackBackupCreated(ctx, backup, cluster, backupType)

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +93
// Track reconciliation at the end
defer func() {
r.trackReconcileDuration(ctx, "DocumentDB", "reconcile", time.Since(reconcileStart).Seconds(), err == nil)
}()
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deferred reconciliation-duration telemetry uses the mutable local variable err to determine success. Several code paths return a nil error while leaving err set (e.g., requeue-on-error patterns), which will incorrectly report reconciliations as failures. Consider basing success on the actual returned error (e.g., named return retErr, or capture the return value in a separate variable).

Copilot uses AI. Check for mistakes.
Comment on lines +501 to +508
r.TelemetryMgr.Events.TrackReconciliationError(telemetry.ReconciliationErrorEvent{
ResourceType: resourceType,
ResourceID: resourceName, // Will be replaced with GUID when available
NamespaceHash: telemetry.HashNamespace(namespace),
ErrorType: errorType,
ErrorMessage: sanitizeError(err),
ResolutionStatus: "pending",
})
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReconciliationErrorEvent.ResourceID is currently populated with resourceName (e.g., req.Name), which is a user-provided identifier and conflicts with the stated privacy goal of not collecting resource names. Prefer using the stored telemetry GUID (e.g., TelemetryMgr.GUIDs.GetClusterID(documentdb) / GetOrCreate...) or omit the field when a GUID isn’t available.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +70
// Simple heuristics for common patterns
minute, hour, dayOfMonth, _, dayOfWeek := parts[0], parts[1], parts[2], parts[3], parts[4]

// Hourly: runs every hour (e.g., "0 * * * *")
if minute != "*" && hour == "*" && dayOfMonth == "*" && dayOfWeek == "*" {
return ScheduleFrequencyHourly
}

// Daily: runs once per day (e.g., "0 2 * * *")
if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek == "*" {
return ScheduleFrequencyDaily
}

// Weekly: runs once per week (e.g., "0 2 * * 0")
if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek != "*" {
return ScheduleFrequencyWeekly
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CategorizeScheduleFrequency will classify cron expressions like */5 * * * * as hourly because it only checks for minute != "*" + hour == "*". That miscategorizes common “every N minutes” schedules; consider tightening the checks to match only fixed-minute hourly patterns (e.g., minute is a single number) and treat step/range expressions as custom.

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +31
appInsightsInstrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE"
# Or use connection string:
# appInsightsConnectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..."
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Helm values example uses appInsightsInstrumentationKey / appInsightsConnectionString, but the chart values (and template) use telemetry.instrumentationKey / telemetry.connectionString / telemetry.existingSecret. Please update the doc snippet to match the actual values schema to avoid misconfiguration.

Suggested change
appInsightsInstrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE"
# Or use connection string:
# appInsightsConnectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..."
instrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE"
# Or use connection string:
# connectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..."
# Or use an existing secret containing APPINSIGHTS_INSTRUMENTATIONKEY / APPLICATIONINSIGHTS_CONNECTION_STRING:
# existingSecret: "documentdb-operator-telemetry"

Copilot uses AI. Check for mistakes.
retentionDays := 30 // default
if cluster.Spec.Backup != nil && cluster.Spec.Backup.RetentionDays > 0 {
retentionDays = cluster.Spec.Backup.RetentionDays
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telemetry RetentionDays is taken from cluster.Spec.Backup.RetentionDays with a hardcoded default, but Backup resources can override retention via backup.Spec.RetentionDays. Consider using backup.Spec.RetentionDays when set (falling back to the cluster default) so telemetry reflects the effective retention policy.

Suggested change
}
}
if backup.Spec.RetentionDays > 0 {
retentionDays = backup.Spec.RetentionDays
}

Copilot uses AI. Check for mistakes.
Comment on lines +587 to +597
// sanitizeError removes potential PII from error messages.
func sanitizeError(err error) string {
if err == nil {
return ""
}
msg := err.Error()
// Truncate long messages
if len(msg) > 200 {
msg = msg[:200] + "..."
}
return msg
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanitizeError claims to remove potential PII, but it only truncates the message. Since ErrorMessage is sent in telemetry, this can still leak sensitive info (names, URLs, etc.) and contradicts the design docs (“no raw error messages”). Consider removing err.Error() from telemetry entirely (send only categorized error_type/error_code), or implement a real sanitizer and keep messages very coarse.

Suggested change
// sanitizeError removes potential PII from error messages.
func sanitizeError(err error) string {
if err == nil {
return ""
}
msg := err.Error()
// Truncate long messages
if len(msg) > 200 {
msg = msg[:200] + "..."
}
return msg
// sanitizeError returns a coarse, non-PII classification of the error.
func sanitizeError(err error) string {
if err == nil {
return ""
}
// Map well-known Kubernetes/API error types to generic, non-PII messages.
switch {
case errors.IsNotFound(err):
return "not found"
case errors.IsAlreadyExists(err):
return "already exists"
case errors.IsForbidden(err):
return "forbidden"
case errors.IsUnauthorized(err):
return "unauthorized"
case errors.IsConflict(err):
return "conflict"
case errors.IsTimeout(err):
return "timeout"
case errors.IsInvalid(err):
return "invalid"
default:
// Do not include the raw error text to avoid leaking PII or sensitive data.
return "unknown error"
}

Copilot uses AI. Check for mistakes.
- **Metric**: `operator.health.status`
- **Value**: `1` (healthy) or `0` (unhealthy)
- **Frequency**: Every 60 seconds
- **Dimensions**: `pod_name`, `namespace`
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec lists operator health metric dimensions as pod_name, namespace, but the implementation uses namespace_hash. Update the spec to reflect namespace_hash (or adjust the metric dimensions) so the documentation matches emitted telemetry.

Suggested change
- **Dimensions**: `pod_name`, `namespace`
- **Dimensions**: `pod_name`, `namespace_hash`

Copilot uses AI. Check for mistakes.
Comment on lines +222 to +227
// getRestartCount returns the restart count (simplified implementation).
func getRestartCount() int {
// In a real implementation, this would track restarts
// For now, return 0 as this is initial startup
return 0
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRestartCount() always returns 0, but OperatorStartup includes restart_count and the spec describes it as “restarts in the last hour”. Consider either implementing restart counting (e.g., via Pod status / container restart count) or omitting the field until it’s accurate to avoid misleading telemetry.

Copilot uses AI. Check for mistakes.
Comment on lines +206 to +227
// Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;...
for _, part := range strings.Split(connStr, ";") {
if strings.HasPrefix(part, "InstrumentationKey=") {
return strings.TrimPrefix(part, "InstrumentationKey=")
}
}

return ""
}

// parseIngestionEndpointFromConnectionString extracts the ingestion endpoint from a connection string.
func parseIngestionEndpointFromConnectionString(connStr string) string {
if connStr == "" {
return ""
}

// Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;...
for _, part := range strings.Split(connStr, ";") {
if strings.HasPrefix(part, "IngestionEndpoint=") {
return strings.TrimPrefix(part, "IngestionEndpoint=")
}
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connection string parsing is sensitive to whitespace: parts like " IngestionEndpoint=..." (common after ; ) won’t match HasPrefix. Trimming whitespace on each part (e.g., strings.TrimSpace(part)) before checking prefixes would make parsing more robust and avoid unexpectedly disabling telemetry when a valid connection string contains spaces.

Copilot uses AI. Check for mistakes.
@Ritvik-Jayaswal Ritvik-Jayaswal marked this pull request as draft February 13, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants