📊 Monitoring

This guide shows you how to set up monitoring for the Secrets application using Prometheus and Grafana.

For a complete reference of all metrics, see the Metrics Reference.

Related guides:

Metrics Reference - Complete catalog of all metrics and Prometheus queries
Health Check Endpoints - Liveness and readiness probes for container orchestration
Incident Response - Troubleshooting production issues

Overview
Configuration
Quickstart (Prometheus + Grafana)
Prometheus Configuration
Grafana Dashboard
Alerting
Disabling Metrics
Troubleshooting
See Also

Overview

The Secrets application exposes Prometheus-compatible metrics at http://localhost:8081/metrics. This guide walks you through:

Configuring metrics collection
Setting up Prometheus to scrape metrics
Creating Grafana dashboards
Configuring alerts

For detailed information about available metrics, see the Metrics Reference.

The application uses OpenTelemetry for metrics instrumentation with a Prometheus-compatible export endpoint. Metrics can be enabled/disabled via configuration and cover two main areas:

HTTP Request Metrics - Request counts, durations, and status codes for all API endpoints
Business Operation Metrics - Domain-specific operation counters and durations (auth, secrets, transit, tokenization)

Health monitoring: For liveness and readiness probes, see Health Check Endpoints.

Configuration

Environment Variables

# Enable or disable metrics collection
METRICS_ENABLED=true  # default: true

# Namespace prefix for all metrics
METRICS_NAMESPACE=secrets  # default: secrets

Update your .env file:

# Metrics configuration
METRICS_ENABLED=true
METRICS_NAMESPACE=secrets

Quickstart (Prometheus + Grafana)

Use this minimal local stack to visualize Secrets metrics quickly:

Start Secrets with metrics enabled
Start Prometheus with a scrape config for http://host.docker.internal:8081/metrics
Open Grafana and create panels from Prometheus queries

Note: On Linux, replace host.docker.internal with the host IP reachable from your Docker network.

Minimal prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "secrets-api"
    static_configs:
      - targets: ["host.docker.internal:8081"]
    metrics_path: "/metrics"

Quick run commands:

# Start Prometheus
docker run --rm -d --name prom \
  -p 9090:9090 \
  -v "$(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml:ro" \
  prom/prometheus

# Start Grafana
docker run --rm -d --name grafana \
  -p 3000:3000 \
  grafana/grafana

Suggested first panel query (requests/sec by route):

sum(rate(secrets_http_requests_total[5m])) by (method, path)

For more Prometheus queries, see the Prometheus Query Library in the Metrics Reference.

Prometheus Configuration

Scrape Configuration

Add the Secrets application to your prometheus.yml:

scrape_configs:
  - job_name: 'secrets-api'
    static_configs:
      - targets: ['localhost:8081']
    metrics_path: '/metrics'
    scrape_interval: 15s

Configuration notes:

Port 8081 - Metrics are exposed on a separate port from the main API (8080)
No authentication - The /metrics endpoint is public (standard Prometheus practice)
Scrape interval - 15-30 seconds is recommended for most deployments

For a complete list of available metrics, see the Metrics Reference.

Example Queries

This section provides commonly used queries for monitoring. For a comprehensive query library, see the Prometheus Query Library in the Metrics Reference.

Total requests per second (rate over 5 minutes):

rate(secrets_http_requests_total[5m])

95th percentile request latency:

histogram_quantile(0.95, rate(secrets_http_request_duration_seconds_bucket[5m]))

Error rate by domain:

rate(secrets_operations_total{status="error"}[5m]) / rate(secrets_operations_total[5m])

Slowest operations:

topk(5, rate(secrets_operation_duration_seconds_sum[5m]) / rate(secrets_operation_duration_seconds_count[5m]))

For more query examples including rate limiting, tokenization-specific queries, and SLO queries, see the Metrics Reference.

Rate Limiting and Security Queries

429 rate by route (5m):

sum(rate(secrets_http_requests_total{status_code="429"}[5m])) by (path)

429 ratio by route (5m):

sum(rate(secrets_http_requests_total{status_code="429"}[5m])) by (path)
/
sum(rate(secrets_http_requests_total[5m])) by (path)

Denied authorization rate (403) by route (5m):

sum(rate(secrets_http_requests_total{status_code="403"}[5m])) by (path)

Token endpoint 429 ratio (5m):

sum(rate(secrets_http_requests_total{path="/v1/token",status_code="429"}[5m]))
/
sum(rate(secrets_http_requests_total{path="/v1/token"}[5m]))

Token endpoint request rate by status (5m):

sum(rate(secrets_http_requests_total{path="/v1/token"}[5m])) by (status_code)

Additional rate limiting queries including 429 analysis, token endpoint health, and throttle pressure metrics are available in the Metrics Reference.

Grafana Dashboard

Dashboard Artifacts

Starter Grafana dashboard JSON artifacts are available for quick setup:

Available dashboards:

secrets-overview.json - Baseline request/error/latency view
secrets-rate-limiting.json - 429 behavior and throttle pressure view

Location: docs/operations/dashboards/

Import instructions:

Open Grafana (default: http://localhost:3000)
Go to Dashboards → Import
Click Upload JSON file
Select a dashboard file from docs/operations/dashboards/
Select your Prometheus datasource
Click Import

Notes:

Treat these dashboards as starter templates
Adjust panel thresholds and time windows for your traffic profile

For detailed panel recommendations and configuration tips, see the Grafana Dashboards section in the Metrics Reference.

Recommended Panels

When creating custom dashboards, consider including:

Panel Type	Metric	Purpose
Time Series	`rate(secrets_http_requests_total[5m])`	Request rate by route
Time Series	`histogram_quantile(0.95, ...)`	p95 latency by route
Stat	5xx error rate	Current server error rate
Gauge	API availability	Availability percentage with SLO threshold
Table	Top operations by volume	Identify hottest paths
Heatmap	Request duration buckets	Latency distribution visualization

Example panel query (request rate by route):

sum(rate(secrets_http_requests_total[5m])) by (method, path)

For more panel examples and configuration guidance, see the Metrics Reference.

Alerting

Recommended Alerts

This section provides production-ready alert rules for Prometheus Alertmanager.

Alert configuration tips:

Start with warning thresholds and tune based on your environment
Use appropriate for durations to avoid alert flapping
Include actionable descriptions in annotations
Link to runbooks or documentation in annotations

Alert Ownership and Escalation

Alert class	Default severity	Primary owner	Escalate if unresolved
API availability / 5xx surge	`critical`	Platform/on-call	10 minutes
Token issuance failures	`critical`	Platform + IAM owner	10 minutes
Sustained `429` ratio	`warning`	Service owner + platform	30 minutes
Elevated `403` denied rate	`warning`	Security + service owner	30 minutes
Metrics scrape failures	`warning`	Observability owner	30 minutes

Suggested escalation policy:

Page primary owner immediately for critical
Notify secondary owner if not acknowledged in 5 minutes
Escalate to incident commander when SLA/SLO at risk

High Error Rate

- alert: HighErrorRate
  expr: rate(secrets_operations_total{status="error"}[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value }} errors/sec"

High Latency

- alert: HighLatency
  expr: histogram_quantile(0.95, rate(secrets_http_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High request latency"
    description: "95th percentile latency is {{ $value }}s"

Excessive 429 Ratio

- alert: ExcessiveRateLimit429Ratio
  expr: |
    (
      sum(rate(secrets_http_requests_total{status_code="429"}[10m]))
      /
      sum(rate(secrets_http_requests_total[10m]))
    ) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High 429 ratio detected"
    description: "More than 5% of requests are being throttled"

429 Burst On Critical Routes

- alert: RateLimitBurstCriticalRoute
  expr: sum(rate(secrets_http_requests_total{status_code="429",path=~"/v1/secrets/.*|/v1/transit/.*"}[5m])) > 2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Sustained 429 on secrets/transit routes"
    description: "Critical crypto routes are being throttled above threshold"

Token Endpoint 429 Ratio (Warning)

- alert: TokenEndpoint429RatioWarning
  expr: |
    (
      sum(rate(secrets_http_requests_total{path="/v1/token",status_code="429"}[10m]))
      /
      sum(rate(secrets_http_requests_total{path="/v1/token"}[10m]))
    ) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Token endpoint throttling elevated"
    description: "More than 5% of /v1/token requests are returning 429"

Token Endpoint 429 Ratio (Critical)

- alert: TokenEndpoint429RatioCritical
  expr: |
    (
      sum(rate(secrets_http_requests_total{path="/v1/token",status_code="429"}[10m]))
      /
      sum(rate(secrets_http_requests_total{path="/v1/token"}[10m]))
    ) > 0.20
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "Token endpoint throttling critical"
    description: "More than 20% of /v1/token requests are returning 429"

Disabling Metrics

To disable metrics collection, set METRICS_ENABLED=false in your environment:

export METRICS_ENABLED=false

When disabled:

The /metrics endpoint is not registered (requests return 404 Not Found)
No metrics are collected (zero overhead)
HTTP metrics middleware is not applied
Business metrics use a no-op implementation

Troubleshooting

Metrics endpoint returns 404

Check that METRICS_ENABLED=true in your environment
Verify that you are accessing the correct port (default is 8081, e.g., http://localhost:8081/metrics)
Verify the server is running and accessible

Missing metrics

Ensure Prometheus is scraping the /metrics endpoint
Check that operations are actually being executed
Verify the namespace matches your configuration (METRICS_NAMESPACE)

High memory usage

Review Prometheus scrape interval (recommend 15-30 seconds)
Check for high cardinality labels (should not occur with proper configuration)
Verify metrics are being scraped regularly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📊 Monitoring

Table of Contents

Overview

Configuration

Environment Variables

Quickstart (Prometheus + Grafana)

Prometheus Configuration

Scrape Configuration

Example Queries

Rate Limiting and Security Queries

Grafana Dashboard

Dashboard Artifacts

Recommended Panels

Alerting

Recommended Alerts

Alert Ownership and Escalation

High Error Rate

High Latency

Excessive 429 Ratio

429 Burst On Critical Routes

Token Endpoint 429 Ratio (Warning)

Token Endpoint 429 Ratio (Critical)

Disabling Metrics

Troubleshooting

Metrics endpoint returns 404

Missing metrics

High memory usage

See Also

FilesExpand file tree

monitoring.md

Latest commit

History

monitoring.md

File metadata and controls

📊 Monitoring

Table of Contents

Overview

Configuration

Environment Variables

Quickstart (Prometheus + Grafana)

Prometheus Configuration

Scrape Configuration

Example Queries

Rate Limiting and Security Queries

Grafana Dashboard

Dashboard Artifacts

Recommended Panels

Alerting

Recommended Alerts

Alert Ownership and Escalation

High Error Rate

High Latency

Excessive 429 Ratio

429 Burst On Critical Routes

Token Endpoint 429 Ratio (Warning)

Token Endpoint 429 Ratio (Critical)

Disabling Metrics

Troubleshooting

Metrics endpoint returns 404

Missing metrics

High memory usage

See Also