You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Display health indicators for CLP package services in the WebUI to help users identify when critical backend components are unavailable.
Problem
Users have experienced situations where the WebUI is accessible and appears functional, but queries are not being processed because the query job orchestration components (query-scheduler, query-worker, reducer, etc.) are down. While administrators can check service health via orchestrator tools (docker compose ps, kubectl get pods), there is no visibility into service health from user-facing interfaces like the WebUI, making it difficult for users to diagnose such issues.
Why this matters
User Experience: Users cannot easily determine why their queries are not processing
Debugging: Without health indicators, users have no visibility into which components might be failing
Operational Awareness: Administrators need to quickly identify service outages without manually checking container/pod status
Affected services
Based on the current architecture in tools/deployment/package/docker-compose-all.yaml and tools/deployment/package-helm/templates/, the services that need health monitoring include:
Note: Current health checks are defined per orchestrator:
Docker Compose: healthcheck blocks in docker-compose-all.yaml
Kubernetes (Helm): readinessProbe / livenessProbe in deployment templates
Core job orchestration services
Service
Description
Port
Current Health Endpoint
query-scheduler
Schedules query jobs
7000
None (the port is only a TCP listener for reducers)
compression-scheduler
Schedules compression jobs
-
None
query-worker
Celery worker for executing queries
-
None (Celery process)
compression-worker
Celery worker for executing compression jobs
-
None (Celery process)
reducer
Aggregates query results
-
None
Supporting services
Service
Description
Port
Current Health Endpoint
api-server
REST API server
3001
GET /health
webui
Web interface
4000
TCP socket check
garbage-collector
Cleans up old archives and results
-
None
mcp-server
MCP server (optional)
8000
GET /health
log-ingestor
Ingestion service
3002
GET /health
Third-party services
Service
Description
Port
Health Check Method
database
MariaDB
3306
mysqladmin ping
queue
RabbitMQ
5672
rabbitmq-diagnostics check_running
redis
Redis
6379
redis-cli PING
results-cache
MongoDB
27017
mongosh ping
Initialization jobs
Service
Description
Health Check Method
db-table-creator
Creates database tables in MariaDB
Job completion status (one-time)
results-cache-indices-creator
Initializes MongoDB indices
Job completion status (one-time)
Note: Most services depend on these initialization jobs completing successfully before starting.
Possible implementation
Two decisions need to be made:
How services report health — the mechanism for collecting health status from services
How health statuses are cached/exposed — how the API server stores and exposes aggregated health data
1. Alternative approaches for health reporting
How should services report their health status to a central aggregator?
Option
Description
Advantages
Disadvantages
1A: Orchestrator-based
Leverage Docker/Kubernetes APIs to get container/pod health status
• Uses existing health checks defined in compose/helm files
1B: Services send heartbeats to API server (recommended)
Services periodically POST health reports to API server
• Orchestrator agnostic • Services only need to make HTTP requests (simpler than serving) • Single aggregation point as source of truth
• Requires adding HTTP client to each service
1C: API server scrapes services
API server periodically polls each service's health endpoint
• Similar to Prometheus model • Bypasses orchestrator
• Requires all services to expose HTTP endpoints (not all are HTTP servers) • Requires service discovery (API server needs to know hostnames assigned by Docker Compose / Kubernetes)
Option 1B implementation details:
Add a POST /health endpoint to the API server that accepts health reports
Each service periodically (e.g., every 10 seconds) sends a report with:
Service name
Service instance ID
Optional timestamp (for debugging clock skew / network delays; API server's receive time is authoritative for health calculations)
Optional error message to explicitly mark as unhealthy (e.g., "failed to connect to database")
Optional status details (e.g., queue depth, active jobs)
API server also marks services as unhealthy if no heartbeat received within a threshold (e.g., 30 seconds)
2. Alternatives for health status storage/caching
Some entity (now we assume the API server) aggregates health statuses. Options for how it
stores/exposes them:
Option
Description
Advantages
Disadvantages
2A: API server in-memory cache (recommended)
Cache in memory, expose via GET /health
• Simplest; no external storage • Health data is ephemeral by nature • Orchestrators can query for health checks
• WebUI must poll (no push updates) • Data lost on restart (acceptable)
2B: MongoDB (results-cache)
Store in dedicated MongoDB collection
• WebUI can use CDC via Socket.IO for real-time updates (existing pattern)
• Additional complexity for ephemeral data
2C: Redis
Store with TTL-based expiry
• Fast reads/writes • TTL auto-expires stale entries
• WebUI doesn't connect to Redis • Requires new infrastructure
2D: MariaDB / MySQL (clp-db)
Store in heartbeat table
• WebUI already connects to clp-db • Transactional consistency
• WebUI must poll (no CDC) • Additional load on primary database
Option 2A endpoints:
GET /health — returns health status of all services (for WebUI)
GET /health?service=<name>&instance=<id> — returns health status of a specific service instance (for container orchestrator health checks on services without their own endpoints)
Caveat for orchestrator health checks: This creates a chicken-and-egg problem. Currently, API server has hard dependencies (depends_on database / initContainers waiting for db-table-creator), so it can't start before other services without relaxing these dependencies.
Recommended architecture (Option 1B + 2A)
flowchart LR
subgraph Services
QS[query-scheduler]
CS[compression-scheduler]
QW[query-worker]
CW[compression-worker]
R[reducer]
GC[garbage-collector]
end
subgraph Aggregator
API[API Server<br/>in-memory cache]
end
subgraph Frontend
WebUI[WebUI]
end
subgraph Orchestrator
DC[Docker Compose /<br/>Kubernetes]
end
QS -->|POST /health| API
CS -->|POST /health| API
QW -->|POST /health| API
CW -->|POST /health| API
R -->|POST /health| API
GC -->|POST /health| API
API -->|GET /health| WebUI
API -->|GET /health?service=X| DC
Loading
Implementation steps (Option 1B + 2A)
API server changes:
Add POST /health endpoint to accept service health reports
Add GET /health endpoint to return aggregated health status of all services (for WebUI)
Add GET /health?service=<name>&instance=<id> for querying specific service health (for container orchestrator health checks)
Add background task to mark services as unhealthy if no report received within threshold
Cache health statuses in memory
Service changes (first-party):
Add health report HTTP client to each long-running Python service (query-scheduler, compression-scheduler, reducer, garbage-collector)
Add health report mechanism to Celery workers (query-worker, compression-worker)
Configure report interval via environment variable or config
For initialization jobs (db-table-creator, results-cache-indices-creator): report completion status once upon success/failure
Third-party service health reporting (optional, future enhancement):
Third-party services (database, queue, redis, results-cache) don't run our code, so they can't directly report health. Possible approaches:
Extend existing healthchecks: Append a curl command to existing healthcheck scripts, e.g., mysqladmin ping && curl -X POST http://api-server:3001/health -d '{"service":"database"}'
WebUI changes:
Poll GET /health endpoint periodically (e.g., every 5 seconds)
Add health status display component (e.g., status bar or dedicated page)
Visual indicators: green (healthy), red (unhealthy/missing); optionally yellow (degraded) in future
Container orchestrator changes (optional, requires more thought):
For services without their own health endpoints, configure health checks to use GET /health?service=<name>&instance=<id> on the API server (see caveat in Option 2A above)
Configuration options
# Example values.yaml additionsclpConfig:
# Per-service config to optionally enable reporting (so services don't have a hard dependency on API server)query_scheduler:
health_reporting:
enabled: true # optional; service continues to function if API server is unavailableinterval: 10# seconds between reportsunhealthy_threshold: 30# seconds without report before API server marks as unhealthycompression_scheduler:
health_reporting:
enabled: trueinterval: 10unhealthy_threshold: 30# ... similar for other services
References
Docker Compose health checks: tools/deployment/package/docker-compose-all.yaml
Request
Display health indicators for CLP package services in the WebUI to help users identify when critical backend components are unavailable.
Problem
Users have experienced situations where the WebUI is accessible and appears functional, but queries are not being processed because the query job orchestration components (query-scheduler, query-worker, reducer, etc.) are down. While administrators can check service health via orchestrator tools (
docker compose ps,kubectl get pods), there is no visibility into service health from user-facing interfaces like the WebUI, making it difficult for users to diagnose such issues.Why this matters
Affected services
Based on the current architecture in
tools/deployment/package/docker-compose-all.yamlandtools/deployment/package-helm/templates/, the services that need health monitoring include:Note: Current health checks are defined per orchestrator:
healthcheckblocks indocker-compose-all.yamlreadinessProbe/livenessProbein deployment templatesCore job orchestration services
query-schedulercompression-schedulerquery-workercompression-workerreducerSupporting services
api-serverGET /healthwebuigarbage-collectormcp-serverGET /healthlog-ingestorGET /healthThird-party services
databasemysqladmin pingqueuerabbitmq-diagnostics check_runningredisredis-cli PINGresults-cachemongosh pingInitialization jobs
db-table-creatorresults-cache-indices-creatorNote: Most services depend on these initialization jobs completing successfully before starting.
Possible implementation
Two decisions need to be made:
1. Alternative approaches for health reporting
How should services report their health status to a central aggregator?
• Docker: requires exposing socket (security concern)
• Kubernetes: requires additional RBAC permissions
• Services only need to make HTTP requests (simpler than serving)
• Single aggregation point as source of truth
• Bypasses orchestrator
• Requires service discovery (API server needs to know hostnames assigned by Docker Compose / Kubernetes)
Option 1B implementation details:
POST /healthendpoint to the API server that accepts health reports2. Alternatives for health status storage/caching
Some entity (now we assume the API server) aggregates health statuses. Options for how it
stores/exposes them:
GET /health• Health data is ephemeral by nature
• Orchestrators can query for health checks
• Data lost on restart (acceptable)
results-cache)• TTL auto-expires stale entries
• Requires new infrastructure
clp-db)clp-db• Transactional consistency
• Additional load on primary database
Option 2A endpoints:
GET /health— returns health status of all services (for WebUI)GET /health?service=<name>&instance=<id>— returns health status of a specific service instance (for container orchestrator health checks on services without their own endpoints)Caveat for orchestrator health checks: This creates a chicken-and-egg problem. Currently, API server has hard dependencies (
depends_ondatabase /initContainerswaiting fordb-table-creator), so it can't start before other services without relaxing these dependencies.Recommended architecture (Option 1B + 2A)
flowchart LR subgraph Services QS[query-scheduler] CS[compression-scheduler] QW[query-worker] CW[compression-worker] R[reducer] GC[garbage-collector] end subgraph Aggregator API[API Server<br/>in-memory cache] end subgraph Frontend WebUI[WebUI] end subgraph Orchestrator DC[Docker Compose /<br/>Kubernetes] end QS -->|POST /health| API CS -->|POST /health| API QW -->|POST /health| API CW -->|POST /health| API R -->|POST /health| API GC -->|POST /health| API API -->|GET /health| WebUI API -->|GET /health?service=X| DCImplementation steps (Option 1B + 2A)
API server changes:
POST /healthendpoint to accept service health reportsGET /healthendpoint to return aggregated health status of all services (for WebUI)GET /health?service=<name>&instance=<id>for querying specific service health (for container orchestrator health checks)Service changes (first-party):
Third-party service health reporting (optional, future enhancement):
Third-party services (database, queue, redis, results-cache) don't run our code, so they can't directly report health. Possible approaches:
mysqladmin ping && curl -X POST http://api-server:3001/health -d '{"service":"database"}'WebUI changes:
GET /healthendpoint periodically (e.g., every 5 seconds)Container orchestrator changes (optional, requires more thought):
GET /health?service=<name>&instance=<id>on the API server (see caveat in Option 2A above)Configuration options
References
tools/deployment/package/docker-compose-all.yamltools/deployment/package-helm/templates/components/api-server/src/routes.rscomponents/webui/server/src/plugins/app/socket/MongoSocketIoServer/components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py