Multi-agent SRE copilot system for automated production incident analysis and root-cause diagnosis.
- Ingests alerts via webhook and automatically triages production incidents
- Coordinates specialized LLM agents to analyze logs, metrics, and service topology
- Generates root-cause hypotheses with confidence scores using observability data
- Delivers structured incident summaries to Slack with actionable insights
ββββββββββββββββ HTTP βββββββββββββββββ LangGraph ββββββββββββββββββββββ
β Alert Source β ββββββββ> β FastAPI β ββββββββββ> β Agent Mesh β
β (webhook) β β /ingest/alertβ β Orchestrator β
ββββββββββββββββ βββββββββββββββββ ββββββββββββ¬ββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββ
β β β β β
v v v v v
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ
β Sentinel β βTopologistβ βHypothesisβ β Comms β β UI β
β Agent β β Agent β β Agent β β Agent β β(Next.js)β
ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬βββββ
β β β β β
v v v v v
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Layer β
β ββββββββββββ βββββββββββββ ββββββββββ β
β β Postgres β β ClickHouseβ β Redis β β
β β(incidents)β β (logs) β β(cache) β β
β ββββββββββββ βββββββββββββ ββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agent Pipeline:
- Sentinel Agent: Deduplicates incoming alerts, seeds initial blast radius hypothesis
- Topologist Agent: Queries service dependency graph, identifies impacted downstream services
- Hypothesis Agent: Correlates logs/metrics using LLM to generate root-cause theories
- Comms Agent: Formats findings and posts structured summary to Slack
Tech Stack:
- Backend: Python 3.11, FastAPI, LangGraph (multi-agent orchestration)
- LLM: OpenAI GPT-4 or Anthropic Claude 3.5 Sonnet
- Data Stores: PostgreSQL (incidents), ClickHouse (logs), Redis (cache)
- Frontend: Next.js 14, TailwindCSS
- Observability: OpenTelemetry (distributed tracing)
- Language: Python 3.11
- API Framework: FastAPI
- Agent Framework: LangGraph (LangChain)
- LLM Providers: OpenAI, Anthropic
- Databases: PostgreSQL, ClickHouse, Redis
- Frontend: Next.js 14, TypeScript, TailwindCSS
- Observability: OpenTelemetry
- Deployment: Docker, Docker Compose
- Docker 20.10+
- Docker Compose 1.29+
- OpenAI API key OR Anthropic API key (for LLM agent reasoning)
- (Optional) Slack webhook URL for notifications
# Clone repository
git clone <repo-url>
cd opsagent-mesh
# Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY or ANTHROPIC_API_KEY
# Run full demo (builds, starts services, seeds data, runs evaluation)
make demoWhat make demo does:
- Builds Docker images for API and UI
- Starts PostgreSQL, ClickHouse, Redis, OpenTelemetry Collector
- Seeds synthetic incident and log data
- Runs 3 evaluation scenarios and displays agent reasoning
Access points:
- API: http://localhost:8000/docs
- UI Dashboard: http://localhost:3000
- Health check: http://localhost:8000/health
Create .env file from template:
cp .env.example .envRequired variables:
| Variable | Description | Example |
|---|---|---|
MODEL_BACKEND |
LLM provider (openai or anthropic) |
openai |
OPENAI_API_KEY |
OpenAI API key (if using OpenAI) | sk-proj-... |
ANTHROPIC_API_KEY |
Anthropic API key (if using Anthropic) | sk-ant-... |
Optional variables:
| Variable | Description | Default |
|---|---|---|
SLACK_WEBHOOK_URL |
Slack webhook for notifications | (none) |
POSTGRES_URL |
PostgreSQL connection string | postgresql://postgres:postgres@postgres:5432/opsmesh |
CLICKHOUSE_URL |
ClickHouse HTTP endpoint | http://clickhouse:8123 |
REDIS_URL |
Redis connection string | redis://redis:6379/0 |
OPENAI_MODEL |
OpenAI model name | gpt-4o |
ANTHROPIC_MODEL |
Anthropic model name | claude-3-5-sonnet-20241022 |
# Start all services
make up
# Send test alert
curl -X POST http://localhost:8000/ingest/alert \
-H "Content-Type: application/json" \
-d '{
"id": "alert-001",
"service": "checkout",
"severity": "high",
"ts": "2024-08-01T12:00:00Z",
"symptom": "p95 latency > 2s",
"fingerprint": "checkout-latency-spike"
}'
# View incident in UI
open http://localhost:3000# Run evaluation harness (3 scenarios)
make eval
# Expected output:
# Scenario 1: Database connection pool exhaustion (latency spike)
# Scenario 2: Cascading failure from upstream auth service
# Scenario 3: Memory leak causing OOM in payment service# Load synthetic logs and topology
make seeddocker compose exec api python -m pytest tests/opsagent-mesh/
βββ Makefile # Build and orchestration commands
βββ docker-compose.yml # Service definitions
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ apps/
β βββ api/ # FastAPI backend
β β βββ main.py # API routes, /ingest/alert endpoint
β β βββ database.py # SQLAlchemy models
β β βββ models.py # ORM models (Incident, Finding)
β β βββ seed_synthetic.py # Synthetic data generator
β βββ ui/ # Next.js frontend
β βββ pages/
β β βββ index.tsx # Incident list dashboard
β β βββ incidents/[id].tsx # Incident detail view
β βββ package.json
βββ orchestrator/
β βββ graph.py # LangGraph workflow definition
β βββ agents/
β β βββ sentinel.py # Alert deduplication agent
β β βββ topologist.py # Dependency graph agent
β β βββ hypothesis.py # Root-cause reasoning agent
β β βββ comms.py # Slack notification agent
β βββ tools/
β βββ metrics.py # ClickHouse query tool
β βββ tracing.py # Distributed trace analysis
β βββ logs.py # Log correlation tool
β βββ model.py # LLM wrapper
βββ infra/
β βββ postgres/init.sql # Schema initialization
β βββ clickhouse/init.sql # Log table schema
β βββ otel-collector.yaml # OpenTelemetry config
βββ data/synthetic/
β βββ topology.json # Service dependency graph
β βββ metrics.json # Sample metrics data
β βββ checkout_logs.parquet # Synthetic log data
βββ eval/
βββ run.py # Evaluation harness
Error: docker: Error response from daemon: Conflict
Fix: Clean up existing containers and volumes:
make clean
make setup
make upError: Agent completes but no Slack message received
Fix:
- Verify
SLACK_WEBHOOK_URLis set in.env - Test webhook manually:
curl -X POST $SLACK_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d '{"text":"Test message"}'Error: openai.AuthenticationError or anthropic.APIError
Fix:
- Verify API key is correct in
.env - Check MODEL_BACKEND matches the provider:
# For OpenAI
MODEL_BACKEND=openai
OPENAI_API_KEY=sk-proj-...
# For Anthropic
MODEL_BACKEND=anthropic
ANTHROPIC_API_KEY=sk-ant-...Error: sqlalchemy.exc.OperationalError: could not connect to server
Fix: Ensure PostgreSQL is healthy:
docker compose logs postgres
docker compose restart postgresError: ClickHouseError: Code: 60. DB::Exception: Table doesn't exist
Fix: Reinitialize ClickHouse schema:
docker compose down -v
make up
make seedError: Module not found: Can't resolve...
Fix: Rebuild UI container:
docker compose build ui
docker compose up -d ui- Add support for multi-region topology analysis
- Implement automated rollback recommendations based on deployment history
- Add custom metric anomaly detection using time-series forecasting
- Support for additional LLM providers (Cohere, Mistral)
- Implement agent fine-tuning on historical incident data
- Add grafana integration for real-time dashboard embedding
- Support for Kubernetes pod-level topology mapping
- Implement confidence calibration for hypothesis scoring
- Add A/B testing framework for agent prompt optimization
- Create Terraform modules for cloud deployment (AWS, GCP, Azure)
TBD