OpsAgent Mesh

Multi-agent SRE copilot system for automated production incident analysis and root-cause diagnosis.

What It Does

Ingests alerts via webhook and automatically triages production incidents
Coordinates specialized LLM agents to analyze logs, metrics, and service topology
Generates root-cause hypotheses with confidence scores using observability data
Delivers structured incident summaries to Slack with actionable insights

Architecture / Key Components

┌──────────────┐   HTTP    ┌───────────────┐   LangGraph  ┌────────────────────┐
│ Alert Source │ ────────> │  FastAPI      │ ──────────> │  Agent Mesh        │
│  (webhook)   │           │  /ingest/alert│             │  Orchestrator      │
└──────────────┘           └───────────────┘             └──────────┬─────────┘
                                                                     │
         ┌───────────────────────────────────────────────────────────┴────────────┐
         │                      │                    │                   │         │
         v                      v                    v                   v         v
   ┌──────────┐          ┌──────────┐        ┌──────────┐        ┌──────────┐  ┌─────────┐
   │ Sentinel │          │Topologist│        │Hypothesis│        │   Comms  │  │   UI    │
   │  Agent   │          │  Agent   │        │  Agent   │        │  Agent   │  │(Next.js)│
   └────┬─────┘          └────┬─────┘        └────┬─────┘        └────┬─────┘  └────┬────┘
        │                     │                    │                   │             │
        v                     v                    v                   v             v
   ┌────────────────────────────────────────────────────────────────────────────────────┐
   │                          Data Layer                                                │
   │   ┌──────────┐        ┌───────────┐        ┌────────┐                             │
   │   │ Postgres │        │ ClickHouse│        │ Redis  │                             │
   │   │(incidents)│       │  (logs)   │        │(cache) │                             │
   │   └──────────┘        └───────────┘        └────────┘                             │
   └────────────────────────────────────────────────────────────────────────────────────┘

Agent Pipeline:

Sentinel Agent: Deduplicates incoming alerts, seeds initial blast radius hypothesis
Topologist Agent: Queries service dependency graph, identifies impacted downstream services
Hypothesis Agent: Correlates logs/metrics using LLM to generate root-cause theories
Comms Agent: Formats findings and posts structured summary to Slack

Tech Stack:

Backend: Python 3.11, FastAPI, LangGraph (multi-agent orchestration)
LLM: OpenAI GPT-4 or Anthropic Claude 3.5 Sonnet
Data Stores: PostgreSQL (incidents), ClickHouse (logs), Redis (cache)
Frontend: Next.js 14, TailwindCSS
Observability: OpenTelemetry (distributed tracing)

Tech Stack

Language: Python 3.11
API Framework: FastAPI
Agent Framework: LangGraph (LangChain)
LLM Providers: OpenAI, Anthropic
Databases: PostgreSQL, ClickHouse, Redis
Frontend: Next.js 14, TypeScript, TailwindCSS
Observability: OpenTelemetry
Deployment: Docker, Docker Compose

Prerequisites

Docker 20.10+
Docker Compose 1.29+
OpenAI API key OR Anthropic API key (for LLM agent reasoning)
(Optional) Slack webhook URL for notifications

Quickstart

# Clone repository
git clone <repo-url>
cd opsagent-mesh

# Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY or ANTHROPIC_API_KEY

# Run full demo (builds, starts services, seeds data, runs evaluation)
make demo

What make demo does:

Builds Docker images for API and UI
Starts PostgreSQL, ClickHouse, Redis, OpenTelemetry Collector
Seeds synthetic incident and log data
Runs 3 evaluation scenarios and displays agent reasoning

Access points:

API: http://localhost:8000/docs
UI Dashboard: http://localhost:3000
Health check: http://localhost:8000/health

Configuration

Environment Variables

Create .env file from template:

cp .env.example .env

Required variables:

Variable	Description	Example
`MODEL_BACKEND`	LLM provider (`openai` or `anthropic`)	`openai`
`OPENAI_API_KEY`	OpenAI API key (if using OpenAI)	`sk-proj-...`
`ANTHROPIC_API_KEY`	Anthropic API key (if using Anthropic)	`sk-ant-...`

Optional variables:

Variable	Description	Default
`SLACK_WEBHOOK_URL`	Slack webhook for notifications	(none)
`POSTGRES_URL`	PostgreSQL connection string	`postgresql://postgres:postgres@postgres:5432/opsmesh`
`CLICKHOUSE_URL`	ClickHouse HTTP endpoint	`http://clickhouse:8123`
`REDIS_URL`	Redis connection string	`redis://redis:6379/0`
`OPENAI_MODEL`	OpenAI model name	`gpt-4o`
`ANTHROPIC_MODEL`	Anthropic model name	`claude-3-5-sonnet-20241022`

How to Run Tests

Manual Testing

# Start all services
make up

# Send test alert
curl -X POST http://localhost:8000/ingest/alert \
  -H "Content-Type: application/json" \
  -d '{
    "id": "alert-001",
    "service": "checkout",
    "severity": "high",
    "ts": "2024-08-01T12:00:00Z",
    "symptom": "p95 latency > 2s",
    "fingerprint": "checkout-latency-spike"
  }'

# View incident in UI
open http://localhost:3000

Automated Evaluation

# Run evaluation harness (3 scenarios)
make eval

# Expected output:
# Scenario 1: Database connection pool exhaustion (latency spike)
# Scenario 2: Cascading failure from upstream auth service
# Scenario 3: Memory leak causing OOM in payment service

Seed Test Data

# Load synthetic logs and topology
make seed

Run Unit Tests

docker compose exec api python -m pytest tests/

Project Structure

opsagent-mesh/
├── Makefile                    # Build and orchestration commands
├── docker-compose.yml          # Service definitions
├── requirements.txt            # Python dependencies
├── .env.example                # Environment template
├── apps/
│   ├── api/                    # FastAPI backend
│   │   ├── main.py             # API routes, /ingest/alert endpoint
│   │   ├── database.py         # SQLAlchemy models
│   │   ├── models.py           # ORM models (Incident, Finding)
│   │   └── seed_synthetic.py   # Synthetic data generator
│   └── ui/                     # Next.js frontend
│       ├── pages/
│       │   ├── index.tsx       # Incident list dashboard
│       │   └── incidents/[id].tsx  # Incident detail view
│       └── package.json
├── orchestrator/
│   ├── graph.py                # LangGraph workflow definition
│   ├── agents/
│   │   ├── sentinel.py         # Alert deduplication agent
│   │   ├── topologist.py       # Dependency graph agent
│   │   ├── hypothesis.py       # Root-cause reasoning agent
│   │   └── comms.py            # Slack notification agent
│   └── tools/
│       ├── metrics.py          # ClickHouse query tool
│       ├── tracing.py          # Distributed trace analysis
│       ├── logs.py             # Log correlation tool
│       └── model.py            # LLM wrapper
├── infra/
│   ├── postgres/init.sql       # Schema initialization
│   ├── clickhouse/init.sql     # Log table schema
│   └── otel-collector.yaml     # OpenTelemetry config
├── data/synthetic/
│   ├── topology.json           # Service dependency graph
│   ├── metrics.json            # Sample metrics data
│   └── checkout_logs.parquet   # Synthetic log data
└── eval/
    └── run.py                  # Evaluation harness

Troubleshooting

Services won't start

Error: docker: Error response from daemon: Conflict

Fix: Clean up existing containers and volumes:

make clean
make setup
make up

No Slack notifications

Error: Agent completes but no Slack message received

Fix:

Verify SLACK_WEBHOOK_URL is set in .env
Test webhook manually:

curl -X POST $SLACK_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d '{"text":"Test message"}'

LLM API errors

Error: openai.AuthenticationError or anthropic.APIError

Fix:

Verify API key is correct in .env
Check MODEL_BACKEND matches the provider:

# For OpenAI
MODEL_BACKEND=openai
OPENAI_API_KEY=sk-proj-...

# For Anthropic
MODEL_BACKEND=anthropic
ANTHROPIC_API_KEY=sk-ant-...

Database connection refused

Error: sqlalchemy.exc.OperationalError: could not connect to server

Fix: Ensure PostgreSQL is healthy:

docker compose logs postgres
docker compose restart postgres

ClickHouse query failures

Error: ClickHouseError: Code: 60. DB::Exception: Table doesn't exist

Fix: Reinitialize ClickHouse schema:

docker compose down -v
make up
make seed

Frontend build errors

Error: Module not found: Can't resolve...

Fix: Rebuild UI container:

docker compose build ui
docker compose up -d ui

Roadmap / Future Improvements

Add support for multi-region topology analysis
Implement automated rollback recommendations based on deployment history
Add custom metric anomaly detection using time-series forecasting
Support for additional LLM providers (Cohere, Mistral)
Implement agent fine-tuning on historical incident data
Add grafana integration for real-time dashboard embedding
Support for Kubernetes pod-level topology mapping
Implement confidence calibration for hypothesis scoring
Add A/B testing framework for agent prompt optimization
Create Terraform modules for cloud deployment (AWS, GCP, Azure)

License

TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpsAgent Mesh

What It Does

Architecture / Key Components

Tech Stack

Prerequisites

Quickstart

Configuration

Environment Variables

How to Run Tests

Manual Testing

Automated Evaluation

Seed Test Data

Run Unit Tests

Project Structure

Troubleshooting

Services won't start

No Slack notifications

LLM API errors

Database connection refused

ClickHouse query failures

Frontend build errors

Roadmap / Future Improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
apps		apps
data/synthetic		data/synthetic
eval		eval
infra		infra
orchestrator		orchestrator
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OpsAgent Mesh

What It Does

Architecture / Key Components

Tech Stack

Prerequisites

Quickstart

Configuration

Environment Variables

How to Run Tests

Manual Testing

Automated Evaluation

Seed Test Data

Run Unit Tests

Project Structure

Troubleshooting

Services won't start

No Slack notifications

LLM API errors

Database connection refused

ClickHouse query failures

Frontend build errors

Roadmap / Future Improvements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages