Skip to content

shreshta-p/opsagent-mesh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OpsAgent Mesh

Multi-agent SRE copilot system for automated production incident analysis and root-cause diagnosis.

What It Does

  • Ingests alerts via webhook and automatically triages production incidents
  • Coordinates specialized LLM agents to analyze logs, metrics, and service topology
  • Generates root-cause hypotheses with confidence scores using observability data
  • Delivers structured incident summaries to Slack with actionable insights

Architecture / Key Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   HTTP    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   LangGraph  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Alert Source β”‚ ────────> β”‚  FastAPI      β”‚ ──────────> β”‚  Agent Mesh        β”‚
β”‚  (webhook)   β”‚           β”‚  /ingest/alertβ”‚             β”‚  Orchestrator      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                     β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                      β”‚                    β”‚                   β”‚         β”‚
         v                      v                    v                   v         v
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Sentinel β”‚          β”‚Topologistβ”‚        β”‚Hypothesisβ”‚        β”‚   Comms  β”‚  β”‚   UI    β”‚
   β”‚  Agent   β”‚          β”‚  Agent   β”‚        β”‚  Agent   β”‚        β”‚  Agent   β”‚  β”‚(Next.js)β”‚
   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
        β”‚                     β”‚                    β”‚                   β”‚             β”‚
        v                     v                    v                   v             v
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                          Data Layer                                                β”‚
   β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
   β”‚   β”‚ Postgres β”‚        β”‚ ClickHouseβ”‚        β”‚ Redis  β”‚                             β”‚
   β”‚   β”‚(incidents)β”‚       β”‚  (logs)   β”‚        β”‚(cache) β”‚                             β”‚
   β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Agent Pipeline:

  1. Sentinel Agent: Deduplicates incoming alerts, seeds initial blast radius hypothesis
  2. Topologist Agent: Queries service dependency graph, identifies impacted downstream services
  3. Hypothesis Agent: Correlates logs/metrics using LLM to generate root-cause theories
  4. Comms Agent: Formats findings and posts structured summary to Slack

Tech Stack:

  • Backend: Python 3.11, FastAPI, LangGraph (multi-agent orchestration)
  • LLM: OpenAI GPT-4 or Anthropic Claude 3.5 Sonnet
  • Data Stores: PostgreSQL (incidents), ClickHouse (logs), Redis (cache)
  • Frontend: Next.js 14, TailwindCSS
  • Observability: OpenTelemetry (distributed tracing)

Tech Stack

  • Language: Python 3.11
  • API Framework: FastAPI
  • Agent Framework: LangGraph (LangChain)
  • LLM Providers: OpenAI, Anthropic
  • Databases: PostgreSQL, ClickHouse, Redis
  • Frontend: Next.js 14, TypeScript, TailwindCSS
  • Observability: OpenTelemetry
  • Deployment: Docker, Docker Compose

Prerequisites

  • Docker 20.10+
  • Docker Compose 1.29+
  • OpenAI API key OR Anthropic API key (for LLM agent reasoning)
  • (Optional) Slack webhook URL for notifications

Quickstart

# Clone repository
git clone <repo-url>
cd opsagent-mesh

# Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY or ANTHROPIC_API_KEY

# Run full demo (builds, starts services, seeds data, runs evaluation)
make demo

What make demo does:

  1. Builds Docker images for API and UI
  2. Starts PostgreSQL, ClickHouse, Redis, OpenTelemetry Collector
  3. Seeds synthetic incident and log data
  4. Runs 3 evaluation scenarios and displays agent reasoning

Access points:

Configuration

Environment Variables

Create .env file from template:

cp .env.example .env

Required variables:

Variable Description Example
MODEL_BACKEND LLM provider (openai or anthropic) openai
OPENAI_API_KEY OpenAI API key (if using OpenAI) sk-proj-...
ANTHROPIC_API_KEY Anthropic API key (if using Anthropic) sk-ant-...

Optional variables:

Variable Description Default
SLACK_WEBHOOK_URL Slack webhook for notifications (none)
POSTGRES_URL PostgreSQL connection string postgresql://postgres:postgres@postgres:5432/opsmesh
CLICKHOUSE_URL ClickHouse HTTP endpoint http://clickhouse:8123
REDIS_URL Redis connection string redis://redis:6379/0
OPENAI_MODEL OpenAI model name gpt-4o
ANTHROPIC_MODEL Anthropic model name claude-3-5-sonnet-20241022

How to Run Tests

Manual Testing

# Start all services
make up

# Send test alert
curl -X POST http://localhost:8000/ingest/alert \
  -H "Content-Type: application/json" \
  -d '{
    "id": "alert-001",
    "service": "checkout",
    "severity": "high",
    "ts": "2024-08-01T12:00:00Z",
    "symptom": "p95 latency > 2s",
    "fingerprint": "checkout-latency-spike"
  }'

# View incident in UI
open http://localhost:3000

Automated Evaluation

# Run evaluation harness (3 scenarios)
make eval

# Expected output:
# Scenario 1: Database connection pool exhaustion (latency spike)
# Scenario 2: Cascading failure from upstream auth service
# Scenario 3: Memory leak causing OOM in payment service

Seed Test Data

# Load synthetic logs and topology
make seed

Run Unit Tests

docker compose exec api python -m pytest tests/

Project Structure

opsagent-mesh/
β”œβ”€β”€ Makefile                    # Build and orchestration commands
β”œβ”€β”€ docker-compose.yml          # Service definitions
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ .env.example                # Environment template
β”œβ”€β”€ apps/
β”‚   β”œβ”€β”€ api/                    # FastAPI backend
β”‚   β”‚   β”œβ”€β”€ main.py             # API routes, /ingest/alert endpoint
β”‚   β”‚   β”œβ”€β”€ database.py         # SQLAlchemy models
β”‚   β”‚   β”œβ”€β”€ models.py           # ORM models (Incident, Finding)
β”‚   β”‚   └── seed_synthetic.py   # Synthetic data generator
β”‚   └── ui/                     # Next.js frontend
β”‚       β”œβ”€β”€ pages/
β”‚       β”‚   β”œβ”€β”€ index.tsx       # Incident list dashboard
β”‚       β”‚   └── incidents/[id].tsx  # Incident detail view
β”‚       └── package.json
β”œβ”€β”€ orchestrator/
β”‚   β”œβ”€β”€ graph.py                # LangGraph workflow definition
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ sentinel.py         # Alert deduplication agent
β”‚   β”‚   β”œβ”€β”€ topologist.py       # Dependency graph agent
β”‚   β”‚   β”œβ”€β”€ hypothesis.py       # Root-cause reasoning agent
β”‚   β”‚   └── comms.py            # Slack notification agent
β”‚   └── tools/
β”‚       β”œβ”€β”€ metrics.py          # ClickHouse query tool
β”‚       β”œβ”€β”€ tracing.py          # Distributed trace analysis
β”‚       β”œβ”€β”€ logs.py             # Log correlation tool
β”‚       └── model.py            # LLM wrapper
β”œβ”€β”€ infra/
β”‚   β”œβ”€β”€ postgres/init.sql       # Schema initialization
β”‚   β”œβ”€β”€ clickhouse/init.sql     # Log table schema
β”‚   └── otel-collector.yaml     # OpenTelemetry config
β”œβ”€β”€ data/synthetic/
β”‚   β”œβ”€β”€ topology.json           # Service dependency graph
β”‚   β”œβ”€β”€ metrics.json            # Sample metrics data
β”‚   └── checkout_logs.parquet   # Synthetic log data
└── eval/
    └── run.py                  # Evaluation harness

Troubleshooting

Services won't start

Error: docker: Error response from daemon: Conflict

Fix: Clean up existing containers and volumes:

make clean
make setup
make up

No Slack notifications

Error: Agent completes but no Slack message received

Fix:

  1. Verify SLACK_WEBHOOK_URL is set in .env
  2. Test webhook manually:
curl -X POST $SLACK_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d '{"text":"Test message"}'

LLM API errors

Error: openai.AuthenticationError or anthropic.APIError

Fix:

  1. Verify API key is correct in .env
  2. Check MODEL_BACKEND matches the provider:
# For OpenAI
MODEL_BACKEND=openai
OPENAI_API_KEY=sk-proj-...

# For Anthropic
MODEL_BACKEND=anthropic
ANTHROPIC_API_KEY=sk-ant-...

Database connection refused

Error: sqlalchemy.exc.OperationalError: could not connect to server

Fix: Ensure PostgreSQL is healthy:

docker compose logs postgres
docker compose restart postgres

ClickHouse query failures

Error: ClickHouseError: Code: 60. DB::Exception: Table doesn't exist

Fix: Reinitialize ClickHouse schema:

docker compose down -v
make up
make seed

Frontend build errors

Error: Module not found: Can't resolve...

Fix: Rebuild UI container:

docker compose build ui
docker compose up -d ui

Roadmap / Future Improvements

  • Add support for multi-region topology analysis
  • Implement automated rollback recommendations based on deployment history
  • Add custom metric anomaly detection using time-series forecasting
  • Support for additional LLM providers (Cohere, Mistral)
  • Implement agent fine-tuning on historical incident data
  • Add grafana integration for real-time dashboard embedding
  • Support for Kubernetes pod-level topology mapping
  • Implement confidence calibration for hypothesis scoring
  • Add A/B testing framework for agent prompt optimization
  • Create Terraform modules for cloud deployment (AWS, GCP, Azure)

License

TBD

About

Multi-agent SRE copilot using LangGraph and LLMs (GPT-4o/Claude) to automatically analyze production incidents, generate root-cause hypotheses, and deliver insights via Slack.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors