Skip to content

bipplane/DevIR

Repository files navigation

DevIR

An autonomous, graph-based AI agent that diagnoses runtime errors like a Junior SRE.

Python 3.9+ Licence: MIT LangGraph Docker Tests


The Problem

When prod goes down at 2am, engineers face the same repetitive process:

         ┌──────────────────────────────────────────────────────────────────────────┐
         │                                                                          │
         ▼                                                                          │
┌─────────────────────────────────────────────────────────────────────────┐         │
│  Read stack trace  │ ──> │ Diagnose issue │ ──> │ Google frantically │  │         │
└─────────────────────────────────────────────────────────────────────────┘         │
         │                                                                          │
         ▼                                                                          │
┌─────────────────────────────────────────────────────────────────────────┐         │
│  Check the code  │ ──> │ Implement fix │ ──> │ Test │ ──> │ Deploy │    │         │
└─────────────────────────────────────────────────────────────────────────┘         │
         │                                                                          │
         ▼                                                                          │
   Still broken? Loop back to start  ───────────────────────────────────────────────┘

This workflow is predictable enough to automate, yet complex enough that simple scripts fail.

Traditional AI assistants treat this as a single-shot problem. You paste an error, get a response, and hope it gets fixed entirely. But real debugging is iterative. You search, find partial answers, refine your understanding, and try again.

Project Overview

This project aims to introduce a state machine architecture for incident response. Instead of a black-box agent, you get an observable, controllable workflow where each step is explicitly walked through. The agent can loop back to gather more information when its confidence is low, as well as pause for human approval before executing anything destructive.

Quick Start

from src.graph import IncidentResponder

# Initialise the agent
responder = IncidentResponder(verbose=True)

# Feed it an error
error_log = """
psycopg2.OperationalError: could not connect to server: Connection refused
    Is the server running on host "localhost" and accepting TCP/IP connections on port 5432?
"""

# Run the investigation
result = responder.investigate(error_log)

# Get the solution
print(result["proposed_solution"])
print(result["solution_steps"])

Or use the CLI:

# Interactive mode with sample errors
python -m src.main

# Direct error input
python -m src.main --error "OOMKilled: Container exceeded memory limit"

# From a log file
python -m src.main --file error.log

Architecture

flowchart TB
    subgraph Input
        A[Error Log] --> B[Diagnostician]
    end

    subgraph Investigation
        B --> C[Webscraper]
        C --> D[Code Auditor]
        D --> E[Solver]
    end

    subgraph Control Flow
        E -->|confidence < 0.3| C
        E -->|needs approval| F[Human Approval]
        E -->|ready| G[Output]
        F -->|approved| G
        F -->|rejected| H[Abort]
    end

    subgraph Tools
        C -.->|Tavily API| I[(Web Search)]
        D -.->|File Reader| J[(Codebase)]
    end

    style B fill:#e1f5fe
    style C fill:#fff3e0
    style D fill:#f3e5f5
    style E fill:#e8f5e9
    style F fill:#ffebee
Loading
classDiagram
    IncidentResponder --> StateGraph
    IncidentResponder --> NodeFactory
    NodeFactory --> BaseLLM
    NodeFactory --> TavilySearchTool
    NodeFactory --> FileReaderTool
    StateGraph --> AgentState

    class IncidentResponder {
        +llm: BaseLLM
        +search_tool: TavilySearchTool
        +file_tool: FileReaderTool
        +investigate(error_log) AgentState
    }

    class StateGraph {
        +nodes: Dict~str, Callable~
        +edges: List~Tuple~
        +compile() CompiledGraph
    }

    class AgentState {
        +error_log: str
        +error_type: str
        +research_findings: List~str~
        +proposed_solution: str
        +solution_confidence: float
        +needs_human_approval: bool
    }

    class NodeFactory {
        +diagnostician(state) Dict
        +webscraper(state) Dict
        +code_auditor(state) Dict
        +solver(state) Dict
        +human_approval(state) Dict
    }

    class BaseLLM {
        <<abstract>>
        +generate(prompt, system_prompt) str
    }

    class TavilySearchTool {
        +search(query) List~Result~
    }

    class FileReaderTool {
        +read_file(path) str
        +find_files(patterns) List~str~
    }
Loading

Key Concepts

State Machine Architecture

Unlike black-box agents, where you pray to the Cursor God after every prompt in hopes that the AI figures it out, this project uses LangGraph to define an explicit state machine. Every node is a function that reads from and writes to a typed state object. Every edge is a deliberate transition. You can trace exactly how information flows through the system.

class AgentState(TypedDict):
    error_log: str              # The input
    error_type: str             # Diagnosis result
    research_findings: List[str] # What we learned
    proposed_solution: str       # The fix
    solution_confidence: float   # How sure we are (0.0 - 1.0)
    needs_human_approval: bool   # Safety flag

Cyclical Graph Flows

Real debugging is an undoubtedly an iterative process. If a low confidence answer is produced, the graph loops back to the Webscraper with a more refined query for better results. This self-correction mechanism continues until either: confidence exceeds threshold,max iterations are reached, or the proposed solution is accepted.

def check_solution_confidence(state: AgentState) -> str:
    if state["solution_confidence"] < 0.3:
        return "research"  # Loop back
    if state["needs_human_approval"]:
        return "human_approval"  # Checkpoint
    return "end"  # Done

Human-in-the-Loop

The agent pauses before executing destructive operations. If the proposed solution involves DROP TABLE, rm -rf, or similar commands, the workflow routes to a human approval checkpoint. The engineer sees exactly what the agent wants to do, why it wants to do it, and can approve or abort this action.

HUMAN APPROVAL REQUIRED
-----------------------------
The agent wants to perform:
  DROP TABLE users; -- Recreate with correct schema

Type 'yes' to approve or 'no' to abort:

Tool Integration

The agent has two primary tools:

  1. TavilySearchTool: Searches the web for StackOverflow answers, documentation, and blogposts. Tavily is optimised for AI agents and provides clean, structured results. As of writing this, it has a free plan that allows limited monthly API calls, perfect for personal projects.

  2. FileReaderTool: Safely reads files from your codebase. It has security guards to block access to .env, secrets, as well as other sensitive files.

Framework Agnostic Design

The LLM interface is abstracted behind BaseLLM. Swap providers without changing the graph logic. The default is Google Gemini (free tier since I'm just a broke uni student), but the architecture supports any LLM that can follow structured prompts.

Installation

git clone https://github.com/bipplane/devops-incident-responder.git
cd devops-incident-responder

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

Configure API Keys

  1. Copy the example environment file:
copy .env.example .env  # Windows
# cp .env.example .env  # Linux/Mac
  1. Edit .env and add your API keys:
# Required - Google Gemini (FREE)
# Get your key at: https://aistudio.google.com/app/apikey
GOOGLE_API_KEY=your_gemini_api_key

# Required - Tavily Search (FREE - 1000 searches/month)
# Get your key at: https://tavily.com
TAVILY_API_KEY=your_tavily_api_key

Note: The application automatically loads the .env file on startup using python-dotenv. Make sure the .env file is in the project root directory.

Run the Agent

python -m src.main

Docker

Run the agent in a container without installing Python dependencies locally.

Quick Start

# Build the image
docker build -t incident-responder .

# Run with an error (pass API keys via .env file)
docker run --env-file .env incident-responder --error "Connection refused on port 5432"

# Interactive mode
docker run -it --env-file .env incident-responder

Using Docker Compose

# Run the agent
docker compose run --rm agent

# Run with arguments
docker compose run --rm agent --error "OOMKilled: Container exceeded memory limit"

# Run tests in container
docker compose run --rm test

Image Details

Property Value
Base Image python:3.11-slim
Image Size ~340MB
User Non-root (appuser)
Health Check Verifies imports on startup

The Dockerfile uses a multi-stage build to keep the final image lean. Build dependencies (gcc) are discarded after installing Python packages.

Testing Strategy

The project includes a comprehensive test suite demonstrating production-grade testing practices.

# Run all tests
pytest tests/ -v

# Run specific test class
pytest tests/test_agent.py::TestNodesMocked -v
Test Category Description
Unit Tests State management, JSON/text parsing, routing logic
LLM Resilience Handles "LLM drift" - markdown blocks, thinking tags, malformed output
Mocked Nodes Full node testing without API calls using unittest.mock
Security Path traversal blocking, .env access prevention, credential protection
Integration End-to-end workflow (skipped in CI/CD to preserve API credits)

Why This Matters

Most AI projects skip testing because "the LLM is unpredictable." This project proves you can infact, isolate and test every layer:

  • Parse layer: Does parse_json_response handle garbage?
  • Logic layer: Does the routing function send low-confidence to refinement?
  • Security layer: Can the agent read .env files? (It cannot.)

Project Structure

devops-incident-responder/
├── src/
│   ├── main.py           # CLI entry point
│   ├── graph.py          # LangGraph workflow definition
│   ├── nodes.py          # Node functions (the workers)
│   ├── state.py          # AgentState TypedDict
│   ├── prompts.py        # All LLM prompts
│   ├── llm.py            # LLM interface (Gemini)
│   └── tools/
│       ├── search_tool.py   # Tavily integration
│       └── file_tool.py     # Safe file reader
├── tests/
│   └── test_agent.py
├── .env.example
├── requirements.txt
└── README.md

Extending the Agent

Adding a New Node

Nodes are just functions that take state and return updates. Add your node to the factory, then wire it into the graph.

# In nodes.py
def log_pattern_analyser(self, state: AgentState) -> Dict[str, Any]:
    """Analyse logs for common patterns."""
    patterns = self._detect_patterns(state["error_log"])
    return {"detected_patterns": patterns}

# In graph.py
workflow.add_node("analyse_patterns", factory.log_pattern_analyser)
workflow.add_edge("diagnose", "analyse_patterns")
workflow.add_edge("analyse_patterns", "research")

Adding a New Tool

Tools are independent classes. Create one, inject it into the NodeFactory, and use it in your nodes.

# In tools/kubectl_tool.py
class KubectlTool:
    def get_pod_logs(self, pod_name: str, namespace: str = "default") -> str:
        # Implementation
        pass

    def describe_pod(self, pod_name: str) -> str:
        # Implementation
        pass

Sample Errors

The interactive mode includes pre-configured sample errors for testing:

Error Type Root Cause
PostgreSQL Connection Timeout Database Docker networking misconfiguration
Docker OOMKilled Infrastructure Memory limits exceeded
Kubernetes CrashLoopBackOff Deployment Missing build artifact
AWS Lambda Timeout Cloud Cold start + slow database query
NGINX 502 Bad Gateway Networking Backend service unreachable

Why This Architecture?

Compared to Single-Shot Prompting

Single-shot prompting gives you one chance. If the AI misunderstands, you start over. This state machine allows the agent to iterate, gather more context, and refine its answer.

Compared to Agent Frameworks (CrewAI, AutoGen)

Higher-level frameworks abstract away the control flow. You define agents and hope they collaborate. With LangGraph, you define the exact graph. Every edge is intentional. Every loop is explicit. This makes debugging the agent itself much easier.

Compared to RAG Pipelines

RAG is retrieval-focused. This is action-focused. The agent doesn't just find information, it also synthesises a solution, proposes code changes, and optionally executes them with human oversight.

Licence

MIT Licence - use this for your portfolio, interviews, or production systems.


Learning Outcomes

My takeaways from building this project:

  • The usage of mock node tests help to prove that the internal logic works, without using API calls (CI/CD-friendly)
  • Implementation of security tests (tool safety) demonstrates awareness of DevSecOps to prevent leakage of sensitive info such as .env files or secrets
  • API credits are precious, and thus it is important to skip integration tests in CI to preserve these credits

Built as a portfolio project demonstrating production-grade AI agent architecture.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages