An autonomous, graph-based AI agent that diagnoses runtime errors like a Junior SRE.
When prod goes down at 2am, engineers face the same repetitive process:
┌──────────────────────────────────────────────────────────────────────────┐
│ │
▼ │
┌─────────────────────────────────────────────────────────────────────────┐ │
│ Read stack trace │ ──> │ Diagnose issue │ ──> │ Google frantically │ │ │
└─────────────────────────────────────────────────────────────────────────┘ │
│ │
▼ │
┌─────────────────────────────────────────────────────────────────────────┐ │
│ Check the code │ ──> │ Implement fix │ ──> │ Test │ ──> │ Deploy │ │ │
└─────────────────────────────────────────────────────────────────────────┘ │
│ │
▼ │
Still broken? Loop back to start ───────────────────────────────────────────────┘
This workflow is predictable enough to automate, yet complex enough that simple scripts fail.
Traditional AI assistants treat this as a single-shot problem. You paste an error, get a response, and hope it gets fixed entirely. But real debugging is iterative. You search, find partial answers, refine your understanding, and try again.
This project aims to introduce a state machine architecture for incident response. Instead of a black-box agent, you get an observable, controllable workflow where each step is explicitly walked through. The agent can loop back to gather more information when its confidence is low, as well as pause for human approval before executing anything destructive.
from src.graph import IncidentResponder
# Initialise the agent
responder = IncidentResponder(verbose=True)
# Feed it an error
error_log = """
psycopg2.OperationalError: could not connect to server: Connection refused
Is the server running on host "localhost" and accepting TCP/IP connections on port 5432?
"""
# Run the investigation
result = responder.investigate(error_log)
# Get the solution
print(result["proposed_solution"])
print(result["solution_steps"])Or use the CLI:
# Interactive mode with sample errors
python -m src.main
# Direct error input
python -m src.main --error "OOMKilled: Container exceeded memory limit"
# From a log file
python -m src.main --file error.logflowchart TB
subgraph Input
A[Error Log] --> B[Diagnostician]
end
subgraph Investigation
B --> C[Webscraper]
C --> D[Code Auditor]
D --> E[Solver]
end
subgraph Control Flow
E -->|confidence < 0.3| C
E -->|needs approval| F[Human Approval]
E -->|ready| G[Output]
F -->|approved| G
F -->|rejected| H[Abort]
end
subgraph Tools
C -.->|Tavily API| I[(Web Search)]
D -.->|File Reader| J[(Codebase)]
end
style B fill:#e1f5fe
style C fill:#fff3e0
style D fill:#f3e5f5
style E fill:#e8f5e9
style F fill:#ffebee
classDiagram
IncidentResponder --> StateGraph
IncidentResponder --> NodeFactory
NodeFactory --> BaseLLM
NodeFactory --> TavilySearchTool
NodeFactory --> FileReaderTool
StateGraph --> AgentState
class IncidentResponder {
+llm: BaseLLM
+search_tool: TavilySearchTool
+file_tool: FileReaderTool
+investigate(error_log) AgentState
}
class StateGraph {
+nodes: Dict~str, Callable~
+edges: List~Tuple~
+compile() CompiledGraph
}
class AgentState {
+error_log: str
+error_type: str
+research_findings: List~str~
+proposed_solution: str
+solution_confidence: float
+needs_human_approval: bool
}
class NodeFactory {
+diagnostician(state) Dict
+webscraper(state) Dict
+code_auditor(state) Dict
+solver(state) Dict
+human_approval(state) Dict
}
class BaseLLM {
<<abstract>>
+generate(prompt, system_prompt) str
}
class TavilySearchTool {
+search(query) List~Result~
}
class FileReaderTool {
+read_file(path) str
+find_files(patterns) List~str~
}
Unlike black-box agents, where you pray to the Cursor God after every prompt in hopes that the AI figures it out, this project uses LangGraph to define an explicit state machine. Every node is a function that reads from and writes to a typed state object. Every edge is a deliberate transition. You can trace exactly how information flows through the system.
class AgentState(TypedDict):
error_log: str # The input
error_type: str # Diagnosis result
research_findings: List[str] # What we learned
proposed_solution: str # The fix
solution_confidence: float # How sure we are (0.0 - 1.0)
needs_human_approval: bool # Safety flagReal debugging is an undoubtedly an iterative process. If a low confidence answer is produced, the graph loops back to the Webscraper with a more refined query for better results. This self-correction mechanism continues until either: confidence exceeds threshold,max iterations are reached, or the proposed solution is accepted.
def check_solution_confidence(state: AgentState) -> str:
if state["solution_confidence"] < 0.3:
return "research" # Loop back
if state["needs_human_approval"]:
return "human_approval" # Checkpoint
return "end" # DoneThe agent pauses before executing destructive operations. If the proposed solution involves DROP TABLE, rm -rf, or similar commands, the workflow routes to a human approval checkpoint. The engineer sees exactly what the agent wants to do, why it wants to do it, and can approve or abort this action.
HUMAN APPROVAL REQUIRED
-----------------------------
The agent wants to perform:
DROP TABLE users; -- Recreate with correct schema
Type 'yes' to approve or 'no' to abort:
The agent has two primary tools:
-
TavilySearchTool: Searches the web for StackOverflow answers, documentation, and blogposts. Tavily is optimised for AI agents and provides clean, structured results. As of writing this, it has a free plan that allows limited monthly API calls, perfect for personal projects.
-
FileReaderTool: Safely reads files from your codebase. It has security guards to block access to
.env, secrets, as well as other sensitive files.
The LLM interface is abstracted behind BaseLLM. Swap providers without changing the graph logic. The default is Google Gemini (free tier since I'm just a broke uni student), but the architecture supports any LLM that can follow structured prompts.
git clone https://github.com/bipplane/devops-incident-responder.git
cd devops-incident-responder
# Create virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt- Copy the example environment file:
copy .env.example .env # Windows
# cp .env.example .env # Linux/Mac- Edit
.envand add your API keys:
# Required - Google Gemini (FREE)
# Get your key at: https://aistudio.google.com/app/apikey
GOOGLE_API_KEY=your_gemini_api_key
# Required - Tavily Search (FREE - 1000 searches/month)
# Get your key at: https://tavily.com
TAVILY_API_KEY=your_tavily_api_keyNote: The application automatically loads the
.envfile on startup usingpython-dotenv. Make sure the.envfile is in the project root directory.
python -m src.mainRun the agent in a container without installing Python dependencies locally.
# Build the image
docker build -t incident-responder .
# Run with an error (pass API keys via .env file)
docker run --env-file .env incident-responder --error "Connection refused on port 5432"
# Interactive mode
docker run -it --env-file .env incident-responder# Run the agent
docker compose run --rm agent
# Run with arguments
docker compose run --rm agent --error "OOMKilled: Container exceeded memory limit"
# Run tests in container
docker compose run --rm test| Property | Value |
|---|---|
| Base Image | python:3.11-slim |
| Image Size | ~340MB |
| User | Non-root (appuser) |
| Health Check | Verifies imports on startup |
The Dockerfile uses a multi-stage build to keep the final image lean. Build dependencies (gcc) are discarded after installing Python packages.
The project includes a comprehensive test suite demonstrating production-grade testing practices.
# Run all tests
pytest tests/ -v
# Run specific test class
pytest tests/test_agent.py::TestNodesMocked -v| Test Category | Description |
|---|---|
| Unit Tests | State management, JSON/text parsing, routing logic |
| LLM Resilience | Handles "LLM drift" - markdown blocks, thinking tags, malformed output |
| Mocked Nodes | Full node testing without API calls using unittest.mock |
| Security | Path traversal blocking, .env access prevention, credential protection |
| Integration | End-to-end workflow (skipped in CI/CD to preserve API credits) |
Most AI projects skip testing because "the LLM is unpredictable." This project proves you can infact, isolate and test every layer:
- Parse layer: Does
parse_json_responsehandle garbage? - Logic layer: Does the routing function send low-confidence to refinement?
- Security layer: Can the agent read
.envfiles? (It cannot.)
devops-incident-responder/
├── src/
│ ├── main.py # CLI entry point
│ ├── graph.py # LangGraph workflow definition
│ ├── nodes.py # Node functions (the workers)
│ ├── state.py # AgentState TypedDict
│ ├── prompts.py # All LLM prompts
│ ├── llm.py # LLM interface (Gemini)
│ └── tools/
│ ├── search_tool.py # Tavily integration
│ └── file_tool.py # Safe file reader
├── tests/
│ └── test_agent.py
├── .env.example
├── requirements.txt
└── README.md
Nodes are just functions that take state and return updates. Add your node to the factory, then wire it into the graph.
# In nodes.py
def log_pattern_analyser(self, state: AgentState) -> Dict[str, Any]:
"""Analyse logs for common patterns."""
patterns = self._detect_patterns(state["error_log"])
return {"detected_patterns": patterns}
# In graph.py
workflow.add_node("analyse_patterns", factory.log_pattern_analyser)
workflow.add_edge("diagnose", "analyse_patterns")
workflow.add_edge("analyse_patterns", "research")Tools are independent classes. Create one, inject it into the NodeFactory, and use it in your nodes.
# In tools/kubectl_tool.py
class KubectlTool:
def get_pod_logs(self, pod_name: str, namespace: str = "default") -> str:
# Implementation
pass
def describe_pod(self, pod_name: str) -> str:
# Implementation
passThe interactive mode includes pre-configured sample errors for testing:
| Error | Type | Root Cause |
|---|---|---|
| PostgreSQL Connection Timeout | Database | Docker networking misconfiguration |
| Docker OOMKilled | Infrastructure | Memory limits exceeded |
| Kubernetes CrashLoopBackOff | Deployment | Missing build artifact |
| AWS Lambda Timeout | Cloud | Cold start + slow database query |
| NGINX 502 Bad Gateway | Networking | Backend service unreachable |
Single-shot prompting gives you one chance. If the AI misunderstands, you start over. This state machine allows the agent to iterate, gather more context, and refine its answer.
Higher-level frameworks abstract away the control flow. You define agents and hope they collaborate. With LangGraph, you define the exact graph. Every edge is intentional. Every loop is explicit. This makes debugging the agent itself much easier.
RAG is retrieval-focused. This is action-focused. The agent doesn't just find information, it also synthesises a solution, proposes code changes, and optionally executes them with human oversight.
MIT Licence - use this for your portfolio, interviews, or production systems.
My takeaways from building this project:
- The usage of mock node tests help to prove that the internal logic works, without using API calls (CI/CD-friendly)
- Implementation of security tests (tool safety) demonstrates awareness of DevSecOps to prevent leakage of sensitive info such as .env files or secrets
- API credits are precious, and thus it is important to skip integration tests in CI to preserve these credits
Built as a portfolio project demonstrating production-grade AI agent architecture.
