This project is a simple SRE AI Agent designed for beginners to understand Site Reliability Engineering (SRE), incident detection, and AIOps fundamentals.
The agent runs as a single Python file and simulates how SRE teams detect incidents, assess severity, and suggest remediation actions.
To demonstrate how an SRE-style system:
- Collects service metrics
- Detects incidents using simple rules
- Explains what went wrong
- Suggests safe remediation steps
This project focuses on thinking like an SRE, not complex tooling.
- Simulates metrics such as latency, error rate, CPU, and traffic
- Detects incidents based on SRE thresholds
- Classifies severity
- Provides remediation suggestions
- Runs in recommendation-only mode (no auto-fix)
Metrics β Incident Detection β Decision Logic β Human-Readable Output
- Python 3.8 or higher
- Basic understanding of command line
- Git (optional)
Check Python version:
python3 --version
Clone the repository:
git clone git@github.com:shakirbhattt/sre-ai-agent.git
cd sre-ai-agent
python main.py
Example Output
π Metrics Collected:
- latency_p95: 1.8s
- error_rate: 12%
- cpu: 40%
- traffic_rpm: 1200
π¨ INCIDENT DETECTED Severity: HIGH Reason: High error rate and latency breach
Suggested Actions:
- Rollback last deployment
- Check downstream dependencies
- Scale service temporarily