KIRA is an Autonomous AIOps system that monitors simulated cloud infrastructure, detects anomalies using machine learning, identifies root causes, and automatically executes self-healing actions β all without human intervention.
The system implements a closed-loop architecture:
Simulate β Detect β Diagnose β Decide β Heal
Every component is a real, running service β not pseudocode or a notebook demo.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KIRA System β
β β
β ββββββββββββββββ POST /metrics ββββββββββββββββ β
β β Simulation β βββββββββββββββββββΊ β FastAPI β β
β β Service β β Backend β β
β β simulator.pyβ β :8000 β β
β ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β βββββββββββββββββΌβββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ β
β β Anomaly β β Storage β β
β βDetection β β (deque) β β
β ββββββ¬ββββββ ββββββββββββ β
β β β
β βΌ β
β ββββββββββββ β
β βDecision β β
β β Engine β β
β ββββββ¬ββββββ β
β β β
β βΌ β
β ββββββββββββ β
β β Action β β
β β Engine β β
β ββββββββββββ β
β β
β ββββββββββββββββ ββββββββββββββββ β
β β React β β Grafana β β
β β Dashboard β β Dashboard β β
β β :5173 β β :3000 β β
β ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cloud Deployments:
Backend API β Railway (kira-backend-production.up.railway.app)
ML Model β HF Spaces (huggingface.co/spaces/Pushkar264/kira-anomaly-detector)
Source Code β GitHub (github.com/PushkarKanjani/kira-aiops)
-
simulator.pygenerates realistic metrics every 3 seconds- CPU, memory, latency, error rate, requests/sec
- Injects faults: CPU spike, memory leak, crash loop, high error rate
- Fault states persist for 8-18 ticks to mimic real incidents
-
Metrics are POST-ed to
http://localhost:8000/metricsas JSON -
FastAPI backend receives the payload and validates it via Pydantic schema
-
Metric is stored in a rolling window deque (last 30 readings per service)
-
Anomaly detection runs on the full window
-
Decision engine evaluates the anomaly result against priority-ordered rules
-
Action engine executes the appropriate self-healing action
-
Full pipeline result is stored and returned as JSON response
-
Prometheus pushgateway receives metric updates every tick
-
Prometheus scrapes pushgateway every 5 seconds
-
Grafana queries Prometheus and updates dashboard every 5 seconds
-
React dashboard polls
/metricsendpoint every 3 seconds
Two detection methods run on every tick:
- Computes mean and standard deviation of each feature over the rolling window
- Flags features where the latest value exceeds 2.5 standard deviations
- Works from tick 5 onwards
- Fast, per-feature, catches obvious spikes immediately
- Trains a fresh model on the rolling window (30 samples, 5 features)
- Uses contamination=0.08 (expects 8% anomalies)
- Scores the latest point β more negative = more anomalous
- Kicks in after 20 samples
- Catches subtle correlated anomalies Z-score misses
- Either detector flagging = anomaly
- Final score = max(zscore_score, iforest_score)
- Score range: 0.0 (normal) to 1.0 (certain anomaly)
| Feature | Normal Range | Anomaly Range |
|---|---|---|
| cpu_percent | 20-40% | >80% |
| memory_percent | 40-55% | >80% |
| request_latency_ms | 60-100ms | >300ms |
| error_rate_percent | 0-2% | >10% |
| requests_per_second | 100-140 | <30 |
Pure rule-based classification, priority ordered:
| Priority | Condition | Action | Severity |
|---|---|---|---|
| 1 | restarts >= 3 AND error > 20% | restart | critical |
| 2 | cpu > 80% AND latency > 300ms | scale_up | high |
| 3 | memory > 80% | restart | high |
| 4 | error_rate > 10% | throttle | medium |
| 5 | latency > 450ms | alert | medium |
| 6 | anomaly but no rule matched | alert | low |
Rules are checked in order β first match wins. No action is taken if no anomaly is detected.
Four self-healing actions are implemented:
- Simulates sending SIGTERM to the container
- Simulates pulling latest image and restarting
- Used for: crash loops, memory leaks
- Simulates requesting +1 replica from orchestrator
- Used for: CPU overload causing latency degradation
- Simulates setting rate limit to 50 RPS on ingress
- Used for: elevated error rates
- Dispatches an alert with reason and anomaly score
- Used for: unclassified anomalies, latency spikes
- Each (service, action_type) pair has a 30-second cooldown
- Prevents action storms on persistent faults
- Cooldown state is tracked in-memory
- Python 3.11+
- Node.js 18+
- Docker Desktop
docker start kira-prometheus pushgateway kira-grafanacd kira-aiops
.\kira-env\Scripts\Activate
$env:PYTHONPATH = "D:\path\to\kira\aiops-system"
uvicorn backend.main:app --reload --port 8000cd kira-aiops
.\kira-env\Scripts\Activate
$env:PYTHONPATH = "D:\path\to\kira\aiops-system"
python -m simulation-service.simulatorcd kira-aiops/dashboard
npm run dev| Service | URL |
|---|---|
| API docs | http://localhost:8000/docs |
| React dashboard | http://localhost:5173 |
| Grafana | http://localhost:3000 |
| Prometheus | http://localhost:9090 |
| Pushgateway | http://localhost:9091 |
[NORMAL] kira-sim-node | score=0.012 | action=none | severity=low
[ANOMALY] kira-sim-node | score=0.743 | action=scale_up | severity=high
[ACTION] SCALE_UP β kira-sim-node
Requesting +1 replica from orchestrator...
New replica healthy and serving traffic.
[CPU_SPIKE] CPU= 91.2% MEM= 53.4% LAT= 438.0ms ERR= 2.1%
[NORMAL] CPU= 27.3% MEM= 44.1% LAT= 71.2ms ERR= 0.4%
{
"metric": { "cpu_percent": 91.2, "memory_percent": 53.4 },
"anomaly": { "is_anomaly": true, "anomaly_score": 0.743 },
"decision": { "action_type": "scale_up", "severity": "high" },
"action": { "success": true, "message": "Scaled up by 1 replica" }
}| Limitation | Reason | Future Fix |
|---|---|---|
| Simulated data only | No real cloud infra available | Connect to real AWS CloudWatch |
| In-memory storage | No persistence across restarts | Add PostgreSQL or Redis |
| Isolation Forest retrains each tick | Simple but inefficient at scale | Pre-train and cache model |
| Self-healing is simulated | No real Kubernetes cluster | Add kubectl/Railway API calls |
| No authentication | Phase 1 scope | Add JWT in Phase 2 |
| Single service monitored | Demo scope | Extend to multi-service topology |
| Component | Platform | URL |
|---|---|---|
| Backend API | Railway | https://kira-backend-production.up.railway.app |
| ML Model | Hugging Face | https://huggingface.co/spaces/Pushkar264/kira-anomaly-detector |
| Source Code | GitHub | https://github.com/PushkarKanjani/kira-aiops |
| Layer | Technology | Reason |
|---|---|---|
| Simulation | Python + NumPy | Realistic stateful fault injection |
| Backend | FastAPI | Lightweight, async, ML-friendly |
| Anomaly Detection | Isolation Forest + Z-score | Unsupervised, no labeled data needed |
| Storage | In-memory deque | Zero dependencies for Phase 1 |
| Metrics | Prometheus + Pushgateway | Industry standard observability |
| Dashboards | Grafana + React | Operational + portfolio views |
| Containerization | Docker | Reproducible environments |
| Cloud | Railway + Hugging Face | Free tier, no credit card |
KIRA β Built as Semester 6 AIOps Lab Project Kinetic Infrastructure Remediation & Autonomy