Skip to content

PushkarKanjani/kira-aiops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

KIRA β€” Phase 1 Walkthrough

Kinetic Infrastructure Remediation & Autonomy


1. What We Built

KIRA is an Autonomous AIOps system that monitors simulated cloud infrastructure, detects anomalies using machine learning, identifies root causes, and automatically executes self-healing actions β€” all without human intervention.

The system implements a closed-loop architecture:

Simulate β†’ Detect β†’ Diagnose β†’ Decide β†’ Heal

Every component is a real, running service β€” not pseudocode or a notebook demo.


2. System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    KIRA System                          β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    POST /metrics    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Simulation  β”‚ ──────────────────► β”‚   FastAPI    β”‚ β”‚
β”‚  β”‚  Service     β”‚                     β”‚   Backend    β”‚ β”‚
β”‚  β”‚  simulator.pyβ”‚                     β”‚   :8000      β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                              β”‚          β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”  β”‚
β”‚                              β–Ό               β–Ό      β–Ό  β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚                        β”‚ Anomaly  β”‚  β”‚ Storage  β”‚      β”‚
β”‚                        β”‚Detection β”‚  β”‚ (deque)  β”‚      β”‚
β”‚                        β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                             β”‚                           β”‚
β”‚                             β–Ό                           β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚                        β”‚Decision  β”‚                     β”‚
β”‚                        β”‚ Engine   β”‚                     β”‚
β”‚                        β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                             β”‚                           β”‚
β”‚                             β–Ό                           β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚                        β”‚ Action   β”‚                     β”‚
β”‚                        β”‚ Engine   β”‚                     β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  β”‚    React     β”‚    β”‚   Grafana    β”‚                  β”‚
β”‚  β”‚  Dashboard   β”‚    β”‚  Dashboard   β”‚                  β”‚
β”‚  β”‚   :5173      β”‚    β”‚    :3000     β”‚                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cloud Deployments:
  Backend API  β†’ Railway   (kira-backend-production.up.railway.app)
  ML Model     β†’ HF Spaces (huggingface.co/spaces/Pushkar264/kira-anomaly-detector)
  Source Code  β†’ GitHub    (github.com/PushkarKanjani/kira-aiops)

3. Data Flow β€” Step by Step

  1. simulator.py generates realistic metrics every 3 seconds

    • CPU, memory, latency, error rate, requests/sec
    • Injects faults: CPU spike, memory leak, crash loop, high error rate
    • Fault states persist for 8-18 ticks to mimic real incidents
  2. Metrics are POST-ed to http://localhost:8000/metrics as JSON

  3. FastAPI backend receives the payload and validates it via Pydantic schema

  4. Metric is stored in a rolling window deque (last 30 readings per service)

  5. Anomaly detection runs on the full window

  6. Decision engine evaluates the anomaly result against priority-ordered rules

  7. Action engine executes the appropriate self-healing action

  8. Full pipeline result is stored and returned as JSON response

  9. Prometheus pushgateway receives metric updates every tick

  10. Prometheus scrapes pushgateway every 5 seconds

  11. Grafana queries Prometheus and updates dashboard every 5 seconds

  12. React dashboard polls /metrics endpoint every 3 seconds


4. How Anomaly Detection Works

Two detection methods run on every tick:

Z-Score Detection

  • Computes mean and standard deviation of each feature over the rolling window
  • Flags features where the latest value exceeds 2.5 standard deviations
  • Works from tick 5 onwards
  • Fast, per-feature, catches obvious spikes immediately

Isolation Forest

  • Trains a fresh model on the rolling window (30 samples, 5 features)
  • Uses contamination=0.08 (expects 8% anomalies)
  • Scores the latest point β€” more negative = more anomalous
  • Kicks in after 20 samples
  • Catches subtle correlated anomalies Z-score misses

Combined Result

  • Either detector flagging = anomaly
  • Final score = max(zscore_score, iforest_score)
  • Score range: 0.0 (normal) to 1.0 (certain anomaly)

Features Used

Feature Normal Range Anomaly Range
cpu_percent 20-40% >80%
memory_percent 40-55% >80%
request_latency_ms 60-100ms >300ms
error_rate_percent 0-2% >10%
requests_per_second 100-140 <30

5. How the Decision Engine Works

Pure rule-based classification, priority ordered:

Priority Condition Action Severity
1 restarts >= 3 AND error > 20% restart critical
2 cpu > 80% AND latency > 300ms scale_up high
3 memory > 80% restart high
4 error_rate > 10% throttle medium
5 latency > 450ms alert medium
6 anomaly but no rule matched alert low

Rules are checked in order β€” first match wins. No action is taken if no anomaly is detected.


6. How the Action Engine Works

Four self-healing actions are implemented:

restart

  • Simulates sending SIGTERM to the container
  • Simulates pulling latest image and restarting
  • Used for: crash loops, memory leaks

scale_up

  • Simulates requesting +1 replica from orchestrator
  • Used for: CPU overload causing latency degradation

throttle

  • Simulates setting rate limit to 50 RPS on ingress
  • Used for: elevated error rates

alert

  • Dispatches an alert with reason and anomaly score
  • Used for: unclassified anomalies, latency spikes

Cooldown System

  • Each (service, action_type) pair has a 30-second cooldown
  • Prevents action storms on persistent faults
  • Cooldown state is tracked in-memory

7. How to Run the System Locally

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • Docker Desktop

Start all infrastructure

docker start kira-prometheus pushgateway kira-grafana

Terminal 1 β€” Backend

cd kira-aiops
.\kira-env\Scripts\Activate
$env:PYTHONPATH = "D:\path\to\kira\aiops-system"
uvicorn backend.main:app --reload --port 8000

Terminal 2 β€” Simulator

cd kira-aiops
.\kira-env\Scripts\Activate
$env:PYTHONPATH = "D:\path\to\kira\aiops-system"
python -m simulation-service.simulator

Terminal 3 β€” React Dashboard

cd kira-aiops/dashboard
npm run dev

Access Points

Service URL
API docs http://localhost:8000/docs
React dashboard http://localhost:5173
Grafana http://localhost:3000
Prometheus http://localhost:9090
Pushgateway http://localhost:9091

8. Expected Outputs

Backend terminal (normal state)

[NORMAL] kira-sim-node | score=0.012 | action=none | severity=low

Backend terminal (anomaly detected)

[ANOMALY] kira-sim-node | score=0.743 | action=scale_up | severity=high
  [ACTION] SCALE_UP β†’ kira-sim-node
           Requesting +1 replica from orchestrator...
           New replica healthy and serving traffic.

Simulator terminal

[CPU_SPIKE]      CPU= 91.2%  MEM= 53.4%  LAT= 438.0ms  ERR=  2.1%
[NORMAL]         CPU= 27.3%  MEM= 44.1%  LAT=  71.2ms  ERR=  0.4%

API response (GET /metrics)

{
  "metric": { "cpu_percent": 91.2, "memory_percent": 53.4 },
  "anomaly": { "is_anomaly": true, "anomaly_score": 0.743 },
  "decision": { "action_type": "scale_up", "severity": "high" },
  "action": { "success": true, "message": "Scaled up by 1 replica" }
}

9. Known Limitations

Limitation Reason Future Fix
Simulated data only No real cloud infra available Connect to real AWS CloudWatch
In-memory storage No persistence across restarts Add PostgreSQL or Redis
Isolation Forest retrains each tick Simple but inefficient at scale Pre-train and cache model
Self-healing is simulated No real Kubernetes cluster Add kubectl/Railway API calls
No authentication Phase 1 scope Add JWT in Phase 2
Single service monitored Demo scope Extend to multi-service topology

10. Live Deployments

Component Platform URL
Backend API Railway https://kira-backend-production.up.railway.app
ML Model Hugging Face https://huggingface.co/spaces/Pushkar264/kira-anomaly-detector
Source Code GitHub https://github.com/PushkarKanjani/kira-aiops

11. Technology Stack Summary

Layer Technology Reason
Simulation Python + NumPy Realistic stateful fault injection
Backend FastAPI Lightweight, async, ML-friendly
Anomaly Detection Isolation Forest + Z-score Unsupervised, no labeled data needed
Storage In-memory deque Zero dependencies for Phase 1
Metrics Prometheus + Pushgateway Industry standard observability
Dashboards Grafana + React Operational + portfolio views
Containerization Docker Reproducible environments
Cloud Railway + Hugging Face Free tier, no credit card

KIRA β€” Built as Semester 6 AIOps Lab Project Kinetic Infrastructure Remediation & Autonomy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors