KIRA — Phase 1 Walkthrough

Kinetic Infrastructure Remediation & Autonomy

1. What We Built

KIRA is an Autonomous AIOps system that monitors simulated cloud infrastructure, detects anomalies using machine learning, identifies root causes, and automatically executes self-healing actions — all without human intervention.

The system implements a closed-loop architecture:

Simulate → Detect → Diagnose → Decide → Heal

Every component is a real, running service — not pseudocode or a notebook demo.

2. System Architecture

┌─────────────────────────────────────────────────────────┐
│                    KIRA System                          │
│                                                         │
│  ┌──────────────┐    POST /metrics    ┌──────────────┐ │
│  │  Simulation  │ ──────────────────► │   FastAPI    │ │
│  │  Service     │                     │   Backend    │ │
│  │  simulator.py│                     │   :8000      │ │
│  └──────────────┘                     └──────┬───────┘ │
│                                              │          │
│                              ┌───────────────┼──────┐  │
│                              ▼               ▼      ▼  │
│                        ┌──────────┐  ┌──────────┐      │
│                        │ Anomaly  │  │ Storage  │      │
│                        │Detection │  │ (deque)  │      │
│                        └────┬─────┘  └──────────┘      │
│                             │                           │
│                             ▼                           │
│                        ┌──────────┐                     │
│                        │Decision  │                     │
│                        │ Engine   │                     │
│                        └────┬─────┘                     │
│                             │                           │
│                             ▼                           │
│                        ┌──────────┐                     │
│                        │ Action   │                     │
│                        │ Engine   │                     │
│                        └──────────┘                     │
│                                                         │
│  ┌──────────────┐    ┌──────────────┐                  │
│  │    React     │    │   Grafana    │                  │
│  │  Dashboard   │    │  Dashboard   │                  │
│  │   :5173      │    │    :3000     │                  │
│  └──────────────┘    └──────────────┘                  │
└─────────────────────────────────────────────────────────┘

Cloud Deployments:
  Backend API  → Railway   (kira-backend-production.up.railway.app)
  ML Model     → HF Spaces (huggingface.co/spaces/Pushkar264/kira-anomaly-detector)
  Source Code  → GitHub    (github.com/PushkarKanjani/kira-aiops)

3. Data Flow — Step by Step

simulator.py generates realistic metrics every 3 seconds
- CPU, memory, latency, error rate, requests/sec
- Injects faults: CPU spike, memory leak, crash loop, high error rate
- Fault states persist for 8-18 ticks to mimic real incidents
Metrics are POST-ed to http://localhost:8000/metrics as JSON
FastAPI backend receives the payload and validates it via Pydantic schema
Metric is stored in a rolling window deque (last 30 readings per service)
Anomaly detection runs on the full window
Decision engine evaluates the anomaly result against priority-ordered rules
Action engine executes the appropriate self-healing action
Full pipeline result is stored and returned as JSON response
Prometheus pushgateway receives metric updates every tick
Prometheus scrapes pushgateway every 5 seconds
Grafana queries Prometheus and updates dashboard every 5 seconds
React dashboard polls /metrics endpoint every 3 seconds

4. How Anomaly Detection Works

Two detection methods run on every tick:

Z-Score Detection

Computes mean and standard deviation of each feature over the rolling window
Flags features where the latest value exceeds 2.5 standard deviations
Works from tick 5 onwards
Fast, per-feature, catches obvious spikes immediately

Isolation Forest

Trains a fresh model on the rolling window (30 samples, 5 features)
Uses contamination=0.08 (expects 8% anomalies)
Scores the latest point — more negative = more anomalous
Kicks in after 20 samples
Catches subtle correlated anomalies Z-score misses

Combined Result

Either detector flagging = anomaly
Final score = max(zscore_score, iforest_score)
Score range: 0.0 (normal) to 1.0 (certain anomaly)

Features Used

Feature	Normal Range	Anomaly Range
cpu_percent	20-40%	>80%
memory_percent	40-55%	>80%
request_latency_ms	60-100ms	>300ms
error_rate_percent	0-2%	>10%
requests_per_second	100-140	<30

5. How the Decision Engine Works

Pure rule-based classification, priority ordered:

Priority	Condition	Action	Severity
1	restarts >= 3 AND error > 20%	restart	critical
2	cpu > 80% AND latency > 300ms	scale_up	high
3	memory > 80%	restart	high
4	error_rate > 10%	throttle	medium
5	latency > 450ms	alert	medium
6	anomaly but no rule matched	alert	low

Rules are checked in order — first match wins. No action is taken if no anomaly is detected.

6. How the Action Engine Works

Four self-healing actions are implemented:

restart

Simulates sending SIGTERM to the container
Simulates pulling latest image and restarting
Used for: crash loops, memory leaks

scale_up

Simulates requesting +1 replica from orchestrator
Used for: CPU overload causing latency degradation

throttle

Simulates setting rate limit to 50 RPS on ingress
Used for: elevated error rates

alert

Dispatches an alert with reason and anomaly score
Used for: unclassified anomalies, latency spikes

Cooldown System

Each (service, action_type) pair has a 30-second cooldown
Prevents action storms on persistent faults
Cooldown state is tracked in-memory

7. How to Run the System Locally

Prerequisites

Python 3.11+
Node.js 18+
Docker Desktop

Start all infrastructure

docker start kira-prometheus pushgateway kira-grafana

Terminal 1 — Backend

cd kira-aiops
.\kira-env\Scripts\Activate
$env:PYTHONPATH = "D:\path\to\kira\aiops-system"
uvicorn backend.main:app --reload --port 8000

Terminal 2 — Simulator

cd kira-aiops
.\kira-env\Scripts\Activate
$env:PYTHONPATH = "D:\path\to\kira\aiops-system"
python -m simulation-service.simulator

Terminal 3 — React Dashboard

cd kira-aiops/dashboard
npm run dev

Access Points

Service	URL
API docs	http://localhost:8000/docs
React dashboard	http://localhost:5173
Grafana	http://localhost:3000
Prometheus	http://localhost:9090
Pushgateway	http://localhost:9091

8. Expected Outputs

Backend terminal (normal state)

[NORMAL] kira-sim-node | score=0.012 | action=none | severity=low

Backend terminal (anomaly detected)

[ANOMALY] kira-sim-node | score=0.743 | action=scale_up | severity=high
  [ACTION] SCALE_UP → kira-sim-node
           Requesting +1 replica from orchestrator...
           New replica healthy and serving traffic.

Simulator terminal

[CPU_SPIKE]      CPU= 91.2%  MEM= 53.4%  LAT= 438.0ms  ERR=  2.1%
[NORMAL]         CPU= 27.3%  MEM= 44.1%  LAT=  71.2ms  ERR=  0.4%

API response (GET /metrics)

{
  "metric": { "cpu_percent": 91.2, "memory_percent": 53.4 },
  "anomaly": { "is_anomaly": true, "anomaly_score": 0.743 },
  "decision": { "action_type": "scale_up", "severity": "high" },
  "action": { "success": true, "message": "Scaled up by 1 replica" }
}

9. Known Limitations

Limitation	Reason	Future Fix
Simulated data only	No real cloud infra available	Connect to real AWS CloudWatch
In-memory storage	No persistence across restarts	Add PostgreSQL or Redis
Isolation Forest retrains each tick	Simple but inefficient at scale	Pre-train and cache model
Self-healing is simulated	No real Kubernetes cluster	Add kubectl/Railway API calls
No authentication	Phase 1 scope	Add JWT in Phase 2
Single service monitored	Demo scope	Extend to multi-service topology

10. Live Deployments

Component	Platform	URL
Backend API	Railway	https://kira-backend-production.up.railway.app
ML Model	Hugging Face	https://huggingface.co/spaces/Pushkar264/kira-anomaly-detector
Source Code	GitHub	https://github.com/PushkarKanjani/kira-aiops

11. Technology Stack Summary

Layer	Technology	Reason
Simulation	Python + NumPy	Realistic stateful fault injection
Backend	FastAPI	Lightweight, async, ML-friendly
Anomaly Detection	Isolation Forest + Z-score	Unsupervised, no labeled data needed
Storage	In-memory deque	Zero dependencies for Phase 1
Metrics	Prometheus + Pushgateway	Industry standard observability
Dashboards	Grafana + React	Operational + portfolio views
Containerization	Docker	Reproducible environments
Cloud	Railway + Hugging Face	Free tier, no credit card

KIRA — Built as Semester 6 AIOps Lab Project Kinetic Infrastructure Remediation & Autonomy

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
backend		backend
dashboard		dashboard
docker		docker
docs		docs
ml-model		ml-model
ngrok		ngrok
simulation-service		simulation-service
.gitignore		.gitignore
Dockerfile		Dockerfile
Procfile		Procfile
README.md		README.md
nixpacks.toml		nixpacks.toml
render.yaml		render.yaml

Folders and files

Latest commit

History

Repository files navigation

KIRA — Phase 1 Walkthrough

Kinetic Infrastructure Remediation & Autonomy

1. What We Built

2. System Architecture

3. Data Flow — Step by Step

4. How Anomaly Detection Works

Z-Score Detection

Isolation Forest

Combined Result

Features Used

5. How the Decision Engine Works

6. How the Action Engine Works

restart

scale_up

throttle

alert

Cooldown System

7. How to Run the System Locally

Prerequisites

Start all infrastructure

Terminal 1 — Backend

Terminal 2 — Simulator

Terminal 3 — React Dashboard

Access Points

8. Expected Outputs

Backend terminal (normal state)

Backend terminal (anomaly detected)

Simulator terminal

API response (GET /metrics)

9. Known Limitations

10. Live Deployments

11. Technology Stack Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages