SRE & AI Engineer building resilient systems at scale. Creator of OpenSRE — an AI-powered SRE platform with episodic memory and knowledge graph for automated incident investigation.
I'm an SRE & AI Engineer with 10+ years of experience in infrastructure, platform engineering, and production reliability. I build systems that combine site reliability engineering with AI to automate what SRE teams do manually — from incident investigation to root cause analysis.
Currently building OpenSRE, an open-source AI SRE platform that learns from every production incident using episodic memory and Neo4j knowledge graphs.
-
OpenSRE — AI SRE platform with episodic memory and knowledge graph. Investigates production incidents, correlates alerts, analyzes logs, and finds root causes automatically. 46 production skills, multi-provider LLM support, Slack/Teams integration. Website | Live Demo
-
Portfolio — Personal portfolio and blog
- Site Reliability Engineering — production incident response, observability, SLO/SLI design, on-call operations
- AI Engineering — LLM agents, LangGraph orchestration, episodic memory systems, RAG pipelines
- Platform Engineering — Kubernetes, Terraform, ArgoCD, CI/CD pipelines
- Cloud Infrastructure — AWS, GCP, infrastructure as code, containerization
- Observability — Prometheus, Grafana, Elasticsearch, Datadog, PagerDuty



