Read-only multi-cluster SRE agent in your terminal. Ask plain-language
questions about Kubernetes / JVM / Python / GPU workloads across every
cluster you have credentials for, and get answers stitched together
from kubectl, Prometheus, Loki, jcmd, py-spy, nvidia-smi, perf, eBPF,
and friends — without typing any of them.
cloudy never mutates infrastructure. Every call is GET / LIST /
WATCH, enforced at four layers.
██████╗██╗ ██████╗ ██╗ ██╗██████╗ ██╗ ██╗
██╔════╝██║ ██╔═══██╗██║ ██║██╔══██╗╚██╗ ██╔╝
██║ ██║ ██║ ██║██║ ██║██║ ██║ ╚████╔╝
██║ ██║ ██║ ██║██║ ██║██║ ██║ ╚██╔╝
╚██████╗███████╗╚██████╔╝╚██████╔╝██████╔╝ ██║
╚═════╝╚══════╝ ╚═════╝ ╚═════╝ ╚═════╝ ╚═╝
⚙ /setup discover clusters & backends
? /help keyboard shortcuts
⏎ or just ask a question
You type:
Why did checkout-service p99 spike around 2am yesterday?
cloudy plans the investigation, runs the relevant read-only probes (metrics, logs, traces, profiles), and explains what it found. The agent picks tools from a typed registry — Kubernetes, Prometheus, Loki / ES, Tempo / Jaeger, pprof, async-profiler, py-spy, NVIDIA SMI, perf, eBPF — based on the question, not on a fixed script.
One-liner (macOS, Linux — amd64 + arm64):
curl -fsSL https://raw.githubusercontent.com/rlaope/cloudy/master/install.sh | shDrops the latest GitHub release into ~/.local/bin/cloudy, sets the
executable bit, and prints a PATH-setup hint if needed. Once the
installer finishes, the binary is reachable as plain cloudy from
any directory (no ./ prefix — it lives on $PATH, not in your
working directory). Re-run the same one-liner anytime to upgrade,
or use cloudy update from inside the TUI — the installer always
pulls whatever GitHub marks as latest.
Override the install location with CLOUDY_INSTALL_DIR:
curl -fsSL https://raw.githubusercontent.com/rlaope/cloudy/master/install.sh \
| CLOUDY_INSTALL_DIR=/usr/local/bin shBuild from source (Windows, contributors, anything off the release matrix):
git clone https://github.com/rlaope/cloudy.git
cd cloudy
make build # produces ./cloudy in the repo root
./cloudy --version # quick smoke test from the build dir
sudo mv cloudy /usr/local/bin/ # or move it onto your PATH any other way
cloudy --version # now reachable as a bare commandEither install path leaves the binary reachable as plain cloudy
from any directory once it is on your PATH.
cloudyThe TUI opens. Two commands get you to the first question:
/setup— scans your kubeconfig contexts, auto-discovers Prometheus / Loki / Elasticsearch / Tempo / Jaeger / Postgres / MySQL / Redis / pprof / V8 inspector endpoints, lets you pick which to enable inline, then writes~/.cloudy/config.yamlplus aprofile.yamlsnapshot of the scan. No restart./login— picks an LLM provider (Anthropic / OpenAI / Google / Moonshot) with arrow keys and saves the API key to~/.cloudy/secrets(mode0600). The chosen model is active immediately;/model <id>swaps mid-session.
Then ask:
> Why does the payments-api pod keep getting OOMKilled?
Headless / CI usage:
cloudy ask "Why is the checkout service slow right now?" # one-shot
cloudy setup # non-interactive setup
cloudy profile use payments-sre # activate a permission profile
cloudy profile cluster # show RBAC for current contextThree independent enforcement layers plus boot-time and runtime hardening. Defense in depth, not a single chokepoint.
- HTTP
RoundTripperrejects every method other thanGET/HEAD/OPTIONSbefore the request reaches the network. The K8s client honours this too —rest.Config.WrapTransportis set to the same wrapper, so apiserver calls share the HTTP whitelist end-to-end. - Bundled
ClusterRole(manifests/rbac/) only grantsget/list/watch(plus the two narrow bastion verbs below) at the RBAC layer — the cluster itself refuses anything else even if a guard in cloudy were bypassed. - Bastion reachability verbs (
services/proxy: get,pods/portforward: create) are the minimum required to reach HTTP and TCP backends through the apiserver and do not widen the mutation surface.
On top of those layers cloudy adds two hardening guards:
- The
tools.Registrymutator-name assertion panics at boot if any registered tool name looks like a write (create_*,delete_*,patch_*, ...). Mutating tools (exec,delete,patch, write-mode port-forward) are never registered, so the LLM never sees them in its tool catalogue and cannot ask for them. - A risk-rated approval gate sits in front of tools that are
read-only but expensive enough to perturb the system they're
observing — STW JVM pauses, attached eBPF probes, long profiling
windows. The TUI surfaces a
y/Nbanner; headless entry points refuse them with a clear message. See docs/SAFETY.md.
| Domain | What it talks to |
|---|---|
| Kubernetes | apiserver (get / list / watch only) |
| Metrics | Prometheus, Thanos, VictoriaMetrics |
| Logs | Loki, Elasticsearch, OpenSearch |
| Traces | Tempo, Jaeger |
| JVM | jcmd, async-profiler (heap / cpu / alloc) |
| Python | py-spy (sampling / dump-stacks) |
| Ruby | rbspy (sampling) — registered as perf.rbspy_dump |
| GPU | NVIDIA SMI, DCGM |
| Kernel | perf, eBPF (read-only probes only) |
| Databases | Postgres / MySQL / Redis (read-only query subset) |
HTTP backends are reached via the K8s apiserver's services/proxy,
TCP backends via in-process SPDY port-forward. A single
kubectl-reachable cluster is enough — no VPN, no per-service
ingress.
Every probe the agent can call is a typed tool with a JSON schema.
Tools self-register at boot — perf, eBPF, and DB groups also gate on
binary / driver presence. Type /tools in the TUI to see what's
wired in your environment.
| Group | Tools (count) |
|---|---|
k8s (20) |
list_pods, list_nodes, list_namespaces, describe_pod, events, logs, top_pods, top_nodes, list_deployments, list_statefulsets, list_daemonsets, list_jobs, list_cronjobs, list_services, list_ingresses, list_hpa, list_pdbs, list_networkpolicies, list_crds, list_cr (CRD-generic dynamic-client reader; unlocks Argo Rollouts, KEDA, cert-manager, Gateway API, Sloth SLOs, ServiceMonitor, etc. in one tool) |
prom (4) |
query, query_range, label_values, series |
log (7) |
loki_query_range, loki_labels, loki_label_values, loki_series, es_search, es_indices, es_cluster_health |
trace (7) |
tempo_get_trace, tempo_search, service_graph (Tempo metrics-generator service-graph edges), route_red (Tempo metrics-generator per-route RED), jaeger_services, jaeger_operations, jaeger_search_traces |
alert (3) |
list_active, list_silences (Alertmanager v2), list_rules (Prometheus rules API) |
gitops (3) |
argo_list_apps, argo_app_status, argo_app_history (Argo CD v1 API) |
db (18) |
Postgres: pg_version, pg_stat_activity, pg_stat_database, pg_stat_replication, pg_locks, pg_top_table_size. MySQL: mysql_version, mysql_processlist, mysql_global_status, mysql_global_variables, mysql_engine_innodb_status, mysql_top_table_size. Redis: redis_info, redis_dbsize, redis_scan, redis_inspect_key, redis_slowlog, redis_client_list |
perf (4) |
rbspy_dump (Ruby, always-on), go_pprof_cpu, linux_perf_record, v8_inspector_cpu_profile (last three conditional on host binaries) |
jvm (4) |
jstat_gc, jcmd_gc, jcmd_thread_dump, async_profile |
py (2) |
spy_dump, spy_top_snapshot |
gpu (2) |
nvidia_smi, dcgm_metrics |
ebpf (5) |
biolatency, tcptop, tcprtt, execsnoop, bpftrace_oneliner (all RiskHigh; gated by the approval banner) |
No
rubygroup. rbspy is registered asperf.rbspy_dump. If you are looking for Ruby profiling in/tools, searchperf.
Skills are curated multi-step playbooks the agent picks when a
question matches their triggers. They live in
skills/ (mirrored under internal/skills/skills/ for
embedding); you can override or add by dropping a .md file into
~/.cloudy/skills/ — user files win on name conflicts.
| Skill | When it fires |
|---|---|
cluster-recon |
"What's running in my cluster right now?" topology dump. |
incident-context |
"What's burning right now?" — cross-references firing alerts with recent Argo CD syncs and pod restarts. |
k8s-incident |
First-pass triage for CrashLoopBackOff / Pending / OOMKilled / Eviction. |
crashloop-deep-dive |
Beyond exit codes — previous-container logs, probe audit, init-container ordering, traces. |
oom-killed-triage |
Container-limit vs. node-level OOM, sawtooth-vs-plateau working-set pattern, JVM heap flag check. |
log-spike-correlation |
Joins a Loki / ES error spike to Prom anomalies and pod events. |
trace-error-pivot |
Walk a p99 / error-rate regression down to the slow span in Tempo or Jaeger and back to the pod. |
db-latency-hunt |
PostgreSQL / MySQL / Redis read-only forensics for slow upstream DB calls. |
prom-explorer |
Interactive PromQL composition without prior knowledge of the metric schema. |
jvm-gc |
GC pause / heap-exhaustion / old-gen growth diagnosis. |
jvm-thread |
Deadlock, blocked threads, pool exhaustion. |
py-perf |
GIL contention, async-loop stalls, CPU bottlenecks. |
gpu-saturation |
GPU OOM, low utilization, thermal throttling. |
Bring your own key. Picked at /login, swappable mid-session with
/model <id>.
| Provider | Env var | Model prefix |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
claude-* |
| OpenAI | OPENAI_API_KEY |
gpt-*, o1-* |
| Google Gemini | GOOGLE_API_KEY |
gemini-* |
| Moonshot / Kimi | MOONSHOT_API_KEY |
kimi-* |
| OpenAI-compatible | OPENAI_BASE_URL |
any |
OpenAI-compatible covers Ollama, vLLM, LM Studio, OpenRouter, and any
in-network gateway that speaks the same wire format. LLM adapters
honor HTTP_PROXY / HTTPS_PROXY for corporate egress.
cloudy resolves its state directory in this order: $CLOUDY_HOME →
$XDG_CONFIG_HOME/cloudy → $HOME/.cloudy. Layout:
| Path | What |
|---|---|
config.yaml |
Clusters, backends, model, safety limits. Generated by /setup; hand-editing supported. |
profile.yaml |
Snapshot of the last /setup scan (discovered endpoints + selection state). |
secrets |
Dotenv-format API keys (mode 0600). Written by /login. |
profiles/<name>.yaml |
Permission profile bundles: tool/namespace allow-deny rules and field masking (passwords, tokens). |
active_profile |
Pointer to the currently selected permission profile (managed by cloudy profile use). |
See docs/PERMISSION_PROFILES.md for the permission-profile schema.
- docs/SAFETY.md — read-only guards, risk-rated approval gate, threat model
- docs/AUTO_DISCOVERY.md — what
/setupprobes, where, and how findings map to config - docs/BASTION.md — deploying cloudy on a shared bastion (per-user state, systemd, proxy)
- docs/PERMISSION_PROFILES.md — profile schema, masking rules, per-session limits
- CHANGELOG.md — release notes
Pre-1.0. Build from source. Public API and config schema may shift between minor versions; pin a tag if that matters for you.
MIT.