This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Praetor is an infrastructure-as-code repository for Galaxy LAN infrastructure, managing Kubernetes clusters on Proxmox using Talos Linux. The repository contains three main projects: Terraform for cluster provisioning, Helm for Kubernetes bootstrap, and DNS automation via UniFi Controller.
# First-time setup (required before any operations)
make cluster-init # Initialize XDG config directory (~/.config/praetor-terraform/)
make cluster-select # Interactively choose active cluster
# Core operations
make init # Initialize OpenTofu
make plan # Plan changes for selected cluster
make apply # Apply changes (interactive)
make apply-auto # Apply with auto-approve
# Cluster management
make kubeconfig # Download kubeconfig to ./kubeconfig
make k8s-status # Show Kubernetes cluster status
make cluster-health-check # Check cluster health
# VM operations
make vm-status-all # Show all VM statuses
make vm-start-all # Start all cluster VMs
make vm-stop-all # Stop all cluster VMs
# Pod monitoring
make pods-all # All pods in all namespaces
make pods-unhealthy # Only unhealthy podsEach component has standardized scripts:
cd helm/<component>
./install.sh # Install or upgrade
./delete.sh # Uninstall (if available)Bootstrap order matters:
storage/- Local Path Provisioner (required first)kube-vip/- LoadBalancer providermetrics-server/- Resource metricsprometheus/- Monitoring (before Traefik for CRDs)traefik/- Ingress controllercert-manager/- TLS certificatesk8s_gateway/- Automatic DNSargocd/- GitOps (optional, infra clusters only)
Required tools: tofu (OpenTofu), kubectl, talosctl, helm, jq, curl, bc
make check-tools # Verify all tools installedThe system uses XDG Base Directory specification for multi-cluster support:
~/.config/praetor-terraform/global.conf- Proxmox API settings~/.config/praetor-terraform/clusters.conf- Cluster definitions~/.config/praetor-terraform/cluster-config.mk- Active cluster
Clusters are defined in terraform/main.tf as modules (e.g., module "prod_cluster").
- Gateway:
10.120.208.1 - VM Subnet:
10.120.208.0/20 - LoadBalancer Pool:
.205-.253(49 IPs, kube-vip managed) - DNS Server IP:
.253(k8s_gateway) - Domain pattern:
<service>.<cluster>.galaxy.lan
All Helm scripts use helm/libs/k8s_utils.sh which provides:
k8s_print_color- Colorized outputk8s_create_namespace- Idempotent namespace creationk8s_install_or_upgrade_helm_release- Complete Helm workflow- Auto-detection of cluster name from kubectl context and subnet from node IPs
Sensitive files are encrypted with git-crypt (.gitattributes rules):
*.tfvars,*.tfstate*- Infrastructure credentialskubeconfig*,talosconfig*- Cluster configs- All
*secret*,*credentials*files
git-crypt unlock # Decrypt with authorized GPG key
git-crypt status # Verify encryption status before commits- No SSH access by default
- API-only configuration via
talosctl - Immutable filesystem
terraform/main.tf- Cluster definitions usingterraform-proxmox-talosmoduleterraform/providers.tf- Proxmox, Talos, UniFi, Kubernetes providersterraform/Makefile- 500+ lines of automation commandshelm/libs/k8s_utils.sh- Shared Bash utilities for all Helm scripts
| Cluster | Nodes | Subnet | Purpose |
|---|---|---|---|
| k8s-prod | 3 CP + 3 workers | 10.120.208.x | Production workloads |
| k8s-infra | 3 CP + 2 workers | 10.100.20.x | ArgoCD, infrastructure services |
| k8s-dev | 3 CP + 3 workers | TBD | Development/testing |
# Check cluster status
kubectl --context=k8s-prod get nodes -o wide
kubectl --context=k8s-infra get nodes -o wide
# Check firing alerts
kubectl --context=k8s-prod -n monitoring exec -it deploy/kube-prometheus-stack-operator -- \
wget -qO- 'http://kube-prometheus-stack-prometheus:9090/api/v1/alerts' | \
jq -r '.data.alerts[] | select(.state=="firing") | "[\(.labels.severity)] \(.labels.alertname)"'
# Check unhealthy pods
kubectl --context=k8s-prod get pods -A | grep -v Running | grep -v CompletedPrometheus stack deployed in monitoring namespace on both clusters:
- Prometheus:
kube-prometheus-stack-prometheus:9090 - Alertmanager:
kube-prometheus-stack-alertmanager:9093 - Grafana:
kube-prometheus-stack-grafana:80
Common alerts to expect:
- Watchdog - Always firing, proves alerting pipeline works
- InfoInhibitor - Inhibits noisy info alerts, expected
Location: /Users/james.maes/Git.Local/dmd/me-health-portal-cd-pipeline/
GitHub: git@github.com:dmdbrands/me-health-portal-cd-pipeline.git
ArgoCD App: me-health-portal-uat (on k8s-infra)
Kustomize-based CD pipeline for me-health-portal application. Key files:
base/- Base manifests for all componentsoverlays/uat/- UAT environment overridessecrets/- Secret templates (sealed with kubeseal)
| Component | Type | Notes |
|---|---|---|
| me-health-portal | Deployment | Main app, HPA fixed at 1 replica (needs Redis for sessions) |
| mh-opensearch | StatefulSet | Log storage, 10Gi PVC, 90-day ISM retention |
| mh-graylog | StatefulSet | Log aggregation |
| mh-fluentbit | DaemonSet | Log collection (excludes graylog/opensearch/fluentbit logs) |
| mh-rabbitmq | StatefulSet | Message queue, 1000m CPU limit |
| mh-postgres | StatefulSet | PostgreSQL cluster (3 replicas) |
| mh-redis-sentinel | StatefulSet | Redis (3 replicas) |
| mh-mongodb | StatefulSet | MongoDB |
| mh-authentik | Deployment | Authentication |
FluentBit Log Loop (RESOLVED)
- Problem: FluentBit collected Graylog logs, sent to Graylog, creating feedback loop
- Symptom: 29M docs in 100 minutes, ~85GB/day ingestion
- Fix: Added
Exclude_Pathin FluentBit config for graylog/opensearch/fluentbit containers - Result: Reduced to ~318 docs/min (~92MB/day)
OpenSearch Retention
- ISM policy
graylog-retentionconfigured for 90-day retention - Auto-deletes
graylog_*indices older than 90 days - 10Gi PVC sufficient for ~100 days at current rate
HPA MaxedOut Alert
- me-health-portal HPA has minReplicas=maxReplicas=1 intentionally
- Cannot scale until Redis session caching is implemented
- Alert will persist until then
Session state is stored in ./docs/SESSION-STATE.md. To continue from a previous session:
- Read
./docs/SESSION-STATE.mdfor context - Read
./docs/TODO.mdfor pending tasks - Check git status for any uncommitted work