CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Praetor is an infrastructure-as-code repository for Galaxy LAN infrastructure, managing Kubernetes clusters on Proxmox using Talos Linux. The repository contains three main projects: Terraform for cluster provisioning, Helm for Kubernetes bootstrap, and DNS automation via UniFi Controller.

Build & Development Commands

Terraform (Primary - in `/terraform/`)

# First-time setup (required before any operations)
make cluster-init       # Initialize XDG config directory (~/.config/praetor-terraform/)
make cluster-select     # Interactively choose active cluster

# Core operations
make init               # Initialize OpenTofu
make plan               # Plan changes for selected cluster
make apply              # Apply changes (interactive)
make apply-auto         # Apply with auto-approve

# Cluster management
make kubeconfig         # Download kubeconfig to ./kubeconfig
make k8s-status         # Show Kubernetes cluster status
make cluster-health-check  # Check cluster health

# VM operations
make vm-status-all      # Show all VM statuses
make vm-start-all       # Start all cluster VMs
make vm-stop-all        # Stop all cluster VMs

# Pod monitoring
make pods-all           # All pods in all namespaces
make pods-unhealthy     # Only unhealthy pods

Helm (in `/helm/`)

Each component has standardized scripts:

cd helm/<component>
./install.sh            # Install or upgrade
./delete.sh             # Uninstall (if available)

Bootstrap order matters:

storage/ - Local Path Provisioner (required first)
kube-vip/ - LoadBalancer provider
metrics-server/ - Resource metrics
prometheus/ - Monitoring (before Traefik for CRDs)
traefik/ - Ingress controller
cert-manager/ - TLS certificates
k8s_gateway/ - Automatic DNS
argocd/ - GitOps (optional, infra clusters only)

Prerequisites

Required tools: tofu (OpenTofu), kubectl, talosctl, helm, jq, curl, bc

make check-tools        # Verify all tools installed

Architecture

Multi-Cluster Configuration

The system uses XDG Base Directory specification for multi-cluster support:

~/.config/praetor-terraform/global.conf - Proxmox API settings
~/.config/praetor-terraform/clusters.conf - Cluster definitions
~/.config/praetor-terraform/cluster-config.mk - Active cluster

Clusters are defined in terraform/main.tf as modules (e.g., module "prod_cluster").

Network Layout

Gateway: 10.120.208.1
VM Subnet: 10.120.208.0/20
LoadBalancer Pool: .205-.253 (49 IPs, kube-vip managed)
DNS Server IP: .253 (k8s_gateway)
Domain pattern: <service>.<cluster>.galaxy.lan

Helm Shared Utilities

All Helm scripts use helm/libs/k8s_utils.sh which provides:

k8s_print_color - Colorized output
k8s_create_namespace - Idempotent namespace creation
k8s_install_or_upgrade_helm_release - Complete Helm workflow
Auto-detection of cluster name from kubectl context and subnet from node IPs

Security

Git-Crypt Encrypted Files

Sensitive files are encrypted with git-crypt (.gitattributes rules):

*.tfvars, *.tfstate* - Infrastructure credentials
kubeconfig*, talosconfig* - Cluster configs
All *secret*, *credentials* files

git-crypt unlock        # Decrypt with authorized GPG key
git-crypt status        # Verify encryption status before commits

Talos Linux Security Model

No SSH access by default
API-only configuration via talosctl
Immutable filesystem

Key Files

terraform/main.tf - Cluster definitions using terraform-proxmox-talos module
terraform/providers.tf - Proxmox, Talos, UniFi, Kubernetes providers
terraform/Makefile - 500+ lines of automation commands
helm/libs/k8s_utils.sh - Shared Bash utilities for all Helm scripts

Cluster Operations

Available Clusters

Cluster	Nodes	Subnet	Purpose
k8s-prod	3 CP + 3 workers	10.120.208.x	Production workloads
k8s-infra	3 CP + 2 workers	10.100.20.x	ArgoCD, infrastructure services
k8s-dev	3 CP + 3 workers	TBD	Development/testing

Quick Status Commands

# Check cluster status
kubectl --context=k8s-prod get nodes -o wide
kubectl --context=k8s-infra get nodes -o wide

# Check firing alerts
kubectl --context=k8s-prod -n monitoring exec -it deploy/kube-prometheus-stack-operator -- \
  wget -qO- 'http://kube-prometheus-stack-prometheus:9090/api/v1/alerts' | \
  jq -r '.data.alerts[] | select(.state=="firing") | "[\(.labels.severity)] \(.labels.alertname)"'

# Check unhealthy pods
kubectl --context=k8s-prod get pods -A | grep -v Running | grep -v Completed

Prometheus/Alerting

Prometheus stack deployed in monitoring namespace on both clusters:

Prometheus: kube-prometheus-stack-prometheus:9090
Alertmanager: kube-prometheus-stack-alertmanager:9093
Grafana: kube-prometheus-stack-grafana:80

Common alerts to expect:

Watchdog - Always firing, proves alerting pipeline works
InfoInhibitor - Inhibits noisy info alerts, expected

Related Repositories

me-health-portal-cd-pipeline

Location: /Users/james.maes/Git.Local/dmd/me-health-portal-cd-pipeline/ GitHub: git@github.com:dmdbrands/me-health-portal-cd-pipeline.git ArgoCD App: me-health-portal-uat (on k8s-infra)

Kustomize-based CD pipeline for me-health-portal application. Key files:

base/ - Base manifests for all components
overlays/uat/ - UAT environment overrides
secrets/ - Secret templates (sealed with kubeseal)

me-health-portal Components

Component	Type	Notes
me-health-portal	Deployment	Main app, HPA fixed at 1 replica (needs Redis for sessions)
mh-opensearch	StatefulSet	Log storage, 10Gi PVC, 90-day ISM retention
mh-graylog	StatefulSet	Log aggregation
mh-fluentbit	DaemonSet	Log collection (excludes graylog/opensearch/fluentbit logs)
mh-rabbitmq	StatefulSet	Message queue, 1000m CPU limit
mh-postgres	StatefulSet	PostgreSQL cluster (3 replicas)
mh-redis-sentinel	StatefulSet	Redis (3 replicas)
mh-mongodb	StatefulSet	MongoDB
mh-authentik	Deployment	Authentication

Known Issues & Solutions

FluentBit Log Loop (RESOLVED)

Problem: FluentBit collected Graylog logs, sent to Graylog, creating feedback loop
Symptom: 29M docs in 100 minutes, ~85GB/day ingestion
Fix: Added Exclude_Path in FluentBit config for graylog/opensearch/fluentbit containers
Result: Reduced to ~318 docs/min (~92MB/day)

OpenSearch Retention

ISM policy graylog-retention configured for 90-day retention
Auto-deletes graylog_* indices older than 90 days
10Gi PVC sufficient for ~100 days at current rate

HPA MaxedOut Alert

me-health-portal HPA has minReplicas=maxReplicas=1 intentionally
Cannot scale until Redis session caching is implemented
Alert will persist until then

Session Continuity

Session state is stored in ./docs/SESSION-STATE.md. To continue from a previous session:

Read ./docs/SESSION-STATE.md for context
Read ./docs/TODO.md for pending tasks
Check git status for any uncommitted work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Build & Development Commands

Terraform (Primary - in `/terraform/`)

Helm (in `/helm/`)

Prerequisites

Architecture

Multi-Cluster Configuration

Network Layout

Helm Shared Utilities

Security

Git-Crypt Encrypted Files

Talos Linux Security Model

Key Files

Cluster Operations

Available Clusters

Quick Status Commands

Prometheus/Alerting

Related Repositories

me-health-portal-cd-pipeline

me-health-portal Components

Known Issues & Solutions

Session Continuity

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Build & Development Commands

Terraform (Primary - in /terraform/)

Helm (in /helm/)

Prerequisites

Architecture

Multi-Cluster Configuration

Network Layout

Helm Shared Utilities

Security

Git-Crypt Encrypted Files

Talos Linux Security Model

Key Files

Cluster Operations

Available Clusters

Quick Status Commands

Prometheus/Alerting

Related Repositories

me-health-portal-cd-pipeline

me-health-portal Components

Known Issues & Solutions

Session Continuity

Terraform (Primary - in `/terraform/`)

Helm (in `/helm/`)