Skip to content

Latest commit

 

History

History
207 lines (153 loc) · 7.21 KB

File metadata and controls

207 lines (153 loc) · 7.21 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Praetor is an infrastructure-as-code repository for Galaxy LAN infrastructure, managing Kubernetes clusters on Proxmox using Talos Linux. The repository contains three main projects: Terraform for cluster provisioning, Helm for Kubernetes bootstrap, and DNS automation via UniFi Controller.

Build & Development Commands

Terraform (Primary - in /terraform/)

# First-time setup (required before any operations)
make cluster-init       # Initialize XDG config directory (~/.config/praetor-terraform/)
make cluster-select     # Interactively choose active cluster

# Core operations
make init               # Initialize OpenTofu
make plan               # Plan changes for selected cluster
make apply              # Apply changes (interactive)
make apply-auto         # Apply with auto-approve

# Cluster management
make kubeconfig         # Download kubeconfig to ./kubeconfig
make k8s-status         # Show Kubernetes cluster status
make cluster-health-check  # Check cluster health

# VM operations
make vm-status-all      # Show all VM statuses
make vm-start-all       # Start all cluster VMs
make vm-stop-all        # Stop all cluster VMs

# Pod monitoring
make pods-all           # All pods in all namespaces
make pods-unhealthy     # Only unhealthy pods

Helm (in /helm/)

Each component has standardized scripts:

cd helm/<component>
./install.sh            # Install or upgrade
./delete.sh             # Uninstall (if available)

Bootstrap order matters:

  1. storage/ - Local Path Provisioner (required first)
  2. kube-vip/ - LoadBalancer provider
  3. metrics-server/ - Resource metrics
  4. prometheus/ - Monitoring (before Traefik for CRDs)
  5. traefik/ - Ingress controller
  6. cert-manager/ - TLS certificates
  7. k8s_gateway/ - Automatic DNS
  8. argocd/ - GitOps (optional, infra clusters only)

Prerequisites

Required tools: tofu (OpenTofu), kubectl, talosctl, helm, jq, curl, bc

make check-tools        # Verify all tools installed

Architecture

Multi-Cluster Configuration

The system uses XDG Base Directory specification for multi-cluster support:

  • ~/.config/praetor-terraform/global.conf - Proxmox API settings
  • ~/.config/praetor-terraform/clusters.conf - Cluster definitions
  • ~/.config/praetor-terraform/cluster-config.mk - Active cluster

Clusters are defined in terraform/main.tf as modules (e.g., module "prod_cluster").

Network Layout

  • Gateway: 10.120.208.1
  • VM Subnet: 10.120.208.0/20
  • LoadBalancer Pool: .205-.253 (49 IPs, kube-vip managed)
  • DNS Server IP: .253 (k8s_gateway)
  • Domain pattern: <service>.<cluster>.galaxy.lan

Helm Shared Utilities

All Helm scripts use helm/libs/k8s_utils.sh which provides:

  • k8s_print_color - Colorized output
  • k8s_create_namespace - Idempotent namespace creation
  • k8s_install_or_upgrade_helm_release - Complete Helm workflow
  • Auto-detection of cluster name from kubectl context and subnet from node IPs

Security

Git-Crypt Encrypted Files

Sensitive files are encrypted with git-crypt (.gitattributes rules):

  • *.tfvars, *.tfstate* - Infrastructure credentials
  • kubeconfig*, talosconfig* - Cluster configs
  • All *secret*, *credentials* files
git-crypt unlock        # Decrypt with authorized GPG key
git-crypt status        # Verify encryption status before commits

Talos Linux Security Model

  • No SSH access by default
  • API-only configuration via talosctl
  • Immutable filesystem

Key Files

  • terraform/main.tf - Cluster definitions using terraform-proxmox-talos module
  • terraform/providers.tf - Proxmox, Talos, UniFi, Kubernetes providers
  • terraform/Makefile - 500+ lines of automation commands
  • helm/libs/k8s_utils.sh - Shared Bash utilities for all Helm scripts

Cluster Operations

Available Clusters

Cluster Nodes Subnet Purpose
k8s-prod 3 CP + 3 workers 10.120.208.x Production workloads
k8s-infra 3 CP + 2 workers 10.100.20.x ArgoCD, infrastructure services
k8s-dev 3 CP + 3 workers TBD Development/testing

Quick Status Commands

# Check cluster status
kubectl --context=k8s-prod get nodes -o wide
kubectl --context=k8s-infra get nodes -o wide

# Check firing alerts
kubectl --context=k8s-prod -n monitoring exec -it deploy/kube-prometheus-stack-operator -- \
  wget -qO- 'http://kube-prometheus-stack-prometheus:9090/api/v1/alerts' | \
  jq -r '.data.alerts[] | select(.state=="firing") | "[\(.labels.severity)] \(.labels.alertname)"'

# Check unhealthy pods
kubectl --context=k8s-prod get pods -A | grep -v Running | grep -v Completed

Prometheus/Alerting

Prometheus stack deployed in monitoring namespace on both clusters:

  • Prometheus: kube-prometheus-stack-prometheus:9090
  • Alertmanager: kube-prometheus-stack-alertmanager:9093
  • Grafana: kube-prometheus-stack-grafana:80

Common alerts to expect:

  • Watchdog - Always firing, proves alerting pipeline works
  • InfoInhibitor - Inhibits noisy info alerts, expected

Related Repositories

me-health-portal-cd-pipeline

Location: /Users/james.maes/Git.Local/dmd/me-health-portal-cd-pipeline/ GitHub: git@github.com:dmdbrands/me-health-portal-cd-pipeline.git ArgoCD App: me-health-portal-uat (on k8s-infra)

Kustomize-based CD pipeline for me-health-portal application. Key files:

  • base/ - Base manifests for all components
  • overlays/uat/ - UAT environment overrides
  • secrets/ - Secret templates (sealed with kubeseal)

me-health-portal Components

Component Type Notes
me-health-portal Deployment Main app, HPA fixed at 1 replica (needs Redis for sessions)
mh-opensearch StatefulSet Log storage, 10Gi PVC, 90-day ISM retention
mh-graylog StatefulSet Log aggregation
mh-fluentbit DaemonSet Log collection (excludes graylog/opensearch/fluentbit logs)
mh-rabbitmq StatefulSet Message queue, 1000m CPU limit
mh-postgres StatefulSet PostgreSQL cluster (3 replicas)
mh-redis-sentinel StatefulSet Redis (3 replicas)
mh-mongodb StatefulSet MongoDB
mh-authentik Deployment Authentication

Known Issues & Solutions

FluentBit Log Loop (RESOLVED)

  • Problem: FluentBit collected Graylog logs, sent to Graylog, creating feedback loop
  • Symptom: 29M docs in 100 minutes, ~85GB/day ingestion
  • Fix: Added Exclude_Path in FluentBit config for graylog/opensearch/fluentbit containers
  • Result: Reduced to ~318 docs/min (~92MB/day)

OpenSearch Retention

  • ISM policy graylog-retention configured for 90-day retention
  • Auto-deletes graylog_* indices older than 90 days
  • 10Gi PVC sufficient for ~100 days at current rate

HPA MaxedOut Alert

  • me-health-portal HPA has minReplicas=maxReplicas=1 intentionally
  • Cannot scale until Redis session caching is implemented
  • Alert will persist until then

Session Continuity

Session state is stored in ./docs/SESSION-STATE.md. To continue from a previous session:

  1. Read ./docs/SESSION-STATE.md for context
  2. Read ./docs/TODO.md for pending tasks
  3. Check git status for any uncommitted work