A note has been placed at the top of the README.md file in this repository to clarify most users may be looking to install the full AMD Enterprise Suite, not just using this repo. Specifically a link is provided to the official Enterprise on-premise installation here
Cluster-Forge is a Kubernetes platform automation tool that bundles third-party, community, and in-house components into a single, GitOps-managed stack deployable in Kubernetes clusters. It automates the deployment of a complete AI/ML compute platform built on AMD Enterprise AI Suite components, delivering consistent, production-ready clusters with all essential services pre-configured and integrated.
The platform uses ArgoCD's app-of-apps pattern with a sophisticated bootstrap process that establishes GitOps infrastructure (ArgoCD, Gitea, OpenBao) before deploying the complete application stack.
- AI/ML Engineers - Unified platform for model training, serving, and orchestration
- Platform Engineers - Infrastructure automation with GitOps patterns
- DevOps Engineers - Consistent deployment across environments
- Infrastructure Engineers - Multi-cluster management and operations
- Site Reliability Engineers - Observability and reliability tooling
- Research Teams - Ephemeral test clusters for experimentation
Cluster-Forge uses a three-phase bootstrap process that establishes GitOps infrastructure before deploying applications:
Phase 1: Pre-Cleanup
- Detects and removes previous installations when gitea-init-job completed successfully
- Deletes Gitea resources, OpenBao init jobs, and temporary files
- Ensures clean state for fresh deployments
Phase 2: GitOps Foundation Bootstrap (Manual Helm Templates)
- ArgoCD (v8.3.5) - GitOps controller deployed via helm template + kubectl apply
- Gitea (v12.3.0) - Git server with init job to create cluster-forge and cluster-values repositories
Phase 3: App-of-Apps Deployment (ArgoCD-Managed)
- Creates cluster-forge Application pointing to root/ helm chart
- ArgoCD syncs all remaining applications including OpenBao from enabledApps list
- Applications deployed in wave order (-70 to 0) based on dependencies
- OpenBao (v0.18.2) deployed via ArgoCD with openbao-init job for vault configuration
Cluster-Forge supports flexible GitOps repository configurations:
Local Mode (Default) - Self-contained cluster-native GitOps:
clusterForge.repoUrl: Points to local Gitea (http://gitea-http.cf-gitea.svc:3000/cluster-org/cluster-forge.git)externalValues.enabled: true: Separate cluster-values repository for configuration- Initialization handled by gitea-init-job which clones and pushes repositories from initial-cf-values ConfigMap
- Zero external dependencies once bootstrapped
External Mode - Traditional GitHub-based GitOps:
- Set
clusterForge.repoUrlto external GitHub repository - Supports custom branch selection for testing and development
Three cluster profiles with inheritance-based resource optimization:
Small Clusters (1-5 users, dev/test):
- Single replica deployments (ArgoCD, Redis, etc.)
- Reduced resource limits (ArgoCD controller: 2 CPU, 2Gi RAM)
- Adds kyverno-policies-storage-local-path for RWX→RWO PVC mutation
- MinIO tenant: 2Ti storage, single server
- Mix of local-path and direct storage classes
- Suitable for: Local workstations, development environments
Medium Clusters (5-20 users, team production):
- Single replica with moderate resource allocation
- Same storage policies as small (local-path support)
- ArgoCD controller: 1 CPU, 2Gi RAM
- MinIO tenant: 2Ti storage
- Uses direct storage class consistently
- Suitable for: Small teams, staging environments
Large Clusters (10s-100s users, enterprise scale):
- OpenBao HA: 3 replicas with Raft consensus
- No local-path policies (assumes distributed storage like Longhorn)
- MinIO tenant: 500Gi storage
- Production-grade resource allocation
- Uses direct storage class for all persistent volumes
- Suitable for: Production deployments, multi-tenant environments
Size configurations use YAML merge semantics where size-specific values override base values.yaml settings.
Cluster-Forge root chart generates ArgoCD Application manifests from:
enabledApps[]- List of applications to deploy (defined in size-specific values files)apps.<name>- Configuration for each application including:path- Relative path in sources/ directorynamespace- Target Kubernetes namespacesyncWave- Deployment order (-70 to 0)valuesObject- Inline Helm valueshelmParameters- Templated Helm parameters (e.g., domain injection)ignoreDifferences- ArgoCD diff exclusions
Size-Specific Application Sets:
- Small clusters: 46 enabled applications including storage-local-path policies
- Medium clusters: 47 enabled applications including storage-local-path policies and openbao-init
- Large clusters: 45 enabled applications excluding storage-local-path policies
The cluster-forge Application uses multi-source feature when externalValues.enabled=true:
- Source 1: cluster-forge repo (root/ helm chart)
- Source 2: cluster-values repo (custom values.yaml)
- Merges: base values.yaml + size values + external cluster-values/values.yaml
Layer 1: GitOps Foundation (Bootstrap + Sync Wave -70 to -30)
- ArgoCD 8.3.5 - GitOps continuous deployment controller (bootstrap)
- Gitea 12.3.0 - Self-hosted Git server with SQLite backend (bootstrap)
- OpenBao 0.18.2 - Vault-compatible secrets management (ArgoCD-managed, sync wave -70)
- External Secrets 0.15.1 - Secrets synchronization operator (sync wave -40)
Layer 2: Core Infrastructure (Sync Wave -5 to -2)
Networking:
- Gateway API v1.3.0 - Kubernetes standard ingress API
- KGateway v2.1.0-main - Gateway API implementation with custom WebSocket support
- MetalLB v0.15.2 - Bare metal load balancer
- Cert-Manager v1.18.2 - Automated TLS certificate management
Policy & Security:
- Kyverno 3.5.1 - Policy engine for admission control
- Kyverno Config - OIDC integration, policy configurations
- Kyverno Policies Base - Core security policies
- Kyverno Policies Storage-Local-Path - Access mode mutation (small/medium only)
- Cluster-Auth 0.5.0 - Kubernetes RBAC integration
Storage & Database:
- CNPG Operator 0.26.0 - CloudNativePG PostgreSQL operator
- MinIO Operator 7.1.1 - S3-compatible object storage operator
- MinIO Tenant 7.1.1 - Tenant deployment with default-bucket and models buckets
Layer 3: Observability (Sync Wave -5 to -2)
- Prometheus Operator CRDs 23.0.0 - Metrics infrastructure
- OpenTelemetry Operator 0.93.1 - Telemetry collection with contrib collector
- OTEL-LGTM Stack v1.0.7 - Integrated observability (Loki, Grafana, Tempo, Mimir)
- Storage: 50Gi each for tempo/loki/mimir, 10Gi grafana
- Metrics collector: 8Gi RAM, 2 CPU
- Logs collector daemonset: 2Gi RAM, 1 CPU
Layer 4: Identity & Access (Sync Wave -1 to 0)
- Keycloak (keycloak-old chart) - Enterprise IAM with AIRM realm
- Custom extensions via init containers (SilogenExtensionPackage.jar)
- Realm import with domain-group-authenticator
- Client secrets for: AIRM, K8s, MinIO, Gitea, ArgoCD
Layer 5: AI/ML Compute Stack (Sync Wave -3 to 0)
GPU & Scheduling:
- AMD GPU Operator v1.4.1 - GPU device plugin and drivers
- KubeRay Operator 1.4.2 - Ray distributed computing framework
- Kueue 0.13.0 - Job queueing with multi-framework support
- Integrations: batch/job, Ray, MPIJob, PyTorchJob, TensorFlow, Jobset, AppWrapper, Pod, Deployment
- AppWrapper v1.1.2 - Application-level resource scheduling
- KEDA 2.18.1 - Event-driven autoscaling
- Kedify-OTEL v0.0.6 - KEDA telemetry integration
ML Serving & Inference:
- KServe v0.16.0 - Model serving platform (Standard deployment mode)
- KServe CRDs v0.16.0 - Model serving custom resources
Workflow & Messaging:
- Kaiwo v0.2.0-rc11 - AI workload orchestration
- Kaiwo CRDs v0.2.0-rc11 - Workflow custom resources
- RabbitMQ v2.15.0 - Message broker for async processing
Layer 6: AIRM Application (Sync Wave 0)
- AIRM 0.3.5 - AMD Resource Manager application suite
- Configurable image repositories for custom registries and air-gapped deployments via --airm-image-repository flag
- AIM Cluster Model Source - Cluster resource models for AIRM
cluster-forge/
├── scripts/
│ ├── bootstrap.sh # Main bootstrap orchestration
│ ├── init-gitea-job/ # Helm chart for Gitea initialization
│ ├── init-openbao-job/ # Helm chart for OpenBao initialization
│ └── utils/ # Backup/restore utilities
│ ├── export_databases.sh
│ ├── export_rabbitmq.sh
│ ├── import_databases.sh
│ ├── import_rabbitmq.sh
│ └── mirror_minio.sh
├── root/
│ ├── Chart.yaml # ClusterForge root helm chart metadata
│ ├── values.yaml # Base configuration
│ ├── values_small.yaml # Small cluster overrides
│ ├── values_medium.yaml # Medium cluster overrides
│ ├── values_large.yaml # Large cluster overrides
│ └── templates/
│ ├── _helpers.yaml # Template helper functions
│ ├── cluster-apps.yaml # Generates ArgoCD Application per enabledApp
│ └── cluster-forge.yaml # Self-managing ClusterForge Application
├── sources/ # Versioned helm charts and configurations
│ ├── <component>/
│ │ ├── <version>/ # Upstream helm chart or Kustomize
│ │ ├── source.yaml # Source metadata (optional)
│ │ └── values_ha.yaml # HA overrides (optional)
│ └── <component-config>/ # Configuration helm charts
│ └── templates/ # ConfigMaps, Secrets, ExternalSecrets
├── docs/ # Architecture and operational documentation
└── sbom/ # Software bill of materials tooling
The bootstrap.sh script orchestrates complete cluster setup with flexible options:
./scripts/bootstrap.sh <domain> [options]Available Options:
--cluster-size=[small|medium|large]- Cluster size configuration (default: medium)--apps=APP1,APP2- Deploy only specified components- Bootstrap apps:
namespaces,argocd,openbao,gitea,cluster-forge - Child apps: Any app from enabledApps list (e.g.,
keycloak,keda,airm)
- Bootstrap apps:
--disabled-apps=APP1,GLOB*- Exclude apps from installation (supports*and?wildcards)- Patterns are matched against each entry in
enabledAppsbefore the values configmap is pushed to Gitea - If an app appears in both
--appsand--disabled-apps, it is skipped (disabled takes priority) - Example:
--disabled-apps=airm,airm-infra-*skipsairm,airm-infra-cnpg,airm-infra-external-secrets,airm-infra-rabbitmq
- Patterns are matched against each entry in
--target-revision=BRANCH- cluster-forge git revision for ArgoCD (default: latest release tag)--template-onlyor-t- Output YAML manifests instead of applying to cluster--skip-deps- Skip dependency checking for advanced users--airm-image-repository=REPO- Custom AIRM container image repository for air-gapped deployments
Bootstrap Process:
- Validation - Checks domain, cluster size, values files, required tool availability (kubectl, helm, yq with version checking)
- Pre-cleanup - Removes previous installations if gitea-init-job completed
- Values Merge - Combines base + size-specific values with domain injection
- Namespace Creation - Creates argocd, cf-gitea, openbao namespaces
- ArgoCD Deployment - helm template + kubectl apply with server-side apply using --field-manager=argocd-controller
- OpenBao Bootstrap - Separate bootstrap phase for secrets management foundation
- Gitea Deployment - helm template + kubectl apply, waits for rollout
- Gitea Init Job - Creates cluster-org, clones/pushes cluster-forge and cluster-values repos with AIRM image repository support
- ClusterForge App - Creates root Application that manages all remaining components via ArgoCD
- Component Deployment - ArgoCD syncs all enabledApps including secrets and application stack
The --apps flag enables targeted deployment for development and troubleshooting:
Bootstrap Components (deployed via helm template):
namespaces- Core namespaces (argocd, cf-gitea)argocd- GitOps controllergitea- Local Git servercluster-forge- Root ArgoCD Application
Child Components (deployed via ArgoCD sync):
- Any application from enabledApps list
- Examples:
openbao,openbao-init,keycloak,keda,kedify-otel
Usage Examples:
# Deploy only core GitOps foundation
./scripts/bootstrap.sh example.com --apps=namespaces,argocd,gitea,cluster-forge
# Deploy only secrets management
./scripts/bootstrap.sh example.com --apps=openbao,openbao-init,openbao-config
# Render manifests for debugging
./scripts/bootstrap.sh example.com --apps=keycloak --template-only
# Deploy with custom AIRM image repository for air-gapped environments
./scripts/bootstrap.sh example.com --airm-image-repository=registry.internal.com/airm
# Install everything except AIRM and all its infra dependencies
./scripts/bootstrap.sh example.com --disabled-apps=airm,airm-infra-*
# Combine --apps and --disabled-apps (disabled takes priority)
./scripts/bootstrap.sh example.com --apps=airm,keycloak --disabled-apps=airmOnce bootstrapped, the cluster is fully self-sufficient:
Local Git Server (Gitea):
- Stores cluster-forge repository (platform code)
- Stores cluster-values repository (environment-specific configuration)
- Provides Git UI at https://gitea.{domain}
- Admin credentials in gitea-admin-credentials secret
- SQLite backend for lightweight operation
Local Secrets Management (OpenBao):
- Vault-compatible secrets engine
- Initialized with policies for each component
- Kubernetes auth method configured
- External Secrets Operator integration
- Secrets for: Keycloak clients, AIRM, database credentials, API keys
Configuration as Code:
- All platform configuration in cluster-values repo
- Changes trigger ArgoCD sync automatically
- Full audit trail through Git history
- Rollback capability via Git revert
Three-layer configuration merge:
- Base Layer (values.yaml) - Common defaults for all sizes
- Size Layer (values_{size}.yaml) - Size-specific overrides
- External Layer (cluster-values/values.yaml) - Environment customization
# Bootstrap merges: base <- size <- external
VALUES=$(yq eval-all '. as $item ireduce ({}; . * $item)' \
values.yaml values_medium.yaml cluster-values/values.yaml)Size-Specific Behaviors:
Small/Medium are single-node and have storage class mutation policies:
enabledApps:
- kyverno-policies-storage-local-path # RWX→RWO mutation for local-pathLarge enables Multi-Node and HA components:
apps:
openbao:
valuesObject:
server:
ha:
enabled: true
replicas: 3Versioned Sources Structure:
sources/argocd/
├── 8.3.5/ # Upstream helm chart
├── source.yaml # Source metadata (upstream repo, version)
└── values_ha.yaml # Optional HA overrides
Configuration Companions: Each major component has -config variant:
- argocd-config: OIDC integration, RBAC policies, ExternalSecrets
- gitea-config: Keycloak OAuth, repository templates
- openbao-config: Policy definitions, secret paths, initialization scripts
- minio-tenant-config: Bucket policies, user credentials, gateway routes
Three-Tier Secrets System:
-
OpenBao (Source of Truth)
- KV v2 secrets engine at secret/
- Policies per namespace: argocd-policy, airm-policy, gitea-policy, etc.
- Kubernetes auth method for pod authentication
-
External Secrets Operator (Synchronization)
- ExternalSecret resources in each namespace
- SecretStore points to OpenBao with serviceAccountRef
- Automatic sync from OpenBao → Kubernetes Secrets
- Example: argocd-oidc-creds ExternalSecret → OIDC client secret
-
Kubernetes Secrets (Consumption)
- Standard Kubernetes Secret objects
- Referenced by pods via env, volumeMounts
- Automatically updated when OpenBao source changes
Bootstrap Secret Flow:
- bootstrap.sh generates initial passwords with
openssl rand -hex 16 - ArgoCD deploys OpenBao via cluster-forge Application
- openbao-init-job (sync wave -50) writes secrets to OpenBao
- External Secrets Operator (sync wave -40) syncs to Kubernetes Secrets
- Applications consume via secret references
Kyverno policies organized by concern:
Base Policies (kyverno-policies-base):
- Core security policies
- Resource quotas
- Label requirements
Storage Policies (kyverno-policies-storage-local-path):
- Access mode mutation: ReadWriteMany → ReadWriteOnce
- Only enabled for small/medium clusters with local-path storage
- Prevents PVC creation failures on non-distributed storage
Custom Policies:
- AIRM-specific policies included in airm chart
- Custom validations and mutations per application
Database Export/Import:
scripts/utils/export_databases.sh # PostgreSQL dumps from CNPG
scripts/utils/import_databases.sh # Restore PostgreSQL databasesMessage Queue:
scripts/utils/export_rabbitmq.sh # RabbitMQ definitions and messages
scripts/utils/import_rabbitmq.sh # Restore queues and exchangesObject Storage:
scripts/utils/mirror_minio.sh # MinIO bucket synchronizationIntegrated LGTM Platform:
- Loki - Log aggregation with 50Gi storage
- Grafana - Visualization dashboards with 10Gi storage
- Tempo - Distributed tracing with 50Gi storage
- Mimir - Prometheus metrics with 50Gi storage
Automatic Collection:
- Metrics collector deployment: 8Gi RAM, 2 CPU limits
- Logs collector daemonset: 2Gi RAM, 1 CPU per node
- OpenTelemetry contrib collector for advanced telemetry
- Node exporter and kube-state-metrics enabled by default
Service Endpoints:
- Grafana UI: Port 3000
- OTLP gRPC: Port 4317
- OTLP HTTP: Port 4318
- Prometheus: Port 9090
- Loki: Port 3100
Multi-Framework Job Integration:
Kueue manages scheduling for:
- Kubernetes batch/job
- Ray (RayJob, RayCluster)
- Kubeflow (MPIJob, PyTorchJob, TFJob, XGBoostJob, JAXJob, PaddleJob)
- AppWrapper for multi-pod applications
- Pod, Deployment, StatefulSet
Resource Management:
- Kueue ClusterQueues for resource pools
- LocalQueues per namespace
- ResourceFlavors for GPU/CPU quotas
- Cohort sharing across teams
Model Serving:
- KServe Standard deployment mode
- InferenceService CRD for models
- Auto-scaling with KEDA
- S3 model storage via MinIO
GPU Support:
- AMD GPU Operator for device plugin
- Automatic driver installation
- GPU metrics in Prometheus
- Scheduling via Kueue resource flavors
Kubernetes Cluster:
- Kubernetes 1.33+ (configurable via bootstrap.sh KUBE_VERSION)
- kubectl with cluster-admin access
- Working storage class (local-path for small/medium, distributed for large)
- Sufficient resources per cluster size
Networking:
- Domain name or wildcard DNS (*.example.com or *.{ip}.nip.io)
- Ingress capability (Gateway API + KGateway deployed by ClusterForge)
- External LoadBalancer or MetalLB (deployed by ClusterForge)
TLS Certificates:
- cluster-tls secret in kgateway-system namespace
- Can be self-signed for development
- Production should use Cert-Manager with ACME
Required Tools:
- yq v4+ (YAML processor) with automatic version checking
- helm 3.0+ with automatic version checking
- kubectl with automatic version checking
- openssl (for password generation)
Bootstrap script provides comprehensive dependency validation with platform-specific installation instructions for missing tools.
Small Cluster:
- single node
- 250Gi+ total storage
- Local-path or hostPath storage class
Medium Cluster:
- single node
- 500Gi+ total storage
- Local-path or distributed storage
Large Cluster:
- multinode, HA / 3 node control plane
- 1Ti+ total storage
- Distributed storage required (Storage appliances / cloud / Longhorn, Ceph, etc.)
FR1: AIRM Platform Delivery
- Deploy AMD Resource Manager (AIRM) 0.3.5 with UI and API
- Support configurable image repositories via
--airm-image-repositorybootstrap parameter - Provide model serving with KServe v0.16.0
- Support distributed computing via KubeRay Operator 1.4.2
- Enable workflow orchestration through Kaiwo v0.2.0-rc11
- Integrate AMD GPU Operator v1.4.1 for GPU resources
FR2: GitOps Operations
- Bootstrap ArgoCD 8.3.5 with single command
- Manage 40+ components as ArgoCD Applications
- Support multi-source Applications for values separation
- Enable local Gitea 12.3.0 for cluster-native GitOps
FR3: Size-Aware Deployment
- Support small/medium/large configurations via --CLUSTER_SIZE flag
- Automatically merge size-specific values with base configuration
- Enable/disable components based on cluster size (e.g., HA modes)
- Apply appropriate policies per size (storage access modes)
FR4: Secrets Management
- Initialize OpenBao 0.18.2 with vault policies
- Configure External Secrets Operator 0.15.1 integration
- Generate and store all component credentials
- Sync secrets from OpenBao to Kubernetes automatically
FR5: Dependency Orchestration
- Deploy components in wave order (-5 to 0)
- Bootstrap foundation before app-of-apps (ArgoCD, OpenBao, Gitea)
- Wait for component health before proceeding
- Use ignoreDifferences for known drift patterns
Performance:
- Complete bootstrap in under 15 minutes (small cluster)
- ArgoCD sync time under 5 minutes for full stack
- Gitea init job completes in under 2 minutes
Reliability:
- OpenBao HA with 3 replicas and Raft (large clusters)
- ArgoCD automated sync with self-heal
- Server-side apply to prevent field manager conflicts
Maintainability:
- Single values file per cluster size
- DRY principle for configuration inheritance
- Versioned sources for reproducible deployments
- SBOM generation for supply chain security
Usability:
- Single-command deployment
- Helpful error messages with validation
- Progress indication during bootstrap
- Access URLs displayed on completion
- Add chart to sources/{component}/{version}/
- Define app configuration in values.yaml:
apps:
my-component:
path: my-component/1.0.0
namespace: my-namespace
syncWave: -1
valuesObject:
# component values- Add to enabledApps list
Create cluster-values repository with custom values.yaml:
# Override any base configuration
global:
domain: custom.example.com
apps:
argocd:
valuesObject:
server:
replicas: 3 # Custom overrideModify values_{size}.yaml to adjust resources:
- Change replica counts
- Adjust CPU/memory limits
- Enable/disable HA modes
- Add size-specific enabledApps
Detailed documentation in /docs:
- Bootstrap Guide - Deployment walkthrough
- Cluster Size Configuration - Size planning
- Values Inheritance Pattern - GitOps configuration
- Kyverno Modular Design - Policy architecture
- Kyverno Access Mode Policy - Storage policies
- Backup and Restore - Data protection
ClusterForge includes comprehensive SBOM tooling in /sbom:
SBOM Files:
components.yaml- Canonical list of all components with versions, licenses, and metadataSBOM-QUICK-GUIDE.md- Guide for SBOM generation and validation
Validation Scripts:
validate-components-sync.sh- Ensures components.yaml matches actual sources/validate-enabled-apps.sh- Validates enabledApps lists reference defined componentsvalidate-metadata.sh- Checks required metadata fieldsvalidate-sync.sh- Full validation suite
Generation Scripts:
generate-sbom.sh- Generates SPDX/CycloneDX SBOM documentsgenerate-compare-components.sh- Compares component versionsupdate_licenses.sh- Updates license information
Current Release: v1.8.0
Key Component Versions:
- ArgoCD: 8.3.5
- Gitea: 12.3.0
- OpenBao: 0.18.2
- Keycloak: keycloak-old chart
- KServe: v0.16.0
- Kaiwo: v0.2.0-rc11
- AIRM: 0.3.5
- Kueue: 0.13.0
- AMD GPU Operator: v1.4.1
- OTEL-LGTM Stack: v1.0.7
Repository: https://github.com/silogen/cluster-forge
Issue Tracking: Use GitHub Issues for bug reports and feature requests
Maintainers: ClusterForge Team