Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion docs/integrator/recipe-development.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,15 @@ spec:

Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details.

A platform may split into multiple mixins when parts of the stack are independently opt-in. For example, `--platform slurm` resolves through two mixins: `platform-slurm` always contributes the SchedMD Slinky operator and CRDs, and `platform-slurm-cluster` is opt-in for the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi). A leaf that wants operator-only composes just `platform-slurm`; a leaf that wants the cluster too composes both — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter.
Some platforms declare their full component stack inline per leaf overlay rather than via a platform mixin. This is the case for `--platform slurm` and `--platform dynamo`, where each leaf carries hardware-specific tuning (GPU GRES strings, accelerator resource limits) that the mixin merge path cannot represent cleanly. Other platforms like `--platform kubeflow` and `--platform inference` still use the `platform-kubeflow` / `platform-inference` mixins shown above, since their leaf-specific tuning is minimal.

For example, `--platform slurm` leaves inline three `componentRefs`:

- `slinky-slurm-operator-crds` — SchedMD Slinky CRDs
- `slinky-slurm-operator` — the operator and admission webhook
- `slinky-slurm` — the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi), with leaf-specific `overrides` (e.g. H100 GRES wiring on the `nodesets.slinky` map)

This is the same shape `dynamo-platform` uses across the `*-inference-dynamo` leaves. See `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the full example.

When authoring a recipe targeting Talos (`criteria.os: talos`), append the `os-talos` mixin to your overlay's `spec.mixins` list (e.g. `spec.mixins: [os-talos]`, or `[platform-kubeflow, os-talos]` if you already mix in a non-OS fragment). OS-scoped mixins are mutually exclusive — combining `os-ubuntu` and `os-talos` in one overlay is a recipe authoring error, not a supported composition. The mixin overrides namespaces for affected components and supplies PSA-privileged Namespace manifests via `componentRefs[].preManifestFiles`, which are applied before each chart — see [Talos integration](talos-integration.md) for the component list and labels.

Expand Down
1 change: 1 addition & 0 deletions docs/user/api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,7 @@ Bundler names correspond to component names in [`recipes/registry.yaml`](https:/
| `nvsentinel` | GPU health monitoring and automated remediation |
| `prometheus-adapter` | Custom metrics for HPA scaling |
| `prometheus-operator-crds` | CRDs for the prometheus-operator (`Alertmanager`, `Prometheus`, `ServiceMonitor`, etc.) |
| `slinky-slurm` | Slinky-managed Slurm cluster instance (Controller, LoginSet, NodeSet, RestApi); reconciled by `slinky-slurm-operator` |
| `slinky-slurm-operator` | SchedMD Slinky Slurm operator and admission webhook |
| `slinky-slurm-operator-crds` | CRDs for the SchedMD Slinky Slurm operator (`slinky.slurm.net`) |

Expand Down
4 changes: 2 additions & 2 deletions docs/user/component-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/
| **kubeflow-trainer** | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | [Kubeflow Trainer](https://github.com/kubeflow/trainer) |
| **slinky-slurm-operator-crds** | Custom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the `slinky.slurm.net` CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Opt-in via the `platform-slurm-cluster` mixin (alongside `platform-slurm` for the operator). Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) |
| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Declared inline per slurm leaf overlay alongside `slinky-slurm-operator-crds` and `slinky-slurm-operator` (matching the dynamo-platform pattern) so each leaf can carry its own GPU/GRES tuning. Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) |

## How Components Are Selected

Expand All @@ -44,7 +44,7 @@ Not every component appears in every recipe. The recipe engine selects component
- **Base components** (cert-manager, kube-prometheus-stack) appear in most recipes.
- **Cloud-specific components** (aws-efa, aws-ebs-csi-driver) are added when the service matches.
- **Intent-specific components** (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway).
- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, the cluster (`slinky-slurm`) is opt-in via the `platform-slurm-cluster` mixin alongside the always-applied operator (`platform-slurm`); leaves that want operator-only compose just `platform-slurm`.
- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, all three Slinky pieces (`slinky-slurm-operator-crds`, `slinky-slurm-operator`, `slinky-slurm`) are declared inline per slurm leaf overlay — the same shape `dynamo-platform` uses across `*-inference-dynamo` leaves. Leaves that want the operator only inline the CRDs + operator and omit the `slinky-slurm` componentRef.
- **Accelerator/OS-specific tuning** (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.

### NFD Topology Updater
Expand Down
23 changes: 23 additions & 0 deletions pkg/recipe/deployment_order_guard_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,29 @@ func TestDeploymentOrderGuards(t *testing.T) {
{"gpu-operator", "nvsentinel"},
},
},
{
name: "h100-gke-cos-training-slurm",
criteria: func() *Criteria {
c := NewCriteria()
c.Service = CriteriaServiceGKE
c.Accelerator = CriteriaAcceleratorH100
c.OS = CriteriaOSCOS
c.Intent = CriteriaIntentTraining
c.Platform = CriteriaPlatformSlurm
return c
},
requiredDeps: map[string][]string{
"slinky-slurm-operator": {"cert-manager", "slinky-slurm-operator-crds"},
"slinky-slurm": {"slinky-slurm-operator", "slinky-slurm-operator-crds"},
},
requiredOrdering: [][2]string{
{"cert-manager", "slinky-slurm-operator"},
{"slinky-slurm-operator-crds", "slinky-slurm-operator"},
{"slinky-slurm-operator", "slinky-slurm"},
{"slinky-slurm-operator-crds", "slinky-slurm"},
{"gpu-operator", "nvsentinel"},
},
},
Comment thread
mchmarny marked this conversation as resolved.
{
name: "h100-kind-training-slurm",
criteria: func() *Criteria {
Expand Down
3 changes: 3 additions & 0 deletions pkg/recipe/metadata_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1860,6 +1860,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) {
{"h100-eks-ubuntu-training", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, ""}, true},
{"h100-eks-ubuntu-inference", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, ""}, true},
{"h100-eks-ubuntu-training-kubeflow", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, CriteriaPlatformKubeflow}, true},
{"h100-eks-ubuntu-training-slurm", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, CriteriaPlatformSlurm}, true},
{"h100-eks-ubuntu-inference-dynamo", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformDynamo}, true},
{"h100-eks-ubuntu-inference-nim", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformNIM}, true},
// H100 AKS Ubuntu variants
Expand All @@ -1869,6 +1870,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) {
{"h100-aks-ubuntu-inference-dynamo", criteria{CriteriaServiceAKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformDynamo}, true},
// H100 GKE COS platform variants (GKE uses COS, no Ubuntu variant)
{"h100-gke-cos-training-kubeflow", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentTraining, CriteriaPlatformKubeflow}, true},
{"h100-gke-cos-training-slurm", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentTraining, CriteriaPlatformSlurm}, true},
{"h100-gke-cos-inference-dynamo", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentInference, CriteriaPlatformDynamo}, true},
// GB200 EKS Ubuntu variants
{"gb200-eks-ubuntu-training", criteria{CriteriaServiceEKS, CriteriaAcceleratorGB200, CriteriaOSUbuntu, CriteriaIntentTraining, ""}, true},
Expand All @@ -1889,6 +1891,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) {
{"h100-kind-inference", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentInference, ""}, false},
// Deeper kind leaves — platform variants must also stay OFF
{"h100-kind-training-kubeflow", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentTraining, CriteriaPlatformKubeflow}, false},
{"h100-kind-training-slurm", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentTraining, CriteriaPlatformSlurm}, false},
{"h100-kind-inference-dynamo", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentInference, CriteriaPlatformDynamo}, false},
}

Expand Down
30 changes: 0 additions & 30 deletions recipes/mixins/platform-slurm-cluster.yaml

This file was deleted.

30 changes: 0 additions & 30 deletions recipes/mixins/platform-slurm.yaml

This file was deleted.

70 changes: 60 additions & 10 deletions recipes/overlays/h100-eks-ubuntu-training-slurm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@ metadata:

spec:
# H100 + EKS + Ubuntu + training with the Slinky operator and a
# Slinky-managed Slurm cluster. EKS-specific cluster tuning (gp3
# storage, GPU GRES, DCGM job-mapping) is layered at install time
# via `aicr bundle ... --set slurmcluster:...` or a valuesFile.
# Slinky-managed Slurm cluster. H100 GPU GRES is declared inline below
# (pod-side limit + slurmd-side Gres= line); remaining EKS-specific
# tuning (gp3 storage, DCGM job-mapping) is layered at install time
# via `aicr bundle ... --set slinkyslurm:...` or a valuesFile.
base: h100-eks-ubuntu-training

criteria:
Expand All @@ -33,18 +34,67 @@ spec:

mixins:
- os-ubuntu
- platform-slurm
- platform-slurm-cluster

constraints:
- name: K8s.server.version
value: ">= 1.32.4"

# Mixin-contributed components cannot be overridden from a leaf; use
# `--set slurmcluster:...` or a valuesFile at install time instead.
# The Slinky operator (CRDs + operator + cluster instance) is declared
# inline per slurm leaf, mirroring the dynamo-platform pattern in
# h100-*-inference-dynamo leaves. Inlining lets each leaf carry its
# own GPU/GRES tuning without fighting the mixin-vs-leaf identity-field
# guard in mixinComponentRefSafeForMerge (pkg/recipe/metadata_store.go),
# and keeps base.yaml free of platform-specific components.
#
# GPU GRES on slinky-slurm must be declared in two places because the
# chart does not derive Gres= in slurm.conf from pod resource limits
# (see comment in components/slinky-slurm/values.yaml):
# 1. nodesets.slinky.extraConfMap.Gres — adds `Gres=gpu:h100:8` to
# slurmd's --conf so slurmctld knows it has GPUs to allocate via
# `srun --gres=gpu:N`.
# 2. nodesets.slinky.slurmd.resources.limits.nvidia.com/gpu — reserves
# 8 H100s on the slurmd pod so the NVIDIA device plugin injects
# /dev/nvidia* into the container. Without this `gres.conf`'s
# AutoDetect=nvidia finds nothing. `requests` is omitted: Kubernetes
# auto-mirrors requests=limits for extended resources.
# Accelerated nodeSelector/tolerations on slurmd are injected via the
# registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths.
componentRefs: []
componentRefs:
- name: slinky-slurm-operator-crds
type: Helm
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double-declaration commentary (extraConfMap.Gres + nvidia.com/gpu limit, with the explanation of why both are needed) is exactly the kind of context that future-you will thank present-you for. Nice.

Two micro-suggestions to consider — neither blocking:

  1. The mars-prod filesystem path (manifests/clusters/mars-prod/mars-dgxc-k8s-tgr-bwi-prd1/...) is from an internal NVIDIA repo and will be opaque to anyone reading this in the public OSS tree. Either qualify with "(internal NVIDIA reference cluster)" or drop the parenthetical — the rationale (requests auto-mirror to limits for extended resources) stands on its own.
  2. The same ~20-line GRES rationale block is now copy-pasted verbatim in the GKE and EKS slurm leaves. If a future slurm leaf changes the wording in one and forgets the other they'll drift. Acceptable for two leaves; worth keeping an eye on if a third lands.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the mars-prod parenthetical from both EKS and GKE slurm leaves in 7782ff7. Agreed it reads as opaque in the OSS tree; the auto-mirror rationale stands on its own. On the duplication: keeping both copies for now (matches your "acceptable for two leaves" call); if a third slurm leaf lands we can extract the GRES rationale into a shared comment fragment or a leaf-template.

valuesFile: components/slinky-slurm-operator-crds/values.yaml

# Validation is inherited from h100-eks-training (operator-health,
# expected-resources, gpu-operator-version, check-nvidia-smi).
- name: slinky-slurm-operator
type: Helm
valuesFile: components/slinky-slurm-operator/values.yaml
dependencyRefs:
- cert-manager
- slinky-slurm-operator-crds

- name: slinky-slurm
type: Helm
valuesFile: components/slinky-slurm/values.yaml
dependencyRefs:
- slinky-slurm-operator
- slinky-slurm-operator-crds
overrides:
nodesets:
slinky:
extraConfMap:
Gres: "gpu:h100:8"
slurmd:
resources:
limits:
nvidia.com/gpu: 8

# K8s-native nccl-all-reduce-bw is dropped on Slinky leaves: that
# check launches a Pod against the cluster scheduler, so on a
# Slinky-managed cluster it bypasses slurmd entirely and measures the
# wrong path. The equivalent signal here is a slurm-launched
# `srun nccl-tests/all_reduce_perf` that goes through slurmd + the
# EFA libfabric stack already present on the parent EKS leaf.
# Deployment and conformance checks are inherited unchanged.
validation:
performance:
checks: []
constraints: []
Loading
Loading