From 18576db7974b4d643af6a447b5875a8fed227a81 Mon Sep 17 00:00:00 2001 From: Fagani Hajizada Date: Thu, 21 May 2026 14:16:44 +0200 Subject: [PATCH 1/6] refactor(slinky-slurm): inline platform refs Move slinky-slurm-operator-crds, slinky-slurm-operator, and slinky-slurm from the platform-slurm / platform-slurm-cluster mixins into inline componentRefs on each slurm leaf, mirroring the dynamo-platform pattern used by *-inference-dynamo leaves. The mixinComponentRefSafeForMerge guard in pkg/recipe/metadata_store.go blocks leaf-level overrides on identity fields (source, version, valuesFile) of mixin-contributed components, which made it impossible to declare leaf-specific GPU GRES tuning (extraConfMap.Gres and slurmd.resources.limits.nvidia.com/gpu) from the leaf overlays. Inlining the three Slinky charts per slurm leaf moves them off the mixin merge path and onto the standard componentRef overlay path, unblocking the GRES wiring that srun --gres=gpu:N requires. Changes: - recipes/overlays/h100-eks-ubuntu-training-slurm.yaml: inline the three Slinky componentRefs with H100 GRES overrides on slinky-slurm - recipes/overlays/h100-kind-training-slurm.yaml: inline the three Slinky componentRefs (CPU-only, no GRES overrides) - recipes/mixins/platform-slurm.yaml, recipes/mixins/platform-slurm-cluster.yaml: removed; no remaining references in the live tree (only a historical CHANGELOG mention) - recipes/registry.yaml: update slinky-slurm comment to point to the inline pattern; defaults still drive source/version/namespace - docs/integrator/recipe-development.md, docs/user/component-catalog.md: rewrite the slurm-platform section to describe the inline pattern - docs/user/api-reference.md: add slinky-slurm to the bundler component table (pre-existing gap from when the cluster was mixin-opt-in) --- docs/integrator/recipe-development.md | 2 +- docs/user/api-reference.md | 1 + docs/user/component-catalog.md | 4 +- recipes/mixins/platform-slurm-cluster.yaml | 30 ---------- recipes/mixins/platform-slurm.yaml | 30 ---------- .../h100-eks-ubuntu-training-slurm.yaml | 60 +++++++++++++++---- .../overlays/h100-kind-training-slurm.yaml | 37 +++++++++--- recipes/registry.yaml | 4 +- 8 files changed, 86 insertions(+), 82 deletions(-) delete mode 100644 recipes/mixins/platform-slurm-cluster.yaml delete mode 100644 recipes/mixins/platform-slurm.yaml diff --git a/docs/integrator/recipe-development.md b/docs/integrator/recipe-development.md index 94200d3f1..4a25af9f2 100644 --- a/docs/integrator/recipe-development.md +++ b/docs/integrator/recipe-development.md @@ -115,7 +115,7 @@ spec: Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details. -A platform may split into multiple mixins when parts of the stack are independently opt-in. For example, `--platform slurm` resolves through two mixins: `platform-slurm` always contributes the SchedMD Slinky operator and CRDs, and `platform-slurm-cluster` is opt-in for the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi). A leaf that wants operator-only composes just `platform-slurm`; a leaf that wants the cluster too composes both — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter. +A platform's full component stack is declared inline per leaf overlay rather than via a platform mixin, so each leaf can carry its own hardware-specific tuning (GPU GRES, resource limits, partition layout). For example, `--platform slurm` leaves inline the SchedMD Slinky operator CRDs, the operator itself, and the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi) as three `componentRefs` entries — same shape `dynamo-platform` uses across the `*-inference-dynamo` leaves. A leaf that wants the operator only inlines `slinky-slurm-operator-crds` + `slinky-slurm-operator` and omits the `slinky-slurm` componentRef; a leaf that wants the full cluster adds all three with leaf-specific `overrides` on `slinky-slurm` — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter. When authoring a recipe targeting Talos (`criteria.os: talos`), append the `os-talos` mixin to your overlay's `spec.mixins` list (e.g. `spec.mixins: [os-talos]`, or `[platform-kubeflow, os-talos]` if you already mix in a non-OS fragment). OS-scoped mixins are mutually exclusive — combining `os-ubuntu` and `os-talos` in one overlay is a recipe authoring error, not a supported composition. The mixin overrides namespaces for affected components and supplies PSA-privileged Namespace manifests via `componentRefs[].preManifestFiles`, which are applied before each chart — see [Talos integration](talos-integration.md) for the component list and labels. diff --git a/docs/user/api-reference.md b/docs/user/api-reference.md index d38a2da5e..9312b48e1 100644 --- a/docs/user/api-reference.md +++ b/docs/user/api-reference.md @@ -366,6 +366,7 @@ Bundler names correspond to component names in [`recipes/registry.yaml`](https:/ | `nvsentinel` | GPU health monitoring and automated remediation | | `prometheus-adapter` | Custom metrics for HPA scaling | | `prometheus-operator-crds` | CRDs for the prometheus-operator (`Alertmanager`, `Prometheus`, `ServiceMonitor`, etc.) | +| `slinky-slurm` | Slinky-managed Slurm cluster instance (Controller, LoginSet, NodeSet, RestApi); reconciled by `slinky-slurm-operator` | | `slinky-slurm-operator` | SchedMD Slinky Slurm operator and admission webhook | | `slinky-slurm-operator-crds` | CRDs for the SchedMD Slinky Slurm operator (`slinky.slurm.net`) | diff --git a/docs/user/component-catalog.md b/docs/user/component-catalog.md index ddfb06ebc..72014f8f5 100644 --- a/docs/user/component-catalog.md +++ b/docs/user/component-catalog.md @@ -35,7 +35,7 @@ The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/ | **kubeflow-trainer** | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | [Kubeflow Trainer](https://github.com/kubeflow/trainer) | | **slinky-slurm-operator-crds** | Custom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the `slinky.slurm.net` CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | | **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | -| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Opt-in via the `platform-slurm-cluster` mixin (alongside `platform-slurm` for the operator). Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) | +| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Declared inline per slurm leaf overlay alongside `slinky-slurm-operator-crds` and `slinky-slurm-operator` (matching the dynamo-platform pattern) so each leaf can carry its own GPU/GRES tuning. Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) | ## How Components Are Selected @@ -44,7 +44,7 @@ Not every component appears in every recipe. The recipe engine selects component - **Base components** (cert-manager, kube-prometheus-stack) appear in most recipes. - **Cloud-specific components** (aws-efa, aws-ebs-csi-driver) are added when the service matches. - **Intent-specific components** (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway). -- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, the cluster (`slinky-slurm`) is opt-in via the `platform-slurm-cluster` mixin alongside the always-applied operator (`platform-slurm`); leaves that want operator-only compose just `platform-slurm`. +- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, all three Slinky pieces (`slinky-slurm-operator-crds`, `slinky-slurm-operator`, `slinky-slurm`) are declared inline per slurm leaf overlay — the same shape `dynamo-platform` uses across `*-inference-dynamo` leaves. Leaves that want the operator only inline the CRDs + operator and omit the `slinky-slurm` componentRef. - **Accelerator/OS-specific tuning** (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination. ### NFD Topology Updater diff --git a/recipes/mixins/platform-slurm-cluster.yaml b/recipes/mixins/platform-slurm-cluster.yaml deleted file mode 100644 index 71ac2ad8b..000000000 --- a/recipes/mixins/platform-slurm-cluster.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Opt-in mixin: a Slinky-managed Slurm cluster instance (Controller / -# LoginSet / NodeSet / RestApi) on top of the operator from -# platform-slurm. Leaves wanting only the operator compose -# platform-slurm alone; leaves wanting a runnable cluster compose both. -kind: RecipeMixin -apiVersion: aicr.nvidia.com/v1alpha1 -metadata: - name: platform-slurm-cluster -spec: - componentRefs: - - name: slinky-slurm - type: Helm - valuesFile: components/slinky-slurm/values.yaml - dependencyRefs: - - slinky-slurm-operator - - slinky-slurm-operator-crds diff --git a/recipes/mixins/platform-slurm.yaml b/recipes/mixins/platform-slurm.yaml deleted file mode 100644 index 972060416..000000000 --- a/recipes/mixins/platform-slurm.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -kind: RecipeMixin -apiVersion: aicr.nvidia.com/v1alpha1 -metadata: - name: platform-slurm -spec: - componentRefs: - - name: slinky-slurm-operator-crds - type: Helm - valuesFile: components/slinky-slurm-operator-crds/values.yaml - - - name: slinky-slurm-operator - type: Helm - valuesFile: components/slinky-slurm-operator/values.yaml - dependencyRefs: - - cert-manager - - slinky-slurm-operator-crds diff --git a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml index 0824fc35d..2420ae968 100644 --- a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml +++ b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml @@ -19,9 +19,10 @@ metadata: spec: # H100 + EKS + Ubuntu + training with the Slinky operator and a - # Slinky-managed Slurm cluster. EKS-specific cluster tuning (gp3 - # storage, GPU GRES, DCGM job-mapping) is layered at install time - # via `aicr bundle ... --set slurmcluster:...` or a valuesFile. + # Slinky-managed Slurm cluster. H100 GPU GRES is declared inline below + # (pod-side limit + slurmd-side Gres= line); remaining EKS-specific + # tuning (gp3 storage, DCGM job-mapping) is layered at install time + # via `aicr bundle ... --set slinkyslurm:...` or a valuesFile. base: h100-eks-ubuntu-training criteria: @@ -33,18 +34,57 @@ spec: mixins: - os-ubuntu - - platform-slurm - - platform-slurm-cluster constraints: - name: K8s.server.version value: ">= 1.32.4" - # Mixin-contributed components cannot be overridden from a leaf; use - # `--set slurmcluster:...` or a valuesFile at install time instead. + # The Slinky operator (CRDs + operator + cluster instance) is declared + # inline per slurm leaf, mirroring the dynamo-platform pattern in + # h100-*-inference-dynamo leaves. Inlining lets each leaf carry its + # own GPU/GRES tuning without fighting the mixin-vs-leaf identity-field + # guard in mixinComponentRefSafeForMerge (pkg/recipe/metadata_store.go), + # and keeps base.yaml free of platform-specific components. + # + # GPU GRES on slinky-slurm must be declared in two places because the + # chart does not derive Gres= in slurm.conf from pod resource limits + # (see comment in components/slinky-slurm/values.yaml): + # 1. nodesets.slinky.extraConfMap.Gres — adds `Gres=gpu:h100:8` to + # slurmd's --conf so slurmctld knows it has GPUs to allocate via + # `srun --gres=gpu:N`. + # 2. nodesets.slinky.slurmd.resources.limits.nvidia.com/gpu — reserves + # 8 H100s on the slurmd pod so the NVIDIA device plugin injects + # /dev/nvidia* into the container. Without this `gres.conf`'s + # AutoDetect=nvidia finds nothing. `requests` is omitted: Kubernetes + # auto-mirrors requests=limits for extended resources, matching the + # mars-prod reference (manifests/clusters/mars-prod/ + # mars-dgxc-k8s-tgr-bwi-prd1/components/slurm/values.yaml). # Accelerated nodeSelector/tolerations on slurmd are injected via the # registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths. - componentRefs: [] + componentRefs: + - name: slinky-slurm-operator-crds + type: Helm + valuesFile: components/slinky-slurm-operator-crds/values.yaml - # Validation is inherited from h100-eks-training (operator-health, - # expected-resources, gpu-operator-version, check-nvidia-smi). + - name: slinky-slurm-operator + type: Helm + valuesFile: components/slinky-slurm-operator/values.yaml + dependencyRefs: + - cert-manager + - slinky-slurm-operator-crds + + - name: slinky-slurm + type: Helm + valuesFile: components/slinky-slurm/values.yaml + dependencyRefs: + - slinky-slurm-operator + - slinky-slurm-operator-crds + overrides: + nodesets: + slinky: + extraConfMap: + Gres: "gpu:h100:8" + slurmd: + resources: + limits: + nvidia.com/gpu: 8 diff --git a/recipes/overlays/h100-kind-training-slurm.yaml b/recipes/overlays/h100-kind-training-slurm.yaml index 60368a99f..0792876b1 100644 --- a/recipes/overlays/h100-kind-training-slurm.yaml +++ b/recipes/overlays/h100-kind-training-slurm.yaml @@ -29,17 +29,38 @@ spec: intent: training platform: slurm - mixins: - - platform-slurm - - platform-slurm-cluster - # DRA (GA in K8s 1.34) — restated from the parent for clarity. constraints: - name: K8s.server.version value: ">= 1.34" - # Mixin-contributed components cannot be overridden from a leaf; use - # `--set slurmcluster:...` or a valuesFile at install time instead. - componentRefs: [] + # The Slinky operator (CRDs + operator + cluster instance) is declared + # inline per slurm leaf, mirroring the dynamo-platform pattern in + # h100-*-inference-dynamo leaves. Inlining lets each leaf carry its + # own GPU/GRES tuning (Kind = CPU-only; H100 leaves = `Gres=gpu:h100:8` + # + `nvidia.com/gpu: 8`) without fighting the mixin-vs-leaf + # identity-field guard in mixinComponentRefSafeForMerge + # (pkg/recipe/metadata_store.go), and keeps base.yaml free of + # platform-specific components. + # + # Kind has no GPUs, so the NodeSet runs CPU-only: no `extraConfMap.Gres`, + # no `slurmd.resources.limits.nvidia.com/gpu`, no DCGM. This makes the + # leaf usable as a no-GPU CI smoke test for the operator + chart wiring. + componentRefs: + - name: slinky-slurm-operator-crds + type: Helm + valuesFile: components/slinky-slurm-operator-crds/values.yaml + + - name: slinky-slurm-operator + type: Helm + valuesFile: components/slinky-slurm-operator/values.yaml + dependencyRefs: + - cert-manager + - slinky-slurm-operator-crds - # Validation is inherited from h100-kind-training. + - name: slinky-slurm + type: Helm + valuesFile: components/slinky-slurm/values.yaml + dependencyRefs: + - slinky-slurm-operator + - slinky-slurm-operator-crds diff --git a/recipes/registry.yaml b/recipes/registry.yaml index 51ecbe1cb..71c50dae1 100644 --- a/recipes/registry.yaml +++ b/recipes/registry.yaml @@ -644,7 +644,9 @@ components: - name: slinky-slurm displayName: slinky-slurm - # Cluster instance chart; wired in via platform-slurm-cluster mixin. + # Cluster instance chart; declared inline per slurm leaf overlay so + # each leaf can carry its own GPU/GRES tuning (mirrors the + # dynamo-platform pattern in inference-dynamo leaves). valueOverrideKeys: - slinkyslurm - slurmcluster From 8538fc710f94e3a773d68cb4be0e873274d9f51c Mon Sep 17 00:00:00 2001 From: Fagani Hajizada Date: Thu, 21 May 2026 14:18:57 +0200 Subject: [PATCH 2/6] feat(recipes): add h100-gke-cos-training-slurm First-class GKE/COS leaf for H100 training with the Slinky-managed Slurm cluster (Controller, LoginSet, NodeSet, RestApi). Inherits from h100-gke-cos-training and inlines the slinky-slurm trio (slinky-slurm-operator-crds, slinky-slurm-operator, slinky-slurm) with the same H100 GRES overrides as the EKS slurm leaf, so srun --gres=gpu:N allocates real GPUs (Gres=gpu:h100:8 in slurm.conf, nvidia.com/gpu: 8 on the slurmd pod). Unlike the EKS slurm leaf, no os-* mixin is composed: gke-cos.yaml already disables GPU driver installation and pins DRA / nodewright paths for COS's read-only rootfs. Drops the K8s-native NCCL bandwidth perf check inherited from h100-gke-cos-training: on a Slinky cluster the equivalent validation is a slurm-launched srun job (e.g. nccl-tests/all_reduce_perf) that exercises slurmd + the GPUDirect TCPXO plugin from the parent leaf. Conformance and deployment checks are inherited unchanged. Adds a deployment-order guard test case mirroring the existing h100-kind-training-slurm and h100-eks-ubuntu-training-slurm cases, asserting CRDs precede the operator, the operator precedes the cluster, cert-manager precedes the operator, and gpu-operator precedes nvsentinel. --- pkg/recipe/deployment_order_guard_test.go | 23 ++++ .../overlays/h100-gke-cos-training-slurm.yaml | 101 ++++++++++++++++++ 2 files changed, 124 insertions(+) create mode 100644 recipes/overlays/h100-gke-cos-training-slurm.yaml diff --git a/pkg/recipe/deployment_order_guard_test.go b/pkg/recipe/deployment_order_guard_test.go index d374aa815..61ce26480 100644 --- a/pkg/recipe/deployment_order_guard_test.go +++ b/pkg/recipe/deployment_order_guard_test.go @@ -198,6 +198,29 @@ func TestDeploymentOrderGuards(t *testing.T) { {"gpu-operator", "nvsentinel"}, }, }, + { + name: "h100-gke-cos-training-slurm", + criteria: func() *Criteria { + c := NewCriteria() + c.Service = CriteriaServiceGKE + c.Accelerator = CriteriaAcceleratorH100 + c.OS = CriteriaOSCOS + c.Intent = CriteriaIntentTraining + c.Platform = CriteriaPlatformSlurm + return c + }, + requiredDeps: map[string][]string{ + "slinky-slurm-operator": {"cert-manager", "slinky-slurm-operator-crds"}, + "slinky-slurm": {"slinky-slurm-operator", "slinky-slurm-operator-crds"}, + }, + requiredOrdering: [][2]string{ + {"cert-manager", "slinky-slurm-operator"}, + {"slinky-slurm-operator-crds", "slinky-slurm-operator"}, + {"slinky-slurm-operator", "slinky-slurm"}, + {"slinky-slurm-operator-crds", "slinky-slurm"}, + {"gpu-operator", "nvsentinel"}, + }, + }, { name: "h100-kind-training-slurm", criteria: func() *Criteria { diff --git a/recipes/overlays/h100-gke-cos-training-slurm.yaml b/recipes/overlays/h100-gke-cos-training-slurm.yaml new file mode 100644 index 000000000..b1aca6301 --- /dev/null +++ b/recipes/overlays/h100-gke-cos-training-slurm.yaml @@ -0,0 +1,101 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: RecipeMetadata +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: h100-gke-cos-training-slurm + +spec: + # H100 + GKE + COS + training with the Slinky operator and a + # Slinky-managed Slurm cluster. H100 GPU GRES is declared inline + # below; remaining GKE-specific tuning (single-replica login on RWO + # storage, DCGM job-mapping) is layered at install time via + # `aicr bundle ... --set slinkyslurm:...` or a valuesFile. + base: h100-gke-cos-training + + criteria: + service: gke + accelerator: h100 + os: cos + intent: training + platform: slurm + + # Unlike the EKS slurm leaf, no os-ubuntu / os-cos mixin is needed: + # gke-cos.yaml already disables GPU driver installation and pins + # DRA / nodewright paths for COS's read-only rootfs. + + constraints: + - name: K8s.server.version + value: ">= 1.32" + + # The Slinky operator (CRDs + operator + cluster instance) is declared + # inline per slurm leaf, mirroring the dynamo-platform pattern in + # h100-*-inference-dynamo leaves. Inlining lets each leaf carry its + # own GPU/GRES tuning without fighting the mixin-vs-leaf identity-field + # guard in mixinComponentRefSafeForMerge (pkg/recipe/metadata_store.go), + # and keeps base.yaml free of platform-specific components. + # + # GPU GRES on slinky-slurm must be declared in two places because the + # chart does not derive Gres= in slurm.conf from pod resource limits + # (see comment in components/slinky-slurm/values.yaml): + # 1. nodesets.slinky.extraConfMap.Gres — adds `Gres=gpu:h100:8` to + # slurmd's --conf so slurmctld knows it has GPUs to allocate via + # `srun --gres=gpu:N`. + # 2. nodesets.slinky.slurmd.resources.limits.nvidia.com/gpu — reserves + # 8 H100s on the slurmd pod so the NVIDIA device plugin injects + # /dev/nvidia* into the container. Without this `gres.conf`'s + # AutoDetect=nvidia finds nothing. `requests` is omitted: Kubernetes + # auto-mirrors requests=limits for extended resources, matching the + # mars-prod reference (manifests/clusters/mars-prod/ + # mars-dgxc-k8s-tgr-bwi-prd1/components/slurm/values.yaml). + # Accelerated nodeSelector/tolerations on slurmd are injected via the + # registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths. + componentRefs: + - name: slinky-slurm-operator-crds + type: Helm + valuesFile: components/slinky-slurm-operator-crds/values.yaml + + - name: slinky-slurm-operator + type: Helm + valuesFile: components/slinky-slurm-operator/values.yaml + dependencyRefs: + - cert-manager + - slinky-slurm-operator-crds + + - name: slinky-slurm + type: Helm + valuesFile: components/slinky-slurm/values.yaml + dependencyRefs: + - slinky-slurm-operator + - slinky-slurm-operator-crds + overrides: + nodesets: + slinky: + extraConfMap: + Gres: "gpu:h100:8" + slurmd: + resources: + limits: + nvidia.com/gpu: 8 + + # K8s-native NCCL bandwidth check is dropped on slurm leaves: on a + # Slinky cluster the equivalent validation is a slurm-launched job + # (`srun nccl-tests/all_reduce_perf`) that exercises slurmd + the + # GPUDirect TCPXO plugin already deployed by the parent leaf via + # gke-nccl-tcpxo. Conformance checks are inherited unchanged. + validation: + performance: + checks: [] + constraints: [] From e51b4b16fc3fff6c41f22733bd970aa9779e47cb Mon Sep 17 00:00:00 2001 From: Fagani Hajizada Date: Thu, 21 May 2026 14:35:32 +0200 Subject: [PATCH 3/6] docs(recipes): split slurm inline-platform paragraph into list The single dense paragraph introduced in the slinky-slurm refactor was hard to scan. Break it into a short lead sentence, a bulleted list of the three inlined componentRefs (CRDs / operator / cluster), and a trailing line that points at dynamo-platform as the precedent and h100-eks-ubuntu-training-slurm.yaml as the worked example. Drops the speculative "partition layout" tuning example and the operator-only leaf variation: neither is exercised by any leaf in tree today, and documenting hypotheticals risks the same drift problem catalog-md hit before (NKX-10404). --- docs/integrator/recipe-development.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/integrator/recipe-development.md b/docs/integrator/recipe-development.md index 4a25af9f2..18b6c4699 100644 --- a/docs/integrator/recipe-development.md +++ b/docs/integrator/recipe-development.md @@ -115,7 +115,15 @@ spec: Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details. -A platform's full component stack is declared inline per leaf overlay rather than via a platform mixin, so each leaf can carry its own hardware-specific tuning (GPU GRES, resource limits, partition layout). For example, `--platform slurm` leaves inline the SchedMD Slinky operator CRDs, the operator itself, and the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi) as three `componentRefs` entries — same shape `dynamo-platform` uses across the `*-inference-dynamo` leaves. A leaf that wants the operator only inlines `slinky-slurm-operator-crds` + `slinky-slurm-operator` and omits the `slinky-slurm` componentRef; a leaf that wants the full cluster adds all three with leaf-specific `overrides` on `slinky-slurm` — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter. +A platform's full component stack is declared inline per leaf overlay rather than via a platform mixin. This lets each leaf carry its own hardware-specific tuning — most commonly GPU GRES strings and accelerator resource limits — without going through the mixin merge path. + +For example, `--platform slurm` leaves inline three `componentRefs`: + +- `slinky-slurm-operator-crds` — SchedMD Slinky CRDs +- `slinky-slurm-operator` — the operator and admission webhook +- `slinky-slurm` — the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi), with leaf-specific `overrides` (e.g. H100 GRES wiring on the `nodesets.slinky` map) + +This is the same shape `dynamo-platform` uses across the `*-inference-dynamo` leaves. See `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the full example. When authoring a recipe targeting Talos (`criteria.os: talos`), append the `os-talos` mixin to your overlay's `spec.mixins` list (e.g. `spec.mixins: [os-talos]`, or `[platform-kubeflow, os-talos]` if you already mix in a non-OS fragment). OS-scoped mixins are mutually exclusive — combining `os-ubuntu` and `os-talos` in one overlay is a recipe authoring error, not a supported composition. The mixin overrides namespaces for affected components and supplies PSA-privileged Namespace manifests via `componentRefs[].preManifestFiles`, which are applied before each chart — see [Talos integration](talos-integration.md) for the component list and labels. From 2d6614c47dc16081a8e3b22495ae6606de4aeecf Mon Sep 17 00:00:00 2001 From: Fagani Hajizada Date: Thu, 21 May 2026 17:20:17 +0200 Subject: [PATCH 4/6] fix(recipes): drop K8s NCCL perf check on EKS slurm leaf for parity MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The GKE slurm leaf already overrides validation.performance.checks to [] with the rationale that on a Slinky-managed cluster the K8s-native nccl-all-reduce-bw measures the wrong path — it launches a Pod past slurmd, so the bandwidth number reflects raw cluster networking, not what an `srun nccl-tests/all_reduce_perf` would see going through slurmd + the configured fabric (TCPXO on GKE, EFA on EKS). The argument generalizes to any Slinky cluster, so apply the same override on the EKS slurm leaf. Tighten the prose on both files: the old comment said "Conformance checks are inherited unchanged" but deployment checks are also inherited; replace with "Deployment and conformance checks are inherited unchanged" for precision. --- recipes/overlays/h100-eks-ubuntu-training-slurm.yaml | 12 ++++++++++++ recipes/overlays/h100-gke-cos-training-slurm.yaml | 11 +++++++---- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml index 2420ae968..bb3d4c04d 100644 --- a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml +++ b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml @@ -88,3 +88,15 @@ spec: resources: limits: nvidia.com/gpu: 8 + + # K8s-native nccl-all-reduce-bw is dropped on Slinky leaves: that + # check launches a Pod against the cluster scheduler, so on a + # Slinky-managed cluster it bypasses slurmd entirely and measures the + # wrong path. The equivalent signal here is a slurm-launched + # `srun nccl-tests/all_reduce_perf` that goes through slurmd + the + # EFA libfabric stack already present on the parent EKS leaf. + # Deployment and conformance checks are inherited unchanged. + validation: + performance: + checks: [] + constraints: [] diff --git a/recipes/overlays/h100-gke-cos-training-slurm.yaml b/recipes/overlays/h100-gke-cos-training-slurm.yaml index b1aca6301..a2f12e53f 100644 --- a/recipes/overlays/h100-gke-cos-training-slurm.yaml +++ b/recipes/overlays/h100-gke-cos-training-slurm.yaml @@ -90,11 +90,14 @@ spec: limits: nvidia.com/gpu: 8 - # K8s-native NCCL bandwidth check is dropped on slurm leaves: on a - # Slinky cluster the equivalent validation is a slurm-launched job - # (`srun nccl-tests/all_reduce_perf`) that exercises slurmd + the + # K8s-native nccl-all-reduce-bw is dropped on Slinky leaves: that + # check launches a Pod against the cluster scheduler, so on a + # Slinky-managed cluster it bypasses slurmd entirely and measures the + # wrong path. The equivalent signal here is a slurm-launched + # `srun nccl-tests/all_reduce_perf` that goes through slurmd + the # GPUDirect TCPXO plugin already deployed by the parent leaf via - # gke-nccl-tcpxo. Conformance checks are inherited unchanged. + # gke-nccl-tcpxo. Deployment and conformance checks are inherited + # unchanged. validation: performance: checks: [] From 24093c095e08435f9a861c1a1277aca37a8828c5 Mon Sep 17 00:00:00 2001 From: Fagani Hajizada Date: Thu, 21 May 2026 17:23:47 +0200 Subject: [PATCH 5/6] test(recipe): extend NFD topology-updater coverage to slurm leaves MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit TestNFDTopologyUpdater_OverlayCoverage enumerates every GPU overlay variant to catch overlays that accidentally lose NRT publishing on real-cluster leaves or accidentally turn it on for kind-chain leaves. The three slurm leaves (h100-eks-ubuntu-training-slurm, h100-gke-cos-training-slurm, h100-kind-training-slurm) were missing from the table — a pre-existing gap that this PR is the natural moment to close, since it adds one new slurm leaf and refactors the other two off the deleted platform-slurm mixins. Expected pattern matches the parent inheritance: EKS+GKE slurm leaves inherit topologyUpdater.enable=true from their h100-*-training parents; the kind slurm leaf stays off (no kubelet podResources socket on KWOK). --- pkg/recipe/metadata_test.go | 3 +++ 1 file changed, 3 insertions(+) diff --git a/pkg/recipe/metadata_test.go b/pkg/recipe/metadata_test.go index fc4cf1679..8d51ec621 100644 --- a/pkg/recipe/metadata_test.go +++ b/pkg/recipe/metadata_test.go @@ -1860,6 +1860,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) { {"h100-eks-ubuntu-training", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, ""}, true}, {"h100-eks-ubuntu-inference", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, ""}, true}, {"h100-eks-ubuntu-training-kubeflow", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, CriteriaPlatformKubeflow}, true}, + {"h100-eks-ubuntu-training-slurm", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, CriteriaPlatformSlurm}, true}, {"h100-eks-ubuntu-inference-dynamo", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformDynamo}, true}, {"h100-eks-ubuntu-inference-nim", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformNIM}, true}, // H100 AKS Ubuntu variants @@ -1869,6 +1870,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) { {"h100-aks-ubuntu-inference-dynamo", criteria{CriteriaServiceAKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformDynamo}, true}, // H100 GKE COS platform variants (GKE uses COS, no Ubuntu variant) {"h100-gke-cos-training-kubeflow", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentTraining, CriteriaPlatformKubeflow}, true}, + {"h100-gke-cos-training-slurm", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentTraining, CriteriaPlatformSlurm}, true}, {"h100-gke-cos-inference-dynamo", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentInference, CriteriaPlatformDynamo}, true}, // GB200 EKS Ubuntu variants {"gb200-eks-ubuntu-training", criteria{CriteriaServiceEKS, CriteriaAcceleratorGB200, CriteriaOSUbuntu, CriteriaIntentTraining, ""}, true}, @@ -1889,6 +1891,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) { {"h100-kind-inference", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentInference, ""}, false}, // Deeper kind leaves — platform variants must also stay OFF {"h100-kind-training-kubeflow", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentTraining, CriteriaPlatformKubeflow}, false}, + {"h100-kind-training-slurm", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentTraining, CriteriaPlatformSlurm}, false}, {"h100-kind-inference-dynamo", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentInference, CriteriaPlatformDynamo}, false}, } From 7782ff744df6c140972647492807eaaa9c8d7125 Mon Sep 17 00:00:00 2001 From: Fagani Hajizada Date: Fri, 22 May 2026 11:31:44 +0200 Subject: [PATCH 6/6] docs(recipes): address review nits on slurm leaves - recipe-development.md: scope the "inline-per-leaf platform stack" claim to slurm and dynamo. The prior wording was too broad: platform-kubeflow and platform-inference are still active mixins (shown in the example two paragraphs up), and the dynamo-platform precedent already lives inline in *-inference-dynamo leaves. The new wording calls out slurm/dynamo as inline-per-leaf and explicitly preserves the mixin status of kubeflow/inference. - h100-gke-cos-training-slurm.yaml: drop the verbatim restatement of K8s.server.version (>= 1.32) from the parent. Same-name constraints get overridden during merge, so restating without strengthening is noise. The EKS slurm leaf restates >= 1.32.4 to tighten its parent; the GKE leaf has no tighter floor, so we inherit and add a one-line comment documenting why. - h100-eks-ubuntu-training-slurm.yaml and h100-gke-cos-training-slurm.yaml: drop the parenthetical pointer to mars-prod (manifests/clusters/mars-prod/...). That path lives in an internal NVIDIA repo and reads as opaque in the OSS tree. The rationale (Kubernetes auto-mirrors requests=limits for extended resources) stands on its own. Per Mark's review on #997. --- docs/integrator/recipe-development.md | 2 +- .../overlays/h100-eks-ubuntu-training-slurm.yaml | 4 +--- recipes/overlays/h100-gke-cos-training-slurm.yaml | 13 ++++++------- 3 files changed, 8 insertions(+), 11 deletions(-) diff --git a/docs/integrator/recipe-development.md b/docs/integrator/recipe-development.md index 54eee750d..463215563 100644 --- a/docs/integrator/recipe-development.md +++ b/docs/integrator/recipe-development.md @@ -115,7 +115,7 @@ spec: Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details. -A platform's full component stack is declared inline per leaf overlay rather than via a platform mixin. This lets each leaf carry its own hardware-specific tuning — most commonly GPU GRES strings and accelerator resource limits — without going through the mixin merge path. +Some platforms declare their full component stack inline per leaf overlay rather than via a platform mixin. This is the case for `--platform slurm` and `--platform dynamo`, where each leaf carries hardware-specific tuning (GPU GRES strings, accelerator resource limits) that the mixin merge path cannot represent cleanly. Other platforms like `--platform kubeflow` and `--platform inference` still use the `platform-kubeflow` / `platform-inference` mixins shown above, since their leaf-specific tuning is minimal. For example, `--platform slurm` leaves inline three `componentRefs`: diff --git a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml index bb3d4c04d..c15271b24 100644 --- a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml +++ b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml @@ -56,9 +56,7 @@ spec: # 8 H100s on the slurmd pod so the NVIDIA device plugin injects # /dev/nvidia* into the container. Without this `gres.conf`'s # AutoDetect=nvidia finds nothing. `requests` is omitted: Kubernetes - # auto-mirrors requests=limits for extended resources, matching the - # mars-prod reference (manifests/clusters/mars-prod/ - # mars-dgxc-k8s-tgr-bwi-prd1/components/slurm/values.yaml). + # auto-mirrors requests=limits for extended resources. # Accelerated nodeSelector/tolerations on slurmd are injected via the # registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths. componentRefs: diff --git a/recipes/overlays/h100-gke-cos-training-slurm.yaml b/recipes/overlays/h100-gke-cos-training-slurm.yaml index a2f12e53f..688326312 100644 --- a/recipes/overlays/h100-gke-cos-training-slurm.yaml +++ b/recipes/overlays/h100-gke-cos-training-slurm.yaml @@ -35,10 +35,11 @@ spec: # Unlike the EKS slurm leaf, no os-ubuntu / os-cos mixin is needed: # gke-cos.yaml already disables GPU driver installation and pins # DRA / nodewright paths for COS's read-only rootfs. - - constraints: - - name: K8s.server.version - value: ">= 1.32" + # + # K8s.server.version (>= 1.32) is inherited from h100-gke-cos-training.yaml; + # Slinky on GKE has no tighter floor than the parent leaf, so we don't + # restate it here (cf. the EKS slurm leaf, which restates >= 1.32.4 to + # match its parent). # The Slinky operator (CRDs + operator + cluster instance) is declared # inline per slurm leaf, mirroring the dynamo-platform pattern in @@ -57,9 +58,7 @@ spec: # 8 H100s on the slurmd pod so the NVIDIA device plugin injects # /dev/nvidia* into the container. Without this `gres.conf`'s # AutoDetect=nvidia finds nothing. `requests` is omitted: Kubernetes - # auto-mirrors requests=limits for extended resources, matching the - # mars-prod reference (manifests/clusters/mars-prod/ - # mars-dgxc-k8s-tgr-bwi-prd1/components/slurm/values.yaml). + # auto-mirrors requests=limits for extended resources. # Accelerated nodeSelector/tolerations on slurmd are injected via the # registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths. componentRefs: