diff --git a/docs/integrator/recipe-development.md b/docs/integrator/recipe-development.md index a13b19f2c..463215563 100644 --- a/docs/integrator/recipe-development.md +++ b/docs/integrator/recipe-development.md @@ -115,7 +115,15 @@ spec: Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details. -A platform may split into multiple mixins when parts of the stack are independently opt-in. For example, `--platform slurm` resolves through two mixins: `platform-slurm` always contributes the SchedMD Slinky operator and CRDs, and `platform-slurm-cluster` is opt-in for the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi). A leaf that wants operator-only composes just `platform-slurm`; a leaf that wants the cluster too composes both — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter. +Some platforms declare their full component stack inline per leaf overlay rather than via a platform mixin. This is the case for `--platform slurm` and `--platform dynamo`, where each leaf carries hardware-specific tuning (GPU GRES strings, accelerator resource limits) that the mixin merge path cannot represent cleanly. Other platforms like `--platform kubeflow` and `--platform inference` still use the `platform-kubeflow` / `platform-inference` mixins shown above, since their leaf-specific tuning is minimal. + +For example, `--platform slurm` leaves inline three `componentRefs`: + +- `slinky-slurm-operator-crds` — SchedMD Slinky CRDs +- `slinky-slurm-operator` — the operator and admission webhook +- `slinky-slurm` — the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi), with leaf-specific `overrides` (e.g. H100 GRES wiring on the `nodesets.slinky` map) + +This is the same shape `dynamo-platform` uses across the `*-inference-dynamo` leaves. See `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the full example. When authoring a recipe targeting Talos (`criteria.os: talos`), append the `os-talos` mixin to your overlay's `spec.mixins` list (e.g. `spec.mixins: [os-talos]`, or `[platform-kubeflow, os-talos]` if you already mix in a non-OS fragment). OS-scoped mixins are mutually exclusive — combining `os-ubuntu` and `os-talos` in one overlay is a recipe authoring error, not a supported composition. The mixin overrides namespaces for affected components and supplies PSA-privileged Namespace manifests via `componentRefs[].preManifestFiles`, which are applied before each chart — see [Talos integration](talos-integration.md) for the component list and labels. diff --git a/docs/user/api-reference.md b/docs/user/api-reference.md index d38a2da5e..9312b48e1 100644 --- a/docs/user/api-reference.md +++ b/docs/user/api-reference.md @@ -366,6 +366,7 @@ Bundler names correspond to component names in [`recipes/registry.yaml`](https:/ | `nvsentinel` | GPU health monitoring and automated remediation | | `prometheus-adapter` | Custom metrics for HPA scaling | | `prometheus-operator-crds` | CRDs for the prometheus-operator (`Alertmanager`, `Prometheus`, `ServiceMonitor`, etc.) | +| `slinky-slurm` | Slinky-managed Slurm cluster instance (Controller, LoginSet, NodeSet, RestApi); reconciled by `slinky-slurm-operator` | | `slinky-slurm-operator` | SchedMD Slinky Slurm operator and admission webhook | | `slinky-slurm-operator-crds` | CRDs for the SchedMD Slinky Slurm operator (`slinky.slurm.net`) | diff --git a/docs/user/component-catalog.md b/docs/user/component-catalog.md index ddfb06ebc..72014f8f5 100644 --- a/docs/user/component-catalog.md +++ b/docs/user/component-catalog.md @@ -35,7 +35,7 @@ The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/ | **kubeflow-trainer** | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | [Kubeflow Trainer](https://github.com/kubeflow/trainer) | | **slinky-slurm-operator-crds** | Custom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the `slinky.slurm.net` CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | | **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | -| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Opt-in via the `platform-slurm-cluster` mixin (alongside `platform-slurm` for the operator). Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) | +| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Declared inline per slurm leaf overlay alongside `slinky-slurm-operator-crds` and `slinky-slurm-operator` (matching the dynamo-platform pattern) so each leaf can carry its own GPU/GRES tuning. Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) | ## How Components Are Selected @@ -44,7 +44,7 @@ Not every component appears in every recipe. The recipe engine selects component - **Base components** (cert-manager, kube-prometheus-stack) appear in most recipes. - **Cloud-specific components** (aws-efa, aws-ebs-csi-driver) are added when the service matches. - **Intent-specific components** (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway). -- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, the cluster (`slinky-slurm`) is opt-in via the `platform-slurm-cluster` mixin alongside the always-applied operator (`platform-slurm`); leaves that want operator-only compose just `platform-slurm`. +- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, all three Slinky pieces (`slinky-slurm-operator-crds`, `slinky-slurm-operator`, `slinky-slurm`) are declared inline per slurm leaf overlay — the same shape `dynamo-platform` uses across `*-inference-dynamo` leaves. Leaves that want the operator only inline the CRDs + operator and omit the `slinky-slurm` componentRef. - **Accelerator/OS-specific tuning** (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination. ### NFD Topology Updater diff --git a/pkg/recipe/deployment_order_guard_test.go b/pkg/recipe/deployment_order_guard_test.go index d374aa815..61ce26480 100644 --- a/pkg/recipe/deployment_order_guard_test.go +++ b/pkg/recipe/deployment_order_guard_test.go @@ -198,6 +198,29 @@ func TestDeploymentOrderGuards(t *testing.T) { {"gpu-operator", "nvsentinel"}, }, }, + { + name: "h100-gke-cos-training-slurm", + criteria: func() *Criteria { + c := NewCriteria() + c.Service = CriteriaServiceGKE + c.Accelerator = CriteriaAcceleratorH100 + c.OS = CriteriaOSCOS + c.Intent = CriteriaIntentTraining + c.Platform = CriteriaPlatformSlurm + return c + }, + requiredDeps: map[string][]string{ + "slinky-slurm-operator": {"cert-manager", "slinky-slurm-operator-crds"}, + "slinky-slurm": {"slinky-slurm-operator", "slinky-slurm-operator-crds"}, + }, + requiredOrdering: [][2]string{ + {"cert-manager", "slinky-slurm-operator"}, + {"slinky-slurm-operator-crds", "slinky-slurm-operator"}, + {"slinky-slurm-operator", "slinky-slurm"}, + {"slinky-slurm-operator-crds", "slinky-slurm"}, + {"gpu-operator", "nvsentinel"}, + }, + }, { name: "h100-kind-training-slurm", criteria: func() *Criteria { diff --git a/pkg/recipe/metadata_test.go b/pkg/recipe/metadata_test.go index fc4cf1679..8d51ec621 100644 --- a/pkg/recipe/metadata_test.go +++ b/pkg/recipe/metadata_test.go @@ -1860,6 +1860,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) { {"h100-eks-ubuntu-training", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, ""}, true}, {"h100-eks-ubuntu-inference", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, ""}, true}, {"h100-eks-ubuntu-training-kubeflow", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, CriteriaPlatformKubeflow}, true}, + {"h100-eks-ubuntu-training-slurm", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentTraining, CriteriaPlatformSlurm}, true}, {"h100-eks-ubuntu-inference-dynamo", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformDynamo}, true}, {"h100-eks-ubuntu-inference-nim", criteria{CriteriaServiceEKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformNIM}, true}, // H100 AKS Ubuntu variants @@ -1869,6 +1870,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) { {"h100-aks-ubuntu-inference-dynamo", criteria{CriteriaServiceAKS, CriteriaAcceleratorH100, CriteriaOSUbuntu, CriteriaIntentInference, CriteriaPlatformDynamo}, true}, // H100 GKE COS platform variants (GKE uses COS, no Ubuntu variant) {"h100-gke-cos-training-kubeflow", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentTraining, CriteriaPlatformKubeflow}, true}, + {"h100-gke-cos-training-slurm", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentTraining, CriteriaPlatformSlurm}, true}, {"h100-gke-cos-inference-dynamo", criteria{CriteriaServiceGKE, CriteriaAcceleratorH100, CriteriaOSCOS, CriteriaIntentInference, CriteriaPlatformDynamo}, true}, // GB200 EKS Ubuntu variants {"gb200-eks-ubuntu-training", criteria{CriteriaServiceEKS, CriteriaAcceleratorGB200, CriteriaOSUbuntu, CriteriaIntentTraining, ""}, true}, @@ -1889,6 +1891,7 @@ func TestNFDTopologyUpdater_OverlayCoverage(t *testing.T) { {"h100-kind-inference", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentInference, ""}, false}, // Deeper kind leaves — platform variants must also stay OFF {"h100-kind-training-kubeflow", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentTraining, CriteriaPlatformKubeflow}, false}, + {"h100-kind-training-slurm", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentTraining, CriteriaPlatformSlurm}, false}, {"h100-kind-inference-dynamo", criteria{CriteriaServiceKind, CriteriaAcceleratorH100, "", CriteriaIntentInference, CriteriaPlatformDynamo}, false}, } diff --git a/recipes/mixins/platform-slurm-cluster.yaml b/recipes/mixins/platform-slurm-cluster.yaml deleted file mode 100644 index 71ac2ad8b..000000000 --- a/recipes/mixins/platform-slurm-cluster.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Opt-in mixin: a Slinky-managed Slurm cluster instance (Controller / -# LoginSet / NodeSet / RestApi) on top of the operator from -# platform-slurm. Leaves wanting only the operator compose -# platform-slurm alone; leaves wanting a runnable cluster compose both. -kind: RecipeMixin -apiVersion: aicr.nvidia.com/v1alpha1 -metadata: - name: platform-slurm-cluster -spec: - componentRefs: - - name: slinky-slurm - type: Helm - valuesFile: components/slinky-slurm/values.yaml - dependencyRefs: - - slinky-slurm-operator - - slinky-slurm-operator-crds diff --git a/recipes/mixins/platform-slurm.yaml b/recipes/mixins/platform-slurm.yaml deleted file mode 100644 index 972060416..000000000 --- a/recipes/mixins/platform-slurm.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -kind: RecipeMixin -apiVersion: aicr.nvidia.com/v1alpha1 -metadata: - name: platform-slurm -spec: - componentRefs: - - name: slinky-slurm-operator-crds - type: Helm - valuesFile: components/slinky-slurm-operator-crds/values.yaml - - - name: slinky-slurm-operator - type: Helm - valuesFile: components/slinky-slurm-operator/values.yaml - dependencyRefs: - - cert-manager - - slinky-slurm-operator-crds diff --git a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml index 0824fc35d..c15271b24 100644 --- a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml +++ b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml @@ -19,9 +19,10 @@ metadata: spec: # H100 + EKS + Ubuntu + training with the Slinky operator and a - # Slinky-managed Slurm cluster. EKS-specific cluster tuning (gp3 - # storage, GPU GRES, DCGM job-mapping) is layered at install time - # via `aicr bundle ... --set slurmcluster:...` or a valuesFile. + # Slinky-managed Slurm cluster. H100 GPU GRES is declared inline below + # (pod-side limit + slurmd-side Gres= line); remaining EKS-specific + # tuning (gp3 storage, DCGM job-mapping) is layered at install time + # via `aicr bundle ... --set slinkyslurm:...` or a valuesFile. base: h100-eks-ubuntu-training criteria: @@ -33,18 +34,67 @@ spec: mixins: - os-ubuntu - - platform-slurm - - platform-slurm-cluster constraints: - name: K8s.server.version value: ">= 1.32.4" - # Mixin-contributed components cannot be overridden from a leaf; use - # `--set slurmcluster:...` or a valuesFile at install time instead. + # The Slinky operator (CRDs + operator + cluster instance) is declared + # inline per slurm leaf, mirroring the dynamo-platform pattern in + # h100-*-inference-dynamo leaves. Inlining lets each leaf carry its + # own GPU/GRES tuning without fighting the mixin-vs-leaf identity-field + # guard in mixinComponentRefSafeForMerge (pkg/recipe/metadata_store.go), + # and keeps base.yaml free of platform-specific components. + # + # GPU GRES on slinky-slurm must be declared in two places because the + # chart does not derive Gres= in slurm.conf from pod resource limits + # (see comment in components/slinky-slurm/values.yaml): + # 1. nodesets.slinky.extraConfMap.Gres — adds `Gres=gpu:h100:8` to + # slurmd's --conf so slurmctld knows it has GPUs to allocate via + # `srun --gres=gpu:N`. + # 2. nodesets.slinky.slurmd.resources.limits.nvidia.com/gpu — reserves + # 8 H100s on the slurmd pod so the NVIDIA device plugin injects + # /dev/nvidia* into the container. Without this `gres.conf`'s + # AutoDetect=nvidia finds nothing. `requests` is omitted: Kubernetes + # auto-mirrors requests=limits for extended resources. # Accelerated nodeSelector/tolerations on slurmd are injected via the # registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths. - componentRefs: [] + componentRefs: + - name: slinky-slurm-operator-crds + type: Helm + valuesFile: components/slinky-slurm-operator-crds/values.yaml - # Validation is inherited from h100-eks-training (operator-health, - # expected-resources, gpu-operator-version, check-nvidia-smi). + - name: slinky-slurm-operator + type: Helm + valuesFile: components/slinky-slurm-operator/values.yaml + dependencyRefs: + - cert-manager + - slinky-slurm-operator-crds + + - name: slinky-slurm + type: Helm + valuesFile: components/slinky-slurm/values.yaml + dependencyRefs: + - slinky-slurm-operator + - slinky-slurm-operator-crds + overrides: + nodesets: + slinky: + extraConfMap: + Gres: "gpu:h100:8" + slurmd: + resources: + limits: + nvidia.com/gpu: 8 + + # K8s-native nccl-all-reduce-bw is dropped on Slinky leaves: that + # check launches a Pod against the cluster scheduler, so on a + # Slinky-managed cluster it bypasses slurmd entirely and measures the + # wrong path. The equivalent signal here is a slurm-launched + # `srun nccl-tests/all_reduce_perf` that goes through slurmd + the + # EFA libfabric stack already present on the parent EKS leaf. + # Deployment and conformance checks are inherited unchanged. + validation: + performance: + checks: [] + constraints: [] diff --git a/recipes/overlays/h100-gke-cos-training-slurm.yaml b/recipes/overlays/h100-gke-cos-training-slurm.yaml new file mode 100644 index 000000000..688326312 --- /dev/null +++ b/recipes/overlays/h100-gke-cos-training-slurm.yaml @@ -0,0 +1,103 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: RecipeMetadata +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: h100-gke-cos-training-slurm + +spec: + # H100 + GKE + COS + training with the Slinky operator and a + # Slinky-managed Slurm cluster. H100 GPU GRES is declared inline + # below; remaining GKE-specific tuning (single-replica login on RWO + # storage, DCGM job-mapping) is layered at install time via + # `aicr bundle ... --set slinkyslurm:...` or a valuesFile. + base: h100-gke-cos-training + + criteria: + service: gke + accelerator: h100 + os: cos + intent: training + platform: slurm + + # Unlike the EKS slurm leaf, no os-ubuntu / os-cos mixin is needed: + # gke-cos.yaml already disables GPU driver installation and pins + # DRA / nodewright paths for COS's read-only rootfs. + # + # K8s.server.version (>= 1.32) is inherited from h100-gke-cos-training.yaml; + # Slinky on GKE has no tighter floor than the parent leaf, so we don't + # restate it here (cf. the EKS slurm leaf, which restates >= 1.32.4 to + # match its parent). + + # The Slinky operator (CRDs + operator + cluster instance) is declared + # inline per slurm leaf, mirroring the dynamo-platform pattern in + # h100-*-inference-dynamo leaves. Inlining lets each leaf carry its + # own GPU/GRES tuning without fighting the mixin-vs-leaf identity-field + # guard in mixinComponentRefSafeForMerge (pkg/recipe/metadata_store.go), + # and keeps base.yaml free of platform-specific components. + # + # GPU GRES on slinky-slurm must be declared in two places because the + # chart does not derive Gres= in slurm.conf from pod resource limits + # (see comment in components/slinky-slurm/values.yaml): + # 1. nodesets.slinky.extraConfMap.Gres — adds `Gres=gpu:h100:8` to + # slurmd's --conf so slurmctld knows it has GPUs to allocate via + # `srun --gres=gpu:N`. + # 2. nodesets.slinky.slurmd.resources.limits.nvidia.com/gpu — reserves + # 8 H100s on the slurmd pod so the NVIDIA device plugin injects + # /dev/nvidia* into the container. Without this `gres.conf`'s + # AutoDetect=nvidia finds nothing. `requests` is omitted: Kubernetes + # auto-mirrors requests=limits for extended resources. + # Accelerated nodeSelector/tolerations on slurmd are injected via the + # registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths. + componentRefs: + - name: slinky-slurm-operator-crds + type: Helm + valuesFile: components/slinky-slurm-operator-crds/values.yaml + + - name: slinky-slurm-operator + type: Helm + valuesFile: components/slinky-slurm-operator/values.yaml + dependencyRefs: + - cert-manager + - slinky-slurm-operator-crds + + - name: slinky-slurm + type: Helm + valuesFile: components/slinky-slurm/values.yaml + dependencyRefs: + - slinky-slurm-operator + - slinky-slurm-operator-crds + overrides: + nodesets: + slinky: + extraConfMap: + Gres: "gpu:h100:8" + slurmd: + resources: + limits: + nvidia.com/gpu: 8 + + # K8s-native nccl-all-reduce-bw is dropped on Slinky leaves: that + # check launches a Pod against the cluster scheduler, so on a + # Slinky-managed cluster it bypasses slurmd entirely and measures the + # wrong path. The equivalent signal here is a slurm-launched + # `srun nccl-tests/all_reduce_perf` that goes through slurmd + the + # GPUDirect TCPXO plugin already deployed by the parent leaf via + # gke-nccl-tcpxo. Deployment and conformance checks are inherited + # unchanged. + validation: + performance: + checks: [] + constraints: [] diff --git a/recipes/overlays/h100-kind-training-slurm.yaml b/recipes/overlays/h100-kind-training-slurm.yaml index 60368a99f..0792876b1 100644 --- a/recipes/overlays/h100-kind-training-slurm.yaml +++ b/recipes/overlays/h100-kind-training-slurm.yaml @@ -29,17 +29,38 @@ spec: intent: training platform: slurm - mixins: - - platform-slurm - - platform-slurm-cluster - # DRA (GA in K8s 1.34) — restated from the parent for clarity. constraints: - name: K8s.server.version value: ">= 1.34" - # Mixin-contributed components cannot be overridden from a leaf; use - # `--set slurmcluster:...` or a valuesFile at install time instead. - componentRefs: [] + # The Slinky operator (CRDs + operator + cluster instance) is declared + # inline per slurm leaf, mirroring the dynamo-platform pattern in + # h100-*-inference-dynamo leaves. Inlining lets each leaf carry its + # own GPU/GRES tuning (Kind = CPU-only; H100 leaves = `Gres=gpu:h100:8` + # + `nvidia.com/gpu: 8`) without fighting the mixin-vs-leaf + # identity-field guard in mixinComponentRefSafeForMerge + # (pkg/recipe/metadata_store.go), and keeps base.yaml free of + # platform-specific components. + # + # Kind has no GPUs, so the NodeSet runs CPU-only: no `extraConfMap.Gres`, + # no `slurmd.resources.limits.nvidia.com/gpu`, no DCGM. This makes the + # leaf usable as a no-GPU CI smoke test for the operator + chart wiring. + componentRefs: + - name: slinky-slurm-operator-crds + type: Helm + valuesFile: components/slinky-slurm-operator-crds/values.yaml + + - name: slinky-slurm-operator + type: Helm + valuesFile: components/slinky-slurm-operator/values.yaml + dependencyRefs: + - cert-manager + - slinky-slurm-operator-crds - # Validation is inherited from h100-kind-training. + - name: slinky-slurm + type: Helm + valuesFile: components/slinky-slurm/values.yaml + dependencyRefs: + - slinky-slurm-operator + - slinky-slurm-operator-crds diff --git a/recipes/registry.yaml b/recipes/registry.yaml index 51ecbe1cb..71c50dae1 100644 --- a/recipes/registry.yaml +++ b/recipes/registry.yaml @@ -644,7 +644,9 @@ components: - name: slinky-slurm displayName: slinky-slurm - # Cluster instance chart; wired in via platform-slurm-cluster mixin. + # Cluster instance chart; declared inline per slurm leaf overlay so + # each leaf can carry its own GPU/GRES tuning (mirrors the + # dynamo-platform pattern in inference-dynamo leaves). valueOverrideKeys: - slinkyslurm - slurmcluster