diff --git a/docs/integrator/recipe-development.md b/docs/integrator/recipe-development.md index 752564708..e27037e5c 100644 --- a/docs/integrator/recipe-development.md +++ b/docs/integrator/recipe-development.md @@ -115,6 +115,8 @@ spec: Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details. +A platform may split into multiple mixins when parts of the stack are independently opt-in. For example, `--platform slurm` resolves through two mixins: `platform-slurm` always contributes the SchedMD Slinky operator and CRDs, and `platform-slurm-cluster` is opt-in for the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi). A leaf that wants operator-only composes just `platform-slurm`; a leaf that wants the cluster too composes both — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter. + When authoring a recipe targeting Talos (`criteria.os: talos`), append the `os-talos` mixin to your overlay's `spec.mixins` list (e.g. `spec.mixins: [os-talos]`, or `[platform-kubeflow, os-talos]` if you already mix in a non-OS fragment). OS-scoped mixins are mutually exclusive — combining `os-ubuntu` and `os-talos` in one overlay is a recipe authoring error, not a supported composition. The mixin overrides namespaces for affected components and supplies PSA-privileged Namespace manifests via `componentRefs[].preManifestFiles`, which are applied before each chart — see [Talos integration](talos-integration.md) for the component list and labels. **Cross-cutting overlays with wildcard criteria** apply across one criteria dimension without being referenced via `spec.base` or listed in `spec.mixins`. The resolver can return multiple independent maximal-leaf overlays for a single query, so a `service: any` overlay is picked up alongside the service-specific maximal leaf and its inheritance chain: diff --git a/docs/user/component-catalog.md b/docs/user/component-catalog.md index 48e42d86e..b31bfbd55 100644 --- a/docs/user/component-catalog.md +++ b/docs/user/component-catalog.md @@ -34,7 +34,8 @@ The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/ | **kueue** | Kubernetes-native job queuing system. Manages quotas and admits jobs for batch and AI workloads. | [Kueue](https://github.com/kubernetes-sigs/kueue) | | **kubeflow-trainer** | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | [Kubeflow Trainer](https://github.com/kubeflow/trainer) | | **slinky-slurm-operator-crds** | Custom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the `slinky.slurm.net` CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | -| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs. Cluster-instance CRs (Controller, NodeSet, LoginSet, ...) are user-authored — AICR ships only the operator, mirroring how dynamo-platform and kubeflow-trainer ship operator-only. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | +| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | +| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Opt-in via the `platform-slurm-cluster` mixin (alongside `platform-slurm` for the operator). Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) | ## How Components Are Selected @@ -43,7 +44,7 @@ Not every component appears in every recipe. The recipe engine selects component - **Base components** (cert-manager, kube-prometheus-stack) appear in most recipes. - **Cloud-specific components** (aws-efa, aws-ebs-csi-driver) are added when the service matches. - **Intent-specific components** (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway). -- **Platform-specific components** (slinky-slurm-operator, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. +- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, the cluster (`slinky-slurm`) is opt-in via the `platform-slurm-cluster` mixin alongside the always-applied operator (`platform-slurm`); leaves that want operator-only compose just `platform-slurm`. - **Accelerator/OS-specific tuning** (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination. ### NFD Topology Updater diff --git a/docs/user/container-images.md b/docs/user/container-images.md index e44182ca1..ab2dbc9a4 100644 --- a/docs/user/container-images.md +++ b/docs/user/container-images.md @@ -19,8 +19,8 @@ A machine-readable **CycloneDX 1.6 JSON** companion to this page is produced by ## Summary -- Components: **25** -- Unique images: **71** +- Components: **26** +- Unique images: **76** - Distinct registries: **11** Registries: `602401143452.dkr.ecr.us-west-2.amazonaws.com`, `cr.agentgateway.dev`, `docker.io`, `gcr.io`, `ghcr.io`, `gke.gcr.io`, `nvcr.io`, `public.ecr.aws`, `quay.io`, `registry.k8s.io`, `us-docker.pkg.dev` @@ -52,6 +52,7 @@ Registries: `602401143452.dkr.ecr.us-west-2.amazonaws.com`, `cr.agentgateway.dev | nvsentinel | helm | nvsentinel | v1.3.0 | 6 | | prometheus-adapter | helm | prometheus-community/prometheus-adapter | 5.3.0 | 1 | | prometheus-operator-crds | helm | prometheus-community/prometheus-operator-crds | 28.0.1 | 0 | +| slinky-slurm | helm | slurm | 1.1.0 | 5 | | slinky-slurm-operator | helm | slurm-operator | 1.1.0 | 2 | | slinky-slurm-operator-crds | helm | slurm-operator-crds | 1.1.0 | 0 | @@ -197,6 +198,14 @@ _No images extracted._ _No images extracted._ +### slinky-slurm + +- `docker.io/library/alpine:3.23.3` +- `ghcr.io/slinkyproject/login:25.11-ubuntu24.04` +- `ghcr.io/slinkyproject/slurmctld:25.11-ubuntu24.04` +- `ghcr.io/slinkyproject/slurmd:25.11-ubuntu24.04` +- `ghcr.io/slinkyproject/slurmrestd:25.11-ubuntu24.04` + ### slinky-slurm-operator - `ghcr.io/slinkyproject/slurm-operator-webhook:1.1.0` diff --git a/kwok/scripts/validate-scheduling.sh b/kwok/scripts/validate-scheduling.sh index b44d48830..8c54adb31 100755 --- a/kwok/scripts/validate-scheduling.sh +++ b/kwok/scripts/validate-scheduling.sh @@ -311,21 +311,32 @@ generate_bundle() { exit 1 fi - # Extract criteria from overlay - local service accelerator intent os + # Without --platform, *-slurm overlays resolve to their non-platform + # parent and the bundle omits the slinky-slurm operator/cluster. + # Scoped to slurm: kubeflow/dynamo are not yet validated under KWOK. + local service accelerator intent os platform service=$(yq eval '.spec.criteria.service // ""' "$recipe_overlay") accelerator=$(yq eval '.spec.criteria.accelerator // ""' "$recipe_overlay") intent=$(yq eval '.spec.criteria.intent // ""' "$recipe_overlay") os=$(yq eval '.spec.criteria.os // ""' "$recipe_overlay") + platform=$(yq eval '.spec.criteria.platform // ""' "$recipe_overlay") - log_info "Criteria: service=$service accelerator=$accelerator intent=$intent os=$os" + log_info "Criteria: service=$service accelerator=$accelerator intent=$intent os=$os platform=$platform" - # Build recipe command with available criteria local recipe_args=() [[ -n "$service" ]] && recipe_args+=(--service "$service") [[ -n "$accelerator" ]] && recipe_args+=(--accelerator "$accelerator") [[ -n "$intent" ]] && recipe_args+=(--intent "$intent") [[ -n "$os" ]] && recipe_args+=(--os "$os") + # Only forward --platform for platforms validated under KWOK. Other + # platforms (kubeflow, dynamo, nim) historically resolve to their + # non-platform parent here; preserve that behavior to avoid regressing + # existing matrix lanes. Extend as additional platforms are validated. + if [[ "$platform" == "slurm" ]]; then + recipe_args+=(--platform "$platform") + elif [[ -n "$platform" ]]; then + log_info "platform=$platform not yet validated under KWOK — resolving without --platform" + fi # Generate resolved recipe from criteria log_info "Generating resolved recipe..." @@ -348,6 +359,19 @@ generate_bundle() { # Disable features not needed for scheduling validation: # - PrometheusRules and AlertManager (slow to create) # - Nodewright customization (creates CRs that depend on operator CRDs) + # - slinky-slurm-operator webhook + cert-manager wiring: the operator's + # webhook validates Slurm CRs through a Service whose pod runs on a + # KWOK fake (Ready without container). Both certManager.enabled and + # webhook.enabled gate the cert-manager.io/Certificate submission + # plus the ValidatingWebhookConfiguration. Disabling them skips + # admission entirely; harmless under KWOK since no real Slurm CRs + # are reconciled. + # - slinky-slurm controller persistence: the chart provisions a PVC + # via the cluster's default StorageClass. Kind's local-path provisioner + # binds with WaitForFirstConsumer, so the PVC is pinned to whichever + # node the pod schedules on — and KWOK fakes can't actually back a + # local-path volume, leaving the pod stuck Pending with NominatedNodeName + # set. Disabling persistence lets the controller pod bind. log_info "Generating bundle..." local bundle_output @@ -360,6 +384,9 @@ generate_bundle() { --accelerated-node-toleration "nvidia.com/gpu=present:NoSchedule" \ --accelerated-node-toleration "kwok.x-k8s.io/node=fake:NoSchedule" \ --set "certmanager:startupapicheck.enabled=false" \ + --set "slinkyslurmoperator:webhook.enabled=false" \ + --set "slinkyslurmoperator:certManager.enabled=false" \ + --set "slurmcluster:controller.persistence.enabled=false" \ --set "kubeprometheusstack:defaultRules.create=false" \ --set "kubeprometheusstack:alertmanager.enabled=false" \ --set "nodewright-customizations:enabled=false" \ diff --git a/pkg/recipe/components_test.go b/pkg/recipe/components_test.go index b3135e39d..b642c3551 100644 --- a/pkg/recipe/components_test.go +++ b/pkg/recipe/components_test.go @@ -15,6 +15,7 @@ package recipe import ( + "maps" "slices" "strings" "testing" @@ -210,6 +211,78 @@ func TestComponentRegistry_NodeSchedulingPaths(t *testing.T) { } } +// Pins the `slinky` map-key choice for slinky-slurm on both sides: +// the registry's nodeScheduling paths AND components/slinky-slurm/ +// values.yaml must reference the same key, or injected tolerations +// land on a non-existent map entry. +func TestComponentRegistry_SlinkySlurm_NodeSchedulingPaths(t *testing.T) { + registry, err := GetComponentRegistry() + if err != nil { + t.Fatalf("failed to load component registry: %v", err) + } + + slurmCluster := registry.Get("slinky-slurm") + if slurmCluster == nil { + t.Fatal("slinky-slurm not found in registry") + } + + wantSysToleration := []string{ + "controller.podSpec.tolerations", + "restapi.podSpec.tolerations", + "loginsets.slinky.podSpec.tolerations", + } + gotSysToleration := slurmCluster.GetSystemTolerationPaths() + for _, p := range wantSysToleration { + if !slices.Contains(gotSysToleration, p) { + t.Errorf("slinky-slurm system toleration paths missing %q (got %v)", p, gotSysToleration) + } + } + + wantSysSelector := []string{ + "controller.podSpec.nodeSelector", + "restapi.podSpec.nodeSelector", + "loginsets.slinky.podSpec.nodeSelector", + } + gotSysSelector := slurmCluster.GetSystemNodeSelectorPaths() + for _, p := range wantSysSelector { + if !slices.Contains(gotSysSelector, p) { + t.Errorf("slinky-slurm system node selector paths missing %q (got %v)", p, gotSysSelector) + } + } + + gotAccelSelector := slurmCluster.GetAcceleratedNodeSelectorPaths() + if !slices.Contains(gotAccelSelector, "nodesets.slinky.podSpec.nodeSelector") { + t.Errorf("slinky-slurm accelerated node selector paths missing %q (got %v)", + "nodesets.slinky.podSpec.nodeSelector", gotAccelSelector) + } + gotAccelToleration := slurmCluster.GetAcceleratedTolerationPaths() + if !slices.Contains(gotAccelToleration, "nodesets.slinky.podSpec.tolerations") { + t.Errorf("slinky-slurm accelerated toleration paths missing %q (got %v)", + "nodesets.slinky.podSpec.tolerations", gotAccelToleration) + } + + const valuesPath = "components/slinky-slurm/values.yaml" + content, err := GetEmbeddedFS().ReadFile(valuesPath) + if err != nil { + t.Fatalf("failed to read %s: %v", valuesPath, err) + } + var values struct { + Nodesets map[string]any `yaml:"nodesets"` + Loginsets map[string]any `yaml:"loginsets"` + } + if err := yaml.Unmarshal(content, &values); err != nil { + t.Fatalf("failed to parse %s: %v", valuesPath, err) + } + if _, ok := values.Nodesets["slinky"]; !ok { + t.Errorf("%s must define nodesets.slinky to match the registry's "+ + "nodeScheduling paths (got nodesets keys: %v)", valuesPath, slices.Sorted(maps.Keys(values.Nodesets))) + } + if _, ok := values.Loginsets["slinky"]; !ok { + t.Errorf("%s must define loginsets.slinky to match the registry's "+ + "nodeScheduling paths (got loginsets keys: %v)", valuesPath, slices.Sorted(maps.Keys(values.Loginsets))) + } +} + func TestComponentRegistry_TaintStrPaths(t *testing.T) { registry, err := GetComponentRegistry() if err != nil { diff --git a/pkg/recipe/deployment_order_guard_test.go b/pkg/recipe/deployment_order_guard_test.go index e653dddd0..d374aa815 100644 --- a/pkg/recipe/deployment_order_guard_test.go +++ b/pkg/recipe/deployment_order_guard_test.go @@ -175,6 +175,50 @@ func TestDeploymentOrderGuards(t *testing.T) { {"gpu-operator", "nvsentinel"}, }, }, + { + name: "h100-eks-ubuntu-training-slurm", + criteria: func() *Criteria { + c := NewCriteria() + c.Service = CriteriaServiceEKS + c.Accelerator = CriteriaAcceleratorH100 + c.OS = CriteriaOSUbuntu + c.Intent = CriteriaIntentTraining + c.Platform = CriteriaPlatformSlurm + return c + }, + requiredDeps: map[string][]string{ + "slinky-slurm-operator": {"cert-manager", "slinky-slurm-operator-crds"}, + "slinky-slurm": {"slinky-slurm-operator", "slinky-slurm-operator-crds"}, + }, + requiredOrdering: [][2]string{ + {"cert-manager", "slinky-slurm-operator"}, + {"slinky-slurm-operator-crds", "slinky-slurm-operator"}, + {"slinky-slurm-operator", "slinky-slurm"}, + {"slinky-slurm-operator-crds", "slinky-slurm"}, + {"gpu-operator", "nvsentinel"}, + }, + }, + { + name: "h100-kind-training-slurm", + criteria: func() *Criteria { + c := NewCriteria() + c.Service = CriteriaServiceKind + c.Accelerator = CriteriaAcceleratorH100 + c.Intent = CriteriaIntentTraining + c.Platform = CriteriaPlatformSlurm + return c + }, + requiredDeps: map[string][]string{ + "slinky-slurm-operator": {"cert-manager", "slinky-slurm-operator-crds"}, + "slinky-slurm": {"slinky-slurm-operator", "slinky-slurm-operator-crds"}, + }, + requiredOrdering: [][2]string{ + {"cert-manager", "slinky-slurm-operator"}, + {"slinky-slurm-operator-crds", "slinky-slurm-operator"}, + {"slinky-slurm-operator", "slinky-slurm"}, + {"slinky-slurm-operator-crds", "slinky-slurm"}, + }, + }, } for _, tt := range tests { diff --git a/recipes/checks/slinky-slurm-operator/health-check.yaml b/recipes/checks/slinky-slurm-operator/health-check.yaml index cadc44126..1fcf997c4 100644 --- a/recipes/checks/slinky-slurm-operator/health-check.yaml +++ b/recipes/checks/slinky-slurm-operator/health-check.yaml @@ -1,4 +1,4 @@ -# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -28,9 +28,6 @@ spec: steps: - name: validate-operator-deployment try: - # Guard against vacuous pass on empty namespace: verify the - # slurm-operator deployment exists and has at least one ready - # replica. - assert: resource: apiVersion: apps/v1 @@ -42,8 +39,7 @@ spec: (availableReplicas > `0`): true - name: validate-webhook-deployment try: - # The webhook must be ready before any Slinky CRs (Controller, - # NodeSet, etc.) can be created, so assert it independently. + # Webhook must be ready before any Slinky CR can be created. - assert: resource: apiVersion: apps/v1 @@ -55,17 +51,10 @@ spec: (availableReplicas > `0`): true - name: validate-all-pods-healthy try: - # Assert no pods are in unhealthy phases. - # Pods must be Running (long-lived) or Succeeded (completed jobs). - # This catches Pending (init containers, scheduling), Failed, and - # Unknown. - # - # chainsaw `error` assertions pass when no matching resource exists, - # which would let this step trivially pass on an empty namespace. - # The two preceding deployment-availability steps prevent that: - # they require both deployments to have at least one ready replica - # in `slinky`, which guarantees pods exist and are inspectable - # before this step runs. + # Catch Pending / Failed / Unknown phases. Chainsaw `error` passes + # vacuously when no resource matches, so the preceding deployment- + # availability asserts are load-bearing: they guarantee the pods + # exist before this step runs. - error: resource: apiVersion: v1 diff --git a/recipes/checks/slinky-slurm/health-check.yaml b/recipes/checks/slinky-slurm/health-check.yaml new file mode 100644 index 000000000..a05752408 --- /dev/null +++ b/recipes/checks/slinky-slurm/health-check.yaml @@ -0,0 +1,145 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Slinky Slurm cluster health check. +# +# Names assume the AICR localformat release name `slinky-slurm`; the +# trailing `-slinky` comes from the chart's default map-key for +# loginsets/nodesets. Other deployers using a different release name +# need a parameterized variant of this check. +# TODO: parameterize CR/workload names once chainsaw supports name-pattern +# matching (currently hardcoded; flux/argocd deployers will rewrite release +# names and silently break these asserts). +apiVersion: chainsaw.kyverno.io/v1alpha1 +kind: Test +metadata: + name: slinky-slurm-health-check +spec: + timeouts: + assert: 10m + steps: + - name: validate-controller-cr + try: + - assert: + resource: + apiVersion: slinky.slurm.net/v1beta1 + kind: Controller + metadata: + name: slinky-slurm + namespace: slurm + - name: validate-loginset-cr + try: + - assert: + resource: + apiVersion: slinky.slurm.net/v1beta1 + kind: LoginSet + metadata: + name: slinky-slurm-login-slinky + namespace: slurm + - name: validate-nodeset-cr + try: + - assert: + resource: + apiVersion: slinky.slurm.net/v1beta1 + kind: NodeSet + metadata: + name: slinky-slurm-worker-slinky + namespace: slurm + - name: validate-restapi-cr + try: + - assert: + resource: + apiVersion: slinky.slurm.net/v1beta1 + kind: RestApi + metadata: + name: slinky-slurm + namespace: slurm + # CR existence alone does not prove reconciliation into running + # pods: assert each workload reports a ready replica before the + # pod-phase guard runs against the namespace. + - name: validate-controller-statefulset-ready + try: + - assert: + resource: + apiVersion: apps/v1 + kind: StatefulSet + metadata: + name: slinky-slurm-controller + namespace: slurm + status: + (availableReplicas > `0`): true + - name: validate-login-deployment-ready + try: + - assert: + resource: + apiVersion: apps/v1 + kind: Deployment + metadata: + name: slinky-slurm-login-slinky + namespace: slurm + status: + (availableReplicas > `0`): true + - name: validate-restapi-deployment-ready + try: + - assert: + resource: + apiVersion: apps/v1 + kind: Deployment + metadata: + name: slinky-slurm-restapi + namespace: slurm + status: + (availableReplicas > `0`): true + - name: validate-nodeset-ready + try: + # NodeSet exposes its own status; operator tracks per-node counts. + - assert: + resource: + apiVersion: slinky.slurm.net/v1beta1 + kind: NodeSet + metadata: + name: slinky-slurm-worker-slinky + namespace: slurm + status: + (availableReplicas > `0`): true + - name: validate-all-pods-healthy + try: + # Catch Pending / Failed / Unknown phases. Chainsaw `error` passes + # vacuously when no resource matches, so the preceding workload- + # readiness asserts (StatefulSet / Deployments / NodeSet) are + # load-bearing: they guarantee the pods exist before this step runs. + - error: + resource: + apiVersion: v1 + kind: Pod + metadata: + namespace: slurm + status: + phase: Pending + - error: + resource: + apiVersion: v1 + kind: Pod + metadata: + namespace: slurm + status: + phase: Failed + - error: + resource: + apiVersion: v1 + kind: Pod + metadata: + namespace: slurm + status: + phase: Unknown diff --git a/recipes/components/slinky-slurm/values.yaml b/recipes/components/slinky-slurm/values.yaml new file mode 100644 index 000000000..43ce9896b --- /dev/null +++ b/recipes/components/slinky-slurm/values.yaml @@ -0,0 +1,179 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Slinky Slurm cluster Helm values. +# +# Overrides the chart's default `slinky` map-keys for `nodesets` / +# `loginsets`: those keys carry the full required sub-tree (image, +# slurmd, logfile sidecar, partition); any other key would have to +# redefine it all. The cross-check in pkg/recipe TestComponentRegistry_ +# SlinkySlurm_NodeSchedulingPaths enforces this alignment with the +# registry's nodeScheduling paths. +# +# Accounting (slurmdbd) and DCGM job-mapping are disabled by default: +# accounting needs an external MariaDB AICR does not bundle; DCGM needs +# dcgm-exporter on workers. Both opt in via valuesFile / --set. + +# Cgroup isolation + NVIDIA GPU autodetect (no-op without GPUs). +configFiles: + cgroup.conf: | + CgroupPlugin=autodetect + IgnoreSystemd=yes + EnableControllers=yes + ConstrainCores=yes + ConstrainRAMSpace=yes + ConstrainDevices=yes + ConstrainSwapSpace=yes + AllowedRAMSpace=95.0 + AllowedSwapSpace=100.0 + gres.conf: | + AutoDetect=nvidia + +# File-based local users (no LDAP/AD provider bundled). +sssd: + conf: | + [sssd] + config_file_version = 2 + services = nss,pam + domains = LOCAL + + [nss] + filter_groups = root,slurm + filter_users = root,slurm + + [pam] + + [domain/LOCAL] + id_provider = files + auth_provider = files + +controller: + # Pin the logfile sidecar to an immutable Alpine tag (chart default is + # :latest). Re-verify on defaultVersion bumps in case the chart pins + # it upstream. + logfile: + image: + tag: "3.23.3" + persistence: + enabled: true + storageClassName: null + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 4Gi + # All values quoted as strings: extraConfMap is map[string]string and + # bare YAML 1.1 scalars (e.g. `no`) would coerce to boolean false. + # Accounting-only directives belong in the accounting opt-in valuesFile. + extraConfMap: + # select/cons_tres is required for GPU GRES to allocate per-resource + # rather than whole-node (Slurm's select/linear default). All other + # site-policy directives (priority weights, fairshare, QoS) are + # deliberately omitted — sites should add them in a leaf valuesFile. + SelectType: "select/cons_tres" + ScronParameters: "enable" + # /metrics endpoint on; ServiceMonitor stays at chart default (off) + # so the chart doesn't render a prometheus-operator CR on clusters + # without it. + metrics: + enabled: true + +restapi: + replicas: 1 + slurmrestd: + resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: 500m + memory: 512Mi + # Chart wraps Service overrides as {metadata, spec}; bare service.type + # is rejected by the RestApi CRD. + service: + spec: + type: ClusterIP + +# Accounting (slurmdbd) is wrapped in `if .Values.accounting.enabled` +# in the chart, so storageConfig below is inert until enabled. Values +# mirror the chart defaults (Service named `mariadb`, Secret named +# `mariadb-password` with key `password`) so operators see the expected +# external MariaDB shape; this is also the wiring AICR will produce +# when a MariaDB component is bundled. +accounting: + enabled: false + storageConfig: + host: mariadb + port: 3306 + database: slurm_acct_db + username: slurm + passwordKeyRef: + name: mariadb-password + key: password + +# Secure default: the login pod is intentionally unreachable via SSH +# (empty rootSshAuthorizedKeys + PasswordAuthentication off) until an +# operator supplies real keys. Do not "fix" without wiring key delivery. +loginsets: + slinky: + enabled: true + replicas: 1 + # Pin the initconf sidecar to an immutable Alpine tag (chart default + # is :latest). + initconf: + image: + tag: "3.23.3" + rootSshAuthorizedKeys: | + # PLACEHOLDER -- override via valuesFile or --set before deploy + # ssh-ed25519 AAAA... user@example.com + extraSshdConfig: | + PasswordAuthentication no + PermitEmptyPasswords no + ChallengeResponseAuthentication no + +# Single-tenant policy: one Default=YES partition only — slurmctld +# refuses to start if an override valuesFile adds a second one. +# Multi-tenant deployers should disable this partition and define +# their own (e.g. per-team Default=NO + AllowGroups). +# +# The chart does NOT inject CPUs=/RealMemory=/Gres= into slurmd's +# --conf line from pod resource limits — it only plumbs POD_CPUS / +# POD_MEMORY env vars and lets the image entrypoint act on them. For +# GPU clusters, declare Gres/Features explicitly via +# `nodesets.slinky.extraConfMap` and pair with the matching +# nvidia.com/gpu limit on `nodesets.slinky.slurmd.resources.limits`. +nodesets: + slinky: + enabled: true + scalingMode: StatefulSet + replicas: 1 + # Pin the logfile sidecar to an immutable Alpine tag (chart default + # is :latest). + logfile: + image: + tag: "3.23.3" + partition: + enabled: true + configMap: + State: UP + Default: "YES" + # Cap default walltime at 24h. UNLIMITED is unsafe as a default: + # a single stuck job can hold GPUs indefinitely with no operator + # recourse short of scancel. Override per-leaf if needed. + MaxTime: "24:00:00" + +vendor: + nvidia: + dcgm: + enabled: false diff --git a/recipes/mixins/platform-slurm-cluster.yaml b/recipes/mixins/platform-slurm-cluster.yaml new file mode 100644 index 000000000..71ac2ad8b --- /dev/null +++ b/recipes/mixins/platform-slurm-cluster.yaml @@ -0,0 +1,30 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Opt-in mixin: a Slinky-managed Slurm cluster instance (Controller / +# LoginSet / NodeSet / RestApi) on top of the operator from +# platform-slurm. Leaves wanting only the operator compose +# platform-slurm alone; leaves wanting a runnable cluster compose both. +kind: RecipeMixin +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: platform-slurm-cluster +spec: + componentRefs: + - name: slinky-slurm + type: Helm + valuesFile: components/slinky-slurm/values.yaml + dependencyRefs: + - slinky-slurm-operator + - slinky-slurm-operator-crds diff --git a/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml new file mode 100644 index 000000000..0824fc35d --- /dev/null +++ b/recipes/overlays/h100-eks-ubuntu-training-slurm.yaml @@ -0,0 +1,50 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: RecipeMetadata +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: h100-eks-ubuntu-training-slurm + +spec: + # H100 + EKS + Ubuntu + training with the Slinky operator and a + # Slinky-managed Slurm cluster. EKS-specific cluster tuning (gp3 + # storage, GPU GRES, DCGM job-mapping) is layered at install time + # via `aicr bundle ... --set slurmcluster:...` or a valuesFile. + base: h100-eks-ubuntu-training + + criteria: + service: eks + accelerator: h100 + os: ubuntu + intent: training + platform: slurm + + mixins: + - os-ubuntu + - platform-slurm + - platform-slurm-cluster + + constraints: + - name: K8s.server.version + value: ">= 1.32.4" + + # Mixin-contributed components cannot be overridden from a leaf; use + # `--set slurmcluster:...` or a valuesFile at install time instead. + # Accelerated nodeSelector/tolerations on slurmd are injected via the + # registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths. + componentRefs: [] + + # Validation is inherited from h100-eks-training (operator-health, + # expected-resources, gpu-operator-version, check-nvidia-smi). diff --git a/recipes/overlays/h100-kind-training-slurm.yaml b/recipes/overlays/h100-kind-training-slurm.yaml new file mode 100644 index 000000000..60368a99f --- /dev/null +++ b/recipes/overlays/h100-kind-training-slurm.yaml @@ -0,0 +1,45 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: RecipeMetadata +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: h100-kind-training-slurm + +spec: + # H100 + Kind + training with the Slinky operator and a Slinky-managed + # Slurm cluster. Kind has no GPUs, so the NodeSet runs CPU-only (no + # GRES, no nvidia.com/gpu, no DCGM); useful for no-GPU CI end-to-end. + base: h100-kind-training + + criteria: + service: kind + accelerator: h100 + intent: training + platform: slurm + + mixins: + - platform-slurm + - platform-slurm-cluster + + # DRA (GA in K8s 1.34) — restated from the parent for clarity. + constraints: + - name: K8s.server.version + value: ">= 1.34" + + # Mixin-contributed components cannot be overridden from a leaf; use + # `--set slurmcluster:...` or a valuesFile at install time instead. + componentRefs: [] + + # Validation is inherited from h100-kind-training. diff --git a/recipes/registry.yaml b/recipes/registry.yaml index bf9ee4c1a..51ecbe1cb 100644 --- a/recipes/registry.yaml +++ b/recipes/registry.yaml @@ -606,10 +606,8 @@ components: - name: slinky-slurm-operator displayName: slinky-slurm-operator - # The short alias `slurm` is reserved here for the operator. When the Slinky `slurm` - # cluster chart (oci://ghcr.io/slinkyproject/charts/slurm) is added - # to AICR, we will assign it a distinct short alias (e.g. - # `slurm-cluster`) so `slurm` continues to route to the operator. + # Short alias `slurm` routes to the operator; the cluster chart + # (slinky-slurm below) uses the distinct alias `slurmcluster`. valueOverrideKeys: - slinkyslurmoperator - slurmoperator @@ -643,3 +641,36 @@ components: tolerationPaths: - operator.tolerations - webhook.tolerations + + - name: slinky-slurm + displayName: slinky-slurm + # Cluster instance chart; wired in via platform-slurm-cluster mixin. + valueOverrideKeys: + - slinkyslurm + - slurmcluster + healthCheck: + assertFile: checks/slinky-slurm/health-check.yaml + helm: + defaultRepository: oci://ghcr.io/slinkyproject/charts + defaultChart: slurm + # When bumping defaultVersion, re-verify the `slinky` default + # map-keys for nodesets/loginsets still exist (the cross-check in + # pkg/recipe TestComponentRegistry_SlinkySlurm_NodeSchedulingPaths + # enforces this, but the chart change must come first). + defaultVersion: "1.1.0" + defaultNamespace: slurm + nodeScheduling: + system: + nodeSelectorPaths: + - controller.podSpec.nodeSelector + - restapi.podSpec.nodeSelector + - loginsets.slinky.podSpec.nodeSelector + tolerationPaths: + - controller.podSpec.tolerations + - restapi.podSpec.tolerations + - loginsets.slinky.podSpec.tolerations + accelerated: + nodeSelectorPaths: + - nodesets.slinky.podSpec.nodeSelector + tolerationPaths: + - nodesets.slinky.podSpec.tolerations