Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
f5f3bb4
feat(slinky-slurm): add cluster chart and EKS/Kind leaves
faganihajizada May 18, 2026
88418d1
ci(kwok): log hint when --platform is unsupported
faganihajizada May 18, 2026
a35dc2b
test(slinky-slurm-operator): restore vacuous-pass note in health check
faganihajizada May 18, 2026
46dd144
test(slinky-slurm): restore vacuous-pass note in cluster health check
faganihajizada May 18, 2026
cee8186
test(slinky-slurm): note release-name fragility in health check
faganihajizada May 18, 2026
b839e97
feat(slinky-slurm): cap default MaxTime, correct slurmd conf note
faganihajizada May 18, 2026
f66991d
ci(kwok): untaint kind control-plane to accept system-tier pods
faganihajizada May 18, 2026
38de273
feat(slinky-slurm): drop multifactor priority defaults
faganihajizada May 18, 2026
c3eed64
ci(kwok): disable cert-manager webhook instead of pinning to CP
faganihajizada May 18, 2026
ac94a23
Revert "ci(kwok): disable cert-manager webhook instead of pinning to CP"
faganihajizada May 18, 2026
eb78d96
Revert "ci(kwok): untaint kind control-plane to accept system-tier pods"
faganihajizada May 18, 2026
54260ca
ci(kwok): disable slurm-operator webhook + cert-manager for KWOK
faganihajizada May 18, 2026
0651bfd
ci(kwok): wait for slurm controller pod before verify_pods
faganihajizada May 18, 2026
6fe4ad0
ci(kwok): disable slurm controller persistence for KWOK
faganihajizada May 18, 2026
9c8779d
Merge branch 'main' into feat/slinky-slurm-cluster-leaves
faganihajizada May 18, 2026
5bebf52
Merge branch 'main' into feat/slinky-slurm-cluster-leaves
faganihajizada May 19, 2026
a1746c6
Merge branch 'main' into feat/slinky-slurm-cluster-leaves
faganihajizada May 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/integrator/recipe-development.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,8 @@ spec:

Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details.

A platform may split into multiple mixins when parts of the stack are independently opt-in. For example, `--platform slurm` resolves through two mixins: `platform-slurm` always contributes the SchedMD Slinky operator and CRDs, and `platform-slurm-cluster` is opt-in for the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi). A leaf that wants operator-only composes just `platform-slurm`; a leaf that wants the cluster too composes both — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter.

When authoring a recipe targeting Talos (`criteria.os: talos`), append the `os-talos` mixin to your overlay's `spec.mixins` list (e.g. `spec.mixins: [os-talos]`, or `[platform-kubeflow, os-talos]` if you already mix in a non-OS fragment). OS-scoped mixins are mutually exclusive — combining `os-ubuntu` and `os-talos` in one overlay is a recipe authoring error, not a supported composition. The mixin overrides namespaces for affected components and supplies PSA-privileged Namespace manifests via `componentRefs[].preManifestFiles`, which are applied before each chart — see [Talos integration](talos-integration.md) for the component list and labels.

**Cross-cutting overlays with wildcard criteria** apply across one criteria dimension without being referenced via `spec.base` or listed in `spec.mixins`. The resolver can return multiple independent maximal-leaf overlays for a single query, so a `service: any` overlay is picked up alongside the service-specific maximal leaf and its inheritance chain:
Expand Down
5 changes: 3 additions & 2 deletions docs/user/component-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@ The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/
| **kueue** | Kubernetes-native job queuing system. Manages quotas and admits jobs for batch and AI workloads. | [Kueue](https://github.com/kubernetes-sigs/kueue) |
| **kubeflow-trainer** | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | [Kubeflow Trainer](https://github.com/kubeflow/trainer) |
| **slinky-slurm-operator-crds** | Custom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the `slinky.slurm.net` CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs. Cluster-instance CRs (Controller, NodeSet, LoginSet, ...) are user-authored — AICR ships only the operator, mirroring how dynamo-platform and kubeflow-trainer ship operator-only. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Opt-in via the `platform-slurm-cluster` mixin (alongside `platform-slurm` for the operator). Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) |

## How Components Are Selected

Expand All @@ -43,7 +44,7 @@ Not every component appears in every recipe. The recipe engine selects component
- **Base components** (cert-manager, kube-prometheus-stack) appear in most recipes.
- **Cloud-specific components** (aws-efa, aws-ebs-csi-driver) are added when the service matches.
- **Intent-specific components** (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway).
- **Platform-specific components** (slinky-slurm-operator, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`.
- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, the cluster (`slinky-slurm`) is opt-in via the `platform-slurm-cluster` mixin alongside the always-applied operator (`platform-slurm`); leaves that want operator-only compose just `platform-slurm`.
- **Accelerator/OS-specific tuning** (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.

### NFD Topology Updater
Expand Down
13 changes: 11 additions & 2 deletions docs/user/container-images.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ A machine-readable **CycloneDX 1.6 JSON** companion to this page is produced by
<!-- BEGIN AICR-BOM -->
## Summary

- Components: **25**
- Unique images: **71**
- Components: **26**
- Unique images: **76**
- Distinct registries: **11**

Registries: `602401143452.dkr.ecr.us-west-2.amazonaws.com`, `cr.agentgateway.dev`, `docker.io`, `gcr.io`, `ghcr.io`, `gke.gcr.io`, `nvcr.io`, `public.ecr.aws`, `quay.io`, `registry.k8s.io`, `us-docker.pkg.dev`
Expand Down Expand Up @@ -52,6 +52,7 @@ Registries: `602401143452.dkr.ecr.us-west-2.amazonaws.com`, `cr.agentgateway.dev
| nvsentinel | helm | nvsentinel | v1.3.0 | 6 |
| prometheus-adapter | helm | prometheus-community/prometheus-adapter | 5.3.0 | 1 |
| prometheus-operator-crds | helm | prometheus-community/prometheus-operator-crds | 28.0.1 | 0 |
| slinky-slurm | helm | slurm | 1.1.0 | 5 |
| slinky-slurm-operator | helm | slurm-operator | 1.1.0 | 2 |
| slinky-slurm-operator-crds | helm | slurm-operator-crds | 1.1.0 | 0 |

Expand Down Expand Up @@ -197,6 +198,14 @@ _No images extracted._

_No images extracted._

### slinky-slurm

- `docker.io/library/alpine:3.23.3`
- `ghcr.io/slinkyproject/login:25.11-ubuntu24.04`
- `ghcr.io/slinkyproject/slurmctld:25.11-ubuntu24.04`
- `ghcr.io/slinkyproject/slurmd:25.11-ubuntu24.04`
- `ghcr.io/slinkyproject/slurmrestd:25.11-ubuntu24.04`

### slinky-slurm-operator

- `ghcr.io/slinkyproject/slurm-operator-webhook:1.1.0`
Expand Down
35 changes: 31 additions & 4 deletions kwok/scripts/validate-scheduling.sh
Original file line number Diff line number Diff line change
Expand Up @@ -311,21 +311,32 @@ generate_bundle() {
exit 1
fi

# Extract criteria from overlay
local service accelerator intent os
# Without --platform, *-slurm overlays resolve to their non-platform
# parent and the bundle omits the slinky-slurm operator/cluster.
# Scoped to slurm: kubeflow/dynamo are not yet validated under KWOK.
local service accelerator intent os platform
service=$(yq eval '.spec.criteria.service // ""' "$recipe_overlay")
accelerator=$(yq eval '.spec.criteria.accelerator // ""' "$recipe_overlay")
intent=$(yq eval '.spec.criteria.intent // ""' "$recipe_overlay")
os=$(yq eval '.spec.criteria.os // ""' "$recipe_overlay")
platform=$(yq eval '.spec.criteria.platform // ""' "$recipe_overlay")

log_info "Criteria: service=$service accelerator=$accelerator intent=$intent os=$os"
log_info "Criteria: service=$service accelerator=$accelerator intent=$intent os=$os platform=$platform"

# Build recipe command with available criteria
local recipe_args=()
[[ -n "$service" ]] && recipe_args+=(--service "$service")
[[ -n "$accelerator" ]] && recipe_args+=(--accelerator "$accelerator")
[[ -n "$intent" ]] && recipe_args+=(--intent "$intent")
[[ -n "$os" ]] && recipe_args+=(--os "$os")
# Only forward --platform for platforms validated under KWOK. Other
# platforms (kubeflow, dynamo, nim) historically resolve to their
# non-platform parent here; preserve that behavior to avoid regressing
# existing matrix lanes. Extend as additional platforms are validated.
if [[ "$platform" == "slurm" ]]; then
recipe_args+=(--platform "$platform")
elif [[ -n "$platform" ]]; then
log_info "platform=$platform not yet validated under KWOK — resolving without --platform"
fi
Comment thread
mchmarny marked this conversation as resolved.

# Generate resolved recipe from criteria
log_info "Generating resolved recipe..."
Expand All @@ -348,6 +359,19 @@ generate_bundle() {
# Disable features not needed for scheduling validation:
# - PrometheusRules and AlertManager (slow to create)
# - Nodewright customization (creates CRs that depend on operator CRDs)
# - slinky-slurm-operator webhook + cert-manager wiring: the operator's
# webhook validates Slurm CRs through a Service whose pod runs on a
# KWOK fake (Ready without container). Both certManager.enabled and
# webhook.enabled gate the cert-manager.io/Certificate submission
# plus the ValidatingWebhookConfiguration. Disabling them skips
# admission entirely; harmless under KWOK since no real Slurm CRs
# are reconciled.
# - slinky-slurm controller persistence: the chart provisions a PVC
# via the cluster's default StorageClass. Kind's local-path provisioner
# binds with WaitForFirstConsumer, so the PVC is pinned to whichever
# node the pod schedules on — and KWOK fakes can't actually back a
# local-path volume, leaving the pod stuck Pending with NominatedNodeName
# set. Disabling persistence lets the controller pod bind.
log_info "Generating bundle..."

local bundle_output
Expand All @@ -360,6 +384,9 @@ generate_bundle() {
--accelerated-node-toleration "nvidia.com/gpu=present:NoSchedule" \
--accelerated-node-toleration "kwok.x-k8s.io/node=fake:NoSchedule" \
--set "certmanager:startupapicheck.enabled=false" \
--set "slinkyslurmoperator:webhook.enabled=false" \
--set "slinkyslurmoperator:certManager.enabled=false" \
--set "slurmcluster:controller.persistence.enabled=false" \
--set "kubeprometheusstack:defaultRules.create=false" \
--set "kubeprometheusstack:alertmanager.enabled=false" \
--set "nodewright-customizations:enabled=false" \
Expand Down
73 changes: 73 additions & 0 deletions pkg/recipe/components_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
package recipe

import (
"maps"
"slices"
"strings"
"testing"
Expand Down Expand Up @@ -210,6 +211,78 @@ func TestComponentRegistry_NodeSchedulingPaths(t *testing.T) {
}
}

// Pins the `slinky` map-key choice for slinky-slurm on both sides:
// the registry's nodeScheduling paths AND components/slinky-slurm/
// values.yaml must reference the same key, or injected tolerations
// land on a non-existent map entry.
func TestComponentRegistry_SlinkySlurm_NodeSchedulingPaths(t *testing.T) {
registry, err := GetComponentRegistry()
if err != nil {
t.Fatalf("failed to load component registry: %v", err)
}

slurmCluster := registry.Get("slinky-slurm")
if slurmCluster == nil {
t.Fatal("slinky-slurm not found in registry")
}

wantSysToleration := []string{
"controller.podSpec.tolerations",
"restapi.podSpec.tolerations",
"loginsets.slinky.podSpec.tolerations",
}
gotSysToleration := slurmCluster.GetSystemTolerationPaths()
for _, p := range wantSysToleration {
if !slices.Contains(gotSysToleration, p) {
t.Errorf("slinky-slurm system toleration paths missing %q (got %v)", p, gotSysToleration)
}
}

wantSysSelector := []string{
"controller.podSpec.nodeSelector",
"restapi.podSpec.nodeSelector",
"loginsets.slinky.podSpec.nodeSelector",
}
gotSysSelector := slurmCluster.GetSystemNodeSelectorPaths()
for _, p := range wantSysSelector {
if !slices.Contains(gotSysSelector, p) {
t.Errorf("slinky-slurm system node selector paths missing %q (got %v)", p, gotSysSelector)
}
}

gotAccelSelector := slurmCluster.GetAcceleratedNodeSelectorPaths()
if !slices.Contains(gotAccelSelector, "nodesets.slinky.podSpec.nodeSelector") {
t.Errorf("slinky-slurm accelerated node selector paths missing %q (got %v)",
"nodesets.slinky.podSpec.nodeSelector", gotAccelSelector)
}
gotAccelToleration := slurmCluster.GetAcceleratedTolerationPaths()
if !slices.Contains(gotAccelToleration, "nodesets.slinky.podSpec.tolerations") {
t.Errorf("slinky-slurm accelerated toleration paths missing %q (got %v)",
"nodesets.slinky.podSpec.tolerations", gotAccelToleration)
}

const valuesPath = "components/slinky-slurm/values.yaml"
content, err := GetEmbeddedFS().ReadFile(valuesPath)
if err != nil {
t.Fatalf("failed to read %s: %v", valuesPath, err)
}
var values struct {
Nodesets map[string]any `yaml:"nodesets"`
Loginsets map[string]any `yaml:"loginsets"`
}
if err := yaml.Unmarshal(content, &values); err != nil {
t.Fatalf("failed to parse %s: %v", valuesPath, err)
}
if _, ok := values.Nodesets["slinky"]; !ok {
t.Errorf("%s must define nodesets.slinky to match the registry's "+
"nodeScheduling paths (got nodesets keys: %v)", valuesPath, slices.Sorted(maps.Keys(values.Nodesets)))
}
if _, ok := values.Loginsets["slinky"]; !ok {
t.Errorf("%s must define loginsets.slinky to match the registry's "+
"nodeScheduling paths (got loginsets keys: %v)", valuesPath, slices.Sorted(maps.Keys(values.Loginsets)))
}
}

Comment thread
mchmarny marked this conversation as resolved.
func TestComponentRegistry_TaintStrPaths(t *testing.T) {
registry, err := GetComponentRegistry()
if err != nil {
Expand Down
44 changes: 44 additions & 0 deletions pkg/recipe/deployment_order_guard_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,50 @@ func TestDeploymentOrderGuards(t *testing.T) {
{"gpu-operator", "nvsentinel"},
},
},
{
name: "h100-eks-ubuntu-training-slurm",
criteria: func() *Criteria {
c := NewCriteria()
c.Service = CriteriaServiceEKS
c.Accelerator = CriteriaAcceleratorH100
c.OS = CriteriaOSUbuntu
c.Intent = CriteriaIntentTraining
c.Platform = CriteriaPlatformSlurm
return c
},
requiredDeps: map[string][]string{
"slinky-slurm-operator": {"cert-manager", "slinky-slurm-operator-crds"},
"slinky-slurm": {"slinky-slurm-operator", "slinky-slurm-operator-crds"},
},
requiredOrdering: [][2]string{
{"cert-manager", "slinky-slurm-operator"},
{"slinky-slurm-operator-crds", "slinky-slurm-operator"},
{"slinky-slurm-operator", "slinky-slurm"},
{"slinky-slurm-operator-crds", "slinky-slurm"},
{"gpu-operator", "nvsentinel"},
},
},
{
name: "h100-kind-training-slurm",
criteria: func() *Criteria {
c := NewCriteria()
c.Service = CriteriaServiceKind
c.Accelerator = CriteriaAcceleratorH100
c.Intent = CriteriaIntentTraining
c.Platform = CriteriaPlatformSlurm
return c
},
requiredDeps: map[string][]string{
"slinky-slurm-operator": {"cert-manager", "slinky-slurm-operator-crds"},
"slinky-slurm": {"slinky-slurm-operator", "slinky-slurm-operator-crds"},
},
requiredOrdering: [][2]string{
{"cert-manager", "slinky-slurm-operator"},
{"slinky-slurm-operator-crds", "slinky-slurm-operator"},
{"slinky-slurm-operator", "slinky-slurm"},
{"slinky-slurm-operator-crds", "slinky-slurm"},
},
},
}

for _, tt := range tests {
Expand Down
23 changes: 6 additions & 17 deletions recipes/checks/slinky-slurm-operator/health-check.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -28,9 +28,6 @@ spec:
steps:
- name: validate-operator-deployment
try:
# Guard against vacuous pass on empty namespace: verify the
# slurm-operator deployment exists and has at least one ready
# replica.
- assert:
resource:
apiVersion: apps/v1
Expand All @@ -42,8 +39,7 @@ spec:
(availableReplicas > `0`): true
- name: validate-webhook-deployment
try:
# The webhook must be ready before any Slinky CRs (Controller,
# NodeSet, etc.) can be created, so assert it independently.
# Webhook must be ready before any Slinky CR can be created.
- assert:
resource:
apiVersion: apps/v1
Expand All @@ -55,17 +51,10 @@ spec:
(availableReplicas > `0`): true
- name: validate-all-pods-healthy
try:
# Assert no pods are in unhealthy phases.
# Pods must be Running (long-lived) or Succeeded (completed jobs).
# This catches Pending (init containers, scheduling), Failed, and
# Unknown.
#
# chainsaw `error` assertions pass when no matching resource exists,
# which would let this step trivially pass on an empty namespace.
# The two preceding deployment-availability steps prevent that:
# they require both deployments to have at least one ready replica
# in `slinky`, which guarantees pods exist and are inspectable
# before this step runs.
# Catch Pending / Failed / Unknown phases. Chainsaw `error` passes
# vacuously when no resource matches, so the preceding deployment-
# availability asserts are load-bearing: they guarantee the pods
# exist before this step runs.
- error:
Comment thread
mchmarny marked this conversation as resolved.
resource:
apiVersion: v1
Expand Down
Loading
Loading