feat(slinky-slurm): add cluster chart and EKS/Kind leaves by faganihajizada · Pull Request #948 · NVIDIA/aicr

faganihajizada · 2026-05-18T10:09:52Z

Summary

Wires the Slinky slurm cluster chart into AICR as an opt-in component (platform-slurm-cluster mixin) and ships the first two leaf recipes that consume both the Slinky operator and cluster:

h100-eks-ubuntu-training-slurm
h100-kind-training-slurm

Changes

Component (recipes/components/slinky-slurm/values.yaml):

Overrides the chart's native slinky keys for nodesets / loginsets so user values merge with chart defaults (avoids the nodeset.logfile typed-merge failure observed with custom sub-keys).
Production-leaning defaults: priority/multifactor scheduler tuning via extraConfMap (strings explicitly quoted to dodge YAML 1.1 coercion), cgroup + gres config files, ClusterIP restapi.
Accounting (slurmdbd) and DCGM job-mapping are disabled by default — accounting needs an external MariaDB AICR does not bundle; DCGM needs dcgm-exporter on workers. Both opt in via valuesFile / --set. accounting.storageConfig is kept as an inert documented example (chart wraps the entire Accounting CR in if accounting.enabled).

Mixin (recipes/mixins/platform-slurm-cluster.yaml): opt-in slinky-slurm cluster with deps on slinky-slurm-operator + slinky-slurm-operator-crds.

Health check (recipes/checks/slinky-slurm/health-check.yaml): asserts Slinky CRs (Controller, LoginSet, NodeSet, RestApi at slinky.slurm.net/v1beta1 under the slinky-slurm release name) AND underlying workload readiness (Deployment / StatefulSet availableReplicas > 0) before the generic pod-health step. Standardized on availableReplicas across all sub-checks.

Registry: adds slinky-slurm with nodeScheduling:

system tier: nodeSelectorPaths + tolerationPaths for controller.podSpec.*, restapi.podSpec.*, loginsets.slinky.podSpec.*
accelerated tier: nodeSelectorPaths + tolerationPaths for nodesets.slinky.podSpec.*

Both --system-node-* and --accelerated-node-* CLI flags now steer the correct pods.

Leaves:

Both leaves intentionally have no validation: block — inherit the full set from their *-training parents (mirrors *-kubeflow leaves).

KWOK (kwok/scripts/validate-scheduling.sh): extracts the platform criterion so *-slurm overlays resolve to the slinky bundle they ship. Flag is intentionally scoped to slurm only — a full matrix run revealed the harness's sequential CRD cleanup is broken for NFD / KAI / run.ai (orphans block pre-flight on the second recipe onward), so widening to kubeflow/dynamo is deferred to a follow-up.

Tests:

pkg/recipe/deployment_order_guard_test.go: 2 new cases asserting cert-manager → slinky-slurm-operator-crds → slinky-slurm-operator → slinky-slurm.
pkg/recipe/components_test.go: new targeted test verifies registry node-scheduling paths line up with the actual slinky map-keys in components/slinky-slurm/values.yaml (prevents silent drift on key renames).

Docs: docs/integrator/recipe-development.md documents the two-mixin opt-in pattern; docs/user/container-images.md regenerated via make bom-docs.

Verified against upstream chart v1.1.0

All component values cross-checked against SlinkyProject/slurm-operator helm/slurm v1.1.0 and main. Notably:

accounting.storageConfig.passwordKeyRef matches chart defaults (name: mariadb-password, key: password).
All extraConfMap values quoted as strings — chart serializes the map as map[string]string and bare YAML 1.1 no would coerce to bool.

Test plan

make test — full unit suite (includes deployment-order guard and components_test).
make lint — go/yaml/license/agents-sync/docs-sidebar.
make qualify — end-to-end pre-PR pass.
aicr recipe + aicr bundle smoke for both leaves.
make check-health against live Kind cluster for both slinky-slurm-operator and slinky-slurm.
aicr validate against deployed Kind cluster.
KWOK scheduling validation for h100-kind-training-slurm and h100-eks-ubuntu-training-slurm.
CI: kwok-recipes, chainsaw, lint, build.

Notes for reviewers

The slurm-only KWOK guard is deliberate — widening to other platforms is blocked on harness CRD-cleanup fixes, not on the resolver behavior itself.
The slinky map-key alignment between registry and values is enforced by the new components_test to catch future renames at unit-test time rather than via opaque scheduling failures.

coderabbitai · 2026-05-18T10:15:30Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Introduces an opt-in slinky-slurm cluster component and a platform-slurm-cluster mixin that composes it atop the always-applied platform-slurm operator. Adds Helm values for the slinky-slurm chart, registry entries, documentation updates (platform->mixin resolution), two recipe overlays (EKS and Kind H100 training with platform: slurm), a Chainsaw/Kyverno health check for the cluster, unit and ordering tests, and a conditional --platform slurm forward in validate-scheduling.sh.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/aicr#878: Overlaps with container-images/BOM adjustments related to Slurm component image listings.
NVIDIA/aicr#866: Related to adding slurm platform plumbing that the validate-scheduling.sh change forwards to aicr recipe.

Suggested labels

enhancement, area/tests, size/L

Suggested reviewers

mchmarny
lockwobr
njhensley

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically summarizes the main changes: adding a Slinky Slurm cluster chart as an opt-in component and introducing two new leaf recipes for EKS and Kind.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset, detailing all major changes including component values, mixin, health checks, registry updates, leaves, KWOK modifications, tests, and documentation updates.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/user/container-images.md`:
- Line 198: Replace the mutable image reference
`docker.io/library/alpine:latest` with a pinned tag or digest in the
slinky-slurm values (e.g., set the image to something like
`docker.io/library/alpine:3.18.4` or a sha256 digest) and then regenerate the
BOM/docs; locate the `docker.io/library/alpine:latest` string in the values
source for slinky-slurm, update it to the chosen immutable tag/digest, and run
the doc/BOM generation step so the docs reflect the pinned image.

In `@recipes/components/slinky-slurm/values.yaml`:
- Around line 59-60: The comment points out a mismatch between the documented
Secret name and the value used by accounting.storageConfig.passwordKeyRef;
update either the comment or the passwordKeyRef so both reference the same
Secret/key (e.g., make passwordKeyRef.secretName = "mariadb-password" and key =
"password" or change the comment to require Secret named "mariadb"), and apply
the same fix for the duplicate occurrence referenced around the other block
(accounting.storageConfig.passwordKeyRef at the later section) so the guidance
and defaults align and the accounting opt-in will not break.

In `@recipes/registry.yaml`:
- Around line 601-610: Add node selector path mappings for the system tier so
the --system-node-selector override is applied: in the same mapping that defines
system.tolerationPaths, add a system.nodeSelectorPaths array with entries
mirroring the toleration ones (e.g. controller.podSpec.nodeSelector,
restapi.podSpec.nodeSelector, loginsets.slinky.podSpec.nodeSelector) so
controller/restapi/loginset pods are steered by the system-node-selector
override.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e72860d5-3271-4166-a452-b585d2891df9

📥 Commits

Reviewing files that changed from the base of the PR and between 66f8e7b and cda91eb.

📒 Files selected for processing (13)

docs/integrator/recipe-development.md
docs/user/component-catalog.md
docs/user/container-images.md
kwok/scripts/validate-scheduling.sh
pkg/recipe/components_test.go
pkg/recipe/deployment_order_guard_test.go
recipes/checks/slinky-slurm-operator/health-check.yaml
recipes/checks/slinky-slurm/health-check.yaml
recipes/components/slinky-slurm/values.yaml
recipes/mixins/platform-slurm-cluster.yaml
recipes/overlays/h100-eks-ubuntu-training-slurm.yaml
recipes/overlays/h100-kind-training-slurm.yaml
recipes/registry.yaml

github-actions · 2026-05-18T11:19:57Z

🌿 Preview your docs: https://nvidia-preview-feat-slinky-slurm-cluster-leaves.docs.buildwithfern.com/aicr

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 331-334: The script currently ignores non-empty platform values
that aren't "slurm", causing silent behavior; update the logic around the
platform variable and recipe_args so the script fails fast: if platform is
non-empty and not equal to "slurm" then emit a clear error message (including
the invalid $platform value) to stderr and exit non-zero, otherwise when
platform == "slurm" append --platform "$platform" to recipe_args as before
(modify the existing if block for platform or add an explicit else branch that
calls echo >&2 and exit 1). Ensure you reference the same variable name
"platform" and the array "recipe_args" so the change is localized.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d0dc40d1-8f40-4107-b26c-7dfa73f3f5e6

📥 Commits

Reviewing files that changed from the base of the PR and between a3c854f and 27cadab.

📒 Files selected for processing (13)

docs/integrator/recipe-development.md
docs/user/component-catalog.md
docs/user/container-images.md
kwok/scripts/validate-scheduling.sh
pkg/recipe/components_test.go
pkg/recipe/deployment_order_guard_test.go
recipes/checks/slinky-slurm-operator/health-check.yaml
recipes/checks/slinky-slurm/health-check.yaml
recipes/components/slinky-slurm/values.yaml
recipes/mixins/platform-slurm-cluster.yaml
recipes/overlays/h100-eks-ubuntu-training-slurm.yaml
recipes/overlays/h100-kind-training-slurm.yaml
recipes/registry.yaml

mchmarny

Solid wiring overall — the mixin split (platform-slurm always-on operator vs opt-in platform-slurm-cluster) is the right shape, and the new TestComponentRegistry_SlinkySlurm_NodeSchedulingPaths cross-check between registry paths and values.yaml map-keys is exactly the kind of guard this codebase needs more of. Deployment-order test additions also look correct.

One blocker and a few non-blocking nits:

CI (blocker): both Tier 2: h100-kind-training-slurm and Tier 2: h100-eks-ubuntu-training-slurm KWOK lanes were cancelled at ~15min (every other lane finished in 5–7min), which is what failed KWOK Test Summary. Since those are the very recipes this PR adds, please confirm whether it's a runner timeout vs. OCI pull stall vs. resolver/bundle slowdown, and get the lanes green before merge.
Defaults posture (non-blocking): PriorityFlags: NO_NORMAL_ALL + raw priority weights + MaxTime: UNLIMITED are real site-policy choices baked into the shipped default. Reasonable for a research cluster; worth at least re-labeling these as "chosen policy" rather than generic "production-leaning defaults", or moving the multifactor / unlimited bits behind an opt-in valuesFile.
Health-check release-name pinning (non-blocking): the slinky-slurm/health-check.yaml resource names assume the localformat release name — fine for make check-health, but flux/argocd deployers will silently break this. The file flags it; just don't let "needs a parameterized variant" stay a comment forever.
KWOK platform allowlist (non-blocking): the platform == "slurm" allowlist will bit-rot the next time someone adds a leaf with a non-allowlisted criteria.platform — consider a log line so the next author gets a hint.
Comment trim regression (nit): the slinky-slurm-operator/health-check.yaml rewrite drops the non-obvious "chainsaw error passes vacuously on empty namespace" rationale.

Also, the branch is behind main — please rebase before re-pushing.

Wire the slinky `slurm` cluster chart into AICR as an opt-in component (`platform-slurm-cluster` mixin) and ship the first two leaf recipes that consume both the Slinky operator and cluster: `h100-eks-ubuntu- training-slurm` and `h100-kind-training-slurm`. Component values mirror the chart's native `slinky` keys for nodesets and loginsets so user overrides merge cleanly with chart defaults (avoids the `nodeset.logfile` typed-merge failure observed when using custom sub-keys). Production-leaning defaults: priority/multifactor scheduler tuning via `extraConfMap`, cgroup/gres config files, ClusterIP restapi. Accounting (slurmdbd) and DCGM job-mapping are disabled by default — accounting needs an external MariaDB AICR does not bundle, DCGM needs dcgm-exporter on workers; both opt in via valuesFile / --set. The accounting.storageConfig block is kept as an inert example because the chart wraps the entire Accounting CR in `if accounting.enabled`. Health check asserts both the Slinky CRs (Controller, LoginSet, NodeSet, RestApi at apiVersion `v1beta1` under the `slinky-slurm` release name) and the underlying workload readiness (Deployment/StatefulSet `availableReplicas > 0`) before the generic pod-health step. KWOK validate-scheduling.sh now extracts the `platform` criterion and passes `--platform slurm` so *-slurm overlays resolve to the slinky bundle they ship. The flag is intentionally scoped to slurm: a full matrix run showed the harness's sequential CRD cleanup is broken for NFD/KAI/run.ai (orphans block pre-flight on the second recipe onward), so widening to kubeflow/dynamo is deferred to a follow-up. Registry adds nodeScheduling paths for `nodesets.slinky.podSpec.*`, `loginsets.slinky.podSpec.*`, `controller.podSpec.*`, and `restapi.podSpec.*` so both `--system-node-*` and `--accelerated-node-*` flags steer the correct pods. Deployment-order guard gains the two new leaves; a targeted components test verifies the registry paths line up with the actual map-keys in components/slinky-slurm/values.yaml. EKS leaf intentionally has no `validation:` block so it inherits the full set from `h100-eks-training` (via `h100-eks-ubuntu-training`); the Kind leaf follows the same pattern. docs/user/container-images.md regenerated via `make bom-docs`.

Mark flagged that the slurm-only allowlist will silently bit-rot when a future *-kubeflow or *-dynamo leaf with criteria.platform lands here. Add a log_info so the next person sees the hint instead of mysterious bundle diffs. Cannot fail-closed yet: existing kubeflow/dynamo Tier 2 lanes carry criteria.platform today and have historically resolved to their non-platform parent under KWOK. Widening the allowlist is blocked on the harness CRD-cleanup bug; tracked as a follow-up.

Mark's earlier review explained that chainsaw `error` asserts pass vacuously when no resource matches — so the preceding deployment availability checks are load-bearing, not redundant. Previous trim of that comment lost the non-obvious invariant; restore it.

Mirror the wording fix applied to slinky-slurm-operator/health-check.yaml: spell out that chainsaw `error` passes vacuously on empty namespaces, so the preceding StatefulSet/Deployment/NodeSet availability asserts are load-bearing.

Mark called out that hardcoding `slinky-slurm-*` in the asserts silently breaks for deployers that override the Helm release name (flux/argocd path-based naming, multi-tenant installs). Add a TODO pointing at the parameterization gap so the next reader doesn't mistake this for a deliberate single-release contract.

Three related cluster-default fixes from review: 1. Single Default=YES partition is a hard slurmctld constraint, not style — call that out so multi-tenant deployers know they must disable this partition before adding their own. 2. MaxTime=UNLIMITED is unsafe as a default; a stuck job pins GPUs indefinitely. Cap at 24h. Leaves with longer-running workloads override per-overlay. 3. The prior "known upstream issue" framing for slurmd auto-registering pod resource limits was wrong. Verified against v1.1 chart sources: --conf only carries Features=<name> + user extraConf, while pod cpu/memory are plumbed as POD_CPUS/POD_MEMORY env vars for the image entrypoint. Reword the note to describe what the chart actually does and why Gres/Features must be set explicitly.

PR-948 CI: every Tier 1/2 lane went red after the cert-manager harness fix. Root cause: the previous commit pinned aicr.nvidia.com/node-type= system exclusively to the real Kind control-plane (KWOK fakes now carry =kwok-system). That correctly routes cert-manager — which tolerates the control-plane taint — but leaves untolerated charts (kai-scheduler, nvsentinel, prometheus-adapter) Pending because the CP still has node-role.kubernetes.io/control-plane:NoSchedule. Remove that taint. Production clusters either run dedicated, untainted system nodes or schedule these charts on workers; the harness should model the former.

Mark called out that PriorityType=priority/multifactor + the raw weight values + PriorityFlags=NO_NORMAL_ALL bake a specific site-policy into the AICR default. Most upstream Slurm clusters either run normalized (default) or omit PriorityFlags and let admins tune. These weights were inherited from an AWS reference config and have not been validated by any AICR-shipped leaf. Drop them — sites that want multifactor add it in their leaf valuesFile. Keep SelectType=select/cons_tres (required for GPU GRES) and ScronParameters=enable. Drop EnforcePartLimits=no (matches Slurm's own default).

The previous two harness fixes (label KWOK fakes kwok-system, pin real CP, untaint CP) cascaded: pinning system-tier workloads to a single real node caused Insufficient CPU pending across most lanes. Use the simpler fix the harness already employs for similar cases — disable the side-effect in the bundle. cert-manager's validating webhook is the only thing slinky-slurm-operator's install touched that needed a reachable endpoint; setting webhook.enabled=false skips admission entirely. KWOK doesn't execute workloads, so we lose nothing for scheduling validation, and every other chart returns to landing on KWOK fakes as before. Revert apply-nodes.sh and run-all-recipes.sh to upstream behavior.

This reverts commit c3eed64.

This reverts commit f66991d.

slinky-slurm-operator's chart gates both the cert-manager.io/Certificate submission and the ValidatingWebhookConfiguration on its own webhook.enabled / certManager.enabled toggles. Disable both for KWOK so admission isn't routed to unreachable fake-node pods. Harmless for scheduling validation; production recipes are unaffected. Verified locally: 43/43 pods scheduled.

EKS slurm lane snapshot caught slinky-slurm-controller-0 mid-bind (NOMINATED system-1, spec.nodeName empty) and counted it as unscheduled. slurm-operator reconciles the Controller CR into a StatefulSet AFTER Helm install completes, so the controller pod appears later than the script's existing 5s post-deploy sleep. Poll up to 60s for the controller pod's spec.nodeName, gated on the Controllers CRD existing so non-slurm lanes are unaffected.

The previous controller-pod poll fix was treating the symptom: the real failure is that slinky-slurm-controller is a StatefulSet with persistence.enabled=true by default, and Kind's local-path provisioner uses WaitForFirstConsumer binding. The pod gets NominatedNodeName=system-1 (a KWOK fake), the PVC tries to provision local-path on that fake, KWOK can't back local storage, and the pod sits Pending forever with no FailedScheduling event. Revert the controller-pod poll and disable controller persistence in the bundle via --set slurmcluster:controller.persistence.enabled=false. Verified locally: 43/43 pods scheduled.

faganihajizada requested review from a team as code owners May 18, 2026 10:09

github-actions Bot added area/recipes area/docs size/XL labels May 18, 2026

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread docs/user/container-images.md Outdated

Comment thread recipes/components/slinky-slurm/values.yaml Outdated

Comment thread recipes/registry.yaml

faganihajizada marked this pull request as draft May 18, 2026 11:07

faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch 3 times, most recently from a3c854f to 27cadab Compare May 18, 2026 11:19

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread kwok/scripts/validate-scheduling.sh Outdated

faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch 2 times, most recently from 7292e85 to f7c7964 Compare May 18, 2026 11:25

faganihajizada marked this pull request as ready for review May 18, 2026 11:25

faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch from f7c7964 to 6d00132 Compare May 18, 2026 11:41

mchmarny assigned faganihajizada May 18, 2026

mchmarny requested changes May 18, 2026

View reviewed changes

faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch from 02c2279 to f5f3bb4 Compare May 18, 2026 14:01

faganihajizada added 9 commits May 18, 2026 16:05

Revert "ci(kwok): disable cert-manager webhook instead of pinning to CP"

ac94a23

This reverts commit c3eed64.

faganihajizada added 2 commits May 18, 2026 16:38

Revert "ci(kwok): untaint kind control-plane to accept system-tier pods"

eb78d96

This reverts commit f66991d.

faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch from 7272a8f to 54260ca Compare May 18, 2026 15:31

faganihajizada added 4 commits May 18, 2026 18:09

Merge branch 'main' into feat/slinky-slurm-cluster-leaves

9c8779d

Merge branch 'main' into feat/slinky-slurm-cluster-leaves

5bebf52

faganihajizada requested a review from mchmarny May 19, 2026 06:55

Merge branch 'main' into feat/slinky-slurm-cluster-leaves

a1746c6

mchmarny approved these changes May 19, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) May 19, 2026 10:18

mchmarny merged commit cc08600 into NVIDIA:main May 19, 2026
91 checks passed

faganihajizada deleted the feat/slinky-slurm-cluster-leaves branch May 19, 2026 10:23

faganihajizada mentioned this pull request May 21, 2026

feat(slinky-slurm): add h100-gke-cos leaf and inline platform refs #997

Merged

25 tasks

Conversation

faganihajizada commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verified against upstream chart v1.1.0

Test plan

Notes for reviewers

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

faganihajizada commented May 18, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading