Skip to content

feat(slinky-slurm): add cluster chart and EKS/Kind leaves#948

Merged
mchmarny merged 17 commits into
NVIDIA:mainfrom
faganihajizada:feat/slinky-slurm-cluster-leaves
May 19, 2026
Merged

feat(slinky-slurm): add cluster chart and EKS/Kind leaves#948
mchmarny merged 17 commits into
NVIDIA:mainfrom
faganihajizada:feat/slinky-slurm-cluster-leaves

Conversation

@faganihajizada
Copy link
Copy Markdown
Member

@faganihajizada faganihajizada commented May 18, 2026

Summary

Wires the Slinky slurm cluster chart into AICR as an opt-in component (platform-slurm-cluster mixin) and ships the first two leaf recipes that consume both the Slinky operator and cluster:

  • h100-eks-ubuntu-training-slurm
  • h100-kind-training-slurm

Changes

Component (recipes/components/slinky-slurm/values.yaml):

  • Overrides the chart's native slinky keys for nodesets / loginsets so user values merge with chart defaults (avoids the nodeset.logfile typed-merge failure observed with custom sub-keys).
  • Production-leaning defaults: priority/multifactor scheduler tuning via extraConfMap (strings explicitly quoted to dodge YAML 1.1 coercion), cgroup + gres config files, ClusterIP restapi.
  • Accounting (slurmdbd) and DCGM job-mapping are disabled by default — accounting needs an external MariaDB AICR does not bundle; DCGM needs dcgm-exporter on workers. Both opt in via valuesFile / --set. accounting.storageConfig is kept as an inert documented example (chart wraps the entire Accounting CR in if accounting.enabled).

Mixin (recipes/mixins/platform-slurm-cluster.yaml): opt-in slinky-slurm cluster with deps on slinky-slurm-operator + slinky-slurm-operator-crds.

Health check (recipes/checks/slinky-slurm/health-check.yaml): asserts Slinky CRs (Controller, LoginSet, NodeSet, RestApi at slinky.slurm.net/v1beta1 under the slinky-slurm release name) AND underlying workload readiness (Deployment / StatefulSet availableReplicas > 0) before the generic pod-health step. Standardized on availableReplicas across all sub-checks.

Registry: adds slinky-slurm with nodeScheduling:

  • system tier: nodeSelectorPaths + tolerationPaths for controller.podSpec.*, restapi.podSpec.*, loginsets.slinky.podSpec.*
  • accelerated tier: nodeSelectorPaths + tolerationPaths for nodesets.slinky.podSpec.*

Both --system-node-* and --accelerated-node-* CLI flags now steer the correct pods.

Leaves:

  • Both leaves intentionally have no validation: block — inherit the full set from their *-training parents (mirrors *-kubeflow leaves).

KWOK (kwok/scripts/validate-scheduling.sh): extracts the platform criterion so *-slurm overlays resolve to the slinky bundle they ship. Flag is intentionally scoped to slurm only — a full matrix run revealed the harness's sequential CRD cleanup is broken for NFD / KAI / run.ai (orphans block pre-flight on the second recipe onward), so widening to kubeflow/dynamo is deferred to a follow-up.

Tests:

  • pkg/recipe/deployment_order_guard_test.go: 2 new cases asserting cert-manager → slinky-slurm-operator-crds → slinky-slurm-operator → slinky-slurm.
  • pkg/recipe/components_test.go: new targeted test verifies registry node-scheduling paths line up with the actual slinky map-keys in components/slinky-slurm/values.yaml (prevents silent drift on key renames).

Docs: docs/integrator/recipe-development.md documents the two-mixin opt-in pattern; docs/user/container-images.md regenerated via make bom-docs.

Verified against upstream chart v1.1.0

All component values cross-checked against SlinkyProject/slurm-operator helm/slurm v1.1.0 and main. Notably:

  • accounting.storageConfig.passwordKeyRef matches chart defaults (name: mariadb-password, key: password).
  • All extraConfMap values quoted as strings — chart serializes the map as map[string]string and bare YAML 1.1 no would coerce to bool.

Test plan

  • make test — full unit suite (includes deployment-order guard and components_test).
  • make lint — go/yaml/license/agents-sync/docs-sidebar.
  • make qualify — end-to-end pre-PR pass.
  • aicr recipe + aicr bundle smoke for both leaves.
  • make check-health against live Kind cluster for both slinky-slurm-operator and slinky-slurm.
  • aicr validate against deployed Kind cluster.
  • KWOK scheduling validation for h100-kind-training-slurm and h100-eks-ubuntu-training-slurm.
  • CI: kwok-recipes, chainsaw, lint, build.

Notes for reviewers

  • The slurm-only KWOK guard is deliberate — widening to other platforms is blocked on harness CRD-cleanup fixes, not on the resolver behavior itself.
  • The slinky map-key alignment between registry and values is enforced by the new components_test to catch future renames at unit-test time rather than via opaque scheduling failures.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces an opt-in slinky-slurm cluster component and a platform-slurm-cluster mixin that composes it atop the always-applied platform-slurm operator. Adds Helm values for the slinky-slurm chart, registry entries, documentation updates (platform->mixin resolution), two recipe overlays (EKS and Kind H100 training with platform: slurm), a Chainsaw/Kyverno health check for the cluster, unit and ordering tests, and a conditional --platform slurm forward in validate-scheduling.sh.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/aicr#878: Overlaps with container-images/BOM adjustments related to Slurm component image listings.
  • NVIDIA/aicr#866: Related to adding slurm platform plumbing that the validate-scheduling.sh change forwards to aicr recipe.

Suggested labels

enhancement, area/tests, size/L

Suggested reviewers

  • mchmarny
  • lockwobr
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically summarizes the main changes: adding a Slinky Slurm cluster chart as an opt-in component and introducing two new leaf recipes for EKS and Kind.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, detailing all major changes including component values, mixin, health checks, registry updates, leaves, KWOK modifications, tests, and documentation updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/user/container-images.md`:
- Line 198: Replace the mutable image reference
`docker.io/library/alpine:latest` with a pinned tag or digest in the
slinky-slurm values (e.g., set the image to something like
`docker.io/library/alpine:3.18.4` or a sha256 digest) and then regenerate the
BOM/docs; locate the `docker.io/library/alpine:latest` string in the values
source for slinky-slurm, update it to the chosen immutable tag/digest, and run
the doc/BOM generation step so the docs reflect the pinned image.

In `@recipes/components/slinky-slurm/values.yaml`:
- Around line 59-60: The comment points out a mismatch between the documented
Secret name and the value used by accounting.storageConfig.passwordKeyRef;
update either the comment or the passwordKeyRef so both reference the same
Secret/key (e.g., make passwordKeyRef.secretName = "mariadb-password" and key =
"password" or change the comment to require Secret named "mariadb"), and apply
the same fix for the duplicate occurrence referenced around the other block
(accounting.storageConfig.passwordKeyRef at the later section) so the guidance
and defaults align and the accounting opt-in will not break.

In `@recipes/registry.yaml`:
- Around line 601-610: Add node selector path mappings for the system tier so
the --system-node-selector override is applied: in the same mapping that defines
system.tolerationPaths, add a system.nodeSelectorPaths array with entries
mirroring the toleration ones (e.g. controller.podSpec.nodeSelector,
restapi.podSpec.nodeSelector, loginsets.slinky.podSpec.nodeSelector) so
controller/restapi/loginset pods are steered by the system-node-selector
override.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e72860d5-3271-4166-a452-b585d2891df9

📥 Commits

Reviewing files that changed from the base of the PR and between 66f8e7b and cda91eb.

📒 Files selected for processing (13)
  • docs/integrator/recipe-development.md
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • kwok/scripts/validate-scheduling.sh
  • pkg/recipe/components_test.go
  • pkg/recipe/deployment_order_guard_test.go
  • recipes/checks/slinky-slurm-operator/health-check.yaml
  • recipes/checks/slinky-slurm/health-check.yaml
  • recipes/components/slinky-slurm/values.yaml
  • recipes/mixins/platform-slurm-cluster.yaml
  • recipes/overlays/h100-eks-ubuntu-training-slurm.yaml
  • recipes/overlays/h100-kind-training-slurm.yaml
  • recipes/registry.yaml

Comment thread docs/user/container-images.md Outdated
Comment thread recipes/components/slinky-slurm/values.yaml Outdated
Comment thread recipes/registry.yaml
@faganihajizada faganihajizada marked this pull request as draft May 18, 2026 11:07
@faganihajizada faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch 3 times, most recently from a3c854f to 27cadab Compare May 18, 2026 11:19
@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 331-334: The script currently ignores non-empty platform values
that aren't "slurm", causing silent behavior; update the logic around the
platform variable and recipe_args so the script fails fast: if platform is
non-empty and not equal to "slurm" then emit a clear error message (including
the invalid $platform value) to stderr and exit non-zero, otherwise when
platform == "slurm" append --platform "$platform" to recipe_args as before
(modify the existing if block for platform or add an explicit else branch that
calls echo >&2 and exit 1). Ensure you reference the same variable name
"platform" and the array "recipe_args" so the change is localized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d0dc40d1-8f40-4107-b26c-7dfa73f3f5e6

📥 Commits

Reviewing files that changed from the base of the PR and between a3c854f and 27cadab.

📒 Files selected for processing (13)
  • docs/integrator/recipe-development.md
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • kwok/scripts/validate-scheduling.sh
  • pkg/recipe/components_test.go
  • pkg/recipe/deployment_order_guard_test.go
  • recipes/checks/slinky-slurm-operator/health-check.yaml
  • recipes/checks/slinky-slurm/health-check.yaml
  • recipes/components/slinky-slurm/values.yaml
  • recipes/mixins/platform-slurm-cluster.yaml
  • recipes/overlays/h100-eks-ubuntu-training-slurm.yaml
  • recipes/overlays/h100-kind-training-slurm.yaml
  • recipes/registry.yaml

Comment thread kwok/scripts/validate-scheduling.sh Outdated
@faganihajizada faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch 2 times, most recently from 7292e85 to f7c7964 Compare May 18, 2026 11:25
@faganihajizada faganihajizada marked this pull request as ready for review May 18, 2026 11:25
@faganihajizada faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch from f7c7964 to 6d00132 Compare May 18, 2026 11:41
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid wiring overall — the mixin split (platform-slurm always-on operator vs opt-in platform-slurm-cluster) is the right shape, and the new TestComponentRegistry_SlinkySlurm_NodeSchedulingPaths cross-check between registry paths and values.yaml map-keys is exactly the kind of guard this codebase needs more of. Deployment-order test additions also look correct.

One blocker and a few non-blocking nits:

  • CI (blocker): both Tier 2: h100-kind-training-slurm and Tier 2: h100-eks-ubuntu-training-slurm KWOK lanes were cancelled at ~15min (every other lane finished in 5–7min), which is what failed KWOK Test Summary. Since those are the very recipes this PR adds, please confirm whether it's a runner timeout vs. OCI pull stall vs. resolver/bundle slowdown, and get the lanes green before merge.
  • Defaults posture (non-blocking): PriorityFlags: NO_NORMAL_ALL + raw priority weights + MaxTime: UNLIMITED are real site-policy choices baked into the shipped default. Reasonable for a research cluster; worth at least re-labeling these as "chosen policy" rather than generic "production-leaning defaults", or moving the multifactor / unlimited bits behind an opt-in valuesFile.
  • Health-check release-name pinning (non-blocking): the slinky-slurm/health-check.yaml resource names assume the localformat release name — fine for make check-health, but flux/argocd deployers will silently break this. The file flags it; just don't let "needs a parameterized variant" stay a comment forever.
  • KWOK platform allowlist (non-blocking): the platform == "slurm" allowlist will bit-rot the next time someone adds a leaf with a non-allowlisted criteria.platform — consider a log line so the next author gets a hint.
  • Comment trim regression (nit): the slinky-slurm-operator/health-check.yaml rewrite drops the non-obvious "chainsaw error passes vacuously on empty namespace" rationale.

Also, the branch is behind main — please rebase before re-pushing.

Comment thread recipes/overlays/h100-kind-training-slurm.yaml
Comment thread recipes/checks/slinky-slurm-operator/health-check.yaml
Comment thread recipes/components/slinky-slurm/values.yaml Outdated
Comment thread recipes/components/slinky-slurm/values.yaml
Comment thread recipes/checks/slinky-slurm/health-check.yaml
Comment thread kwok/scripts/validate-scheduling.sh
Comment thread pkg/recipe/components_test.go
Wire the slinky `slurm` cluster chart into AICR as an opt-in component
(`platform-slurm-cluster` mixin) and ship the first two leaf recipes
that consume both the Slinky operator and cluster: `h100-eks-ubuntu-
training-slurm` and `h100-kind-training-slurm`.

Component values mirror the chart's native `slinky` keys for nodesets
and loginsets so user overrides merge cleanly with chart defaults
(avoids the `nodeset.logfile` typed-merge failure observed when using
custom sub-keys). Production-leaning defaults: priority/multifactor
scheduler tuning via `extraConfMap`, cgroup/gres config files, ClusterIP
restapi. Accounting (slurmdbd) and DCGM job-mapping are disabled by
default — accounting needs an external MariaDB AICR does not bundle,
DCGM needs dcgm-exporter on workers; both opt in via valuesFile / --set.
The accounting.storageConfig block is kept as an inert example because
the chart wraps the entire Accounting CR in `if accounting.enabled`.

Health check asserts both the Slinky CRs (Controller, LoginSet, NodeSet,
RestApi at apiVersion `v1beta1` under the `slinky-slurm` release name)
and the underlying workload readiness (Deployment/StatefulSet
`availableReplicas > 0`) before the generic pod-health step.

KWOK validate-scheduling.sh now extracts the `platform` criterion and
passes `--platform slurm` so *-slurm overlays resolve to the slinky
bundle they ship. The flag is intentionally scoped to slurm: a full
matrix run showed the harness's sequential CRD cleanup is broken for
NFD/KAI/run.ai (orphans block pre-flight on the second recipe onward),
so widening to kubeflow/dynamo is deferred to a follow-up.

Registry adds nodeScheduling paths for `nodesets.slinky.podSpec.*`,
`loginsets.slinky.podSpec.*`, `controller.podSpec.*`, and
`restapi.podSpec.*` so both `--system-node-*` and
`--accelerated-node-*` flags steer the correct pods. Deployment-order
guard gains the two new leaves; a targeted components test verifies the
registry paths line up with the actual map-keys in
components/slinky-slurm/values.yaml.

EKS leaf intentionally has no `validation:` block so it inherits the
full set from `h100-eks-training` (via `h100-eks-ubuntu-training`); the
Kind leaf follows the same pattern.
docs/user/container-images.md regenerated via `make bom-docs`.
@faganihajizada faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch from 02c2279 to f5f3bb4 Compare May 18, 2026 14:01
Mark flagged that the slurm-only allowlist will silently bit-rot when
a future *-kubeflow or *-dynamo leaf with criteria.platform lands here.
Add a log_info so the next person sees the hint instead of mysterious
bundle diffs.

Cannot fail-closed yet: existing kubeflow/dynamo Tier 2 lanes carry
criteria.platform today and have historically resolved to their
non-platform parent under KWOK. Widening the allowlist is blocked on
the harness CRD-cleanup bug; tracked as a follow-up.
Mark's earlier review explained that chainsaw `error` asserts pass
vacuously when no resource matches — so the preceding deployment
availability checks are load-bearing, not redundant. Previous trim of
that comment lost the non-obvious invariant; restore it.
Mirror the wording fix applied to slinky-slurm-operator/health-check.yaml:
spell out that chainsaw `error` passes vacuously on empty namespaces, so
the preceding StatefulSet/Deployment/NodeSet availability asserts are
load-bearing.
Mark called out that hardcoding `slinky-slurm-*` in the asserts
silently breaks for deployers that override the Helm release name
(flux/argocd path-based naming, multi-tenant installs). Add a TODO
pointing at the parameterization gap so the next reader doesn't
mistake this for a deliberate single-release contract.
Three related cluster-default fixes from review:

1. Single Default=YES partition is a hard slurmctld constraint, not
   style — call that out so multi-tenant deployers know they must
   disable this partition before adding their own.

2. MaxTime=UNLIMITED is unsafe as a default; a stuck job pins GPUs
   indefinitely. Cap at 24h. Leaves with longer-running workloads
   override per-overlay.

3. The prior "known upstream issue" framing for slurmd auto-registering
   pod resource limits was wrong. Verified against v1.1 chart sources:
   --conf only carries Features=<name> + user extraConf, while pod
   cpu/memory are plumbed as POD_CPUS/POD_MEMORY env vars for the
   image entrypoint. Reword the note to describe what the chart
   actually does and why Gres/Features must be set explicitly.
PR-948 CI: every Tier 1/2 lane went red after the cert-manager harness
fix. Root cause: the previous commit pinned aicr.nvidia.com/node-type=
system exclusively to the real Kind control-plane (KWOK fakes now
carry =kwok-system). That correctly routes cert-manager — which
tolerates the control-plane taint — but leaves untolerated charts
(kai-scheduler, nvsentinel, prometheus-adapter) Pending because the CP
still has node-role.kubernetes.io/control-plane:NoSchedule.

Remove that taint. Production clusters either run dedicated, untainted
system nodes or schedule these charts on workers; the harness should
model the former.
Mark called out that PriorityType=priority/multifactor + the raw weight
values + PriorityFlags=NO_NORMAL_ALL bake a specific site-policy into
the AICR default. Most upstream Slurm clusters either run normalized
(default) or omit PriorityFlags and let admins tune.

These weights were inherited from an AWS reference config and have not
been validated by any AICR-shipped leaf. Drop them — sites that want
multifactor add it in their leaf valuesFile.

Keep SelectType=select/cons_tres (required for GPU GRES) and
ScronParameters=enable. Drop EnforcePartLimits=no (matches Slurm's
own default).
The previous two harness fixes (label KWOK fakes kwok-system, pin
real CP, untaint CP) cascaded: pinning system-tier workloads to a
single real node caused Insufficient CPU pending across most lanes.

Use the simpler fix the harness already employs for similar cases —
disable the side-effect in the bundle. cert-manager's validating
webhook is the only thing slinky-slurm-operator's install touched
that needed a reachable endpoint; setting webhook.enabled=false skips
admission entirely. KWOK doesn't execute workloads, so we lose nothing
for scheduling validation, and every other chart returns to landing
on KWOK fakes as before.

Revert apply-nodes.sh and run-all-recipes.sh to upstream behavior.
slinky-slurm-operator's chart gates both the cert-manager.io/Certificate
submission and the ValidatingWebhookConfiguration on its own
webhook.enabled / certManager.enabled toggles. Disable both for KWOK
so admission isn't routed to unreachable fake-node pods. Harmless for
scheduling validation; production recipes are unaffected.

Verified locally: 43/43 pods scheduled.
@faganihajizada faganihajizada force-pushed the feat/slinky-slurm-cluster-leaves branch from 7272a8f to 54260ca Compare May 18, 2026 15:31
EKS slurm lane snapshot caught slinky-slurm-controller-0 mid-bind
(NOMINATED system-1, spec.nodeName empty) and counted it as
unscheduled. slurm-operator reconciles the Controller CR into a
StatefulSet AFTER Helm install completes, so the controller pod
appears later than the script's existing 5s post-deploy sleep.

Poll up to 60s for the controller pod's spec.nodeName, gated on
the Controllers CRD existing so non-slurm lanes are unaffected.
The previous controller-pod poll fix was treating the symptom: the
real failure is that slinky-slurm-controller is a StatefulSet with
persistence.enabled=true by default, and Kind's local-path
provisioner uses WaitForFirstConsumer binding. The pod gets
NominatedNodeName=system-1 (a KWOK fake), the PVC tries to provision
local-path on that fake, KWOK can't back local storage, and the pod
sits Pending forever with no FailedScheduling event.

Revert the controller-pod poll and disable controller persistence
in the bundle via --set slurmcluster:controller.persistence.enabled=false.

Verified locally: 43/43 pods scheduled.
@faganihajizada faganihajizada requested a review from mchmarny May 19, 2026 06:55
@mchmarny mchmarny enabled auto-merge (squash) May 19, 2026 10:18
@mchmarny mchmarny merged commit cc08600 into NVIDIA:main May 19, 2026
91 checks passed
@faganihajizada faganihajizada deleted the feat/slinky-slurm-cluster-leaves branch May 19, 2026 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants