feat(slinky-slurm): add cluster chart and EKS/Kind leaves#948
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces an opt-in Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/user/container-images.md`:
- Line 198: Replace the mutable image reference
`docker.io/library/alpine:latest` with a pinned tag or digest in the
slinky-slurm values (e.g., set the image to something like
`docker.io/library/alpine:3.18.4` or a sha256 digest) and then regenerate the
BOM/docs; locate the `docker.io/library/alpine:latest` string in the values
source for slinky-slurm, update it to the chosen immutable tag/digest, and run
the doc/BOM generation step so the docs reflect the pinned image.
In `@recipes/components/slinky-slurm/values.yaml`:
- Around line 59-60: The comment points out a mismatch between the documented
Secret name and the value used by accounting.storageConfig.passwordKeyRef;
update either the comment or the passwordKeyRef so both reference the same
Secret/key (e.g., make passwordKeyRef.secretName = "mariadb-password" and key =
"password" or change the comment to require Secret named "mariadb"), and apply
the same fix for the duplicate occurrence referenced around the other block
(accounting.storageConfig.passwordKeyRef at the later section) so the guidance
and defaults align and the accounting opt-in will not break.
In `@recipes/registry.yaml`:
- Around line 601-610: Add node selector path mappings for the system tier so
the --system-node-selector override is applied: in the same mapping that defines
system.tolerationPaths, add a system.nodeSelectorPaths array with entries
mirroring the toleration ones (e.g. controller.podSpec.nodeSelector,
restapi.podSpec.nodeSelector, loginsets.slinky.podSpec.nodeSelector) so
controller/restapi/loginset pods are steered by the system-node-selector
override.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: e72860d5-3271-4166-a452-b585d2891df9
📒 Files selected for processing (13)
docs/integrator/recipe-development.mddocs/user/component-catalog.mddocs/user/container-images.mdkwok/scripts/validate-scheduling.shpkg/recipe/components_test.gopkg/recipe/deployment_order_guard_test.gorecipes/checks/slinky-slurm-operator/health-check.yamlrecipes/checks/slinky-slurm/health-check.yamlrecipes/components/slinky-slurm/values.yamlrecipes/mixins/platform-slurm-cluster.yamlrecipes/overlays/h100-eks-ubuntu-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/registry.yaml
a3c854f to
27cadab
Compare
|
🌿 Preview your docs: https://nvidia-preview-feat-slinky-slurm-cluster-leaves.docs.buildwithfern.com/aicr |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 331-334: The script currently ignores non-empty platform values
that aren't "slurm", causing silent behavior; update the logic around the
platform variable and recipe_args so the script fails fast: if platform is
non-empty and not equal to "slurm" then emit a clear error message (including
the invalid $platform value) to stderr and exit non-zero, otherwise when
platform == "slurm" append --platform "$platform" to recipe_args as before
(modify the existing if block for platform or add an explicit else branch that
calls echo >&2 and exit 1). Ensure you reference the same variable name
"platform" and the array "recipe_args" so the change is localized.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d0dc40d1-8f40-4107-b26c-7dfa73f3f5e6
📒 Files selected for processing (13)
docs/integrator/recipe-development.mddocs/user/component-catalog.mddocs/user/container-images.mdkwok/scripts/validate-scheduling.shpkg/recipe/components_test.gopkg/recipe/deployment_order_guard_test.gorecipes/checks/slinky-slurm-operator/health-check.yamlrecipes/checks/slinky-slurm/health-check.yamlrecipes/components/slinky-slurm/values.yamlrecipes/mixins/platform-slurm-cluster.yamlrecipes/overlays/h100-eks-ubuntu-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/registry.yaml
7292e85 to
f7c7964
Compare
f7c7964 to
6d00132
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Solid wiring overall — the mixin split (platform-slurm always-on operator vs opt-in platform-slurm-cluster) is the right shape, and the new TestComponentRegistry_SlinkySlurm_NodeSchedulingPaths cross-check between registry paths and values.yaml map-keys is exactly the kind of guard this codebase needs more of. Deployment-order test additions also look correct.
One blocker and a few non-blocking nits:
- CI (blocker): both
Tier 2: h100-kind-training-slurmandTier 2: h100-eks-ubuntu-training-slurmKWOK lanes were cancelled at ~15min (every other lane finished in 5–7min), which is what failedKWOK Test Summary. Since those are the very recipes this PR adds, please confirm whether it's a runner timeout vs. OCI pull stall vs. resolver/bundle slowdown, and get the lanes green before merge. - Defaults posture (non-blocking):
PriorityFlags: NO_NORMAL_ALL+ raw priority weights +MaxTime: UNLIMITEDare real site-policy choices baked into the shipped default. Reasonable for a research cluster; worth at least re-labeling these as "chosen policy" rather than generic "production-leaning defaults", or moving the multifactor / unlimited bits behind an opt-in valuesFile. - Health-check release-name pinning (non-blocking): the
slinky-slurm/health-check.yamlresource names assume the localformat release name — fine formake check-health, but flux/argocd deployers will silently break this. The file flags it; just don't let "needs a parameterized variant" stay a comment forever. - KWOK platform allowlist (non-blocking): the
platform == "slurm"allowlist will bit-rot the next time someone adds a leaf with a non-allowlisted criteria.platform — consider a log line so the next author gets a hint. - Comment trim regression (nit): the
slinky-slurm-operator/health-check.yamlrewrite drops the non-obvious "chainsawerrorpasses vacuously on empty namespace" rationale.
Also, the branch is behind main — please rebase before re-pushing.
Wire the slinky `slurm` cluster chart into AICR as an opt-in component (`platform-slurm-cluster` mixin) and ship the first two leaf recipes that consume both the Slinky operator and cluster: `h100-eks-ubuntu- training-slurm` and `h100-kind-training-slurm`. Component values mirror the chart's native `slinky` keys for nodesets and loginsets so user overrides merge cleanly with chart defaults (avoids the `nodeset.logfile` typed-merge failure observed when using custom sub-keys). Production-leaning defaults: priority/multifactor scheduler tuning via `extraConfMap`, cgroup/gres config files, ClusterIP restapi. Accounting (slurmdbd) and DCGM job-mapping are disabled by default — accounting needs an external MariaDB AICR does not bundle, DCGM needs dcgm-exporter on workers; both opt in via valuesFile / --set. The accounting.storageConfig block is kept as an inert example because the chart wraps the entire Accounting CR in `if accounting.enabled`. Health check asserts both the Slinky CRs (Controller, LoginSet, NodeSet, RestApi at apiVersion `v1beta1` under the `slinky-slurm` release name) and the underlying workload readiness (Deployment/StatefulSet `availableReplicas > 0`) before the generic pod-health step. KWOK validate-scheduling.sh now extracts the `platform` criterion and passes `--platform slurm` so *-slurm overlays resolve to the slinky bundle they ship. The flag is intentionally scoped to slurm: a full matrix run showed the harness's sequential CRD cleanup is broken for NFD/KAI/run.ai (orphans block pre-flight on the second recipe onward), so widening to kubeflow/dynamo is deferred to a follow-up. Registry adds nodeScheduling paths for `nodesets.slinky.podSpec.*`, `loginsets.slinky.podSpec.*`, `controller.podSpec.*`, and `restapi.podSpec.*` so both `--system-node-*` and `--accelerated-node-*` flags steer the correct pods. Deployment-order guard gains the two new leaves; a targeted components test verifies the registry paths line up with the actual map-keys in components/slinky-slurm/values.yaml. EKS leaf intentionally has no `validation:` block so it inherits the full set from `h100-eks-training` (via `h100-eks-ubuntu-training`); the Kind leaf follows the same pattern. docs/user/container-images.md regenerated via `make bom-docs`.
02c2279 to
f5f3bb4
Compare
Mark flagged that the slurm-only allowlist will silently bit-rot when a future *-kubeflow or *-dynamo leaf with criteria.platform lands here. Add a log_info so the next person sees the hint instead of mysterious bundle diffs. Cannot fail-closed yet: existing kubeflow/dynamo Tier 2 lanes carry criteria.platform today and have historically resolved to their non-platform parent under KWOK. Widening the allowlist is blocked on the harness CRD-cleanup bug; tracked as a follow-up.
Mark's earlier review explained that chainsaw `error` asserts pass vacuously when no resource matches — so the preceding deployment availability checks are load-bearing, not redundant. Previous trim of that comment lost the non-obvious invariant; restore it.
Mirror the wording fix applied to slinky-slurm-operator/health-check.yaml: spell out that chainsaw `error` passes vacuously on empty namespaces, so the preceding StatefulSet/Deployment/NodeSet availability asserts are load-bearing.
Mark called out that hardcoding `slinky-slurm-*` in the asserts silently breaks for deployers that override the Helm release name (flux/argocd path-based naming, multi-tenant installs). Add a TODO pointing at the parameterization gap so the next reader doesn't mistake this for a deliberate single-release contract.
Three related cluster-default fixes from review: 1. Single Default=YES partition is a hard slurmctld constraint, not style — call that out so multi-tenant deployers know they must disable this partition before adding their own. 2. MaxTime=UNLIMITED is unsafe as a default; a stuck job pins GPUs indefinitely. Cap at 24h. Leaves with longer-running workloads override per-overlay. 3. The prior "known upstream issue" framing for slurmd auto-registering pod resource limits was wrong. Verified against v1.1 chart sources: --conf only carries Features=<name> + user extraConf, while pod cpu/memory are plumbed as POD_CPUS/POD_MEMORY env vars for the image entrypoint. Reword the note to describe what the chart actually does and why Gres/Features must be set explicitly.
PR-948 CI: every Tier 1/2 lane went red after the cert-manager harness fix. Root cause: the previous commit pinned aicr.nvidia.com/node-type= system exclusively to the real Kind control-plane (KWOK fakes now carry =kwok-system). That correctly routes cert-manager — which tolerates the control-plane taint — but leaves untolerated charts (kai-scheduler, nvsentinel, prometheus-adapter) Pending because the CP still has node-role.kubernetes.io/control-plane:NoSchedule. Remove that taint. Production clusters either run dedicated, untainted system nodes or schedule these charts on workers; the harness should model the former.
Mark called out that PriorityType=priority/multifactor + the raw weight values + PriorityFlags=NO_NORMAL_ALL bake a specific site-policy into the AICR default. Most upstream Slurm clusters either run normalized (default) or omit PriorityFlags and let admins tune. These weights were inherited from an AWS reference config and have not been validated by any AICR-shipped leaf. Drop them — sites that want multifactor add it in their leaf valuesFile. Keep SelectType=select/cons_tres (required for GPU GRES) and ScronParameters=enable. Drop EnforcePartLimits=no (matches Slurm's own default).
The previous two harness fixes (label KWOK fakes kwok-system, pin real CP, untaint CP) cascaded: pinning system-tier workloads to a single real node caused Insufficient CPU pending across most lanes. Use the simpler fix the harness already employs for similar cases — disable the side-effect in the bundle. cert-manager's validating webhook is the only thing slinky-slurm-operator's install touched that needed a reachable endpoint; setting webhook.enabled=false skips admission entirely. KWOK doesn't execute workloads, so we lose nothing for scheduling validation, and every other chart returns to landing on KWOK fakes as before. Revert apply-nodes.sh and run-all-recipes.sh to upstream behavior.
This reverts commit c3eed64.
This reverts commit f66991d.
slinky-slurm-operator's chart gates both the cert-manager.io/Certificate submission and the ValidatingWebhookConfiguration on its own webhook.enabled / certManager.enabled toggles. Disable both for KWOK so admission isn't routed to unreachable fake-node pods. Harmless for scheduling validation; production recipes are unaffected. Verified locally: 43/43 pods scheduled.
7272a8f to
54260ca
Compare
EKS slurm lane snapshot caught slinky-slurm-controller-0 mid-bind (NOMINATED system-1, spec.nodeName empty) and counted it as unscheduled. slurm-operator reconciles the Controller CR into a StatefulSet AFTER Helm install completes, so the controller pod appears later than the script's existing 5s post-deploy sleep. Poll up to 60s for the controller pod's spec.nodeName, gated on the Controllers CRD existing so non-slurm lanes are unaffected.
The previous controller-pod poll fix was treating the symptom: the real failure is that slinky-slurm-controller is a StatefulSet with persistence.enabled=true by default, and Kind's local-path provisioner uses WaitForFirstConsumer binding. The pod gets NominatedNodeName=system-1 (a KWOK fake), the PVC tries to provision local-path on that fake, KWOK can't back local storage, and the pod sits Pending forever with no FailedScheduling event. Revert the controller-pod poll and disable controller persistence in the bundle via --set slurmcluster:controller.persistence.enabled=false. Verified locally: 43/43 pods scheduled.
Summary
Wires the Slinky
slurmcluster chart into AICR as an opt-in component (platform-slurm-clustermixin) and ships the first two leaf recipes that consume both the Slinky operator and cluster:h100-eks-ubuntu-training-slurmh100-kind-training-slurmChanges
Component (
recipes/components/slinky-slurm/values.yaml):slinkykeys fornodesets/loginsetsso user values merge with chart defaults (avoids thenodeset.logfiletyped-merge failure observed with custom sub-keys).extraConfMap(strings explicitly quoted to dodge YAML 1.1 coercion), cgroup + gres config files, ClusterIPrestapi.dcgm-exporteron workers. Both opt in via valuesFile /--set.accounting.storageConfigis kept as an inert documented example (chart wraps the entireAccountingCR inif accounting.enabled).Mixin (
recipes/mixins/platform-slurm-cluster.yaml): opt-inslinky-slurmcluster with deps onslinky-slurm-operator+slinky-slurm-operator-crds.Health check (
recipes/checks/slinky-slurm/health-check.yaml): asserts Slinky CRs (Controller,LoginSet,NodeSet,RestApiatslinky.slurm.net/v1beta1under theslinky-slurmrelease name) AND underlying workload readiness (Deployment/StatefulSetavailableReplicas > 0) before the generic pod-health step. Standardized onavailableReplicasacross all sub-checks.Registry: adds
slinky-slurmwithnodeScheduling:nodeSelectorPaths+tolerationPathsforcontroller.podSpec.*,restapi.podSpec.*,loginsets.slinky.podSpec.*nodeSelectorPaths+tolerationPathsfornodesets.slinky.podSpec.*Both
--system-node-*and--accelerated-node-*CLI flags now steer the correct pods.Leaves:
validation:block — inherit the full set from their*-trainingparents (mirrors*-kubeflowleaves).KWOK (
kwok/scripts/validate-scheduling.sh): extracts theplatformcriterion so*-slurmoverlays resolve to the slinky bundle they ship. Flag is intentionally scoped toslurmonly — a full matrix run revealed the harness's sequential CRD cleanup is broken for NFD / KAI / run.ai (orphans block pre-flight on the second recipe onward), so widening to kubeflow/dynamo is deferred to a follow-up.Tests:
pkg/recipe/deployment_order_guard_test.go: 2 new cases assertingcert-manager → slinky-slurm-operator-crds → slinky-slurm-operator → slinky-slurm.pkg/recipe/components_test.go: new targeted test verifies registry node-scheduling paths line up with the actualslinkymap-keys incomponents/slinky-slurm/values.yaml(prevents silent drift on key renames).Docs:
docs/integrator/recipe-development.mddocuments the two-mixin opt-in pattern;docs/user/container-images.mdregenerated viamake bom-docs.Verified against upstream chart v1.1.0
All component values cross-checked against SlinkyProject/slurm-operator helm/slurm v1.1.0 and
main. Notably:accounting.storageConfig.passwordKeyRefmatches chart defaults (name: mariadb-password,key: password).extraConfMapvalues quoted as strings — chart serializes the map asmap[string]stringand bare YAML 1.1nowould coerce to bool.Test plan
make test— full unit suite (includes deployment-order guard and components_test).make lint— go/yaml/license/agents-sync/docs-sidebar.make qualify— end-to-end pre-PR pass.aicr recipe+aicr bundlesmoke for both leaves.make check-healthagainst live Kind cluster for bothslinky-slurm-operatorandslinky-slurm.aicr validateagainst deployed Kind cluster.h100-kind-training-slurmandh100-eks-ubuntu-training-slurm.Notes for reviewers
slinkymap-key alignment between registry and values is enforced by the new components_test to catch future renames at unit-test time rather than via opaque scheduling failures.