feat(slinky-slurm): add h100-gke-cos leaf and inline platform refs#997
feat(slinky-slurm): add h100-gke-cos leaf and inline platform refs#997faganihajizada wants to merge 9 commits into
Conversation
Move slinky-slurm-operator-crds, slinky-slurm-operator, and slinky-slurm from the platform-slurm / platform-slurm-cluster mixins into inline componentRefs on each slurm leaf, mirroring the dynamo-platform pattern used by *-inference-dynamo leaves. The mixinComponentRefSafeForMerge guard in pkg/recipe/metadata_store.go blocks leaf-level overrides on identity fields (source, version, valuesFile) of mixin-contributed components, which made it impossible to declare leaf-specific GPU GRES tuning (extraConfMap.Gres and slurmd.resources.limits.nvidia.com/gpu) from the leaf overlays. Inlining the three Slinky charts per slurm leaf moves them off the mixin merge path and onto the standard componentRef overlay path, unblocking the GRES wiring that srun --gres=gpu:N requires. Changes: - recipes/overlays/h100-eks-ubuntu-training-slurm.yaml: inline the three Slinky componentRefs with H100 GRES overrides on slinky-slurm - recipes/overlays/h100-kind-training-slurm.yaml: inline the three Slinky componentRefs (CPU-only, no GRES overrides) - recipes/mixins/platform-slurm.yaml, recipes/mixins/platform-slurm-cluster.yaml: removed; no remaining references in the live tree (only a historical CHANGELOG mention) - recipes/registry.yaml: update slinky-slurm comment to point to the inline pattern; defaults still drive source/version/namespace - docs/integrator/recipe-development.md, docs/user/component-catalog.md: rewrite the slurm-platform section to describe the inline pattern - docs/user/api-reference.md: add slinky-slurm to the bundler component table (pre-existing gap from when the cluster was mixin-opt-in)
First-class GKE/COS leaf for H100 training with the Slinky-managed Slurm cluster (Controller, LoginSet, NodeSet, RestApi). Inherits from h100-gke-cos-training and inlines the slinky-slurm trio (slinky-slurm-operator-crds, slinky-slurm-operator, slinky-slurm) with the same H100 GRES overrides as the EKS slurm leaf, so srun --gres=gpu:N allocates real GPUs (Gres=gpu:h100:8 in slurm.conf, nvidia.com/gpu: 8 on the slurmd pod). Unlike the EKS slurm leaf, no os-* mixin is composed: gke-cos.yaml already disables GPU driver installation and pins DRA / nodewright paths for COS's read-only rootfs. Drops the K8s-native NCCL bandwidth perf check inherited from h100-gke-cos-training: on a Slinky cluster the equivalent validation is a slurm-launched srun job (e.g. nccl-tests/all_reduce_perf) that exercises slurmd + the GPUDirect TCPXO plugin from the parent leaf. Conformance and deployment checks are inherited unchanged. Adds a deployment-order guard test case mirroring the existing h100-kind-training-slurm and h100-eks-ubuntu-training-slurm cases, asserting CRDs precede the operator, the operator precedes the cluster, cert-manager precedes the operator, and gpu-operator precedes nvsentinel.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR replaces centralized Slurm mixins with inline per-leaf Slinky Slurm componentRefs across overlays. Existing EKS and Kind Slurm overlays are updated to declare CRDs, operator, and cluster inline with leaf-specific GRES/override settings and DRA constraints. A new GKE/COS/H100 training Slurm overlay is added. Documentation, registry comments, and the API components list are updated to reflect the inline-per-leaf pattern. Tests are extended to cover NFD topology updater expectations and deployment-order guards for the new overlays. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/integrator/recipe-development.md`:
- Line 118: Split the dense paragraph into shorter sentences or a bulleted list
to improve readability: explain each concept separately (1) that a platform's
full component stack is declared inline per leaf overlay rather than via a
platform mixin, (2) that leaves can carry hardware-specific tuning like GPU
GRES, resource limits, and partition layout, (3) the concrete example that
`--platform slurm` leaves inline the SchedMD Slinky operator CRDs and the
operator and the Slinky-managed Slurm cluster as three `componentRefs` (same
shape `dynamo-platform` uses across `*-inference-dynamo` leaves), (4) the
variations where a leaf can inline only `slinky-slurm-operator-crds` +
`slinky-slurm-operator` to get the operator only, or add `slinky-slurm` with
leaf-specific `overrides` to get the full cluster, and (5) reference the example
file `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml`; convert each
numbered concept into its own short sentence or bullet so the paragraph is
scan-friendly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: ac0ba934-f154-4860-95ab-06efc451df91
📒 Files selected for processing (10)
docs/integrator/recipe-development.mddocs/user/api-reference.mddocs/user/component-catalog.mdpkg/recipe/deployment_order_guard_test.gorecipes/mixins/platform-slurm-cluster.yamlrecipes/mixins/platform-slurm.yamlrecipes/overlays/h100-eks-ubuntu-training-slurm.yamlrecipes/overlays/h100-gke-cos-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/registry.yaml
💤 Files with no reviewable changes (2)
- recipes/mixins/platform-slurm-cluster.yaml
- recipes/mixins/platform-slurm.yaml
The single dense paragraph introduced in the slinky-slurm refactor was hard to scan. Break it into a short lead sentence, a bulleted list of the three inlined componentRefs (CRDs / operator / cluster), and a trailing line that points at dynamo-platform as the precedent and h100-eks-ubuntu-training-slurm.yaml as the worked example. Drops the speculative "partition layout" tuning example and the operator-only leaf variation: neither is exercised by any leaf in tree today, and documenting hypotheticals risks the same drift problem catalog-md hit before (NKX-10404).
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Approving — the refactor is well-motivated and well-documented.
Summary: The inline shape mirrors the established dynamo-platform precedent, the PR description traces the engine constraint (mixinComponentRefSafeForMerge identity-field guard) precisely, and the end-to-end evidence on aicr-eidosx (sinfo + 16 unique H100 UUIDs across both slurmd pods) is concrete. Nice catch on adding slinky-slurm to the docs/user/api-reference.md bundler table — that was a pre-existing gap from when it was opt-in via the deleted cluster mixin.
Verified locally at HEAD e51b4b16:
TestDeploymentOrderGuardspasses all 12 cases including the newh100-gke-cos-training-slurm.go test ./pkg/recipe/... -shortgreen.- All function references in the PR (
mergeMixins,mixinComponentRefSafeForMerge,ApplyRegistryDefaults) exist at the cited locations. - No orphaned consumers of the deleted mixins outside of stale planning docs in
docs/superpowers/.
One consistency question on the EKS vs GKE NCCL validation handling (inline), plus a non-blocking suggestion about extending TestNFDTopologyUpdater_OverlayCoverage to cover slurm leaves (pre-existing gap, not introduced by this PR).
The GKE slurm leaf already overrides validation.performance.checks to [] with the rationale that on a Slinky-managed cluster the K8s-native nccl-all-reduce-bw measures the wrong path — it launches a Pod past slurmd, so the bandwidth number reflects raw cluster networking, not what an `srun nccl-tests/all_reduce_perf` would see going through slurmd + the configured fabric (TCPXO on GKE, EFA on EKS). The argument generalizes to any Slinky cluster, so apply the same override on the EKS slurm leaf. Tighten the prose on both files: the old comment said "Conformance checks are inherited unchanged" but deployment checks are also inherited; replace with "Deployment and conformance checks are inherited unchanged" for precision.
TestNFDTopologyUpdater_OverlayCoverage enumerates every GPU overlay variant to catch overlays that accidentally lose NRT publishing on real-cluster leaves or accidentally turn it on for kind-chain leaves. The three slurm leaves (h100-eks-ubuntu-training-slurm, h100-gke-cos-training-slurm, h100-kind-training-slurm) were missing from the table — a pre-existing gap that this PR is the natural moment to close, since it adds one new slurm leaf and refactors the other two off the deleted platform-slurm mixins. Expected pattern matches the parent inheritance: EKS+GKE slurm leaves inherit topologyUpdater.enable=true from their h100-*-training parents; the kind slurm leaf stays off (no kubelet podResources socket on KWOK).
mchmarny
left a comment
There was a problem hiding this comment.
Clean refactor — moving slurm off the mixin merge path onto the same inline shape as dynamo-platform is the right call, the rationale in the description and YAML comments is thorough, and the live rehearsal (sinfo + srun -N2 nvidia-smi -L returning 16 H100 UUIDs) is solid evidence the GRES wiring actually works end-to-end. CI is green across all 142 checks.
Three findings, none blocking:
- Medium —
docs/integrator/recipe-development.mdline 118 overgeneralizes: claims "a platform's full component stack is declared inline … rather than via a platform mixin," butplatform-kubeflow/platform-inferenceare still active mixins and the example just above on the same page composesplatform-kubeflowviaspec.mixins. Scope the claim to slurm (and dynamo, if calling it out as precedent). - Nit — the GKE leaf's
K8s.server.version: ">= 1.32"constraint duplicates the parent. The EKS leaf's restatement is justified because it tightens to>= 1.32.4; here it's noise. - Nit — the ~20-line GRES rationale block is now duplicated verbatim across the EKS and GKE slurm leaves, and one of the references (
manifests/clusters/mars-prod/...) points at an internal NVIDIA repo that will read as opaque in the OSS tree.
Nothing here blocks merge — the medium finding is a doc accuracy issue, not a code defect.
| Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details. | ||
|
|
||
| A platform may split into multiple mixins when parts of the stack are independently opt-in. For example, `--platform slurm` resolves through two mixins: `platform-slurm` always contributes the SchedMD Slinky operator and CRDs, and `platform-slurm-cluster` is opt-in for the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi). A leaf that wants operator-only composes just `platform-slurm`; a leaf that wants the cluster too composes both — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter. | ||
| A platform's full component stack is declared inline per leaf overlay rather than via a platform mixin. This lets each leaf carry its own hardware-specific tuning — most commonly GPU GRES strings and accelerator resource limits — without going through the mixin merge path. |
There was a problem hiding this comment.
This generalization is too broad — platform-kubeflow and platform-inference are still active mixins (used by h100-eks-ubuntu-training-kubeflow.yaml, eks-inference.yaml, etc.), and the example immediately above on this page (line ~107) shows platform-kubeflow being composed via spec.mixins. As written, the new paragraph contradicts that example and the actual state of recipes/mixins/.
Suggest scoping the claim to slurm (and dynamo, if you want to call it out as the precedent), e.g.:
Some platforms declare their full component stack inline per leaf overlay rather than via a platform mixin — most notably
--platform slurmand--platform dynamo, where each leaf carries hardware-specific tuning (GPU GRES strings, accelerator resource limits) that the mixin merge path can't represent cleanly. Other platforms (--platform kubeflow,--platform inference) still use mixins where leaf-specific tuning isn't required.
There was a problem hiding this comment.
Fixed in 7782ff7. Scoped the claim to slurm and dynamo, and explicitly preserved that kubeflow/inference still use platform-kubeflow / platform-inference mixins (referencing the example two paragraphs up). Wording is now consistent with the actual state of recipes/mixins/.
|
|
||
| constraints: | ||
| - name: K8s.server.version | ||
| value: ">= 1.32" |
There was a problem hiding this comment.
nit: this duplicates the parent h100-gke-cos-training.yaml's K8s.server.version: ">= 1.32". Same-name constraints get overridden during merge so it's harmless, but if it's not strengthening the floor it's noise. The EKS leaf bumps to >= 1.32.4 which justifies the restatement; here either drop it or tighten it.
There was a problem hiding this comment.
Dropped the restatement in 7782ff7. Replaced with a short comment noting the floor is inherited from h100-gke-cos-training.yaml (cf. the EKS leaf, which restates >= 1.32.4 to tighten its parent). Cleaner than leaving merge-noise in the constraint list.
| componentRefs: [] | ||
| componentRefs: | ||
| - name: slinky-slurm-operator-crds | ||
| type: Helm |
There was a problem hiding this comment.
The double-declaration commentary (extraConfMap.Gres + nvidia.com/gpu limit, with the explanation of why both are needed) is exactly the kind of context that future-you will thank present-you for. Nice.
Two micro-suggestions to consider — neither blocking:
- The
mars-prodfilesystem path (manifests/clusters/mars-prod/mars-dgxc-k8s-tgr-bwi-prd1/...) is from an internal NVIDIA repo and will be opaque to anyone reading this in the public OSS tree. Either qualify with "(internal NVIDIA reference cluster)" or drop the parenthetical — the rationale (requests auto-mirror to limits for extended resources) stands on its own. - The same ~20-line GRES rationale block is now copy-pasted verbatim in the GKE and EKS slurm leaves. If a future slurm leaf changes the wording in one and forgets the other they'll drift. Acceptable for two leaves; worth keeping an eye on if a third lands.
There was a problem hiding this comment.
Dropped the mars-prod parenthetical from both EKS and GKE slurm leaves in 7782ff7. Agreed it reads as opaque in the OSS tree; the auto-mirror rationale stands on its own. On the duplication: keeping both copies for now (matches your "acceptable for two leaves" call); if a third slurm leaf lands we can extract the GRES rationale into a shared comment fragment or a leaf-template.
- recipe-development.md: scope the "inline-per-leaf platform stack" claim to slurm and dynamo. The prior wording was too broad: platform-kubeflow and platform-inference are still active mixins (shown in the example two paragraphs up), and the dynamo-platform precedent already lives inline in *-inference-dynamo leaves. The new wording calls out slurm/dynamo as inline-per-leaf and explicitly preserves the mixin status of kubeflow/inference. - h100-gke-cos-training-slurm.yaml: drop the verbatim restatement of K8s.server.version (>= 1.32) from the parent. Same-name constraints get overridden during merge, so restating without strengthening is noise. The EKS slurm leaf restates >= 1.32.4 to tighten its parent; the GKE leaf has no tighter floor, so we inherit and add a one-line comment documenting why. - h100-eks-ubuntu-training-slurm.yaml and h100-gke-cos-training-slurm.yaml: drop the parenthetical pointer to mars-prod (manifests/clusters/mars-prod/...). That path lives in an internal NVIDIA repo and reads as opaque in the OSS tree. The rationale (Kubernetes auto-mirrors requests=limits for extended resources) stands on its own. Per Mark's review on NVIDIA#997.
Summary
Adds the first-class
h100-gke-cos-training-slurmleaf and refactors the existing EKS/Kind slurm leaves to inline the three Slinky components per leaf (mirroring thedynamo-platformpattern), unblocking leaf-level GPU GRES tuning that the oldplatform-slurm/platform-slurm-clustermixin shape made impractical.Motivation / Context
The original
platform-slurm/platform-slurm-clustermixins contributedslinky-slurm-operator-crds,slinky-slurm-operator, andslinky-slurmto every slurm leaf. Because mixins are merged after the inheritance chain (pkg/recipe/metadata_store.go:mergeMixins) and the mixin's own componentRefs set identity fields (type,valuesFile,dependencyRefs), any leaf that pre-declared the same component name to attachoverrideswould causemixinComponentRefSafeForMergeto reject the mixin's identity-field entry as a conflict against the chain-existing component.In practice this meant the only way to push leaf-specific GPU GRES (
extraConfMap.Gres=gpu:h100:8andslurmd.resources.limits.nvidia.com/gpu: 8) intonodesets.slinkywas at install time via--set slinkyslurm:..., which made the leaf incomplete:srun --gres=gpu:NsawGres=(null)from any straightaicr bundle.Switching to the same inline pattern used by
dynamo-platformacross*-inference-dynamoleaves moves Slinky off the mixin merge path and onto the standard inheritance/overlay path, so each leaf carries its own hardware-specific tuning as part of the recipe itself.This change was developed and rehearsed end-to-end on the
aicr-eidosxGKE cluster: full 17-component bundle deploy on 2x H100 nodes,sinfoshowing both NodeSet pods idle withgpu:h100:8, andsrun -N2 --gres=gpu:8 nvidia-smi -Lreturning 16 unique H100 UUIDs across both hosts.Fixes: N/A
Related: NKX-9586, #948 (initial slinky-slurm chart + EKS/Kind leaves)
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)recipes/overlays,recipes/mixins,recipes/registry.yaml)Implementation Notes
Why inline instead of relaxing the mixin guard? The
mixinComponentRefSafeForMergeguard exists for a reason (ADR-005's "no silent chart identity override" mitigation): mixin-contributed components must be replaceable wholesale, not partially merged, to keep the mixin contract simple. Thedynamo-platformfamily already established the precedent of declaring platform components inline per leaf — this PR brings slurm in line with that pattern rather than weakening the guard.Why omit
source/version/valuesFileon the slinkycomponentRefs? Unlikedynamo-platform, the Slinky charts are not declared inbase.yaml. They live only inregistry.yamlwithdefaultRepository/defaultVersion/defaultChart, andApplyRegistryDefaults(pkg/recipe/metadata.go:156-184) populates them at recipe-resolve time. The registry comment onslinky-slurmexplicitly documentsdefaultVersionas the single source of truth (one-line bumps instead of touching every leaf). This shape diverges from dynamo's inlining but is engine-supported and reduces leaf duplication.GRES wiring requires two fields, not one. The chart does not derive
Gres=inslurm.conffrom pod resource limits, so leaves must declare GRES in both places:nodesets.slinky.extraConfMap.Gres(tells slurmctld which GPUs slurmd has) andnodesets.slinky.slurmd.resources.limits.nvidia.com/gpu(reserves the GPUs from the device plugin sogres.conf'sAutoDetect=nvidiafinds them). Both are documented inline in the GKE leaf with a cross-reference to the mars-prod reference values.API reference docs gap. While updating component docs I noticed
slinky-slurm(the cluster instance) was missing fromdocs/user/api-reference.md's bundler table — a pre-existing gap from when it was opt-in via the deletedplatform-slurm-clustermixin. Added in the same refactor commit so the docs match the inline reality.Out of scope (logged separately): multi-node NCCL via Pyxis/Enroot requires swapping to
ghcr.io/slinkyproject/slurmd-pyxisand configuringplugstack.conf; not attempted here.Testing
End-to-end rehearsal on
aicr-eidosx(GKE, 2x H100):Risk Assessment
helm upgrade --installbut no manifest churn.Rollout notes: No data migration. The deleted mixins were always paired (no leaf ever used
platform-slurmwithoutplatform-slurm-cluster), so removing them does not regress any composition that was actually used. Any out-of-tree leaf still referencing the mixins will fail recipe validation with a clear error and needs a one-time port to the inline shape — see the EKS leaf for the template.Checklist
make testwith-race)make lint) — exceptlint-go(local Go-version build mismatch ongolangci-lint; CI runs pinned version)deployment_order_guard_test.go)recipe-development.md,component-catalog.md,api-reference.md)dynamo-platforminline pattern)git commit -S) — verify before pushing