-
Notifications
You must be signed in to change notification settings - Fork 46
feat(slinky-slurm): add h100-gke-cos leaf and inline platform refs #997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
18576db
8538fc7
0b98bb7
e51b4b1
402a31d
2d6614c
24093c0
7782ff7
8af0ccd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,9 +19,10 @@ metadata: | |
|
|
||
| spec: | ||
| # H100 + EKS + Ubuntu + training with the Slinky operator and a | ||
| # Slinky-managed Slurm cluster. EKS-specific cluster tuning (gp3 | ||
| # storage, GPU GRES, DCGM job-mapping) is layered at install time | ||
| # via `aicr bundle ... --set slurmcluster:...` or a valuesFile. | ||
| # Slinky-managed Slurm cluster. H100 GPU GRES is declared inline below | ||
| # (pod-side limit + slurmd-side Gres= line); remaining EKS-specific | ||
| # tuning (gp3 storage, DCGM job-mapping) is layered at install time | ||
| # via `aicr bundle ... --set slinkyslurm:...` or a valuesFile. | ||
| base: h100-eks-ubuntu-training | ||
|
|
||
| criteria: | ||
|
|
@@ -33,18 +34,67 @@ spec: | |
|
|
||
| mixins: | ||
| - os-ubuntu | ||
| - platform-slurm | ||
| - platform-slurm-cluster | ||
|
|
||
| constraints: | ||
| - name: K8s.server.version | ||
| value: ">= 1.32.4" | ||
|
|
||
| # Mixin-contributed components cannot be overridden from a leaf; use | ||
| # `--set slurmcluster:...` or a valuesFile at install time instead. | ||
| # The Slinky operator (CRDs + operator + cluster instance) is declared | ||
| # inline per slurm leaf, mirroring the dynamo-platform pattern in | ||
| # h100-*-inference-dynamo leaves. Inlining lets each leaf carry its | ||
| # own GPU/GRES tuning without fighting the mixin-vs-leaf identity-field | ||
| # guard in mixinComponentRefSafeForMerge (pkg/recipe/metadata_store.go), | ||
| # and keeps base.yaml free of platform-specific components. | ||
| # | ||
| # GPU GRES on slinky-slurm must be declared in two places because the | ||
| # chart does not derive Gres= in slurm.conf from pod resource limits | ||
| # (see comment in components/slinky-slurm/values.yaml): | ||
| # 1. nodesets.slinky.extraConfMap.Gres — adds `Gres=gpu:h100:8` to | ||
| # slurmd's --conf so slurmctld knows it has GPUs to allocate via | ||
| # `srun --gres=gpu:N`. | ||
| # 2. nodesets.slinky.slurmd.resources.limits.nvidia.com/gpu — reserves | ||
| # 8 H100s on the slurmd pod so the NVIDIA device plugin injects | ||
| # /dev/nvidia* into the container. Without this `gres.conf`'s | ||
| # AutoDetect=nvidia finds nothing. `requests` is omitted: Kubernetes | ||
| # auto-mirrors requests=limits for extended resources. | ||
| # Accelerated nodeSelector/tolerations on slurmd are injected via the | ||
| # registry's nodesets.slinky.podSpec.{nodeSelector,tolerations} paths. | ||
| componentRefs: [] | ||
| componentRefs: | ||
| - name: slinky-slurm-operator-crds | ||
| type: Helm | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The double-declaration commentary (extraConfMap.Gres + nvidia.com/gpu limit, with the explanation of why both are needed) is exactly the kind of context that future-you will thank present-you for. Nice. Two micro-suggestions to consider — neither blocking:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dropped the |
||
| valuesFile: components/slinky-slurm-operator-crds/values.yaml | ||
|
|
||
| # Validation is inherited from h100-eks-training (operator-health, | ||
| # expected-resources, gpu-operator-version, check-nvidia-smi). | ||
| - name: slinky-slurm-operator | ||
| type: Helm | ||
| valuesFile: components/slinky-slurm-operator/values.yaml | ||
| dependencyRefs: | ||
| - cert-manager | ||
| - slinky-slurm-operator-crds | ||
|
|
||
| - name: slinky-slurm | ||
| type: Helm | ||
| valuesFile: components/slinky-slurm/values.yaml | ||
| dependencyRefs: | ||
| - slinky-slurm-operator | ||
| - slinky-slurm-operator-crds | ||
| overrides: | ||
| nodesets: | ||
| slinky: | ||
| extraConfMap: | ||
| Gres: "gpu:h100:8" | ||
| slurmd: | ||
| resources: | ||
| limits: | ||
| nvidia.com/gpu: 8 | ||
|
|
||
| # K8s-native nccl-all-reduce-bw is dropped on Slinky leaves: that | ||
| # check launches a Pod against the cluster scheduler, so on a | ||
| # Slinky-managed cluster it bypasses slurmd entirely and measures the | ||
| # wrong path. The equivalent signal here is a slurm-launched | ||
| # `srun nccl-tests/all_reduce_perf` that goes through slurmd + the | ||
| # EFA libfabric stack already present on the parent EKS leaf. | ||
| # Deployment and conformance checks are inherited unchanged. | ||
| validation: | ||
| performance: | ||
| checks: [] | ||
| constraints: [] | ||
Uh oh!
There was an error while loading. Please reload this page.