Skip to content

feat(slinky-slurm): add h100-gke-cos leaf and inline platform refs#997

Open
faganihajizada wants to merge 9 commits into
NVIDIA:mainfrom
faganihajizada:feat/slinky-slurm-gke-leaf-and-inline-refactor
Open

feat(slinky-slurm): add h100-gke-cos leaf and inline platform refs#997
faganihajizada wants to merge 9 commits into
NVIDIA:mainfrom
faganihajizada:feat/slinky-slurm-gke-leaf-and-inline-refactor

Conversation

@faganihajizada
Copy link
Copy Markdown
Contributor

@faganihajizada faganihajizada commented May 21, 2026

Summary

Adds the first-class h100-gke-cos-training-slurm leaf and refactors the existing EKS/Kind slurm leaves to inline the three Slinky components per leaf (mirroring the dynamo-platform pattern), unblocking leaf-level GPU GRES tuning that the old platform-slurm / platform-slurm-cluster mixin shape made impractical.

Motivation / Context

The original platform-slurm / platform-slurm-cluster mixins contributed slinky-slurm-operator-crds, slinky-slurm-operator, and slinky-slurm to every slurm leaf. Because mixins are merged after the inheritance chain (pkg/recipe/metadata_store.go:mergeMixins) and the mixin's own componentRefs set identity fields (type, valuesFile, dependencyRefs), any leaf that pre-declared the same component name to attach overrides would cause mixinComponentRefSafeForMerge to reject the mixin's identity-field entry as a conflict against the chain-existing component.

In practice this meant the only way to push leaf-specific GPU GRES (extraConfMap.Gres=gpu:h100:8 and slurmd.resources.limits.nvidia.com/gpu: 8) into nodesets.slinky was at install time via --set slinkyslurm:..., which made the leaf incomplete: srun --gres=gpu:N saw Gres=(null) from any straight aicr bundle.

Switching to the same inline pattern used by dynamo-platform across *-inference-dynamo leaves moves Slinky off the mixin merge path and onto the standard inheritance/overlay path, so each leaf carries its own hardware-specific tuning as part of the recipe itself.

This change was developed and rehearsed end-to-end on the aicr-eidosx GKE cluster: full 17-component bundle deploy on 2x H100 nodes, sinfo showing both NodeSet pods idle with gpu:h100:8, and srun -N2 --gres=gpu:8 nvidia-smi -L returning 16 unique H100 UUIDs across both hosts.

Fixes: N/A
Related: NKX-9586, #948 (initial slinky-slurm chart + EKS/Kind leaves)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: recipes (recipes/overlays, recipes/mixins, recipes/registry.yaml)

Implementation Notes

Why inline instead of relaxing the mixin guard? The mixinComponentRefSafeForMerge guard exists for a reason (ADR-005's "no silent chart identity override" mitigation): mixin-contributed components must be replaceable wholesale, not partially merged, to keep the mixin contract simple. The dynamo-platform family already established the precedent of declaring platform components inline per leaf — this PR brings slurm in line with that pattern rather than weakening the guard.

Why omit source / version / valuesFile on the slinky componentRefs? Unlike dynamo-platform, the Slinky charts are not declared in base.yaml. They live only in registry.yaml with defaultRepository / defaultVersion / defaultChart, and ApplyRegistryDefaults (pkg/recipe/metadata.go:156-184) populates them at recipe-resolve time. The registry comment on slinky-slurm explicitly documents defaultVersion as the single source of truth (one-line bumps instead of touching every leaf). This shape diverges from dynamo's inlining but is engine-supported and reduces leaf duplication.

GRES wiring requires two fields, not one. The chart does not derive Gres= in slurm.conf from pod resource limits, so leaves must declare GRES in both places: nodesets.slinky.extraConfMap.Gres (tells slurmctld which GPUs slurmd has) and nodesets.slinky.slurmd.resources.limits.nvidia.com/gpu (reserves the GPUs from the device plugin so gres.conf's AutoDetect=nvidia finds them). Both are documented inline in the GKE leaf with a cross-reference to the mars-prod reference values.

API reference docs gap. While updating component docs I noticed slinky-slurm (the cluster instance) was missing from docs/user/api-reference.md's bundler table — a pre-existing gap from when it was opt-in via the deleted platform-slurm-cluster mixin. Added in the same refactor commit so the docs match the inline reality.

Out of scope (logged separately): multi-node NCCL via Pyxis/Enroot requires swapping to ghcr.io/slinkyproject/slurmd-pyxis and configuring plugstack.conf; not attempted here.

Testing

# Recipe + ordering tests
go test ./pkg/recipe/... -count=1 -short
# ok  github.com/NVIDIA/aicr/pkg/recipe        1.188s
# ok  github.com/NVIDIA/aicr/pkg/recipe/oskind 0.389s

# Full short test suite (all pkg/...)
go test ./pkg/... -count=1 -short
# All 30+ packages: ok

# Lint sweep (lint-go skipped: local golangci-lint built with go1.25,
# module targets go 1.26.3; CI runs the pinned version)
make lint-yaml license check-docs-filenames check-docs-mdx \
     bom-pinning-check check-agents-sync bom-check
# All pass

End-to-end rehearsal on aicr-eidosx (GKE, 2x H100):

$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*          up   infinite      2   idle slinky-[0-1]

$ srun -N2 --gres=gpu:8 nvidia-smi -L | sort -u | wc -l
16   # 16 unique H100 UUIDs across both slurmd pods

Risk Assessment

  • Low — Isolated to slurm leaves and the platform-slurm mixins; cohabitant components in mixed bundles are untouched. Rehearsed in a live cluster against the full 17-component AICR bundle; cohabitant Helm releases received a revision bump from helm upgrade --install but no manifest churn.
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: No data migration. The deleted mixins were always paired (no leaf ever used platform-slurm without platform-slurm-cluster), so removing them does not regress any composition that was actually used. Any out-of-tree leaf still referencing the mixins will fail recipe validation with a clear error and needs a one-time port to the inline shape — see the EKS leaf for the template.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint) — except lint-go (local Go-version build mismatch on golangci-lint; CI runs pinned version)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (deployment_order_guard_test.go)
  • I updated docs if user-facing behavior changed (recipe-development.md, component-catalog.md, api-reference.md)
  • Changes follow existing patterns in the codebase (mirrors dynamo-platform inline pattern)
  • Commits are cryptographically signed (git commit -S) — verify before pushing

Move slinky-slurm-operator-crds, slinky-slurm-operator, and slinky-slurm
from the platform-slurm / platform-slurm-cluster mixins into inline
componentRefs on each slurm leaf, mirroring the dynamo-platform pattern
used by *-inference-dynamo leaves.

The mixinComponentRefSafeForMerge guard in pkg/recipe/metadata_store.go
blocks leaf-level overrides on identity fields (source, version,
valuesFile) of mixin-contributed components, which made it impossible to
declare leaf-specific GPU GRES tuning (extraConfMap.Gres and
slurmd.resources.limits.nvidia.com/gpu) from the leaf overlays. Inlining
the three Slinky charts per slurm leaf moves them off the mixin merge
path and onto the standard componentRef overlay path, unblocking the
GRES wiring that srun --gres=gpu:N requires.

Changes:
- recipes/overlays/h100-eks-ubuntu-training-slurm.yaml: inline the three
  Slinky componentRefs with H100 GRES overrides on slinky-slurm
- recipes/overlays/h100-kind-training-slurm.yaml: inline the three
  Slinky componentRefs (CPU-only, no GRES overrides)
- recipes/mixins/platform-slurm.yaml,
  recipes/mixins/platform-slurm-cluster.yaml: removed; no remaining
  references in the live tree (only a historical CHANGELOG mention)
- recipes/registry.yaml: update slinky-slurm comment to point to the
  inline pattern; defaults still drive source/version/namespace
- docs/integrator/recipe-development.md, docs/user/component-catalog.md:
  rewrite the slurm-platform section to describe the inline pattern
- docs/user/api-reference.md: add slinky-slurm to the bundler component
  table (pre-existing gap from when the cluster was mixin-opt-in)
First-class GKE/COS leaf for H100 training with the Slinky-managed
Slurm cluster (Controller, LoginSet, NodeSet, RestApi). Inherits from
h100-gke-cos-training and inlines the slinky-slurm trio
(slinky-slurm-operator-crds, slinky-slurm-operator, slinky-slurm) with
the same H100 GRES overrides as the EKS slurm leaf, so srun --gres=gpu:N
allocates real GPUs (Gres=gpu:h100:8 in slurm.conf, nvidia.com/gpu: 8
on the slurmd pod).

Unlike the EKS slurm leaf, no os-* mixin is composed: gke-cos.yaml
already disables GPU driver installation and pins DRA / nodewright
paths for COS's read-only rootfs.

Drops the K8s-native NCCL bandwidth perf check inherited from
h100-gke-cos-training: on a Slinky cluster the equivalent validation is
a slurm-launched srun job (e.g. nccl-tests/all_reduce_perf) that
exercises slurmd + the GPUDirect TCPXO plugin from the parent leaf.
Conformance and deployment checks are inherited unchanged.

Adds a deployment-order guard test case mirroring the existing
h100-kind-training-slurm and h100-eks-ubuntu-training-slurm cases,
asserting CRDs precede the operator, the operator precedes the cluster,
cert-manager precedes the operator, and gpu-operator precedes nvsentinel.
@github-actions
Copy link
Copy Markdown
Contributor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR replaces centralized Slurm mixins with inline per-leaf Slinky Slurm componentRefs across overlays. Existing EKS and Kind Slurm overlays are updated to declare CRDs, operator, and cluster inline with leaf-specific GRES/override settings and DRA constraints. A new GKE/COS/H100 training Slurm overlay is added. Documentation, registry comments, and the API components list are updated to reflect the inline-per-leaf pattern. Tests are extended to cover NFD topology updater expectations and deployment-order guards for the new overlays.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

enhancement, size/M, area/tests, area/bundler

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately captures the two main changes: adding the h100-gke-cos leaf and refactoring to inline platform refs, directly matching the changeset.
Description check ✅ Passed The description comprehensively explains the changeset, motivation, implementation details, testing, and risk assessment, all directly related to the code changes shown.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/recipe-development.md`:
- Line 118: Split the dense paragraph into shorter sentences or a bulleted list
to improve readability: explain each concept separately (1) that a platform's
full component stack is declared inline per leaf overlay rather than via a
platform mixin, (2) that leaves can carry hardware-specific tuning like GPU
GRES, resource limits, and partition layout, (3) the concrete example that
`--platform slurm` leaves inline the SchedMD Slinky operator CRDs and the
operator and the Slinky-managed Slurm cluster as three `componentRefs` (same
shape `dynamo-platform` uses across `*-inference-dynamo` leaves), (4) the
variations where a leaf can inline only `slinky-slurm-operator-crds` +
`slinky-slurm-operator` to get the operator only, or add `slinky-slurm` with
leaf-specific `overrides` to get the full cluster, and (5) reference the example
file `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml`; convert each
numbered concept into its own short sentence or bullet so the paragraph is
scan-friendly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ac0ba934-f154-4860-95ab-06efc451df91

📥 Commits

Reviewing files that changed from the base of the PR and between 767b877 and 0b98bb7.

📒 Files selected for processing (10)
  • docs/integrator/recipe-development.md
  • docs/user/api-reference.md
  • docs/user/component-catalog.md
  • pkg/recipe/deployment_order_guard_test.go
  • recipes/mixins/platform-slurm-cluster.yaml
  • recipes/mixins/platform-slurm.yaml
  • recipes/overlays/h100-eks-ubuntu-training-slurm.yaml
  • recipes/overlays/h100-gke-cos-training-slurm.yaml
  • recipes/overlays/h100-kind-training-slurm.yaml
  • recipes/registry.yaml
💤 Files with no reviewable changes (2)
  • recipes/mixins/platform-slurm-cluster.yaml
  • recipes/mixins/platform-slurm.yaml

Comment thread docs/integrator/recipe-development.md Outdated
The single dense paragraph introduced in the slinky-slurm refactor was
hard to scan. Break it into a short lead sentence, a bulleted list of
the three inlined componentRefs (CRDs / operator / cluster), and a
trailing line that points at dynamo-platform as the precedent and
h100-eks-ubuntu-training-slurm.yaml as the worked example.

Drops the speculative "partition layout" tuning example and the
operator-only leaf variation: neither is exercised by any leaf in tree
today, and documenting hypotheticals risks the same drift problem
catalog-md hit before (NKX-10404).
Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — the refactor is well-motivated and well-documented.

Summary: The inline shape mirrors the established dynamo-platform precedent, the PR description traces the engine constraint (mixinComponentRefSafeForMerge identity-field guard) precisely, and the end-to-end evidence on aicr-eidosx (sinfo + 16 unique H100 UUIDs across both slurmd pods) is concrete. Nice catch on adding slinky-slurm to the docs/user/api-reference.md bundler table — that was a pre-existing gap from when it was opt-in via the deleted cluster mixin.

Verified locally at HEAD e51b4b16:

  • TestDeploymentOrderGuards passes all 12 cases including the new h100-gke-cos-training-slurm.
  • go test ./pkg/recipe/... -short green.
  • All function references in the PR (mergeMixins, mixinComponentRefSafeForMerge, ApplyRegistryDefaults) exist at the cited locations.
  • No orphaned consumers of the deleted mixins outside of stale planning docs in docs/superpowers/.

One consistency question on the EKS vs GKE NCCL validation handling (inline), plus a non-blocking suggestion about extending TestNFDTopologyUpdater_OverlayCoverage to cover slurm leaves (pre-existing gap, not introduced by this PR).

Comment thread recipes/overlays/h100-gke-cos-training-slurm.yaml Outdated
Comment thread recipes/overlays/h100-gke-cos-training-slurm.yaml
Comment thread pkg/recipe/deployment_order_guard_test.go
The GKE slurm leaf already overrides validation.performance.checks to
[] with the rationale that on a Slinky-managed cluster the K8s-native
nccl-all-reduce-bw measures the wrong path — it launches a Pod past
slurmd, so the bandwidth number reflects raw cluster networking, not
what an `srun nccl-tests/all_reduce_perf` would see going through
slurmd + the configured fabric (TCPXO on GKE, EFA on EKS).

The argument generalizes to any Slinky cluster, so apply the same
override on the EKS slurm leaf. Tighten the prose on both files: the
old comment said "Conformance checks are inherited unchanged" but
deployment checks are also inherited; replace with "Deployment and
conformance checks are inherited unchanged" for precision.
TestNFDTopologyUpdater_OverlayCoverage enumerates every GPU overlay
variant to catch overlays that accidentally lose NRT publishing on
real-cluster leaves or accidentally turn it on for kind-chain leaves.
The three slurm leaves (h100-eks-ubuntu-training-slurm,
h100-gke-cos-training-slurm, h100-kind-training-slurm) were missing
from the table — a pre-existing gap that this PR is the natural moment
to close, since it adds one new slurm leaf and refactors the other
two off the deleted platform-slurm mixins.

Expected pattern matches the parent inheritance: EKS+GKE slurm leaves
inherit topologyUpdater.enable=true from their h100-*-training
parents; the kind slurm leaf stays off (no kubelet podResources socket
on KWOK).
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean refactor — moving slurm off the mixin merge path onto the same inline shape as dynamo-platform is the right call, the rationale in the description and YAML comments is thorough, and the live rehearsal (sinfo + srun -N2 nvidia-smi -L returning 16 H100 UUIDs) is solid evidence the GRES wiring actually works end-to-end. CI is green across all 142 checks.

Three findings, none blocking:

  • Mediumdocs/integrator/recipe-development.md line 118 overgeneralizes: claims "a platform's full component stack is declared inline … rather than via a platform mixin," but platform-kubeflow / platform-inference are still active mixins and the example just above on the same page composes platform-kubeflow via spec.mixins. Scope the claim to slurm (and dynamo, if calling it out as precedent).
  • Nit — the GKE leaf's K8s.server.version: ">= 1.32" constraint duplicates the parent. The EKS leaf's restatement is justified because it tightens to >= 1.32.4; here it's noise.
  • Nit — the ~20-line GRES rationale block is now duplicated verbatim across the EKS and GKE slurm leaves, and one of the references (manifests/clusters/mars-prod/...) points at an internal NVIDIA repo that will read as opaque in the OSS tree.

Nothing here blocks merge — the medium finding is a doc accuracy issue, not a code defect.

Comment thread docs/integrator/recipe-development.md Outdated
Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details.

A platform may split into multiple mixins when parts of the stack are independently opt-in. For example, `--platform slurm` resolves through two mixins: `platform-slurm` always contributes the SchedMD Slinky operator and CRDs, and `platform-slurm-cluster` is opt-in for the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi). A leaf that wants operator-only composes just `platform-slurm`; a leaf that wants the cluster too composes both — see `recipes/overlays/h100-eks-ubuntu-training-slurm.yaml` for the latter.
A platform's full component stack is declared inline per leaf overlay rather than via a platform mixin. This lets each leaf carry its own hardware-specific tuning — most commonly GPU GRES strings and accelerator resource limits — without going through the mixin merge path.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generalization is too broad — platform-kubeflow and platform-inference are still active mixins (used by h100-eks-ubuntu-training-kubeflow.yaml, eks-inference.yaml, etc.), and the example immediately above on this page (line ~107) shows platform-kubeflow being composed via spec.mixins. As written, the new paragraph contradicts that example and the actual state of recipes/mixins/.

Suggest scoping the claim to slurm (and dynamo, if you want to call it out as the precedent), e.g.:

Some platforms declare their full component stack inline per leaf overlay rather than via a platform mixin — most notably --platform slurm and --platform dynamo, where each leaf carries hardware-specific tuning (GPU GRES strings, accelerator resource limits) that the mixin merge path can't represent cleanly. Other platforms (--platform kubeflow, --platform inference) still use mixins where leaf-specific tuning isn't required.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7782ff7. Scoped the claim to slurm and dynamo, and explicitly preserved that kubeflow/inference still use platform-kubeflow / platform-inference mixins (referencing the example two paragraphs up). Wording is now consistent with the actual state of recipes/mixins/.


constraints:
- name: K8s.server.version
value: ">= 1.32"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this duplicates the parent h100-gke-cos-training.yaml's K8s.server.version: ">= 1.32". Same-name constraints get overridden during merge so it's harmless, but if it's not strengthening the floor it's noise. The EKS leaf bumps to >= 1.32.4 which justifies the restatement; here either drop it or tighten it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the restatement in 7782ff7. Replaced with a short comment noting the floor is inherited from h100-gke-cos-training.yaml (cf. the EKS leaf, which restates >= 1.32.4 to tighten its parent). Cleaner than leaving merge-noise in the constraint list.

componentRefs: []
componentRefs:
- name: slinky-slurm-operator-crds
type: Helm
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double-declaration commentary (extraConfMap.Gres + nvidia.com/gpu limit, with the explanation of why both are needed) is exactly the kind of context that future-you will thank present-you for. Nice.

Two micro-suggestions to consider — neither blocking:

  1. The mars-prod filesystem path (manifests/clusters/mars-prod/mars-dgxc-k8s-tgr-bwi-prd1/...) is from an internal NVIDIA repo and will be opaque to anyone reading this in the public OSS tree. Either qualify with "(internal NVIDIA reference cluster)" or drop the parenthetical — the rationale (requests auto-mirror to limits for extended resources) stands on its own.
  2. The same ~20-line GRES rationale block is now copy-pasted verbatim in the GKE and EKS slurm leaves. If a future slurm leaf changes the wording in one and forgets the other they'll drift. Acceptable for two leaves; worth keeping an eye on if a third lands.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the mars-prod parenthetical from both EKS and GKE slurm leaves in 7782ff7. Agreed it reads as opaque in the OSS tree; the auto-mirror rationale stands on its own. On the duplication: keeping both copies for now (matches your "acceptable for two leaves" call); if a third slurm leaf lands we can extract the GRES rationale into a shared comment fragment or a leaf-template.

- recipe-development.md: scope the "inline-per-leaf platform
  stack" claim to slurm and dynamo. The prior wording was
  too broad: platform-kubeflow and platform-inference are
  still active mixins (shown in the example two paragraphs
  up), and the dynamo-platform precedent already lives
  inline in *-inference-dynamo leaves. The new wording calls
  out slurm/dynamo as inline-per-leaf and explicitly
  preserves the mixin status of kubeflow/inference.

- h100-gke-cos-training-slurm.yaml: drop the verbatim
  restatement of K8s.server.version (>= 1.32) from the
  parent. Same-name constraints get overridden during merge,
  so restating without strengthening is noise. The EKS slurm
  leaf restates >= 1.32.4 to tighten its parent; the GKE
  leaf has no tighter floor, so we inherit and add a
  one-line comment documenting why.

- h100-eks-ubuntu-training-slurm.yaml and
  h100-gke-cos-training-slurm.yaml: drop the parenthetical
  pointer to mars-prod (manifests/clusters/mars-prod/...).
  That path lives in an internal NVIDIA repo and reads as
  opaque in the OSS tree. The rationale (Kubernetes
  auto-mirrors requests=limits for extended resources)
  stands on its own.

Per Mark's review on NVIDIA#997.
@faganihajizada faganihajizada requested a review from mchmarny May 22, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants