Skip to content

fix(recipes): migrate GB200 kernel-module-params to preManifestFiles#868

Merged
mchmarny merged 2 commits into
mainfrom
fix/859-migrate-manifestfiles-to-pre
May 13, 2026
Merged

fix(recipes): migrate GB200 kernel-module-params to preManifestFiles#868
mchmarny merged 2 commits into
mainfrom
fix/859-migrate-manifestfiles-to-pre

Conversation

@mchmarny
Copy link
Copy Markdown
Member

Summary

Moves the kernel-module-params.yaml entry on gb200-eks-{training,inference} from manifestFiles to preManifestFiles so it lands before helm install --wait gpu-operator instead of after.

Motivation / Context

manifestFiles always emits to a NNN-<name>-post/ folder applied after the chart. The nvidia-kernel-module-params ConfigMap on GB200/EKS is prereq-shaped — the gpu-operator chart's ClusterPolicy reconcile reads it during helm install --wait — so post-install ordering deadlocks the chart for ~52 min on every fresh cluster (5×10m retries, then terminal failure with ConfigMap "nvidia-kernel-module-params" not found).

PR #856 added the preManifestFiles field and the bundler -pre folder emission needed to fix this; this PR migrates the only two entries that issue #859 explicitly called out as broken.

Fixes #859
Related: #856 (introduced preManifestFiles), #706 (introduced the post-only regression), #668 (added the affected entry)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: recipes/overlays/gb200-eks-{training,inference}.yaml

Implementation Notes

Two-line YAML change per file: manifestFiles:preManifestFiles:, with a comment line that records why the ordering matters (so the next person who sees the entry doesn't move it back).

Audit of other manifestFiles call sites

Issue #859 asked for an audit of every other caller. Each shipped manifest was inspected; none need migration under current behavior:

Site Shipped kind(s) Classification
base.yaml / gke-cos.yaml gpu-operator → dcgm-exporter.yaml, gke-resource-quota.yaml ConfigMap, ResourceQuota Post — chart tolerates absent ConfigMap (image has default config); ResourceQuota is namespace-scope, post-install is fine. Has been shipping post since #706 without reports.
h100-eks-{training,inference}.yaml, h100-gke-cos-{training,inference}.yaml, gb200-eks-{training,inference}.yaml nodewright-customizations → tuning*.yaml Skyhook (skyhook.nvidia.com/v1alpha1) Post — Skyhook CRD is delivered by the nodewright-operator chart; CR must apply after the CRD exists.
h100-gke-cos-training.yaml gke-nccl-tcpxo → nccl-tcpxo-installer.yaml, nri-device-injector.yaml DaemonSet × 2 Post — core API, no chart-side reconcile dependency.
aks.yaml network-operator → nfd-network-rule.yaml, nic-cluster-policy-aks.yaml, ib-node-config-aks.yaml NodeFeatureRule, NicClusterPolicy, ConfigMap+DaemonSet Post — NicClusterPolicy CRD comes from the network-operator chart; the rest are core/nfd API with no reconcile dependency.
platform-inference.yaml kgateway-crds → gateway-api-crds.yaml, inference-extension-crds.yaml CustomResourceDefinition × N Post — these are extra CRDs alongside the chart's own; downstream kgateway chart serializes via dependencyRefs, not order within this component.
platform-inference.yaml kgateway → inference-gateway.yaml GatewayParameters, Gateway Post — depends on CRDs from kgateway-crds.
h100-gke-cos-training-kubeflow.yaml, platform-kubeflow.yaml kubeflow-trainer → torch-distributed-cluster-training-runtime.yaml ClusterTrainingRuntime Post — CRD comes from kubeflow-trainer chart.

If any of the "tolerant" entries (e.g., dcgm-exporter.yaml) is later found to deadlock a reconcile, the migration is the same one-field swap.

Verification

Built locally and generated a GB200/EKS bundle:

$ aicr bundle --recipe gb200-recipe.yaml -o /tmp/gb200-bundle
$ ls /tmp/gb200-bundle/
...
009-gpu-operator-pre        ← templates/kernel-module-params.yaml (NEW)
010-gpu-operator            ← upstream chart, helm install --wait
011-gpu-operator-post       ← unchanged
...

The ConfigMap now lands one folder before the chart's --wait, eliminating the deadlock.

Testing

unset GITLAB_TOKEN
make qualify

All chainsaw scenarios (20/20) pass, including cli-bundle-gpu-operator-templates and cli-bundle-variants which exercise the GB200 paths. Pure YAML change — no Go packages affected, so per-package coverage gate is N/A.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Two-line YAML diff. Affects only gb200-eks-{training,inference} recipes. Revert is a single-commit revert; no schema migration, no data shape change, no flag flip. All non-GB200 recipes are byte-identical (the preManifestFiles field is opt-in and unused elsewhere).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

The kernel-module-params ConfigMap on gb200-eks-{training,inference}
overlays is consumed by the gpu-operator chart's ClusterPolicy reconcile,
not after install — so emitting it under manifestFiles (post-install)
deadlocked 'helm install --wait' for ~52 min before terminal failure
on every fresh GB200/EKS cluster.

Moves the entry to preManifestFiles, introduced by #856, which places
the rendered ConfigMap in NNN-gpu-operator-pre/ ahead of the chart in
NNN-gpu-operator/. Verified locally:

  009-gpu-operator-pre/templates/kernel-module-params.yaml
  010-gpu-operator/   (helm install --wait)
  011-gpu-operator-post/

Audit of other manifestFiles call sites confirms the remaining entries
ship chart-CRD-dependent CRs (Skyhook, NicClusterPolicy, Gateway,
ClusterTrainingRuntime) or independent core-API resources, all of which
are correctly post-install today.

Fixes #859
@mchmarny mchmarny requested a review from a team as a code owner May 13, 2026 11:30
@mchmarny mchmarny self-assigned this May 13, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a9135e51-fd54-45c9-bd8d-b57d9fe1e156

📥 Commits

Reviewing files that changed from the base of the PR and between 79b7f6e and 752d8df.

📒 Files selected for processing (2)
  • pkg/bundler/deployer/localformat/local_helm.go
  • pkg/bundler/deployer/localformat/writer.go

📝 Walkthrough

Walkthrough

The PR moves the GPU operator kernel-module ConfigMap in GB200 EKS training and inference recipes from post-install manifests to preManifestFiles so the ConfigMap is applied before the chart’s ClusterPolicy reconciliation during helm install --wait. It also adds detection of top-level Namespace templates when rendering local Helm charts and computes an effective create-namespace flag (createNamespace && !hasNamespaceTemplate) for generated install.sh, and changes injectAuxiliaryFolder to always request install.sh to create the namespace for auxiliary folders.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: moving GB200 kernel-module-params from manifestFiles to preManifestFiles to fix a deadlock issue.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation, testing, and risk assessment for the manifest file reorganization.
Linked Issues check ✅ Passed The PR directly addresses issue #859 by migrating GB200 kernel-module-params to preManifestFiles as required, implements the preManifestFiles mechanism from #856, and includes the comprehensive audit of other manifestFiles call sites requested.
Out of Scope Changes check ✅ Passed All code changes are in scope: two YAML recipe files are migrated per issue #859 requirements, and supporting bundler changes implement auto-detection of Namespace presence to determine --create-namespace behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/859-migrate-manifestfiles-to-pre

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 13, 2026

Coverage Report ✅

Metric Value
Coverage 76.6%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.6%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat 73.48% (+0.13%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/local_helm.go 75.00% (+1.79%) 60 (+4) 45 (+4) 15 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/writer.go 73.98% (-0.13%) 196 (-1) 145 (-1) 51 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

…pace

KWOK e2e for the GB200 recipes (#868) surfaced a latent bug in PR #856:
pre-injection install.sh unconditionally omits --create-namespace, which
deadlocks any recipe that ships a bare-resource manifest (ConfigMap,
ResourceQuota, etc.) in preManifestFiles. The target namespace doesn't
exist yet — the primary chart's helm install --create-namespace hasn't
run — so the pre-pass install fails with 'namespaces "<n>" not found'.

The original 'pre = no --create-namespace' rule was correct for the
Talos case (chart ships a Namespace template; --create-namespace would
collide with Helm 3 ownership annotations) but wrong as a blanket rule.

Move the decision into writeLocalHelmFolder: scan rendered manifests
for a top-level 'kind: Namespace'; if one is present, downgrade
createNamespace to false; otherwise honor the caller's request. Callers
(injectAuxiliaryFolder, the primary local-helm path) now uniformly ask
for createNamespace=true and let detection decide.

Both call shapes are exercised:
- Talos: ships a Namespace → detection downgrades → install.sh omits
  --create-namespace (unchanged behavior; existing goldens pass)
- GB200 kernel-module-params: ships only a ConfigMap → detection leaves
  it true → install.sh creates 'gpu-operator' namespace before the
  ConfigMap, ahead of the primary chart's own --create-namespace which
  becomes a no-op against the existing namespace

Refs #859, follow-up to #856
@mchmarny mchmarny requested a review from a team as a code owner May 13, 2026 12:34
@mchmarny mchmarny enabled auto-merge (squash) May 13, 2026 12:43
@mchmarny mchmarny merged commit 139ed70 into main May 13, 2026
58 checks passed
@mchmarny mchmarny deleted the fix/859-migrate-manifestfiles-to-pre branch May 13, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bundler: manifestFiles always emitted as -post, breaking prereq ConfigMaps (GB200/EKS deploy hangs)

2 participants