fix(recipes): migrate GB200 kernel-module-params to preManifestFiles#868
Conversation
The kernel-module-params ConfigMap on gb200-eks-{training,inference}
overlays is consumed by the gpu-operator chart's ClusterPolicy reconcile,
not after install — so emitting it under manifestFiles (post-install)
deadlocked 'helm install --wait' for ~52 min before terminal failure
on every fresh GB200/EKS cluster.
Moves the entry to preManifestFiles, introduced by #856, which places
the rendered ConfigMap in NNN-gpu-operator-pre/ ahead of the chart in
NNN-gpu-operator/. Verified locally:
009-gpu-operator-pre/templates/kernel-module-params.yaml
010-gpu-operator/ (helm install --wait)
011-gpu-operator-post/
Audit of other manifestFiles call sites confirms the remaining entries
ship chart-CRD-dependent CRs (Skyhook, NicClusterPolicy, Gateway,
ClusterTrainingRuntime) or independent core-API resources, all of which
are correctly post-install today.
Fixes #859
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR moves the GPU operator kernel-module ConfigMap in GB200 EKS training and inference recipes from post-install manifests to preManifestFiles so the ConfigMap is applied before the chart’s ClusterPolicy reconciliation during helm install --wait. It also adds detection of top-level Namespace templates when rendering local Helm charts and computes an effective create-namespace flag (createNamespace && !hasNamespaceTemplate) for generated install.sh, and changes injectAuxiliaryFolder to always request install.sh to create the namespace for auxiliary folders. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
…pace KWOK e2e for the GB200 recipes (#868) surfaced a latent bug in PR #856: pre-injection install.sh unconditionally omits --create-namespace, which deadlocks any recipe that ships a bare-resource manifest (ConfigMap, ResourceQuota, etc.) in preManifestFiles. The target namespace doesn't exist yet — the primary chart's helm install --create-namespace hasn't run — so the pre-pass install fails with 'namespaces "<n>" not found'. The original 'pre = no --create-namespace' rule was correct for the Talos case (chart ships a Namespace template; --create-namespace would collide with Helm 3 ownership annotations) but wrong as a blanket rule. Move the decision into writeLocalHelmFolder: scan rendered manifests for a top-level 'kind: Namespace'; if one is present, downgrade createNamespace to false; otherwise honor the caller's request. Callers (injectAuxiliaryFolder, the primary local-helm path) now uniformly ask for createNamespace=true and let detection decide. Both call shapes are exercised: - Talos: ships a Namespace → detection downgrades → install.sh omits --create-namespace (unchanged behavior; existing goldens pass) - GB200 kernel-module-params: ships only a ConfigMap → detection leaves it true → install.sh creates 'gpu-operator' namespace before the ConfigMap, ahead of the primary chart's own --create-namespace which becomes a no-op against the existing namespace Refs #859, follow-up to #856
Summary
Moves the
kernel-module-params.yamlentry ongb200-eks-{training,inference}frommanifestFilestopreManifestFilesso it lands beforehelm install --wait gpu-operatorinstead of after.Motivation / Context
manifestFilesalways emits to aNNN-<name>-post/folder applied after the chart. Thenvidia-kernel-module-paramsConfigMap on GB200/EKS is prereq-shaped — the gpu-operator chart's ClusterPolicy reconcile reads it duringhelm install --wait— so post-install ordering deadlocks the chart for ~52 min on every fresh cluster (5×10m retries, then terminal failure withConfigMap "nvidia-kernel-module-params" not found).PR #856 added the
preManifestFilesfield and the bundler-prefolder emission needed to fix this; this PR migrates the only two entries that issue #859 explicitly called out as broken.Fixes #859
Related: #856 (introduced
preManifestFiles), #706 (introduced the post-only regression), #668 (added the affected entry)Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)recipes/overlays/gb200-eks-{training,inference}.yamlImplementation Notes
Two-line YAML change per file:
manifestFiles:→preManifestFiles:, with a comment line that records why the ordering matters (so the next person who sees the entry doesn't move it back).Audit of other
manifestFilescall sitesIssue #859 asked for an audit of every other caller. Each shipped manifest was inspected; none need migration under current behavior:
base.yaml/gke-cos.yamlgpu-operator →dcgm-exporter.yaml,gke-resource-quota.yamlConfigMap,ResourceQuotah100-eks-{training,inference}.yaml,h100-gke-cos-{training,inference}.yaml,gb200-eks-{training,inference}.yamlnodewright-customizations →tuning*.yamlSkyhook(skyhook.nvidia.com/v1alpha1)nodewright-operatorchart; CR must apply after the CRD exists.h100-gke-cos-training.yamlgke-nccl-tcpxo →nccl-tcpxo-installer.yaml,nri-device-injector.yamlDaemonSet× 2aks.yamlnetwork-operator →nfd-network-rule.yaml,nic-cluster-policy-aks.yaml,ib-node-config-aks.yamlNodeFeatureRule,NicClusterPolicy,ConfigMap+DaemonSetNicClusterPolicyCRD comes from the network-operator chart; the rest are core/nfd API with no reconcile dependency.platform-inference.yamlkgateway-crds →gateway-api-crds.yaml,inference-extension-crds.yamlCustomResourceDefinition× Nkgatewaychart serializes viadependencyRefs, not order within this component.platform-inference.yamlkgateway →inference-gateway.yamlGatewayParameters,Gatewaykgateway-crds.h100-gke-cos-training-kubeflow.yaml,platform-kubeflow.yamlkubeflow-trainer →torch-distributed-cluster-training-runtime.yamlClusterTrainingRuntimeIf any of the "tolerant" entries (e.g.,
dcgm-exporter.yaml) is later found to deadlock a reconcile, the migration is the same one-field swap.Verification
Built locally and generated a GB200/EKS bundle:
The ConfigMap now lands one folder before the chart's
--wait, eliminating the deadlock.Testing
unset GITLAB_TOKEN make qualifyAll chainsaw scenarios (20/20) pass, including
cli-bundle-gpu-operator-templatesandcli-bundle-variantswhich exercise the GB200 paths. Pure YAML change — no Go packages affected, so per-package coverage gate is N/A.Risk Assessment
Rollout notes: Two-line YAML diff. Affects only
gb200-eks-{training,inference}recipes. Revert is a single-commit revert; no schema migration, no data shape change, no flag flip. All non-GB200 recipes are byte-identical (thepreManifestFilesfield is opt-in and unused elsewhere).Checklist
make testwith-race)make lint)git commit -S)