diff --git a/Makefile b/Makefile index 042dd8b17..d8caee60c 100644 --- a/Makefile +++ b/Makefile @@ -147,6 +147,7 @@ LICENSE_IGNORES = \ -ignore 'bundles/**' \ -ignore 'dist/**' \ -ignore 'vendor/**' \ + -ignore '**/testdata/**' \ -ignore 'site/public/**' \ -ignore 'site/resources/**' \ -ignore 'site/node_modules/**' diff --git a/docs/contributor/component.md b/docs/contributor/component.md index 270e02835..4a61706e5 100644 --- a/docs/contributor/component.md +++ b/docs/contributor/component.md @@ -16,6 +16,27 @@ The bundler system converts RecipeInput objects into deployment artifacts. Artif - **Node scheduling**: Registry defines paths for injecting node selectors and tolerations - **Structured errors**: Uses `pkg/errors` for error codes and wrapping +### Local Format (Shared Bundle Layout) + +`pkg/bundler/deployer/localformat` writes the uniform numbered `NNN-/` bundle layout consumed by every deployer. It owns per-folder content (Chart.yaml, values.yaml, cluster-values.yaml, install.sh, templates/, upstream.env). Deployers (`helm`, future `helmfile` per [#632](https://github.com/NVIDIA/aicr/issues/632), `argocd`, `argocd-helm`) call `localformat.Write()` and then add their own top-level orchestration files (deploy.sh, helmfile.yaml, Application CRs, etc.) — they never re-classify components or duplicate the per-folder writer. + +**Classification rule** (single source of truth, in `localformat.classify`): + +| Recipe shape | Folder kind | Notes | +|---|---|---| +| `helm.defaultRepository` set, no `manifestFiles` | `KindUpstreamHelm` | upstream chart referenced via `upstream.env`; no Chart.yaml | +| `helm.defaultRepository` set + `manifestFiles` (mixed) | `KindUpstreamHelm` (primary) + `KindLocalHelm` (`-post` injected) | two adjacent folders; raw manifests deploy post-install | +| `helm.defaultRepository == ""` + `manifestFiles` | `KindLocalHelm` | manifest-only wrapped chart | +| `kustomize` (Tag/Path set) | `KindLocalHelm` | `kustomize build` at bundle time → `templates/manifest.yaml` | + +**Load-bearing invariants** (don't violate without changing the design): + +1. **`localformat` never writes deployer-specific files.** `deploy.sh`, `helmfile.yaml`, argocd `Application` CRs, Flux `HelmRelease`s — all produced by the respective deployer after `Write()` returns. This separation is what makes a single layout consumable by every deployer. +2. **`install.sh` is never name-customized.** It is rendered from one of exactly two templates (`install-upstream-helm.sh.tmpl`, `install-local-helm.sh.tmpl`), parameterized only by data (name, namespace, upstream ref). Name-keyed quirks (kai-scheduler async timeout, nodewright-operator taint cleanup, DRA restart, orphan-CRD scan) stay in `deploy.sh` as name-matched blocks — not in `install.sh`. This is the structural barrier that prevents per-folder scripts from accumulating drift. +3. **`Write` is deterministic and idempotent.** Same inputs → same on-disk bytes → same `Folder` slice. Map iteration is sorted; no timestamps or random suffixes are embedded. + +For the full classification table, base-format invariants, and the helm deployer's call site, see `pkg/bundler/deployer/localformat/doc.go` (godoc) and `pkg/bundler/deployer/helm/helm.go::Generate`. Further design history: ticket [#662](https://github.com/NVIDIA/aicr/issues/662). + ## Quick Start ### Adding a New Component (Declarative Approach) diff --git a/docs/user/cli-reference.md b/docs/user/cli-reference.md index 80d76602b..65be95d62 100644 --- a/docs/user/cli-reference.md +++ b/docs/user/cli-reference.md @@ -1116,22 +1116,50 @@ bundles/ ``` bundles/ ├── README.md # Deployment guide with ordered steps -├── deploy.sh # One-command deployment script +├── deploy.sh # Generic install loop + name-matched blocks +├── undeploy.sh # Generic reverse loop ├── recipe.yaml # Recipe used to generate bundle ├── checksums.txt # SHA256 checksums ├── attestation/ # Present when --attest is used │ ├── bundle-attestation.sigstore.json # SLSA Build Provenance v1 │ └── aicr-attestation.sigstore.json # Binary SLSA provenance chain -├── gpu-operator/ -│ ├── values.yaml # Component-specific Helm values -│ ├── README.md # Per-component install/upgrade/uninstall -│ └── manifests/ # Additional manifests (if any) -│ └── dcgm-exporter.yaml -└── cert-manager/ +├── 001-cert-manager/ # Upstream-helm folder: no Chart.yaml +│ ├── install.sh # Rendered: helm upgrade --install ... --repo ${REPO} +│ ├── values.yaml +│ ├── cluster-values.yaml # Dynamic-path overrides (operator-edited) +│ └── upstream.env # CHART, REPO, VERSION (sourced by install.sh) +├── 002-gpu-operator/ # Mixed component primary (upstream-helm) +│ ├── install.sh +│ ├── values.yaml +│ ├── cluster-values.yaml +│ └── upstream.env +└── 003-gpu-operator-post/ # Injected -post wrapped chart (mixed component's raw manifests) + ├── Chart.yaml # Local-helm folder: Chart.yaml + templates/ present + ├── install.sh # Rendered: helm upgrade --install ... ./ ├── values.yaml - └── README.md + ├── cluster-values.yaml + └── templates/ + └── dcgm-exporter.yaml ``` +**Folder layout rules:** + +- Folders are numbered `NNN-/` (1-based, zero-padded). Numbering is regenerated on every bundle. +- Each folder is one of two **kinds**, distinguished by the presence of `Chart.yaml`: + - **upstream-helm** — no `Chart.yaml`; `upstream.env` carries `CHART`/`REPO`/`VERSION`; `install.sh` installs the upstream chart. + - **local-helm** — `Chart.yaml` + `templates/`; `install.sh` installs the local chart (`helm upgrade --install ./`). +- **Mixed components** (Helm chart + raw manifests) emit **two adjacent folders**: a primary upstream-helm `NNN-/` and an injected `(NNN+1)--post/` local-helm wrapper carrying the raw manifests. Subsequent components shift by one. +- Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart. +- Kustomize-typed components run `kustomize build` at bundle time; the output becomes a single `templates/manifest.yaml` inside a local-helm folder. + +**Breaking change vs. earlier releases:** + +Previous releases used a flat `/` layout with `manifests/` siblings and a `--deployer helm` script that branched on component kind. The new format is uniform: + +- All folders carry a rendered `install.sh`. The top-level `deploy.sh` is a generic loop with no per-component branching — name-matched special-case blocks (nodewright-operator taint cleanup, kai-scheduler async timeout, orphan-CRD scan, DRA kubelet-plugin restart) live around the loop, not inside it. +- Raw manifests for mixed components now apply **post-install only**, via the injected `-post` wrapped chart. The earlier pre-apply mechanism with a CRD-race retry wrapper is gone — Helm now owns CRD ordering for mixed components natively. +- Tooling that parsed bundle paths by bare component name must account for the `NNN-` prefix. + **Argo CD bundle structure** (with `--deployer argocd`): ``` bundles/ @@ -1273,11 +1301,11 @@ If you need to enforce specific install-time values (e.g., pinning `driver.versi ```shell # Navigate to bundle -cd bundles/gpu-operator +cd bundles -# Review configuration -cat values.yaml +# Review root README and a component's values cat README.md +cat 001-gpu-operator/values.yaml # Verify integrity sha256sum -c checksums.txt @@ -1286,7 +1314,7 @@ sha256sum -c checksums.txt chmod +x deploy.sh && ./deploy.sh ``` -> **Note:** `deploy.sh` and `undeploy.sh` are convenience scripts — not the only deployment path. Each component subdirectory contains a `README.md` with the exact `helm upgrade --install` command for manual or pipeline-driven deployment. +> **Note:** `deploy.sh` and `undeploy.sh` are convenience scripts — not the only deployment path. Each `NNN-/` folder contains a rendered `install.sh` that runs the exact `helm upgrade --install` command for manual or pipeline-driven deployment. #### Deploy Script Behavior (`deploy.sh`) diff --git a/kwok/scripts/validate-scheduling.sh b/kwok/scripts/validate-scheduling.sh index 8c586fdde..be95e2172 100755 --- a/kwok/scripts/validate-scheduling.sh +++ b/kwok/scripts/validate-scheduling.sh @@ -381,8 +381,10 @@ generate_bundle() { # KWOK clusters use emptyDir for Prometheus storage (no PVC/StorageClass). # Cloud overlays (EKS, AKS) set emptyDir: null + volumeClaimTemplate which # the Prometheus CRD rejects. Restore emptyDir and remove PVC for KWOK. - local prom_values="${WORK_DIR}/bundle/kube-prometheus-stack/values.yaml" - if [[ -f "$prom_values" ]] && yq eval '.prometheus.prometheusSpec.storageSpec.emptyDir' "$prom_values" 2>/dev/null | grep -q 'null'; then + # Bundle layout uses NNN-prefixed folders, so glob to find the kube-prometheus-stack folder. + local prom_values + prom_values=$(ls -1 "${WORK_DIR}/bundle"/[0-9][0-9][0-9]-kube-prometheus-stack/values.yaml 2>/dev/null | head -1) + if [[ -n "$prom_values" && -f "$prom_values" ]] && yq eval '.prometheus.prometheusSpec.storageSpec.emptyDir' "$prom_values" 2>/dev/null | grep -q 'null'; then log_info "Fixing kube-prometheus-stack storageSpec for KWOK (emptyDir instead of PVC)" yq eval -i ' .prometheus.prometheusSpec.storageSpec.emptyDir = {"medium": "", "sizeLimit": "10Gi"} | diff --git a/pkg/bundler/bundler_test.go b/pkg/bundler/bundler_test.go index fd7b6901b..a9fe5c9f3 100644 --- a/pkg/bundler/bundler_test.go +++ b/pkg/bundler/bundler_test.go @@ -249,25 +249,25 @@ func TestMake_Success(t *testing.T) { } } - // Verify per-component directories - for _, comp := range []string{"gpu-operator", "network-operator"} { - valuesPath := filepath.Join(tmpDir, comp, "values.yaml") + // Verify per-component directories (numbered by deployment order) + componentDirs := map[string]string{ + "gpu-operator": "001-gpu-operator", + "network-operator": "002-network-operator", + } + for comp, dir := range componentDirs { + valuesPath := filepath.Join(tmpDir, dir, "values.yaml") if _, statErr := os.Stat(valuesPath); os.IsNotExist(statErr) { - t.Errorf("expected %s/values.yaml was not created", comp) - } - readmePath := filepath.Join(tmpDir, comp, "README.md") - if _, statErr := os.Stat(readmePath); os.IsNotExist(statErr) { - t.Errorf("expected %s/README.md was not created", comp) + t.Errorf("expected %s/values.yaml was not created (component %s)", dir, comp) } } - // No Chart.yaml should exist + // No Chart.yaml should exist at top level chartPath := filepath.Join(tmpDir, "Chart.yaml") if _, statErr := os.Stat(chartPath); !os.IsNotExist(statErr) { t.Error("Chart.yaml should not exist in per-component bundle") } - // Verify output summary (3 root + 2 components × 2 files = 7, +1 recipe.yaml = 8) + // Verify output summary (3 root + 2 components × multiple files >= 7) if output.TotalFiles < 7 { t.Errorf("expected at least 7 files, got %d", output.TotalFiles) } @@ -317,14 +317,16 @@ func TestMake_DisabledComponentsFiltered(t *testing.T) { t.Fatal("Make() returned nil output") } - // Enabled component should have a directory - if _, statErr := os.Stat(filepath.Join(tmpDir, "gpu-operator", "values.yaml")); os.IsNotExist(statErr) { - t.Error("expected gpu-operator/values.yaml to be created") + // Enabled component should have a directory (numbering reflects only enabled components) + if _, statErr := os.Stat(filepath.Join(tmpDir, "001-gpu-operator", "values.yaml")); os.IsNotExist(statErr) { + t.Error("expected 001-gpu-operator/values.yaml to be created") } - // Disabled component should NOT have a directory - if _, statErr := os.Stat(filepath.Join(tmpDir, "aws-ebs-csi-driver")); !os.IsNotExist(statErr) { - t.Error("expected aws-ebs-csi-driver directory to NOT be created") + // Disabled component should NOT have a directory (under any numbering) + for _, dir := range []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver"} { + if _, statErr := os.Stat(filepath.Join(tmpDir, dir)); !os.IsNotExist(statErr) { + t.Errorf("expected %s directory to NOT be created", dir) + } } // deploy.sh should not reference the disabled component @@ -420,7 +422,10 @@ func TestMake_SetEnabledOverridesPrecedence(t *testing.T) { t.Fatalf("Make() error = %v", makeErr) } - _, statErr := os.Stat(filepath.Join(tmpDir, "aws-ebs-csi-driver")) + // When included, the component appears as the second numbered folder + // (gpu-operator is 001, aws-ebs-csi-driver is 002). The flat layout + // is gone in this PR — only assert against the numbered path. + _, statErr := os.Stat(filepath.Join(tmpDir, "002-aws-ebs-csi-driver")) included := !os.IsNotExist(statErr) if included != tt.expectIncluded { @@ -461,7 +466,8 @@ func TestMake_SetEnabledNotLeakedToHelmValues(t *testing.T) { t.Fatalf("Make() error = %v", makeErr) } - valuesPath := filepath.Join(tmpDir, "aws-ebs-csi-driver", "values.yaml") + // aws-ebs-csi-driver is the 2nd component in deployment order (after gpu-operator) + valuesPath := filepath.Join(tmpDir, "002-aws-ebs-csi-driver", "values.yaml") valuesData, readErr := os.ReadFile(valuesPath) if readErr != nil { t.Fatalf("failed to read values.yaml: %v", readErr) @@ -519,10 +525,10 @@ func TestMake_WithValueOverrides(t *testing.T) { t.Fatal("Make() returned nil output") } - // Verify gpu-operator/values.yaml was created - valuesPath := filepath.Join(tmpDir, "gpu-operator", "values.yaml") + // Verify 001-gpu-operator/values.yaml was created (single component → 001) + valuesPath := filepath.Join(tmpDir, "001-gpu-operator", "values.yaml") if _, err := os.Stat(valuesPath); os.IsNotExist(err) { - t.Fatal("gpu-operator/values.yaml was not created") + t.Fatal("001-gpu-operator/values.yaml was not created") } } @@ -1190,19 +1196,17 @@ func TestMake_DisabledComponentWithDynamic(t *testing.T) { t.Fatalf("Make() error = %v", makeErr) } - // Disabled component should NOT have a directory at all - if _, statErr := os.Stat(filepath.Join(tmpDir, "aws-ebs-csi-driver")); !os.IsNotExist(statErr) { - t.Error("expected aws-ebs-csi-driver directory to NOT be created (component is disabled)") - } - - // Disabled component should NOT have cluster-values.yaml - if _, statErr := os.Stat(filepath.Join(tmpDir, "aws-ebs-csi-driver", "cluster-values.yaml")); !os.IsNotExist(statErr) { - t.Error("expected aws-ebs-csi-driver/cluster-values.yaml to NOT exist (component is disabled)") + // Disabled component should NOT have a directory at all (under any numbering). + // The directory check implies cluster-values.yaml absence, so don't double-check. + for _, dir := range []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver"} { + if _, statErr := os.Stat(filepath.Join(tmpDir, dir)); !os.IsNotExist(statErr) { + t.Errorf("expected %s directory to NOT be created (component is disabled)", dir) + } } - // Enabled component should still exist - if _, statErr := os.Stat(filepath.Join(tmpDir, "gpu-operator", "values.yaml")); os.IsNotExist(statErr) { - t.Error("expected gpu-operator/values.yaml to be created") + // Enabled component should still exist (gpu-operator is the only enabled → 001) + if _, statErr := os.Stat(filepath.Join(tmpDir, "001-gpu-operator", "values.yaml")); os.IsNotExist(statErr) { + t.Error("expected 001-gpu-operator/values.yaml to be created") } // deploy.sh should not reference the disabled component diff --git a/pkg/bundler/deployer/helm/doc.go b/pkg/bundler/deployer/helm/doc.go index d25bd092b..130df048e 100644 --- a/pkg/bundler/deployer/helm/doc.go +++ b/pkg/bundler/deployer/helm/doc.go @@ -14,13 +14,19 @@ // Package helm generates per-component Helm bundles from recipe results. // -// Generates a directory per component with individual values and install instructions: +// Per-component folder layout (NNN-prefixed, written by pkg/bundler/deployer/localformat): +// +// - NNN-/install.sh: Per-folder install script +// - NNN-/values.yaml: Static Helm values +// - NNN-/cluster-values.yaml: Per-cluster dynamic values +// - NNN-/upstream.env: CHART/REPO/VERSION (upstream-helm folders) +// - NNN-/Chart.yaml + templates/: Local chart (local-helm folders) +// +// Top-level files (owned by this deployer): // -// - /values.yaml: Helm values per component -// - /README.md: Component install/upgrade/uninstall -// - /manifests/: Optional manifest files // - README.md: Root deployment guide with ordered steps // - deploy.sh: Automation script (0755) +// - undeploy.sh: Reverse-order uninstall script (0755) // - checksums.txt: SHA256 digests for verification (optional) // // Usage: diff --git a/pkg/bundler/deployer/helm/helm.go b/pkg/bundler/deployer/helm/helm.go index a71aa8ac8..3de09e4d9 100644 --- a/pkg/bundler/deployer/helm/helm.go +++ b/pkg/bundler/deployer/helm/helm.go @@ -20,25 +20,19 @@ import ( "fmt" "log/slog" "os" - "path/filepath" - "sort" "strings" "time" "github.com/NVIDIA/aicr/pkg/bundler/checksum" "github.com/NVIDIA/aicr/pkg/bundler/deployer" - "github.com/NVIDIA/aicr/pkg/component" + "github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat" "github.com/NVIDIA/aicr/pkg/errors" - "github.com/NVIDIA/aicr/pkg/manifest" "github.com/NVIDIA/aicr/pkg/recipe" ) //go:embed templates/README.md.tmpl var readmeTemplate string -//go:embed templates/component-README.md.tmpl -var componentReadmeTemplate string - //go:embed templates/deploy.sh.tmpl var deployScriptTemplate string @@ -48,20 +42,20 @@ var undeployScriptTemplate string // criteriaAny is the wildcard value for criteria fields. const criteriaAny = "any" -// ComponentData contains data for rendering per-component templates. +// ComponentData contains data for rendering per-component template blocks. +// The helm deployer no longer owns per-component folder content (localformat +// does). ComponentData now carries only the fields needed by the orchestration +// templates: README.md's component table and deploy.sh / undeploy.sh +// name-matched special-case blocks. type ComponentData struct { - Name string - Namespace string - Repository string - ChartName string - Version string // Original version string (preserves 'v' prefix) for helm install --version - ChartVersion string // Normalized version (no 'v' prefix) for chart metadata labels - HasManifests bool - HasChart bool - IsOCI bool - IsKustomize bool // True when the component uses Kustomize instead of Helm - Tag string // Git ref for Kustomize components (tag, branch, or commit) - Path string // Path within the repository to the kustomization + Name string + Namespace string + Repository string + ChartName string + Version string // Original version string (preserves 'v' prefix) for helm install --version + IsOCI bool + Tag string // Git ref for Kustomize-typed components (tag/branch/commit) + Path string // Path within the repository to the kustomization } // compile-time interface check @@ -97,6 +91,9 @@ type Generator struct { } // Generate creates a per-component Helm bundle from the configured generator fields. +// Per-component folder content (Chart.yaml, values.yaml, install.sh, templates/*) +// is delegated to pkg/bundler/deployer/localformat. The helm deployer owns only +// the top-level orchestration: README.md, deploy.sh, undeploy.sh, and checksums. func (g *Generator) Generate(ctx context.Context, outputDir string) (*deployer.Output, error) { start := time.Now() @@ -120,14 +117,36 @@ func (g *Generator) Generate(ctx context.Context, outputDir string) (*deployer.O return nil, err } - // Generate per-component directories - files, size, err := g.generateComponentDirectories(ctx, components, outputDir) + // Map ComponentData to localformat.Component and write per-component folders. + // localformat owns: folder naming, values.yaml/cluster-values.yaml split, + // Chart.yaml, templates/*, install.sh. The helm deployer just orchestrates. + lfComponents := toLocalformatComponents(components, g.ComponentValues, g.DynamicValues) + folders, err := localformat.Write(ctx, localformat.Options{ + OutputDir: outputDir, + Components: lfComponents, + ComponentManifests: g.ComponentManifests, + }) if err != nil { - return nil, errors.Wrap(errors.ErrCodeInternal, - "failed to generate component directories", err) + // localformat.Write returns StructuredErrors; propagate as-is. + return nil, err + } + for _, f := range folders { + // localformat returns paths relative to outputDir. Downstream consumers + // (checksum.WriteChecksums, output.TotalSize, deployment reporting) all + // expect absolute paths, so resolve each entry via SafeJoin before + // appending. SafeJoin also enforces containment. + for _, rel := range f.Files { + abs, joinErr := deployer.SafeJoin(outputDir, rel) + if joinErr != nil { + return nil, errors.Wrap(errors.ErrCodeInvalidRequest, + fmt.Sprintf("path from localformat escapes outputDir: %s", rel), joinErr) + } + output.Files = append(output.Files, abs) + if info, statErr := os.Stat(abs); statErr == nil { + output.TotalSize += info.Size() + } + } } - output.Files = append(output.Files, files...) - output.TotalSize += size // Generate root README.md readmePath, readmeSize, err := g.generateRootREADME(ctx, components, outputDir) @@ -188,12 +207,8 @@ func (g *Generator) Generate(ctx context.Context, outputDir string) (*deployer.O // buildComponentDataList builds a sorted list of ComponentData from the recipe. // It validates that all component names are safe for use as directory names. +// Only the fields consumed by the orchestration templates are populated. func (g *Generator) buildComponentDataList() ([]ComponentData, error) { - componentMap := make(map[string]recipe.ComponentRef) - for _, ref := range g.RecipeResult.ComponentRefs { - componentMap[ref.Name] = ref - } - // Sort by deployment order sorted := deployer.SortComponentRefsByDeploymentOrder( g.RecipeResult.ComponentRefs, @@ -207,166 +222,51 @@ func (g *Generator) buildComponentDataList() ([]ComponentData, error) { fmt.Sprintf("invalid component name %q: must not contain path separators or parent directory references", ref.Name)) } - hasManifests := false - if g.ComponentManifests != nil { - if m, ok := g.ComponentManifests[ref.Name]; ok && len(m) > 0 { - hasManifests = true - } - } - - isKustomize := ref.Type == recipe.ComponentTypeKustomize - chartName := ref.Chart if chartName == "" { chartName = ref.Name } - isOCI := strings.HasPrefix(ref.Source, "oci://") - // Preserve version string as-is for deploy.sh --version flag. - // Helm handles 'v' prefixes correctly via fuzzy matching. - version := ref.Version - components = append(components, ComponentData{ - Name: ref.Name, - Namespace: ref.Namespace, - Repository: ref.Source, - ChartName: chartName, - Version: version, - ChartVersion: deployer.NormalizeVersionWithDefault(ref.Version), - HasManifests: hasManifests, - HasChart: !isKustomize && ref.Source != "", - IsOCI: isOCI, - IsKustomize: isKustomize, - Tag: ref.Tag, - Path: ref.Path, + Name: ref.Name, + Namespace: ref.Namespace, + Repository: ref.Source, + ChartName: chartName, + Version: ref.Version, + IsOCI: strings.HasPrefix(ref.Source, "oci://"), + Tag: ref.Tag, + Path: ref.Path, }) } return components, nil } -// generateComponentDirectories creates per-component directories with values.yaml, README.md, and optional manifests. -func (g *Generator) generateComponentDirectories(ctx context.Context, components []ComponentData, outputDir string) ([]string, int64, error) { - files := make([]string, 0, len(components)*3) - var totalSize int64 - - for i, comp := range components { - select { - case <-ctx.Done(): - return nil, 0, errors.Wrap(errors.ErrCodeInternal, "context cancelled", ctx.Err()) - default: - } - - componentDir, err := deployer.SafeJoin(outputDir, comp.Name) - if err != nil { - return nil, 0, err - } - if mkdirErr := os.MkdirAll(componentDir, 0755); mkdirErr != nil { - return nil, 0, errors.Wrap(errors.ErrCodeInternal, - fmt.Sprintf("failed to create directory for %s", comp.Name), mkdirErr) - } - - // Deep-copy component values so writeClusterValuesFile can safely - // remove dynamic paths without mutating the caller's map. - values := component.DeepCopyMap(g.ComponentValues[comp.Name]) +// toLocalformatComponents maps the orchestration ComponentData list to the +// per-component inputs consumed by localformat.Write. Values and DynamicPaths +// are looked up by component name from the generator's maps. +func toLocalformatComponents( + components []ComponentData, + values map[string]map[string]any, + dynamic map[string][]string, +) []localformat.Component { - // Extract dynamic paths (if any) from values into cluster-values.yaml. - // Every component gets a cluster-values.yaml — dynamic paths are pre-populated, - // and users can add any additional overrides. deploy.sh always passes it. - clusterFiles, clusterSize, clusterErr := writeClusterValuesFile(values, g.DynamicValues[comp.Name], componentDir, comp.Name) - if clusterErr != nil { - return nil, 0, clusterErr - } - files = append(files, clusterFiles...) - totalSize += clusterSize - - valuesPath, valuesSize, err := deployer.WriteValuesFile(values, componentDir, "values.yaml") - if err != nil { - return nil, 0, errors.Wrap(errors.ErrCodeInternal, - fmt.Sprintf("failed to write values.yaml for %s", comp.Name), err) - } - files = append(files, valuesPath) - totalSize += valuesSize - - // Write component README.md - readmePath, readmeSize, err := deployer.GenerateFromTemplate(componentReadmeTemplate, comp, componentDir, "README.md") - if err != nil { - return nil, 0, errors.Wrap(errors.ErrCodeInternal, - fmt.Sprintf("failed to write README.md for %s", comp.Name), err) - } - files = append(files, readmePath) - totalSize += readmeSize - - // Write manifests if present - if g.ComponentManifests != nil { - if manifests, ok := g.ComponentManifests[comp.Name]; ok && len(manifests) > 0 { - manifestDir, manifestDirErr := deployer.SafeJoin(componentDir, "manifests") - if manifestDirErr != nil { - return nil, 0, manifestDirErr - } - if err := os.MkdirAll(manifestDir, 0755); err != nil { - return nil, 0, errors.Wrap(errors.ErrCodeInternal, - fmt.Sprintf("failed to create manifests directory for %s", comp.Name), err) - } - - // Sort manifest paths for deterministic output - manifestPaths := make([]string, 0, len(manifests)) - for p := range manifests { - manifestPaths = append(manifestPaths, p) - } - sort.Strings(manifestPaths) - - manifestsWritten := 0 - for _, manifestPath := range manifestPaths { - content := manifests[manifestPath] - filename := filepath.Base(manifestPath) - outputPath, pathErr := deployer.SafeJoin(manifestDir, filename) - if pathErr != nil { - return nil, 0, errors.New(errors.ErrCodeInvalidRequest, - fmt.Sprintf("invalid manifest filename %q in component %s", filename, comp.Name)) - } - - rendered, renderErr := manifest.Render(content, manifest.RenderInput{ - ComponentName: comp.Name, - Namespace: comp.Namespace, - ChartName: comp.ChartName, - ChartVersion: comp.ChartVersion, - Values: g.ComponentValues[comp.Name], - }) - if renderErr != nil { - return nil, 0, errors.WrapWithContext(errors.ErrCodeInternal, "failed to render manifest template", renderErr, - map[string]any{"component": comp.Name, "filename": filename}) - } - - if !hasYAMLObjects(rendered) { - slog.Debug("skipping empty manifest", "component", comp.Name, "filename", filename) - continue - } - - if err := os.WriteFile(outputPath, rendered, 0600); err != nil { - return nil, 0, errors.WrapWithContext(errors.ErrCodeInternal, "failed to write manifest", err, - map[string]any{"component": comp.Name, "filename": filename}) - } - - files = append(files, outputPath) - totalSize += int64(len(rendered)) - manifestsWritten++ - - slog.Debug("wrote manifest", "component", comp.Name, "filename", filename) - } - - // If no manifests had content, remove the empty directory and update flag - if manifestsWritten == 0 { - if rmErr := os.RemoveAll(manifestDir); rmErr != nil { - slog.Warn("failed to remove empty manifest directory", "dir", manifestDir, "error", rmErr) - } - components[i].HasManifests = false - } - } - } + out := make([]localformat.Component, 0, len(components)) + for _, c := range components { + out = append(out, localformat.Component{ + Name: c.Name, + Namespace: c.Namespace, + Repository: c.Repository, + ChartName: c.ChartName, + Version: c.Version, + IsOCI: c.IsOCI, + Tag: c.Tag, + Path: c.Path, + Values: values[c.Name], + DynamicPaths: dynamic[c.Name], + }) } - - return files, totalSize, nil + return out } // generateRootREADME creates the root README.md with deployment instructions. @@ -492,59 +392,17 @@ func reverseComponents(components []ComponentData) []ComponentData { return reversed } -// uniqueNamespaces returns deduplicated namespaces from Helm/Kustomize components, -// preserving order. Manifest-only components are excluded to match the previous -// behavior where namespace cleanup only occurred inside HasChart/IsKustomize branches. +// uniqueNamespaces returns deduplicated namespaces from all components, +// preserving order. Every component in the uniform local-chart format is a +// helm release with a namespace — no more per-kind filtering needed. func uniqueNamespaces(components []ComponentData) []string { seen := make(map[string]bool) var namespaces []string for _, c := range components { - if c.Namespace != "" && !seen[c.Namespace] && (c.HasChart || c.IsKustomize) { + if c.Namespace != "" && !seen[c.Namespace] { seen[c.Namespace] = true namespaces = append(namespaces, c.Namespace) } } return namespaces } - -// writeClusterValuesFile writes a cluster-values.yaml for per-cluster overrides. -// If dynamicPaths is non-empty, those paths are extracted from values and pre-populated. -// WARNING: This function mutates the values map in place (removes dynamic paths via -// RemoveValueByPath). Callers must pass a deep copy if the original map must be preserved. -// The file is always written — even when empty — so users can add any overrides. -func writeClusterValuesFile(values map[string]any, dynamicPaths []string, componentDir, componentName string) ([]string, int64, error) { - clusterValues := make(map[string]any) - for _, path := range dynamicPaths { - val, found := component.GetValueByPath(values, path) - if found { - component.RemoveValueByPath(values, path) - } else { - val = "" - slog.Warn("dynamic path not found in component values; introducing empty placeholder", - "component", componentName, "path", path) - } - component.SetValueByPath(clusterValues, path, val) - } - - clusterPath, clusterSize, err := deployer.WriteValuesFile(clusterValues, componentDir, "cluster-values.yaml") - if err != nil { - return nil, 0, errors.Wrap(errors.ErrCodeInternal, - fmt.Sprintf("failed to write cluster-values.yaml for %s", componentName), err) - } - - slog.Debug("wrote cluster-values.yaml", "component", componentName, "dynamic_paths", len(dynamicPaths)) - return []string{clusterPath}, clusterSize, nil -} - -// hasYAMLObjects returns true if content contains at least one YAML object -// (a non-comment, non-blank, non-separator line). -func hasYAMLObjects(content []byte) bool { - for _, line := range strings.Split(string(content), "\n") { - trimmed := strings.TrimSpace(line) - if trimmed == "" || strings.HasPrefix(trimmed, "#") || trimmed == "---" { - continue - } - return true - } - return false -} diff --git a/pkg/bundler/deployer/helm/helm_test.go b/pkg/bundler/deployer/helm/helm_test.go index 435948e8f..6b1b26b0e 100644 --- a/pkg/bundler/deployer/helm/helm_test.go +++ b/pkg/bundler/deployer/helm/helm_test.go @@ -17,9 +17,12 @@ package helm import ( "bytes" "context" + "flag" "os" "os/exec" "path/filepath" + "reflect" + "sort" "strings" "testing" "time" @@ -31,83 +34,15 @@ import ( "github.com/NVIDIA/aicr/pkg/recipe" ) +// update regenerates goldens under testdata/ when set via `go test -update`. +var update = flag.Bool("update", false, "update golden files") + // testDriverVersion is a test constant for driver version strings to satisfy goconst. const testDriverVersion = "570.86.16" -func TestGenerate_Success(t *testing.T) { - ctx := context.Background() - outputDir := t.TempDir() - - g := &Generator{ - RecipeResult: createTestRecipeResult(), - ComponentValues: map[string]map[string]any{ - "cert-manager": { - "crds": map[string]any{"enabled": true}, - }, - "gpu-operator": { - "driver": map[string]any{ - "enabled": true, - }, - }, - }, - Version: "v1.0.0", - } - - output, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("Generate failed: %v", err) - } - - // Verify root files exist - rootFiles := []string{"README.md", "deploy.sh", "undeploy.sh"} - for _, f := range rootFiles { - path := filepath.Join(outputDir, f) - if _, statErr := os.Stat(path); os.IsNotExist(statErr) { - t.Errorf("expected root file %s does not exist", f) - } - } - - // Verify per-component directories - for _, comp := range []string{"cert-manager", "gpu-operator"} { - valuesPath := filepath.Join(outputDir, comp, "values.yaml") - if _, statErr := os.Stat(valuesPath); os.IsNotExist(statErr) { - t.Errorf("expected %s/values.yaml does not exist", comp) - } - readmePath := filepath.Join(outputDir, comp, "README.md") - if _, statErr := os.Stat(readmePath); os.IsNotExist(statErr) { - t.Errorf("expected %s/README.md does not exist", comp) - } - } - - // Verify cert-manager values contain crds.enabled - cmValues, err := os.ReadFile(filepath.Join(outputDir, "cert-manager", "values.yaml")) - if err != nil { - t.Fatalf("failed to read cert-manager values: %v", err) - } - if !strings.Contains(string(cmValues), "crds") { - t.Error("cert-manager/values.yaml missing crds section") - } - - // Verify gpu-operator values contain driver - gpuValues, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "values.yaml")) - if err != nil { - t.Fatalf("failed to read gpu-operator values: %v", err) - } - if !strings.Contains(string(gpuValues), "driver") { - t.Error("gpu-operator/values.yaml missing driver") - } - - // No Chart.yaml should exist - chartPath := filepath.Join(outputDir, "Chart.yaml") - if _, statErr := os.Stat(chartPath); !os.IsNotExist(statErr) { - t.Error("Chart.yaml should not exist in per-component bundle") - } - - // Verify output has reasonable file count (3 root files + 2 component dirs × 2 files each = 7) - if len(output.Files) < 7 { - t.Errorf("expected at least 7 files, got %d", len(output.Files)) - } -} +// --------------------------------------------------------------------------- +// Smoke / basic Generate tests +// --------------------------------------------------------------------------- func TestGenerate_NilRecipeResult(t *testing.T) { ctx := context.Background() @@ -124,10 +59,10 @@ func TestGenerate_NilRecipeResult(t *testing.T) { func TestGenerate_ContextCancellation(t *testing.T) { ctx, cancel := context.WithCancel(context.Background()) - cancel() // Cancel immediately + cancel() // cancel before calling Generate g := &Generator{ - RecipeResult: createEmptyRecipeResult(), + RecipeResult: createTestRecipeResult(), ComponentValues: map[string]map[string]any{}, Version: "v1.0.0", } @@ -157,35 +92,30 @@ func TestGenerate_WithChecksums(t *testing.T) { t.Fatalf("Generate failed: %v", err) } - // Check checksums.txt exists checksumPath := filepath.Join(outputDir, "checksums.txt") if _, statErr := os.Stat(checksumPath); os.IsNotExist(statErr) { t.Error("checksums.txt does not exist") } - // Verify checksums.txt references per-component paths - checksumContent, err := os.ReadFile(checksumPath) + content, err := os.ReadFile(checksumPath) if err != nil { t.Fatalf("failed to read checksums.txt: %v", err) } - content := string(checksumContent) + str := string(content) - if !strings.Contains(content, "README.md") { - t.Error("checksums.txt missing README.md") - } - if !strings.Contains(content, "deploy.sh") { - t.Error("checksums.txt missing deploy.sh") - } - if !strings.Contains(content, "undeploy.sh") { - t.Error("checksums.txt missing undeploy.sh") - } - if !strings.Contains(content, filepath.Join("cert-manager", "values.yaml")) { - t.Error("checksums.txt missing cert-manager/values.yaml") + for _, want := range []string{ + "README.md", + "deploy.sh", + "undeploy.sh", + filepath.Join("001-cert-manager", "values.yaml"), + } { + if !strings.Contains(str, want) { + t.Errorf("checksums.txt missing %s", want) + } } - // Each line should have 64-char SHA256 hash - lines := strings.Split(strings.TrimSpace(content), "\n") - for _, line := range lines { + // Each line should carry a 64-char SHA256 hash. + for _, line := range strings.Split(strings.TrimSpace(str), "\n") { parts := strings.Split(line, " ") if len(parts) != 2 { t.Errorf("invalid checksum format: %s", line) @@ -196,137 +126,16 @@ func TestGenerate_WithChecksums(t *testing.T) { } } - // Verify checksums.txt is the last file (appended after generation) + // checksums.txt is appended last. lastFile := output.Files[len(output.Files)-1] if !strings.HasSuffix(lastFile, "checksums.txt") { t.Errorf("expected last file to be checksums.txt, got %s", lastFile) } } -func TestGenerate_WithManifests(t *testing.T) { - ctx := context.Background() - outputDir := t.TempDir() - - manifestContent := "apiVersion: v1\nkind: ConfigMap\nmetadata:\n namespace: {{ .Release.Namespace }}\n labels:\n helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version }}\n" - - g := &Generator{ - RecipeResult: createTestRecipeResult(), - ComponentValues: map[string]map[string]any{ - "cert-manager": {}, - "gpu-operator": {}, - }, - Version: "v1.0.0", - ComponentManifests: map[string]map[string][]byte{ - "gpu-operator": { - "components/gpu-operator/manifests/dcgm-exporter.yaml": []byte(manifestContent), - }, - }, - } - - _, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("Generate failed: %v", err) - } - - // Verify manifest was placed in component directory - manifestPath := filepath.Join(outputDir, "gpu-operator", "manifests", "dcgm-exporter.yaml") - if _, statErr := os.Stat(manifestPath); os.IsNotExist(statErr) { - t.Error("gpu-operator/manifests/dcgm-exporter.yaml does not exist") - } - - // Verify manifest content was rendered with ComponentData - content, err := os.ReadFile(manifestPath) - if err != nil { - t.Fatalf("failed to read manifest: %v", err) - } - rendered := string(content) - if !strings.Contains(rendered, "ConfigMap") { - t.Error("manifest missing ConfigMap kind") - } - if !strings.Contains(rendered, "namespace: gpu-operator") { - t.Errorf("manifest namespace not rendered, got: %s", rendered) - } - if !strings.Contains(rendered, "gpu-operator-25.3.3") { // normalizeVersion strips 'v' prefix for chart labels - t.Errorf("manifest chart label not rendered, got: %s", rendered) - } -} - -func TestHasYAMLObjects(t *testing.T) { - tests := []struct { - name string - content string - expected bool - }{ - {"empty", "", false}, - {"whitespace only", " \n \n", false}, - {"comments only", "# comment\n# another\n", false}, - {"separator only", "---\n", false}, - {"comments and separators", "# Copyright\n# License\n---\n# more comments\n", false}, - {"valid YAML", "apiVersion: v1\nkind: ConfigMap\n", true}, - {"comments then YAML", "# header\napiVersion: v1\n", true}, - {"separator then YAML", "---\napiVersion: v1\n", true}, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - result := hasYAMLObjects([]byte(tt.content)) - if result != tt.expected { - t.Errorf("hasYAMLObjects(%q) = %v, want %v", tt.content, result, tt.expected) - } - }) - } -} - -func TestGenerate_EmptyManifestsSkipped(t *testing.T) { - ctx := context.Background() - outputDir := t.TempDir() - - // Template that renders to empty when enabled=false - emptyTemplate := "# Comment\n{{- $cust := index .Values \"gpu-operator\" }}\n{{- if ne (toString (index $cust \"enabled\")) \"false\" }}\napiVersion: v1\nkind: ConfigMap\nmetadata:\n name: test\n{{- end }}\n" - - g := &Generator{ - RecipeResult: createTestRecipeResult(), - ComponentValues: map[string]map[string]any{ - "cert-manager": {}, - "gpu-operator": {"enabled": "false"}, - }, - Version: "v1.0.0", - ComponentManifests: map[string]map[string][]byte{ - "gpu-operator": { - "components/gpu-operator/manifests/test.yaml": []byte(emptyTemplate), - }, - }, - } - - output, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("Generate failed: %v", err) - } - - // Manifest should not exist (rendered to empty) - manifestPath := filepath.Join(outputDir, "gpu-operator", "manifests", "test.yaml") - if _, statErr := os.Stat(manifestPath); !os.IsNotExist(statErr) { - t.Error("expected empty manifest to be skipped, but file exists") - } - - // Manifests dir should not exist (removed when empty) - manifestDir := filepath.Join(outputDir, "gpu-operator", "manifests") - if _, statErr := os.Stat(manifestDir); !os.IsNotExist(statErr) { - t.Error("expected empty manifests directory to be removed") - } - - // deploy.sh should NOT contain kubectl apply for gpu-operator manifests - deployPath := filepath.Join(outputDir, "deploy.sh") - deployContent, err := os.ReadFile(deployPath) - if err != nil { - t.Fatalf("failed to read deploy.sh: %v", err) - } - if strings.Contains(string(deployContent), "Applying manifests for gpu-operator") { - t.Error("deploy.sh should not contain manifest apply for disabled component") - } - - _ = output -} +// --------------------------------------------------------------------------- +// Deploy-script behavior tests +// --------------------------------------------------------------------------- func TestGenerate_DeployScriptExecutable(t *testing.T) { ctx := context.Background() @@ -351,170 +160,42 @@ func TestGenerate_DeployScriptExecutable(t *testing.T) { if os.IsNotExist(statErr) { t.Fatal("deploy.sh does not exist") } - - // Check executable permission (0755) - mode := info.Mode() - if mode&0111 == 0 { - t.Errorf("deploy.sh is not executable, mode: %o", mode) + if info.Mode()&0111 == 0 { + t.Errorf("deploy.sh is not executable, mode: %o", info.Mode()) } - // Verify shebang content, err := os.ReadFile(deployPath) if err != nil { t.Fatalf("failed to read deploy.sh: %v", err) } - if !strings.HasPrefix(string(content), "#!/usr/bin/env bash") { + script := string(content) + if !strings.HasPrefix(script, "#!/usr/bin/env bash") { t.Error("deploy.sh missing shebang") } - if !strings.Contains(string(content), "set -euo pipefail") { - t.Error("deploy.sh missing strict mode") - } - if !strings.Contains(string(content), "MAX_RETRIES=5") { - t.Error("deploy.sh missing default MAX_RETRIES") - } - if !strings.Contains(string(content), "backoff_seconds()") { - t.Error("deploy.sh missing backoff_seconds function") - } - if !strings.Contains(string(content), "retry()") { - t.Error("deploy.sh missing retry function") - } - if !strings.Contains(string(content), "helm_retry()") { - t.Error("deploy.sh missing helm_retry function") - } - if !strings.Contains(string(content), "cleanup_helm_hooks()") { - t.Error("deploy.sh missing cleanup_helm_hooks function") - } - if !strings.Contains(string(content), "HELM_TIMEOUT=") { - t.Error("deploy.sh missing HELM_TIMEOUT variable") - } - if !strings.Contains(string(content), "NO_WAIT=") { - t.Error("deploy.sh missing NO_WAIT variable") - } - if !strings.Contains(string(content), "--retries") { - t.Error("deploy.sh missing --retries flag handling") - } -} - -func TestGenerate_DeployScriptFinalReadinessNote(t *testing.T) { - ctx := context.Background() - outputDir := t.TempDir() - - g := &Generator{ - RecipeResult: createTestRecipeResult(), - ComponentValues: map[string]map[string]any{ - "cert-manager": {}, - "gpu-operator": {}, - }, - Version: "v1.0.0", - } - - if _, err := g.Generate(ctx, outputDir); err != nil { - t.Fatalf("Generate failed: %v", err) - } - - content, err := os.ReadFile(filepath.Join(outputDir, "deploy.sh")) - if err != nil { - t.Fatalf("failed to read deploy.sh: %v", err) - } - script := string(content) - - // The success-path status line must still be present — existing CI log - // matchers rely on it for full-success runs. - if !strings.Contains(script, `echo "Deployment complete."`) { - t.Error(`deploy.sh missing "Deployment complete." line`) - } - if !strings.Contains(script, `echo "Deployment completed with non-fatal errors (--best-effort)."`) { - t.Error(`deploy.sh missing partial-failure status line`) - } - // The final message must distinguish install completion from workload - // readiness so users don't read success as "ready for GPU workloads". - wantPhrases := []string{ - "The above status reflects Helm install and manifest apply results", - "not whether the cluster is ready for GPU workloads", - "cluster convergence may continue asynchronously", - "Nodewright", - "GPU operator operand rollout", - "DRA kubelet plugin", - } - for _, p := range wantPhrases { - if !strings.Contains(script, p) { - t.Errorf("deploy.sh final note missing phrase: %q", p) + // Assertions on the orchestration script's structural markers. The inner + // per-component `helm upgrade --install` now lives in each folder's + // install.sh (rendered by localformat), so the structural markers here are + // the generic install loop, retry helpers, and flag handling. + for _, want := range []string{ + "set -euo pipefail", + "MAX_RETRIES=5", + "backoff_seconds()", + "cleanup_helm_hooks()", + "HELM_TIMEOUT=", + "NO_WAIT=", + "--retries", + "ASYNC_COMPONENTS=", // async-skip policy lives here + "bash install.sh", // generic install loop invokes each folder's install.sh + } { + if !strings.Contains(script, want) { + t.Errorf("deploy.sh missing %q", want) } } - // Both final status lines must precede the shared readiness note. - doneIdx := strings.Index(script, `echo "Deployment complete."`) - bestEffortIdx := strings.Index(script, `echo "Deployment completed with non-fatal errors (--best-effort)."`) - noteIdx := strings.Index(script, "not whether the cluster is ready for GPU workloads") - if doneIdx < 0 || bestEffortIdx < 0 || noteIdx < 0 { - t.Fatal("unexpected: status line or readiness note missing indices") - } - if doneIdx >= noteIdx { - t.Error(`"Deployment complete." must come before the readiness note`) - } - if bestEffortIdx >= noteIdx { - t.Error(`partial-failure status line must come before the readiness note`) - } } -func TestGenerate_DeployScriptKaiSchedulerTimeout(t *testing.T) { - ctx := context.Background() - outputDir := t.TempDir() - - g := &Generator{ - RecipeResult: &recipe.RecipeResult{ - Kind: "RecipeResult", - APIVersion: "aicr.nvidia.com/v1alpha1", - ComponentRefs: []recipe.ComponentRef{ - { - Name: "kai-scheduler", - Namespace: "kai-scheduler", - Chart: "kai-scheduler", - Version: "v0.13.0", - Type: recipe.ComponentTypeHelm, - Source: "oci://ghcr.io/nvidia/kai-scheduler", - }, - }, - DeploymentOrder: []string{"kai-scheduler"}, - }, - ComponentValues: map[string]map[string]any{ - "kai-scheduler": {}, - }, - Version: "v1.0.0", - } - - _, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("Generate failed: %v", err) - } - - content, err := os.ReadFile(filepath.Join(outputDir, "deploy.sh")) - if err != nil { - t.Fatalf("failed to read deploy.sh: %v", err) - } - script := string(content) - - // kai-scheduler should get a custom 20m timeout override - if !strings.Contains(script, `COMPONENT_HELM_TIMEOUT="20m"`) { - t.Error("deploy.sh missing kai-scheduler 20m timeout override") - } - // Other components should use the default HELM_TIMEOUT - if !strings.Contains(script, `COMPONENT_HELM_TIMEOUT="${HELM_TIMEOUT}"`) { - t.Error("deploy.sh missing default COMPONENT_HELM_TIMEOUT") - } - // kai-scheduler should use a reduced retry budget to fail faster on slow hooks - if !strings.Contains(script, `COMPONENT_MAX_RETRIES="1"`) { - t.Error("deploy.sh missing kai-scheduler retry override") - } - if !strings.Contains(script, `dump_kai_scheduler_helm_diagnostics "${namespace}"`) { - t.Error("deploy.sh missing kai-scheduler diagnostics hook") - } - if !strings.Contains(script, `kubectl get jobs -n "${namespace}"`) { - t.Error("deploy.sh missing job diagnostics") - } - if !strings.Contains(script, `kubectl describe pods -n "${namespace}"`) { - t.Error("deploy.sh missing pod diagnostics") - } -} +// --------------------------------------------------------------------------- +// Undeploy-script behavior tests +// --------------------------------------------------------------------------- func TestGenerate_UndeployScriptExecutable(t *testing.T) { ctx := context.Background() @@ -529,8 +210,7 @@ func TestGenerate_UndeployScriptExecutable(t *testing.T) { Version: "v1.0.0", } - _, err := g.Generate(ctx, outputDir) - if err != nil { + if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate failed: %v", err) } @@ -539,19 +219,11 @@ func TestGenerate_UndeployScriptExecutable(t *testing.T) { if os.IsNotExist(statErr) { t.Fatal("undeploy.sh does not exist") } - - // Check executable permission (0755) - mode := info.Mode() - if mode&0111 == 0 { - t.Errorf("undeploy.sh is not executable, mode: %o", mode) + if info.Mode()&0111 == 0 { + t.Errorf("undeploy.sh is not executable, mode: %o", info.Mode()) } - // Verify content - content, err := os.ReadFile(undeployPath) - if err != nil { - t.Fatalf("failed to read undeploy.sh: %v", err) - } - script := string(content) + script := readFile(t, undeployPath) if !strings.HasPrefix(script, "#!/usr/bin/env bash") { t.Error("undeploy.sh missing shebang") @@ -563,70 +235,29 @@ func TestGenerate_UndeployScriptExecutable(t *testing.T) { t.Error("undeploy.sh missing helm uninstall command") } - // Verify reverse order: gpu-operator should appear before cert-manager - gpuIdx := strings.Index(script, "Uninstalling gpu-operator") - certIdx := strings.Index(script, "Uninstalling cert-manager") + // Verify reverse uninstall order: gpu-operator before cert-manager in the + // rendered "Uninstalling…" report lines. + gpuIdx := strings.Index(script, "gpu-operator") + certIdx := strings.Index(script, "cert-manager") if gpuIdx < 0 || certIdx < 0 { - t.Fatal("undeploy.sh missing component uninstall lines") + t.Fatal("undeploy.sh missing component names") } if gpuIdx > certIdx { - t.Error("undeploy.sh components not in reverse order: gpu-operator should come before cert-manager") + t.Error("undeploy.sh: gpu-operator should appear before cert-manager (reverse order)") } - // Verify --delete-pvcs flag defaults to off + // --delete-pvcs flag defaults to off and is guarded. if !strings.Contains(script, "DELETE_PVCS=false") { t.Error("undeploy.sh missing DELETE_PVCS=false default") } if !strings.Contains(script, "--delete-pvcs") { t.Error("undeploy.sh missing --delete-pvcs flag handling") } - - // Verify PVC deletion is guarded by the flag if !strings.Contains(script, `"${DELETE_PVCS}" == "true"`) { t.Error("undeploy.sh PVC deletion not guarded by DELETE_PVCS flag") } - // Verify no unconditional PVC deletion inside per-component loop - // PVC deletion should only appear in the namespace cleanup section - lines := strings.Split(script, "\n") - inComponentLoop := false - for _, line := range lines { - trimmed := strings.TrimSpace(line) - if strings.Contains(trimmed, "Uninstalling") && strings.Contains(trimmed, "echo") { - inComponentLoop = true - } - if strings.Contains(trimmed, "Clean up namespaces") { - inComponentLoop = false - } - if inComponentLoop && strings.Contains(trimmed, "kubectl delete pvc") { - t.Error("undeploy.sh has unconditional PVC deletion inside per-component loop") - } - } - - // Verify webhook cleanup runs both before and after namespace deletion - nsCleanupIdx := strings.Index(script, "Clean up namespaces") - nsTermIdx := strings.Index(script, "Waiting for namespaces to terminate") - finalWebhookIdx := strings.Index(script, "Final webhook cleanup") - if nsCleanupIdx < 0 || nsTermIdx < 0 || finalWebhookIdx < 0 { - t.Fatal("undeploy.sh missing expected section markers") - } - - // Webhook cleanup should appear in namespace cleanup section (before delete_namespace) - betweenCleanupAndTerm := script[nsCleanupIdx:nsTermIdx] - if !strings.Contains(betweenCleanupAndTerm, "delete_orphaned_webhooks_for_ns") { - t.Error("undeploy.sh missing pre-namespace-deletion webhook cleanup") - } - - // Final webhook cleanup should appear after namespace termination wait - if finalWebhookIdx < nsTermIdx { - t.Error("undeploy.sh final webhook cleanup should run after namespace termination wait") - } - afterTermWait := script[nsTermIdx:] - if !strings.Contains(afterTermWait, "delete_orphaned_webhooks_for_ns") { - t.Error("undeploy.sh missing post-namespace-deletion webhook cleanup") - } - - // Verify jq is a hard requirement (not a soft check) + // jq is a hard requirement for CRD/finalizer inspection. if strings.Contains(script, "HAS_JQ") { t.Error("undeploy.sh should not use HAS_JQ soft check; jq must be a hard requirement") } @@ -634,9 +265,9 @@ func TestGenerate_UndeployScriptExecutable(t *testing.T) { t.Error("undeploy.sh missing jq availability check") } - // Verify pre-flight check exists and runs before component uninstall + // Pre-flight exists and runs before component uninstall. preflightIdx := strings.Index(script, "Pre-flight checks") - uninstallIdx := strings.Index(script, "Uninstall components in reverse order") + uninstallIdx := strings.Index(script, "Uninstall components in reverse install order") if preflightIdx < 0 { t.Fatal("undeploy.sh missing pre-flight checks section") } @@ -644,10 +275,9 @@ func TestGenerate_UndeployScriptExecutable(t *testing.T) { t.Fatal("undeploy.sh missing component uninstall section") } if preflightIdx > uninstallIdx { - t.Error("undeploy.sh pre-flight checks must run before component uninstall") + t.Error("undeploy.sh pre-flight must run before component uninstall") } - // Verify pre-flight uses functions and exits on failure preflightSection := script[preflightIdx:uninstallIdx] if !strings.Contains(preflightSection, "check_release_for_stuck_crds") { t.Error("undeploy.sh pre-flight should call check_release_for_stuck_crds") @@ -655,8 +285,6 @@ func TestGenerate_UndeployScriptExecutable(t *testing.T) { if !strings.Contains(preflightSection, "PREFLIGHT_DETAILS") || !strings.Contains(preflightSection, "exit 1") { t.Error("undeploy.sh pre-flight should detect stuck CRs and exit on failure") } - - // Verify pre-flight checks each Helm component with both release name and namespace args if !strings.Contains(preflightSection, `check_release_for_stuck_crds "gpu-operator" "gpu-operator"`) { t.Error("undeploy.sh pre-flight missing check for gpu-operator with namespace") } @@ -664,28 +292,26 @@ func TestGenerate_UndeployScriptExecutable(t *testing.T) { t.Error("undeploy.sh pre-flight missing check for cert-manager with namespace") } - // Verify helper functions exist and use helm get manifest - if !strings.Contains(script, "check_crd_for_stuck_resources()") { - t.Error("undeploy.sh missing check_crd_for_stuck_resources function") - } - if !strings.Contains(script, "check_release_for_stuck_crds()") { - t.Error("undeploy.sh missing check_release_for_stuck_crds function") - } - if !strings.Contains(script, "helm get manifest") { - t.Error("undeploy.sh should use helm get manifest for CRD discovery") + // Helper functions and Helm manifest discovery. + for _, want := range []string{ + "check_crd_for_stuck_resources()", + "check_release_for_stuck_crds()", + "helm get manifest", + "CRDs stuck in deleting state", + } { + if !strings.Contains(script, want) { + t.Errorf("undeploy.sh missing %q", want) + } } - // Verify stuck CRD handling warns instead of force-clearing + // Stuck CRDs: script must NOT silently force-clear finalizers. if strings.Contains(script, "Force-clearing finalizers on stuck CRD") { t.Error("undeploy.sh should warn about stuck CRDs, not silently force-clear finalizers") } - if !strings.Contains(script, "CRDs stuck in deleting state") { - t.Error("undeploy.sh missing warning about stuck CRDs") - } - // Verify API-group discovery is not reused for destructive cleanup. - // Without bundle-specific ownership metadata, deleting CRDs by group can - // remove another tenant's CRD on a shared cluster. + // API-group based destructive cleanup is forbidden; only ownership-safe paths. + // deploy.sh may build ORPHANED_CRD_GROUPS for read-only pre-flight warnings, + // but undeploy.sh must never use it for destructive deletion. if strings.Contains(script, "ORPHANED_CRD_GROUPS=") { t.Error("undeploy.sh should not build group-based CRD delete lists") } @@ -694,6 +320,10 @@ func TestGenerate_UndeployScriptExecutable(t *testing.T) { } } +// --------------------------------------------------------------------------- +// Property tests (helpers and data-shape preservation) +// --------------------------------------------------------------------------- + func TestUniqueNamespaces(t *testing.T) { tests := []struct { name string @@ -703,28 +333,28 @@ func TestUniqueNamespaces(t *testing.T) { { name: "deduplicates shared namespaces", components: []ComponentData{ - {Name: "prometheus-adapter", Namespace: "monitoring", HasChart: true}, - {Name: "k8s-ephemeral", Namespace: "monitoring", HasChart: true}, - {Name: "kube-prometheus", Namespace: "monitoring", HasChart: true}, - {Name: "gpu-operator", Namespace: "gpu-operator", HasChart: true}, + {Name: "prometheus-adapter", Namespace: "monitoring"}, + {Name: "k8s-ephemeral", Namespace: "monitoring"}, + {Name: "kube-prometheus", Namespace: "monitoring"}, + {Name: "gpu-operator", Namespace: "gpu-operator"}, }, expected: []string{"monitoring", "gpu-operator"}, }, { - name: "excludes manifest-only components", + name: "preserves order", components: []ComponentData{ - {Name: "my-manifests", Namespace: "custom-ns", HasManifests: true}, - {Name: "gpu-operator", Namespace: "gpu-operator", HasChart: true}, + {Name: "a", Namespace: "ns-a"}, + {Name: "b", Namespace: "ns-b"}, }, - expected: []string{"gpu-operator"}, + expected: []string{"ns-a", "ns-b"}, }, { - name: "includes kustomize components", + name: "drops empty namespaces", components: []ComponentData{ - {Name: "my-kustomize", Namespace: "kustomize-ns", IsKustomize: true}, - {Name: "gpu-operator", Namespace: "gpu-operator", HasChart: true}, + {Name: "no-ns", Namespace: ""}, + {Name: "with-ns", Namespace: "real"}, }, - expected: []string{"kustomize-ns", "gpu-operator"}, + expected: []string{"real"}, }, { name: "empty input", @@ -747,153 +377,53 @@ func TestUniqueNamespaces(t *testing.T) { } } -func TestNormalizeVersionWithDefault(t *testing.T) { +func TestReverseComponents(t *testing.T) { tests := []struct { - input string - expected string + name string + input []ComponentData + wantLen int + wantName string }{ - {"v1.0.0", "1.0.0"}, - {"1.0.0", "1.0.0"}, - {"v0.1.0-alpha", "0.1.0-alpha"}, - {"", "0.1.0"}, - } - - for _, tt := range tests { - t.Run(tt.input, func(t *testing.T) { - result := deployer.NormalizeVersionWithDefault(tt.input) - if result != tt.expected { - t.Errorf("NormalizeVersionWithDefault(%q) = %q, want %q", tt.input, result, tt.expected) - } - }) - } -} - -func TestSortComponentNamesByDeploymentOrder(t *testing.T) { - const ( - certManager = "cert-manager" - gpuOperator = "gpu-operator" - networkOperator = "network-operator" - ) - - t.Run("all in order map", func(t *testing.T) { - components := []string{gpuOperator, certManager, networkOperator} - deploymentOrder := []string{certManager, gpuOperator, networkOperator} - - sorted := deployer.SortComponentNamesByDeploymentOrder(components, deploymentOrder) - - if sorted[0] != certManager { - t.Errorf("expected first %s, got %s", certManager, sorted[0]) - } - if sorted[1] != gpuOperator { - t.Errorf("expected second %s, got %s", gpuOperator, sorted[1]) - } - if sorted[2] != networkOperator { - t.Errorf("expected third %s, got %s", networkOperator, sorted[2]) - } - }) - - t.Run("only one in order map", func(t *testing.T) { - // "alpha" is not in the order map, gpuOperator is. - // gpuOperator should come first (okI branch). - components := []string{"alpha", gpuOperator} - deploymentOrder := []string{gpuOperator} - - sorted := deployer.SortComponentNamesByDeploymentOrder(components, deploymentOrder) - if sorted[0] != gpuOperator { - t.Errorf("expected ordered component first, got %s", sorted[0]) - } - }) - - t.Run("only j in order map", func(t *testing.T) { - // "zebra" is not in the order map, certManager is. - // certManager should sort after "zebra" would normally, but since - // certManager is in the map and zebra is not, certManager gets priority=false (okJ branch). - components := []string{certManager, "zebra"} - deploymentOrder := []string{certManager} - - sorted := deployer.SortComponentNamesByDeploymentOrder(components, deploymentOrder) - if sorted[0] != certManager { - t.Errorf("expected ordered component first, got %s", sorted[0]) - } - }) - - t.Run("neither in order map", func(t *testing.T) { - // Both unknown — should fall back to alphabetical. - components := []string{"zebra", "alpha"} - deploymentOrder := []string{gpuOperator} - - sorted := deployer.SortComponentNamesByDeploymentOrder(components, deploymentOrder) - if sorted[0] != "alpha" { - t.Errorf("expected alphabetical first, got %s", sorted[0]) - } - if sorted[1] != "zebra" { - t.Errorf("expected alphabetical second, got %s", sorted[1]) - } - }) - - t.Run("empty deployment order", func(t *testing.T) { - components := []string{"b", "a"} - sorted := deployer.SortComponentNamesByDeploymentOrder(components, nil) - if sorted[0] != "b" { - t.Errorf("expected original order preserved with empty order, got %s", sorted[0]) - } - }) -} - -func TestIsSafePathComponent(t *testing.T) { - tests := []struct { - name string - input string - expected bool - }{ - {"valid component name", "gpu-operator", true}, - {"valid with dots", "cert-manager", true}, - {"empty string", "", false}, - {"path traversal", "../etc/passwd", false}, - {"double dot", "..", false}, - {"forward slash", "gpu/operator", false}, - {"backslash", "gpu\\operator", false}, - {"embedded double dot", "foo..bar", false}, - {"leading dot dot slash", "../foo", false}, + { + name: "empty", + input: []ComponentData{}, + wantLen: 0, + }, + { + name: "single", + input: []ComponentData{{Name: "a"}}, + wantLen: 1, + wantName: "a", + }, + { + name: "multiple", + input: []ComponentData{ + {Name: "a"}, + {Name: "b"}, + {Name: "c"}, + }, + wantLen: 3, + wantName: "c", + }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { - result := deployer.IsSafePathComponent(tt.input) - if result != tt.expected { - t.Errorf("deployer.IsSafePathComponent(%q) = %v, want %v", tt.input, result, tt.expected) - } - }) - } -} - -func TestSafeJoin(t *testing.T) { - baseDir := t.TempDir() + original := make([]ComponentData, len(tt.input)) + copy(original, tt.input) - tests := []struct { - name string - dir string - input string - wantErr bool - }{ - {"valid component", baseDir, "gpu-operator", false}, - {"valid with dots", baseDir, "cert-manager", false}, - {"path traversal", baseDir, "../etc/passwd", true}, - {"double dot", baseDir, "..", true}, - {"absolute path rejected", baseDir, "/etc/passwd", true}, - {"empty name", baseDir, "", false}, // empty joins to baseDir itself - {"relative base", ".", "gpu-operator", false}, - } + result := reverseComponents(tt.input) - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - result, err := deployer.SafeJoin(tt.dir, tt.input) - if (err != nil) != tt.wantErr { - t.Errorf("deployer.SafeJoin(%q, %q) error = %v, wantErr %v", tt.dir, tt.input, err, tt.wantErr) - return + if len(result) != tt.wantLen { + t.Fatalf("len = %d, want %d", len(result), tt.wantLen) + } + if tt.wantLen > 0 && result[0].Name != tt.wantName { + t.Errorf("first element = %q, want %q", result[0].Name, tt.wantName) } - if err == nil && result == "" { - t.Errorf("deployer.SafeJoin(%q, %q) returned empty path", tt.dir, tt.input) + for i, comp := range tt.input { + if comp.Name != original[i].Name { + t.Errorf("original[%d] mutated: got %q, want %q", i, comp.Name, original[i].Name) + } } }) } @@ -960,891 +490,254 @@ func TestBuildComponentDataList_NamespaceAndChart(t *testing.T) { } } -func TestGenerate_KustomizeOnly(t *testing.T) { +// TestNormalizeVersionWithDefault / TestSortComponentNamesByDeploymentOrder / +// TestIsSafePathComponent / TestSafeJoin live in pkg/bundler/deployer; not +// duplicated here. + +// --------------------------------------------------------------------------- +// Determinism and no-timestamp +// --------------------------------------------------------------------------- + +// TestGenerate_Reproducible verifies bundle generation is deterministic. +func TestGenerate_Reproducible(t *testing.T) { ctx := context.Background() - outputDir := t.TempDir() g := &Generator{ - RecipeResult: createKustomizeRecipeResult(), + RecipeResult: createTestRecipeResult(), ComponentValues: map[string]map[string]any{ - "my-kustomize-app": {}, + "cert-manager": {"crds": map[string]any{"enabled": true}}, + "gpu-operator": { + "driver": map[string]any{"enabled": true}, + }, }, Version: "v1.0.0", } - output, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("Generate failed: %v", err) - } + var fileContents [2]map[string]string - // Verify root files exist - for _, f := range []string{"README.md", "deploy.sh", "undeploy.sh"} { - path := filepath.Join(outputDir, f) - if _, statErr := os.Stat(path); os.IsNotExist(statErr) { - t.Errorf("expected root file %s does not exist", f) + for i := 0; i < 2; i++ { + outputDir := t.TempDir() + if _, err := g.Generate(ctx, outputDir); err != nil { + t.Fatalf("iteration %d: Generate() error = %v", i, err) } - } - - // Verify component directory exists with README - readmePath := filepath.Join(outputDir, "my-kustomize-app", "README.md") - if _, statErr := os.Stat(readmePath); os.IsNotExist(statErr) { - t.Error("expected my-kustomize-app/README.md does not exist") - } - - // deploy.sh should contain kustomize build, NOT helm upgrade - deployContent, err := os.ReadFile(filepath.Join(outputDir, "deploy.sh")) - if err != nil { - t.Fatalf("failed to read deploy.sh: %v", err) - } - deployScript := string(deployContent) - - if !strings.Contains(deployScript, "kustomize build") { - t.Error("deploy.sh missing kustomize build command") - } - if strings.Contains(deployScript, "helm upgrade") { - t.Error("deploy.sh should not contain helm upgrade for kustomize-only bundle") - } - if !strings.Contains(deployScript, "via kustomize") { - t.Error("deploy.sh should indicate kustomize deployment") - } - if !strings.Contains(deployScript, "ref=v1.0.0") { - t.Error("deploy.sh should contain kustomize tag ref") - } - if !strings.Contains(deployScript, "deploy/production") { - t.Error("deploy.sh should contain kustomize path") - } - - // undeploy.sh should contain kustomize build for deletion - undeployContent, err := os.ReadFile(filepath.Join(outputDir, "undeploy.sh")) - if err != nil { - t.Fatalf("failed to read undeploy.sh: %v", err) - } - undeployScript := string(undeployContent) - if !strings.Contains(undeployScript, "kustomize build") { - t.Error("undeploy.sh missing kustomize build command") - } - if strings.Contains(undeployScript, "helm_force_uninstall \"my-kustomize-app\"") { - t.Error("undeploy.sh should not call helm_force_uninstall for kustomize-only bundle") + fileContents[i] = make(map[string]string) + err := filepath.Walk(outputDir, func(path string, info os.FileInfo, walkErr error) error { + if walkErr != nil { + return walkErr + } + if info.IsDir() { + return nil + } + content, readErr := os.ReadFile(path) + if readErr != nil { + return readErr + } + relPath, _ := filepath.Rel(outputDir, path) + fileContents[i][relPath] = string(content) + return nil + }) + if err != nil { + t.Fatalf("iteration %d: failed to walk directory: %v", i, err) + } } - // Component README should show kustomize instructions - compReadme, err := os.ReadFile(readmePath) - if err != nil { - t.Fatalf("failed to read component README: %v", err) - } - compReadmeStr := string(compReadme) - if !strings.Contains(compReadmeStr, "kustomize build") { - t.Error("component README should contain kustomize build instructions") - } - if strings.Contains(compReadmeStr, "helm upgrade") { - t.Error("component README should not contain helm commands for kustomize component") + if len(fileContents[0]) != len(fileContents[1]) { + t.Errorf("different number of files: iteration 1 has %d, iteration 2 has %d", + len(fileContents[0]), len(fileContents[1])) } - - if len(output.Files) < 4 { - t.Errorf("expected at least 4 files, got %d", len(output.Files)) + for filename, content1 := range fileContents[0] { + content2, exists := fileContents[1][filename] + if !exists { + t.Errorf("file %s exists in iteration 1 but not iteration 2", filename) + continue + } + if content1 != content2 { + t.Errorf("file %s has different content between iterations:\n--- iteration 1 ---\n%s\n--- iteration 2 ---\n%s", + filename, content1, content2) + } } } -func TestGenerate_MixedHelmAndKustomize(t *testing.T) { +// TestGenerate_NoTimestampInOutput verifies no timestamps are embedded. +func TestGenerate_NoTimestampInOutput(t *testing.T) { ctx := context.Background() outputDir := t.TempDir() g := &Generator{ - RecipeResult: createMixedRecipeResult(), + RecipeResult: createTestRecipeResult(), ComponentValues: map[string]map[string]any{ - "cert-manager": {"crds": map[string]any{"enabled": true}}, - "my-kustomize-app": {}, + "cert-manager": {}, + "gpu-operator": {}, }, Version: "v1.0.0", } - output, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("Generate failed: %v", err) - } - - // Verify both component directories exist - for _, comp := range []string{"cert-manager", "my-kustomize-app"} { - readmePath := filepath.Join(outputDir, comp, "README.md") - if _, statErr := os.Stat(readmePath); os.IsNotExist(statErr) { - t.Errorf("expected %s/README.md does not exist", comp) - } - } - - // deploy.sh should contain BOTH helm and kustomize commands - deployContent, err := os.ReadFile(filepath.Join(outputDir, "deploy.sh")) - if err != nil { - t.Fatalf("failed to read deploy.sh: %v", err) + if _, err := g.Generate(ctx, outputDir); err != nil { + t.Fatalf("Generate() error = %v", err) } - deployScript := string(deployContent) - if !strings.Contains(deployScript, "helm upgrade") { - t.Error("deploy.sh missing helm upgrade for Helm component") - } - if !strings.Contains(deployScript, "kustomize build") { - t.Error("deploy.sh missing kustomize build for Kustomize component") + timestampPatterns := []string{ + "GeneratedAt:", + "generated_at:", + "timestamp:", + "Timestamp:", } - // undeploy.sh should contain BOTH helm and kustomize commands - undeployContent, err := os.ReadFile(filepath.Join(outputDir, "undeploy.sh")) + err := filepath.Walk(outputDir, func(path string, info os.FileInfo, walkErr error) error { + if walkErr != nil { + return walkErr + } + if info.IsDir() { + return nil + } + content, readErr := os.ReadFile(path) + if readErr != nil { + return readErr + } + s := string(content) + relPath, _ := filepath.Rel(outputDir, path) + for _, pattern := range timestampPatterns { + if strings.Contains(s, pattern) { + t.Errorf("file %s contains timestamp pattern %q", relPath, pattern) + } + } + return nil + }) if err != nil { - t.Fatalf("failed to read undeploy.sh: %v", err) + t.Fatalf("failed to walk directory: %v", err) } - undeployScript := string(undeployContent) +} - if !strings.Contains(undeployScript, "helm uninstall") { - t.Error("undeploy.sh missing helm uninstall for Helm component") - } - if !strings.Contains(undeployScript, "kustomize build") { - t.Error("undeploy.sh missing kustomize build for Kustomize component") - } +// --------------------------------------------------------------------------- +// Internal generators (generateDeployScript, generateUndeployScript) +// --------------------------------------------------------------------------- - // Root README should show both types - rootReadme, err := os.ReadFile(filepath.Join(outputDir, "README.md")) - if err != nil { - t.Fatalf("failed to read README.md: %v", err) - } - rootReadmeStr := string(rootReadme) - if !strings.Contains(rootReadmeStr, "Helm") { - t.Error("root README should indicate Helm type") - } - if !strings.Contains(rootReadmeStr, "Kustomize") { - t.Error("root README should indicate Kustomize type") +// TestGenerateDeployScript_ContextCanceled exercises the early-return +// ctx.Err() check inside generateDeployScript. Generate() short-circuits at +// localformat.Write before reaching the helpers, so the helper's own ctx +// guard requires a direct call to cover. +func TestGenerateDeployScript_ContextCanceled(t *testing.T) { + ctx, cancel := context.WithCancel(context.Background()) + cancel() + g := &Generator{Version: "v1.0.0"} + if _, _, err := g.generateDeployScript(ctx, nil, t.TempDir()); err == nil { + t.Fatal("expected error on canceled context") } +} - if len(output.Files) < 7 { - t.Errorf("expected at least 7 files, got %d", len(output.Files)) +// TestGenerateUndeployScript_ContextCanceled — counterpart for undeploy. +func TestGenerateUndeployScript_ContextCanceled(t *testing.T) { + ctx, cancel := context.WithCancel(context.Background()) + cancel() + g := &Generator{Version: "v1.0.0"} + if _, _, err := g.generateUndeployScript(ctx, nil, t.TempDir()); err == nil { + t.Fatal("expected error on canceled context") } } -func TestBuildComponentDataList_Kustomize(t *testing.T) { +// TestGenerate_InvalidOutputDir verifies Generate fails cleanly when the +// supplied outputDir cannot be created (parent directory does not exist +// and isn't writable). Other Generate-level error paths (nil RecipeResult, +// canceled context) are covered by their own focused tests. +func TestGenerate_InvalidOutputDir(t *testing.T) { g := &Generator{ - RecipeResult: &recipe.RecipeResult{ - ComponentRefs: []recipe.ComponentRef{ - { - Name: "my-kustomize-app", - Namespace: "my-app", - Type: recipe.ComponentTypeKustomize, - Source: "https://github.com/example/repo", - Tag: "v2.0.0", - Path: "deploy/production", - }, - }, + RecipeResult: createTestRecipeResult(), + ComponentValues: map[string]map[string]any{ + "cert-manager": {}, + "gpu-operator": {}, }, + Version: "v1.0.0", } - components, err := g.buildComponentDataList() - if err != nil { - t.Fatalf("buildComponentDataList failed: %v", err) - } - - if len(components) != 1 { - t.Fatalf("expected 1 component, got %d", len(components)) - } - - comp := components[0] - if !comp.IsKustomize { - t.Error("expected IsKustomize to be true") - } - if comp.HasChart { - t.Error("expected HasChart to be false for kustomize component") - } - if comp.Tag != "v2.0.0" { - t.Errorf("expected Tag v2.0.0, got %s", comp.Tag) - } - if comp.Path != "deploy/production" { - t.Errorf("expected Path deploy/production, got %s", comp.Path) - } - if comp.Repository != "https://github.com/example/repo" { - t.Errorf("expected Repository https://github.com/example/repo, got %s", comp.Repository) + // /nonexistent/path/... requires creating /nonexistent/, which is not + // writable by an unprivileged process. + _, err := g.Generate(context.Background(), "/nonexistent/path/that/does/not/exist") + if err == nil { + t.Fatal("expected error on uncreatable output directory, got nil") } } -func TestBuildComponentDataList_MixedTypes(t *testing.T) { - g := &Generator{ - RecipeResult: &recipe.RecipeResult{ - ComponentRefs: []recipe.ComponentRef{ - { - Name: "cert-manager", - Namespace: "cert-manager", - Chart: "cert-manager", - Type: recipe.ComponentTypeHelm, - Version: "v1.17.2", - Source: "https://charts.jetstack.io", - }, - { - Name: "my-kustomize-app", - Namespace: "my-app", - Type: recipe.ComponentTypeKustomize, - Source: "https://github.com/example/repo", - Tag: "v2.0.0", - Path: "deploy/production", - }, +// --------------------------------------------------------------------------- +// Dynamic values +// --------------------------------------------------------------------------- + +func TestGenerate_DynamicValues(t *testing.T) { + tests := []struct { + name string + dynamicValues map[string][]string + componentValues map[string]map[string]any + wantClusterContains string // substring expected in gpu-operator/cluster-values.yaml + wantValuesLacksPath string // substring that should NOT be in gpu-operator/values.yaml + }{ + { + name: "no dynamic values — cluster-values.yaml still generated (empty)", + dynamicValues: nil, + componentValues: map[string]map[string]any{ + "cert-manager": {"crds": map[string]any{"enabled": true}}, + "gpu-operator": {"driver": map[string]any{"version": testDriverVersion, "enabled": true}}, }, }, + { + name: "dynamic values present — extracted into cluster-values.yaml", + dynamicValues: map[string][]string{ + "gpu-operator": {"driver.version"}, + }, + componentValues: map[string]map[string]any{ + "cert-manager": {"crds": map[string]any{"enabled": true}}, + "gpu-operator": {"driver": map[string]any{"version": testDriverVersion, "enabled": true}}, + }, + wantClusterContains: "version", + wantValuesLacksPath: `version: "570.86.16"`, + }, + { + name: "dynamic path not in values", + dynamicValues: map[string][]string{ + "gpu-operator": {"nonexistent.path"}, + }, + componentValues: map[string]map[string]any{ + "cert-manager": {"crds": map[string]any{"enabled": true}}, + "gpu-operator": {"driver": map[string]any{"enabled": true}}, + }, + wantClusterContains: "nonexistent", + }, } - components, err := g.buildComponentDataList() - if err != nil { - t.Fatalf("buildComponentDataList failed: %v", err) - } - - if len(components) != 2 { - t.Fatalf("expected 2 components, got %d", len(components)) - } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + ctx := context.Background() + outputDir := t.TempDir() - for _, comp := range components { - switch comp.Name { - case "cert-manager": - if comp.IsKustomize { - t.Error("cert-manager should not be kustomize") + g := &Generator{ + RecipeResult: createTestRecipeResult(), + ComponentValues: tt.componentValues, + Version: "v1.0.0", + DynamicValues: tt.dynamicValues, } - if !comp.HasChart { - t.Error("cert-manager should have HasChart=true") + if _, err := g.Generate(ctx, outputDir); err != nil { + t.Fatalf("Generate failed: %v", err) } - case "my-kustomize-app": - if !comp.IsKustomize { - t.Error("my-kustomize-app should be kustomize") + + gpuCluster := filepath.Join(outputDir, "002-gpu-operator", "cluster-values.yaml") + if _, err := os.Stat(gpuCluster); os.IsNotExist(err) { + t.Fatal("gpu-operator/cluster-values.yaml should always exist") } - if comp.HasChart { - t.Error("my-kustomize-app should have HasChart=false") + if tt.wantClusterContains != "" { + content := readFile(t, gpuCluster) + if !strings.Contains(content, tt.wantClusterContains) { + t.Errorf("cluster-values.yaml missing %q, got:\n%s", tt.wantClusterContains, content) + } } - } - } -} - -// Helper functions - -func createKustomizeRecipeResult() *recipe.RecipeResult { - return &recipe.RecipeResult{ - Kind: "RecipeResult", - APIVersion: "aicr.nvidia.com/v1alpha1", - Metadata: struct { - Version string `json:"version,omitempty" yaml:"version,omitempty"` - AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` - ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` - ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` - }{ - Version: "v0.1.0", - }, - ComponentRefs: []recipe.ComponentRef{ - { - Name: "my-kustomize-app", - Namespace: "my-app", - Type: recipe.ComponentTypeKustomize, - Source: "https://github.com/example/repo", - Tag: "v1.0.0", - Path: "deploy/production", - }, - }, - DeploymentOrder: []string{"my-kustomize-app"}, - } -} - -func createMixedRecipeResult() *recipe.RecipeResult { - return &recipe.RecipeResult{ - Kind: "RecipeResult", - APIVersion: "aicr.nvidia.com/v1alpha1", - Metadata: struct { - Version string `json:"version,omitempty" yaml:"version,omitempty"` - AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` - ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` - ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` - }{ - Version: "v0.1.0", - }, - Criteria: &recipe.Criteria{ - Service: "eks", - Accelerator: "h100", - Intent: "training", - }, - ComponentRefs: []recipe.ComponentRef{ - { - Name: "cert-manager", - Namespace: "cert-manager", - Chart: "cert-manager", - Type: recipe.ComponentTypeHelm, - Version: "v1.17.2", - Source: "https://charts.jetstack.io", - }, - { - Name: "my-kustomize-app", - Namespace: "my-app", - Type: recipe.ComponentTypeKustomize, - Source: "https://github.com/example/repo", - Tag: "v1.0.0", - Path: "deploy/production", - }, - }, - DeploymentOrder: []string{"cert-manager", "my-kustomize-app"}, - } -} - -func createTestRecipeResult() *recipe.RecipeResult { - return &recipe.RecipeResult{ - Kind: "RecipeResult", - APIVersion: "aicr.nvidia.com/v1alpha1", - Metadata: struct { - Version string `json:"version,omitempty" yaml:"version,omitempty"` - AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` - ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` - ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` - }{ - Version: "v0.1.0", - }, - Criteria: &recipe.Criteria{ - Service: "eks", - Accelerator: "h100", - Intent: "training", - }, - ComponentRefs: []recipe.ComponentRef{ - { - Name: "cert-manager", - Namespace: "cert-manager", - Chart: "cert-manager", - Version: "v1.17.2", - Source: "https://charts.jetstack.io", - }, - { - Name: "gpu-operator", - Namespace: "gpu-operator", - Chart: "gpu-operator", - Version: "v25.3.3", - Source: "https://helm.ngc.nvidia.com/nvidia", - }, - }, - DeploymentOrder: []string{"cert-manager", "gpu-operator"}, - } -} - -func createEmptyRecipeResult() *recipe.RecipeResult { - return &recipe.RecipeResult{ - Kind: "RecipeResult", - APIVersion: "aicr.nvidia.com/v1alpha1", - Metadata: struct { - Version string `json:"version,omitempty" yaml:"version,omitempty"` - AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` - ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` - ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` - }{ - Version: "v0.1.0", - }, - ComponentRefs: []recipe.ComponentRef{}, - DeploymentOrder: []string{}, - } -} - -// TestGenerate_Reproducible verifies that Helm bundle generation is deterministic. -// Running Generate() twice with the same input should produce identical output files. -func TestGenerate_Reproducible(t *testing.T) { - ctx := context.Background() - - g := &Generator{ - RecipeResult: createTestRecipeResult(), - ComponentValues: map[string]map[string]any{ - "cert-manager": { - "crds": map[string]any{"enabled": true}, - }, - "gpu-operator": { - "driver": map[string]any{ - "enabled": true, - }, - }, - }, - Version: "v1.0.0", - } - - // Generate twice in different directories - var fileContents [2]map[string]string - - for i := 0; i < 2; i++ { - outputDir := t.TempDir() - - _, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("iteration %d: Generate() error = %v", i, err) - } - - // Read all generated files - fileContents[i] = make(map[string]string) - err = filepath.Walk(outputDir, func(path string, info os.FileInfo, walkErr error) error { - if walkErr != nil { - return walkErr - } - if info.IsDir() { - return nil - } - - content, readErr := os.ReadFile(path) - if readErr != nil { - return readErr - } - - relPath, _ := filepath.Rel(outputDir, path) - fileContents[i][relPath] = string(content) - return nil - }) - if err != nil { - t.Fatalf("iteration %d: failed to walk directory: %v", i, err) - } - } - - // Verify same files were generated - if len(fileContents[0]) != len(fileContents[1]) { - t.Errorf("different number of files: iteration 1 has %d, iteration 2 has %d", - len(fileContents[0]), len(fileContents[1])) - } - - // Verify file contents are identical - for filename, content1 := range fileContents[0] { - content2, exists := fileContents[1][filename] - if !exists { - t.Errorf("file %s exists in iteration 1 but not iteration 2", filename) - continue - } - if content1 != content2 { - t.Errorf("file %s has different content between iterations:\n--- iteration 1 ---\n%s\n--- iteration 2 ---\n%s", - filename, content1, content2) - } - } - - t.Logf("Helm reproducibility verified: both iterations produced %d identical files", len(fileContents[0])) -} - -// TestGenerate_NoTimestampInOutput verifies that generated files don't contain timestamps. -func TestGenerate_NoTimestampInOutput(t *testing.T) { - ctx := context.Background() - outputDir := t.TempDir() - - g := &Generator{ - RecipeResult: createTestRecipeResult(), - ComponentValues: map[string]map[string]any{ - "cert-manager": {}, - "gpu-operator": {}, - }, - Version: "v1.0.0", - } - - _, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("Generate() error = %v", err) - } - - // Check that no files contain obvious timestamp patterns - timestampPatterns := []string{ - "GeneratedAt:", - "generated_at:", - "timestamp:", - "Timestamp:", - } - - err = filepath.Walk(outputDir, func(path string, info os.FileInfo, walkErr error) error { - if walkErr != nil { - return walkErr - } - if info.IsDir() { - return nil - } - - content, readErr := os.ReadFile(path) - if readErr != nil { - return readErr - } - - contentStr := string(content) - relPath, _ := filepath.Rel(outputDir, path) - - for _, pattern := range timestampPatterns { - if strings.Contains(contentStr, pattern) { - t.Errorf("file %s contains timestamp pattern %q", relPath, pattern) - } - } - return nil - }) - if err != nil { - t.Fatalf("failed to walk directory: %v", err) - } -} - -func TestGenerateDeployScript(t *testing.T) { - tests := []struct { - name string - cancelCtx bool - outputDir string - components []ComponentData - wantErr bool - }{ - { - name: "success", - outputDir: "", // filled per-test with t.TempDir() - components: []ComponentData{ - {Name: "cert-manager", Namespace: "cert-manager", Repository: "https://charts.jetstack.io", ChartName: "cert-manager", Version: "v1.17.2", ChartVersion: "1.17.2"}, - {Name: "gpu-operator", Namespace: "gpu-operator", Repository: "https://helm.ngc.nvidia.com/nvidia", ChartName: "gpu-operator", Version: "v25.3.3", ChartVersion: "25.3.3"}, - }, - }, - { - name: "cancelled context", - cancelCtx: true, - outputDir: "", // filled per-test - components: []ComponentData{ - {Name: "cert-manager"}, - }, - wantErr: true, - }, - { - name: "invalid output directory", - outputDir: "/nonexistent/path/that/does/not/exist", - components: []ComponentData{ - {Name: "cert-manager"}, - }, - wantErr: true, - }, - { - name: "empty components", - outputDir: "", - components: []ComponentData{}, - }, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - ctx := context.Background() - if tt.cancelCtx { - var cancel context.CancelFunc - ctx, cancel = context.WithCancel(ctx) - cancel() - } - - dir := tt.outputDir - if dir == "" { - dir = t.TempDir() - } - - g := &Generator{ - Version: "v1.0.0", - } - - path, size, err := g.generateDeployScript(ctx, tt.components, dir) - if (err != nil) != tt.wantErr { - t.Fatalf("error = %v, wantErr %v", err, tt.wantErr) - } - if tt.wantErr { - return - } - - if path == "" { - t.Fatal("expected non-empty path") - } - if size <= 0 { - t.Fatal("expected positive file size") - } - - info, statErr := os.Stat(path) - if statErr != nil { - t.Fatalf("stat(%s): %v", path, statErr) - } - if info.Mode()&0111 == 0 { - t.Errorf("deploy.sh not executable, mode: %o", info.Mode()) - } - }) - } -} - -func TestGenerateDeployScript_EmptyVersionOmitsFlag(t *testing.T) { - ctx := context.Background() - dir := t.TempDir() - - components := []ComponentData{ - { - Name: "gpu-operator", - Namespace: "gpu-operator", - Repository: "https://helm.ngc.nvidia.com/nvidia", - ChartName: "gpu-operator", - Version: "", // empty version — should not produce --version flag - HasChart: true, - }, - } - - g := &Generator{Version: "v1.0.0"} - path, _, err := g.generateDeployScript(ctx, components, dir) - if err != nil { - t.Fatalf("generateDeployScript failed: %v", err) - } - - content, err := os.ReadFile(path) - if err != nil { - t.Fatalf("reading deploy.sh: %v", err) - } - - script := string(content) - if strings.Contains(script, "--version") { - t.Errorf("deploy.sh should not contain --version when Version is empty, got:\n%s", script) - } - if !strings.Contains(script, "helm upgrade --install gpu-operator gpu-operator") { - t.Errorf("deploy.sh should contain helm install command for gpu-operator") - } -} - -func TestGenerateDeployScript_WithVersionIncludesFlag(t *testing.T) { - ctx := context.Background() - dir := t.TempDir() - - components := []ComponentData{ - { - Name: "cert-manager", - Namespace: "cert-manager", - Repository: "https://charts.jetstack.io", - ChartName: "cert-manager", - Version: "v1.17.2", - HasChart: true, - }, - } - - g := &Generator{Version: "v1.0.0"} - path, _, err := g.generateDeployScript(ctx, components, dir) - if err != nil { - t.Fatalf("generateDeployScript failed: %v", err) - } - - content, err := os.ReadFile(path) - if err != nil { - t.Fatalf("reading deploy.sh: %v", err) - } - - script := string(content) - if !strings.Contains(script, "--version v1.17.2") { - t.Errorf("deploy.sh should contain --version v1.17.2, got:\n%s", script) - } -} - -func TestGenerateUndeployScript(t *testing.T) { - tests := []struct { - name string - cancelCtx bool - outputDir string - components []ComponentData - wantErr bool - }{ - { - name: "success", - outputDir: "", - components: []ComponentData{ - {Name: "cert-manager", Namespace: "cert-manager", Repository: "https://charts.jetstack.io", ChartName: "cert-manager", Version: "v1.17.2", ChartVersion: "1.17.2"}, - {Name: "gpu-operator", Namespace: "gpu-operator", Repository: "https://helm.ngc.nvidia.com/nvidia", ChartName: "gpu-operator", Version: "v25.3.3", ChartVersion: "25.3.3"}, - }, - }, - { - name: "cancelled context", - cancelCtx: true, - outputDir: "", - components: []ComponentData{ - {Name: "cert-manager"}, - }, - wantErr: true, - }, - { - name: "invalid output directory", - outputDir: "/nonexistent/path/that/does/not/exist", - components: []ComponentData{ - {Name: "cert-manager"}, - }, - wantErr: true, - }, - { - name: "empty components", - outputDir: "", - components: []ComponentData{}, - }, - { - name: "reverses component order", - outputDir: "", - components: []ComponentData{ - {Name: "alpha", Namespace: "alpha", ChartName: "alpha", Version: "v1.0.0", ChartVersion: "1.0.0", HasChart: true}, - {Name: "beta", Namespace: "beta", ChartName: "beta", Version: "v2.0.0", ChartVersion: "2.0.0", HasChart: true}, - {Name: "gamma", Namespace: "gamma", ChartName: "gamma", Version: "v3.0.0", ChartVersion: "3.0.0", HasChart: true}, - }, - }, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - ctx := context.Background() - if tt.cancelCtx { - var cancel context.CancelFunc - ctx, cancel = context.WithCancel(ctx) - cancel() - } - - dir := tt.outputDir - if dir == "" { - dir = t.TempDir() - } - - g := &Generator{ - Version: "v1.0.0", - } - - path, size, err := g.generateUndeployScript(ctx, tt.components, dir) - if (err != nil) != tt.wantErr { - t.Fatalf("error = %v, wantErr %v", err, tt.wantErr) - } - if tt.wantErr { - return - } - - if path == "" { - t.Fatal("expected non-empty path") - } - if size <= 0 { - t.Fatal("expected positive file size") - } - - info, statErr := os.Stat(path) - if statErr != nil { - t.Fatalf("stat(%s): %v", path, statErr) - } - if info.Mode()&0111 == 0 { - t.Errorf("undeploy.sh not executable, mode: %o", info.Mode()) - } - - if tt.name == "reverses component order" { - content, readErr := os.ReadFile(path) - if readErr != nil { - t.Fatalf("read undeploy.sh: %v", readErr) - } - script := string(content) - gammaIdx := strings.Index(script, "Uninstalling gamma") - alphaIdx := strings.Index(script, "Uninstalling alpha") - if gammaIdx < 0 || alphaIdx < 0 { - t.Fatal("expected both gamma and alpha in undeploy.sh") - } - if gammaIdx > alphaIdx { - t.Error("undeploy.sh should have gamma before alpha (reverse order)") - } - } - }) - } -} - -func TestGenerate_DynamicValues(t *testing.T) { - tests := []struct { - name string - dynamicValues map[string][]string - componentValues map[string]map[string]any - wantClusterValues bool // whether cluster-values.yaml should exist for gpu-operator - wantClusterContains string // substring expected in cluster-values.yaml - wantValuesLacksPath string // dot path that should NOT be in values.yaml - wantDeployClusterValues bool // whether deploy.sh should contain cluster-values.yaml for gpu-operator - }{ - { - name: "no dynamic values — cluster-values.yaml still generated (empty)", - dynamicValues: nil, - componentValues: map[string]map[string]any{ - "cert-manager": {"crds": map[string]any{"enabled": true}}, - "gpu-operator": {"driver": map[string]any{"version": testDriverVersion, "enabled": true}}, - }, - wantClusterValues: true, - wantDeployClusterValues: true, - }, - { - name: "dynamic values present — extracted into cluster-values.yaml", - dynamicValues: map[string][]string{ - "gpu-operator": {"driver.version"}, - }, - componentValues: map[string]map[string]any{ - "cert-manager": {"crds": map[string]any{"enabled": true}}, - "gpu-operator": {"driver": map[string]any{"version": testDriverVersion, "enabled": true}}, - }, - wantClusterValues: true, - wantClusterContains: "version", - wantValuesLacksPath: "version: \"570.86.16\"", - wantDeployClusterValues: true, - }, - { - name: "dynamic path not in values", - dynamicValues: map[string][]string{ - "gpu-operator": {"nonexistent.path"}, - }, - componentValues: map[string]map[string]any{ - "cert-manager": {"crds": map[string]any{"enabled": true}}, - "gpu-operator": {"driver": map[string]any{"enabled": true}}, - }, - wantClusterValues: true, - wantClusterContains: "nonexistent", - wantDeployClusterValues: true, - }, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - ctx := context.Background() - outputDir := t.TempDir() - - g := &Generator{ - RecipeResult: createTestRecipeResult(), - ComponentValues: tt.componentValues, - Version: "v1.0.0", - DynamicValues: tt.dynamicValues, - } - - _, err := g.Generate(ctx, outputDir) - if err != nil { - t.Fatalf("Generate failed: %v", err) - } - - clusterValuesPath := filepath.Join(outputDir, "gpu-operator", "cluster-values.yaml") - _, statErr := os.Stat(clusterValuesPath) - clusterExists := !os.IsNotExist(statErr) - - if clusterExists != tt.wantClusterValues { - t.Errorf("cluster-values.yaml exists = %v, want %v", clusterExists, tt.wantClusterValues) - } - - if tt.wantClusterContains != "" && clusterExists { - content, readErr := os.ReadFile(clusterValuesPath) - if readErr != nil { - t.Fatalf("failed to read cluster-values.yaml: %v", readErr) - } - if !strings.Contains(string(content), tt.wantClusterContains) { - t.Errorf("cluster-values.yaml missing %q, got:\n%s", tt.wantClusterContains, string(content)) - } - } - if tt.wantValuesLacksPath != "" { - valuesContent, readErr := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "values.yaml")) - if readErr != nil { - t.Fatalf("failed to read values.yaml: %v", readErr) - } - if strings.Contains(string(valuesContent), tt.wantValuesLacksPath) { - t.Errorf("values.yaml should not contain %q after dynamic split, got:\n%s", tt.wantValuesLacksPath, string(valuesContent)) - } - } - - // cert-manager should also have cluster-values.yaml (all components get one) - certClusterPath := filepath.Join(outputDir, "cert-manager", "cluster-values.yaml") - if _, certStatErr := os.Stat(certClusterPath); os.IsNotExist(certStatErr) { - t.Error("cert-manager should have cluster-values.yaml (all components get one)") - } - - // Verify deploy.sh content — all components always reference cluster-values.yaml - deployContent, readErr := os.ReadFile(filepath.Join(outputDir, "deploy.sh")) - if readErr != nil { - t.Fatalf("failed to read deploy.sh: %v", readErr) - } - deployScript := string(deployContent) - - gpuClusterRef := `gpu-operator/cluster-values.yaml` - if tt.wantDeployClusterValues { - if !strings.Contains(deployScript, gpuClusterRef) { - t.Error("deploy.sh should contain cluster-values.yaml reference for gpu-operator") + content := readFile(t, filepath.Join(outputDir, "002-gpu-operator", "values.yaml")) + if strings.Contains(content, tt.wantValuesLacksPath) { + t.Errorf("values.yaml should not contain %q after dynamic split, got:\n%s", tt.wantValuesLacksPath, content) } } - // All components always have cluster-values.yaml in deploy.sh - certClusterRef := `cert-manager/cluster-values.yaml` - if !strings.Contains(deployScript, certClusterRef) { - t.Error("deploy.sh should contain cluster-values.yaml reference for all components") + // cert-manager also always has cluster-values.yaml (every component gets one). + if _, err := os.Stat(filepath.Join(outputDir, "001-cert-manager", "cluster-values.yaml")); os.IsNotExist(err) { + t.Error("cert-manager should have cluster-values.yaml") } }) } @@ -1874,42 +767,27 @@ func TestGenerate_DynamicValuesContentVerification(t *testing.T) { }, } - _, err := g.Generate(ctx, outputDir) - if err != nil { + if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate failed: %v", err) } - // Verify cluster-values.yaml has the extracted values - clusterContent, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "cluster-values.yaml")) - if err != nil { - t.Fatalf("failed to read cluster-values.yaml: %v", err) - } - clusterStr := string(clusterContent) - - if !strings.Contains(clusterStr, testDriverVersion) { - t.Errorf("cluster-values.yaml missing driver.version value, got:\n%s", clusterStr) - } - if !strings.Contains(clusterStr, "1.17.4") { - t.Errorf("cluster-values.yaml missing toolkit.version value, got:\n%s", clusterStr) + cluster := readFile(t, filepath.Join(outputDir, "002-gpu-operator", "cluster-values.yaml")) + if !strings.Contains(cluster, testDriverVersion) { + t.Errorf("cluster-values.yaml missing driver.version, got:\n%s", cluster) } - - // Verify values.yaml no longer has the dynamic values - valuesContent, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "values.yaml")) - if err != nil { - t.Fatalf("failed to read values.yaml: %v", err) + if !strings.Contains(cluster, "1.17.4") { + t.Errorf("cluster-values.yaml missing toolkit.version, got:\n%s", cluster) } - valuesStr := string(valuesContent) - if strings.Contains(valuesStr, testDriverVersion) { - t.Errorf("values.yaml should not contain driver version after dynamic split, got:\n%s", valuesStr) + values := readFile(t, filepath.Join(outputDir, "002-gpu-operator", "values.yaml")) + if strings.Contains(values, testDriverVersion) { + t.Errorf("values.yaml should not contain driver.version, got:\n%s", values) } - if strings.Contains(valuesStr, "1.17.4") { - t.Errorf("values.yaml should not contain toolkit version after dynamic split, got:\n%s", valuesStr) + if strings.Contains(values, "1.17.4") { + t.Errorf("values.yaml should not contain toolkit.version, got:\n%s", values) } - - // driver.enabled should still be in values.yaml - if !strings.Contains(valuesStr, "enabled") { - t.Errorf("values.yaml should still contain driver.enabled, got:\n%s", valuesStr) + if !strings.Contains(values, "enabled") { + t.Errorf("values.yaml should still contain driver.enabled, got:\n%s", values) } } @@ -1944,7 +822,6 @@ func TestSetNestedValue(t *testing.T) { t.Run(tt.name, func(t *testing.T) { m := make(map[string]any) component.SetValueByPath(m, tt.path, tt.value) - for _, key := range tt.wantKeys { if _, ok := m[key]; !ok { t.Errorf("missing key %q in result map", key) @@ -1953,30 +830,22 @@ func TestSetNestedValue(t *testing.T) { }) } - // Verify full structure for nested path t.Run("verify nested structure", func(t *testing.T) { m := make(map[string]any) component.SetValueByPath(m, "driver.version", testDriverVersion) - driver, ok := m["driver"].(map[string]any) if !ok { t.Fatal("driver should be a map") } - version, ok := driver["version"] - if !ok { - t.Fatal("driver.version should exist") - } - if version != testDriverVersion { - t.Errorf("driver.version = %v, want 570.86.16", version) + if driver["version"] != testDriverVersion { + t.Errorf("driver.version = %v, want 570.86.16", driver["version"]) } }) - // Verify multiple paths into same parent t.Run("multiple paths same parent", func(t *testing.T) { m := make(map[string]any) component.SetValueByPath(m, "driver.version", testDriverVersion) component.SetValueByPath(m, "driver.enabled", true) - driver, ok := m["driver"].(map[string]any) if !ok { t.Fatal("driver should be a map") @@ -2015,31 +884,20 @@ func TestGenerate_DynamicValuesDeeplyNested(t *testing.T) { }, } - _, err := g.Generate(ctx, outputDir) - if err != nil { + if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate failed: %v", err) } - // Verify cluster-values.yaml was created with the deeply nested path - clusterContent, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "cluster-values.yaml")) - if err != nil { - t.Fatalf("failed to read cluster-values.yaml: %v", err) - } - clusterStr := string(clusterContent) - - if !strings.Contains(clusterStr, "deep-value") { - t.Errorf("cluster-values.yaml missing deeply nested value, got:\n%s", clusterStr) + cluster := readFile(t, filepath.Join(outputDir, "002-gpu-operator", "cluster-values.yaml")) + if !strings.Contains(cluster, "deep-value") { + t.Errorf("cluster-values.yaml missing deeply nested value, got:\n%s", cluster) } - // Parse cluster-values.yaml and verify the YAML structure var clusterMap map[string]any - // Strip the header comment and --- separator - yamlContent := strings.TrimPrefix(clusterStr, "# Generated by Cloud Native Stack\n---\n") - if unmarshalErr := yaml.Unmarshal([]byte(yamlContent), &clusterMap); unmarshalErr != nil { - t.Fatalf("failed to parse cluster-values.yaml: %v", unmarshalErr) + if err := yaml.Unmarshal([]byte(cluster), &clusterMap); err != nil { + t.Fatalf("failed to parse cluster-values.yaml: %v", err) } - // Walk the nested path a.b.c.d a, ok := clusterMap["a"].(map[string]any) if !ok { t.Fatal("expected 'a' to be a map in cluster-values.yaml") @@ -2060,18 +918,12 @@ func TestGenerate_DynamicValuesDeeplyNested(t *testing.T) { t.Errorf("a.b.c.d = %v, want 'deep-value'", d) } - // Verify values.yaml no longer contains the extracted value - valuesContent, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "values.yaml")) - if err != nil { - t.Fatalf("failed to read values.yaml: %v", err) + values := readFile(t, filepath.Join(outputDir, "002-gpu-operator", "values.yaml")) + if strings.Contains(values, "deep-value") { + t.Errorf("values.yaml should not contain deep-value after split, got:\n%s", values) } - if strings.Contains(string(valuesContent), "deep-value") { - t.Errorf("values.yaml should not contain deep-value after dynamic split, got:\n%s", string(valuesContent)) - } - - // driver.enabled should still be in values.yaml - if !strings.Contains(string(valuesContent), "enabled") { - t.Errorf("values.yaml should still contain driver.enabled, got:\n%s", string(valuesContent)) + if !strings.Contains(values, "enabled") { + t.Errorf("values.yaml should still contain driver.enabled, got:\n%s", values) } } @@ -2079,9 +931,6 @@ func TestGenerate_DynamicValuesWithSetOverride(t *testing.T) { ctx := context.Background() outputDir := t.TempDir() - // Simulate --set gpuoperator:driver.version=999.99.99 by providing the value - // in ComponentValues (--set is applied before dynamic extraction). - // Then --dynamic gpuoperator:driver.version should extract the --set value. g := &Generator{ RecipeResult: createTestRecipeResult(), ComponentValues: map[string]map[string]any{ @@ -2099,32 +948,21 @@ func TestGenerate_DynamicValuesWithSetOverride(t *testing.T) { }, } - _, err := g.Generate(ctx, outputDir) - if err != nil { + if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate failed: %v", err) } - // cluster-values.yaml should contain the --set value - clusterContent, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "cluster-values.yaml")) - if err != nil { - t.Fatalf("failed to read cluster-values.yaml: %v", err) - } - if !strings.Contains(string(clusterContent), "999.99.99") { - t.Errorf("cluster-values.yaml should contain --set value 999.99.99, got:\n%s", string(clusterContent)) + cluster := readFile(t, filepath.Join(outputDir, "002-gpu-operator", "cluster-values.yaml")) + if !strings.Contains(cluster, "999.99.99") { + t.Errorf("cluster-values.yaml should contain --set override 999.99.99, got:\n%s", cluster) } - // values.yaml should NOT contain the extracted value - valuesContent, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "values.yaml")) - if err != nil { - t.Fatalf("failed to read values.yaml: %v", err) + values := readFile(t, filepath.Join(outputDir, "002-gpu-operator", "values.yaml")) + if strings.Contains(values, "999.99.99") { + t.Errorf("values.yaml should not contain 999.99.99 after dynamic split, got:\n%s", values) } - if strings.Contains(string(valuesContent), "999.99.99") { - t.Errorf("values.yaml should not contain 999.99.99 after dynamic split, got:\n%s", string(valuesContent)) - } - - // driver.enabled should still be in values.yaml - if !strings.Contains(string(valuesContent), "enabled") { - t.Errorf("values.yaml should still contain driver.enabled, got:\n%s", string(valuesContent)) + if !strings.Contains(values, "enabled") { + t.Errorf("values.yaml should still contain driver.enabled, got:\n%s", values) } } @@ -2132,36 +970,14 @@ func TestGenerate_DynamicValuesRoundTrip(t *testing.T) { ctx := context.Background() outputDir := t.TempDir() - originalValues := map[string]any{ - "driver": map[string]any{ - "version": testDriverVersion, - "enabled": true, - }, - "toolkit": map[string]any{ - "version": "1.17.4", - "enabled": true, - }, - "gds": map[string]any{ - "enabled": false, - }, - } - g := &Generator{ RecipeResult: createTestRecipeResult(), ComponentValues: map[string]map[string]any{ "cert-manager": {"crds": map[string]any{"enabled": true}}, "gpu-operator": { - "driver": map[string]any{ - "version": testDriverVersion, - "enabled": true, - }, - "toolkit": map[string]any{ - "version": "1.17.4", - "enabled": true, - }, - "gds": map[string]any{ - "enabled": false, - }, + "driver": map[string]any{"version": testDriverVersion, "enabled": true}, + "toolkit": map[string]any{"version": "1.17.4", "enabled": true}, + "gds": map[string]any{"enabled": false}, }, }, Version: "v1.0.0", @@ -2170,36 +986,23 @@ func TestGenerate_DynamicValuesRoundTrip(t *testing.T) { }, } - _, err := g.Generate(ctx, outputDir) - if err != nil { + if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate failed: %v", err) } - // Read and parse values.yaml - valuesContent, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "values.yaml")) - if err != nil { - t.Fatalf("failed to read values.yaml: %v", err) - } var staticValues map[string]any - if unmarshalErr := yaml.Unmarshal(valuesContent, &staticValues); unmarshalErr != nil { - t.Fatalf("failed to parse values.yaml: %v", unmarshalErr) + if err := yaml.Unmarshal([]byte(readFile(t, filepath.Join(outputDir, "002-gpu-operator", "values.yaml"))), &staticValues); err != nil { + t.Fatalf("failed to parse values.yaml: %v", err) } - // Read and parse cluster-values.yaml - clusterContent, err := os.ReadFile(filepath.Join(outputDir, "gpu-operator", "cluster-values.yaml")) - if err != nil { - t.Fatalf("failed to read cluster-values.yaml: %v", err) - } var dynamicValues map[string]any - if err := yaml.Unmarshal(clusterContent, &dynamicValues); err != nil { + if err := yaml.Unmarshal([]byte(readFile(t, filepath.Join(outputDir, "002-gpu-operator", "cluster-values.yaml"))), &dynamicValues); err != nil { t.Fatalf("failed to parse cluster-values.yaml: %v", err) } - // Merge static + dynamic values (simulate helm install -f values.yaml -f cluster-values.yaml) + // Merge simulates `helm install -f values.yaml -f cluster-values.yaml`. merged := deepMerge(staticValues, dynamicValues) - // Verify the merged result matches the original values - // Check driver.version was preserved through the round-trip driverMerged, ok := merged["driver"].(map[string]any) if !ok { t.Fatal("merged result missing 'driver' map") @@ -2211,7 +1014,6 @@ func TestGenerate_DynamicValuesRoundTrip(t *testing.T) { t.Errorf("merged driver.enabled = %v, want true", driverMerged["enabled"]) } - // Check toolkit.version was preserved toolkitMerged, ok := merged["toolkit"].(map[string]any) if !ok { t.Fatal("merged result missing 'toolkit' map") @@ -2223,98 +1025,35 @@ func TestGenerate_DynamicValuesRoundTrip(t *testing.T) { t.Errorf("merged toolkit.enabled = %v, want true", toolkitMerged["enabled"]) } - // Check gds.enabled was not affected (not a dynamic path) gdsMerged, ok := merged["gds"].(map[string]any) if !ok { t.Fatal("merged result missing 'gds' map") } if gdsMerged["enabled"] != false { t.Errorf("merged gds.enabled = %v, want false", gdsMerged["enabled"]) - } - - // Verify original values structure is fully recoverable - for key := range originalValues { - if _, exists := merged[key]; !exists { - t.Errorf("merged result missing top-level key %q", key) - } - } -} - -// deepMerge recursively merges src into dst. src values take precedence. -// This simulates Helm's behavior of merging multiple -f value files. -func deepMerge(dst, src map[string]any) map[string]any { - result := make(map[string]any) - for k, v := range dst { - result[k] = v - } - for k, v := range src { - if srcMap, ok := v.(map[string]any); ok { - if dstMap, ok := result[k].(map[string]any); ok { - result[k] = deepMerge(dstMap, srcMap) - continue - } - } - result[k] = v - } - return result -} - -func TestReverseComponents(t *testing.T) { - tests := []struct { - name string - input []ComponentData - wantLen int - wantName string // expected first element name after reverse - }{ - { - name: "empty", - input: []ComponentData{}, - wantLen: 0, - }, - { - name: "single", - input: []ComponentData{{Name: "a"}}, - wantLen: 1, - wantName: "a", - }, - { - name: "multiple", - input: []ComponentData{ - {Name: "a"}, - {Name: "b"}, - {Name: "c"}, - }, - wantLen: 3, - wantName: "c", - }, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - // Keep a copy of original order to verify non-mutation - original := make([]ComponentData, len(tt.input)) - copy(original, tt.input) - - result := reverseComponents(tt.input) + } +} - if len(result) != tt.wantLen { - t.Fatalf("len = %d, want %d", len(result), tt.wantLen) - } - if tt.wantLen > 0 && result[0].Name != tt.wantName { - t.Errorf("first element = %q, want %q", result[0].Name, tt.wantName) - } - // Verify original is unchanged - for i, comp := range tt.input { - if comp.Name != original[i].Name { - t.Errorf("original[%d] mutated: got %q, want %q", i, comp.Name, original[i].Name) - } +// deepMerge recursively merges src into dst. src values take precedence. +func deepMerge(dst, src map[string]any) map[string]any { + result := make(map[string]any) + for k, v := range dst { + result[k] = v + } + for k, v := range src { + if srcMap, ok := v.(map[string]any); ok { + if dstMap, ok := result[k].(map[string]any); ok { + result[k] = deepMerge(dstMap, srcMap) + continue } - }) + } + result[k] = v } + return result } -// TestGenerate_DoesNotMutateComponentValues verifies that Generate deep-copies -// component values before extracting dynamic paths, so the input map is preserved. +// TestGenerate_DoesNotMutateComponentValues verifies Generate does not +// mutate the caller's ComponentValues map. func TestGenerate_DoesNotMutateComponentValues(t *testing.T) { ctx := context.Background() outputDir := t.TempDir() @@ -2334,12 +1073,10 @@ func TestGenerate_DoesNotMutateComponentValues(t *testing.T) { }, } - _, err := g.Generate(ctx, outputDir) - if err != nil { + if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate() error = %v", err) } - // Original values should NOT be mutated — driver.version should still exist driver, ok := originalValues["gpu-operator"]["driver"].(map[string]any) if !ok { t.Fatal("original driver should still be a map") @@ -2349,17 +1086,19 @@ func TestGenerate_DoesNotMutateComponentValues(t *testing.T) { } } +// TestGenerate_DataFiles verifies external data files are included in +// checksums output and path traversal is rejected. func TestGenerate_DataFiles(t *testing.T) { t.Run("valid data file included in output", func(t *testing.T) { ctx := context.Background() outputDir := t.TempDir() - // Create a data file on disk so checksums can read it + // Create a data file on disk so checksum generation can read it. dataDir := filepath.Join(outputDir, "data") - if err := os.MkdirAll(dataDir, 0755); err != nil { + if err := os.MkdirAll(dataDir, 0o755); err != nil { t.Fatal(err) } - if err := os.WriteFile(filepath.Join(dataDir, "overrides.yaml"), []byte("key: value"), 0600); err != nil { + if err := os.WriteFile(filepath.Join(dataDir, "overrides.yaml"), []byte("key: value"), 0o600); err != nil { t.Fatal(err) } @@ -2377,7 +1116,6 @@ func TestGenerate_DataFiles(t *testing.T) { if err != nil { t.Fatalf("Generate() error = %v", err) } - found := false for _, f := range output.Files { if strings.HasSuffix(f, "data/overrides.yaml") { @@ -2414,20 +1152,14 @@ func TestGenerate_DataFiles(t *testing.T) { }) } -// TestUndeployScript_TransientFailureWarnsAndContinues asserts that the three -// post-uninstall cleanup pipelines tolerate a transient kubectl failure instead -// of letting set -euo pipefail kill the script. -// -// Sites covered (matching the warn-on-failure pattern added in this PR): -// - delete_release_cluster_resources (per-release per-kind cleanup helper) -// - force_clear_namespace_finalizers (last-resort namespace unstick helper) -// - per-component orphan-CRD cleanup loop in the script body -// -// Setup: stub `kubectl` to always exit non-zero (simulating a 502/timeout/auth -// hiccup). For each site, source the relevant section of the generated script, -// invoke it, and assert (a) the wrapper exits 0 — proving set -e was not -// triggered — and (b) the descriptive `Warning:` is on stderr — proving the -// failure was visible to the operator. +// --------------------------------------------------------------------------- +// Shell-behavior tests — preserved from the previous helm deployer +// --------------------------------------------------------------------------- + +// TestUndeployScript_TransientFailureWarnsAndContinues covers the three +// post-uninstall cleanup pipelines in undeploy.sh that must tolerate a +// transient kubectl failure instead of letting `set -euo pipefail` kill the +// script. func TestUndeployScript_TransientFailureWarnsAndContinues(t *testing.T) { if _, err := exec.LookPath("bash"); err != nil { t.Skip("bash not available; skipping shell-behavior test") @@ -2455,11 +1187,6 @@ func TestUndeployScript_TransientFailureWarnsAndContinues(t *testing.T) { } undeployPath := filepath.Join(outputDir, "undeploy.sh") - // Stub kubectl: `api-resources` succeeds with a minimal kind list (so the - // helpers reach the inner pipeline we want to exercise); every other - // invocation fails to simulate a transient API hiccup. Placed at the - // front of PATH so it shadows the real kubectl. jq is left alone — the - // pipelines pipe-fail at the kubectl stage either way. stubDir := t.TempDir() stubKubectl := filepath.Join(stubDir, "kubectl") stubScript := "#!/bin/sh\n" + @@ -2479,9 +1206,6 @@ func TestUndeployScript_TransientFailureWarnsAndContinues(t *testing.T) { wantStderr string }{ { - // L97-L103 in template: the helper's outer pipeline must end in `done || echo "Warning: ..." >&2`. - // sed+eval (not `source <(awk ...)` process substitution) for portability - // across bash environments where <(...) is flaky. name: "delete_release_cluster_resources", bashSnippet: ` snippet=$(sed -n '/^delete_release_cluster_resources()/,/^}/p' "$UNDEPLOY") @@ -2492,7 +1216,6 @@ func TestUndeployScript_TransientFailureWarnsAndContinues(t *testing.T) { wantStderr: "Warning: customresourcedefinitions cleanup pipeline for release gpu-operator/gpu-operator failed", }, { - // L150-L154 in template: same pattern in the namespace finalizer-unstick helper. name: "force_clear_namespace_finalizers", bashSnippet: ` snippet=$(sed -n '/^force_clear_namespace_finalizers()/,/^}/p' "$UNDEPLOY") @@ -2502,9 +1225,6 @@ func TestUndeployScript_TransientFailureWarnsAndContinues(t *testing.T) { wantStderr: "Warning: finalizer-clear pipeline for", }, { - // L296-L302 in template: the per-Helm-component orphan-CRD loop in the script body. - // Extract from the section header through (but not including) the - // manual-review note that now replaces the old group-delete block. name: "orphan_crd_inline_loop", bashSnippet: ` snippet=$(sed -n '/^# Clean up orphaned CRDs that were owned by this bundle/,/^# Intentionally skip automatic deletion of unannotated CRDs matched only by/p' "$UNDEPLOY" | sed '$d') @@ -2516,8 +1236,6 @@ func TestUndeployScript_TransientFailureWarnsAndContinues(t *testing.T) { for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { - // Bound each bash invocation to 30s so a wedged subprocess (kubectl - // stub mis-fire, deadlocked pipeline, etc.) cannot hang `go test`. subCtx, cancel := context.WithTimeout(ctx, 30*time.Second) defer cancel() cmd := exec.CommandContext(subCtx, "bash", "-c", "set -euo pipefail\n"+tt.bashSnippet) @@ -2534,23 +1252,18 @@ func TestUndeployScript_TransientFailureWarnsAndContinues(t *testing.T) { err, stdout.String(), stderr.String()) } if !strings.Contains(stderr.String(), tt.wantStderr) { - t.Errorf("expected %q in stderr (proves operators get a visible signal on transient failure), got:\nstderr: %s", + t.Errorf("expected %q in stderr, got:\nstderr: %s", tt.wantStderr, stderr.String()) } }) } } -// TestUndeployScript_PreflightDiscoversExplicitExtraCRDs proves the simplified -// pre-flight still catches a small set of known crds/-installed operator CRDs -// without scanning whole API groups. Here helm get manifest is empty and the -// CRD is unannotated, so only extra_crds_for_release can surface it. +// TestUndeployScript_PreflightDiscoversExplicitExtraCRDs proves pre-flight +// catches a small set of known operator CRDs without scanning whole API +// groups. func TestUndeployScript_PreflightDiscoversExplicitExtraCRDs(t *testing.T) { - for _, bin := range []string{"bash", "awk", "sed", "jq"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) - } - } + skipIfMissingBins(t, "bash", "awk", "sed", "jq") ctx := context.Background() outputDir := t.TempDir() @@ -2568,13 +1281,8 @@ func TestUndeployScript_PreflightDiscoversExplicitExtraCRDs(t *testing.T) { undeployPath := filepath.Join(outputDir, "undeploy.sh") stubDir := t.TempDir() - - helmStub := "#!/bin/sh\n# helm get manifest returns empty (no templated CRDs)\nexit 0\n" - if err := os.WriteFile(filepath.Join(stubDir, "helm"), []byte(helmStub), 0o755); err != nil { - t.Fatalf("write helm stub: %v", err) - } - - kubectlStub := `#!/bin/sh + writeStub(t, stubDir, "helm", "#!/bin/sh\nexit 0\n") + writeStub(t, stubDir, "kubectl", `#!/bin/sh case "$*" in "get crd -o json") echo '{"items":[{"metadata":{"name":"clusterpolicies.nvidia.com"},"spec":{"group":"nvidia.com","names":{"plural":"clusterpolicies"},"scope":"Cluster"}}]}' @@ -2583,96 +1291,38 @@ case "$*" in echo '{"spec":{"names":{"plural":"clusterpolicies"},"group":"nvidia.com","scope":"Cluster"}}' ;; "get clusterpolicies.nvidia.com -o json") - # One CR with an operator finalizer (non-kubernetes.io/*) — the exact - # case pre-flight is supposed to catch before the operator is removed. echo '{"items":[{"metadata":{"name":"cluster-policy","finalizers":["nvidia.com/clusterpolicy"]}}]}' ;; *) exit 0 ;; esac -` - if err := os.WriteFile(filepath.Join(stubDir, "kubectl"), []byte(kubectlStub), 0o755); err != nil { - t.Fatalf("write kubectl stub: %v", err) - } - - bashSnippet := ` - for fn in extra_crds_for_release capture_kubectl_json check_crd_for_stuck_resources check_release_for_stuck_crds; do - snippet=$(sed -n "/^${fn}()/,/^}/p" "$UNDEPLOY") - eval "$snippet" - done - PREFLIGHT_DETAILS=$(mktemp) - check_release_for_stuck_crds "gpu-operator" "gpu-operator" - cat "$PREFLIGHT_DETAILS" - rm -f "$PREFLIGHT_DETAILS" - ` +`) - subCtx, cancel := context.WithTimeout(ctx, 30*time.Second) - defer cancel() - cmd := exec.CommandContext(subCtx, "bash", "-c", "set -euo pipefail\n"+bashSnippet) - cmd.Env = append(os.Environ(), - "PATH="+stubDir+":"+os.Getenv("PATH"), - "UNDEPLOY="+undeployPath, - ) - var stdout, stderr bytes.Buffer - cmd.Stdout = &stdout - cmd.Stderr = &stderr - if err := cmd.Run(); err != nil { - t.Fatalf("script exited non-zero.\nerr: %v\nstdout: %s\nstderr: %s", - err, stdout.String(), stderr.String()) - } + stdout, stderr := runPreflightSnippet(t, ctx, stubDir, undeployPath, + `check_release_for_stuck_crds "gpu-operator" "gpu-operator"`) - if !strings.Contains(stdout.String(), "cluster-policy") { - t.Errorf("expected pre-flight to detect cluster-policy CR via explicit CRD overrides; got stdout: %q\nstderr: %q", - stdout.String(), stderr.String()) - } - if !strings.Contains(stdout.String(), "clusterpolicies.nvidia.com") { - t.Errorf("expected pre-flight output to name the source CRD clusterpolicies.nvidia.com; got: %q", stdout.String()) - } - if !strings.Contains(stdout.String(), "nvidia.com/clusterpolicy") { - t.Errorf("expected pre-flight to surface the unreconciled finalizer; got: %q", stdout.String()) + for _, want := range []string{"cluster-policy", "clusterpolicies.nvidia.com", "nvidia.com/clusterpolicy"} { + if !strings.Contains(stdout, want) { + t.Errorf("expected pre-flight to surface %q; stdout=%q stderr=%q", want, stdout, stderr) + } } } -// TestUndeployScript_PreflightDiscoversPrometheusExplicitCRDs proves the -// expanded exact-name override list now covers kube-prometheus-stack CRDs that -// are commonly installed outside Helm manifest/annotation discovery. +// TestUndeployScript_PreflightDiscoversPrometheusExplicitCRDs covers +// kube-prometheus-stack CRDs installed outside Helm manifest/annotation +// discovery. func TestUndeployScript_PreflightDiscoversPrometheusExplicitCRDs(t *testing.T) { - for _, bin := range []string{"bash", "awk", "sed", "jq"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) - } - } + skipIfMissingBins(t, "bash", "awk", "sed", "jq") ctx := context.Background() outputDir := t.TempDir() g := &Generator{ - RecipeResult: &recipe.RecipeResult{ - Kind: "RecipeResult", - APIVersion: "aicr.nvidia.com/v1alpha1", - Metadata: struct { - Version string `json:"version,omitempty" yaml:"version,omitempty"` - AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` - ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` - ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` - }{ - Version: "v0.1.0", - }, - ComponentRefs: []recipe.ComponentRef{ - { - Name: "kube-prometheus-stack", - Namespace: "monitoring", - Chart: "prometheus-community/kube-prometheus-stack", - Version: "82.8.0", - Source: "https://prometheus-community.github.io/helm-charts", - }, - }, - DeploymentOrder: []string{"kube-prometheus-stack"}, - }, - ComponentValues: map[string]map[string]any{ - "kube-prometheus-stack": {}, - }, - Version: "v1.0.0", + RecipeResult: singleComponentRecipe("kube-prometheus-stack", "monitoring", + "prometheus-community/kube-prometheus-stack", "82.8.0", + "https://prometheus-community.github.io/helm-charts"), + ComponentValues: map[string]map[string]any{"kube-prometheus-stack": {}}, + Version: "v1.0.0", } if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate failed: %v", err) @@ -2680,13 +1330,8 @@ func TestUndeployScript_PreflightDiscoversPrometheusExplicitCRDs(t *testing.T) { undeployPath := filepath.Join(outputDir, "undeploy.sh") stubDir := t.TempDir() - - helmStub := "#!/bin/sh\n# helm get manifest returns empty (no templated CRDs)\nexit 0\n" - if err := os.WriteFile(filepath.Join(stubDir, "helm"), []byte(helmStub), 0o755); err != nil { - t.Fatalf("write helm stub: %v", err) - } - - kubectlStub := `#!/bin/sh + writeStub(t, stubDir, "helm", "#!/bin/sh\nexit 0\n") + writeStub(t, stubDir, "kubectl", `#!/bin/sh case "$*" in "get crd -o json") echo '{"items":[{"metadata":{"name":"prometheuses.monitoring.coreos.com"},"spec":{"group":"monitoring.coreos.com","names":{"plural":"prometheuses"},"scope":"Namespaced"}}]}' @@ -2701,58 +1346,23 @@ case "$*" in exit 0 ;; esac -` - if err := os.WriteFile(filepath.Join(stubDir, "kubectl"), []byte(kubectlStub), 0o755); err != nil { - t.Fatalf("write kubectl stub: %v", err) - } - - bashSnippet := ` - for fn in extra_crds_for_release capture_kubectl_json check_crd_for_stuck_resources check_release_for_stuck_crds; do - snippet=$(sed -n "/^${fn}()/,/^}/p" "$UNDEPLOY") - eval "$snippet" - done - PREFLIGHT_DETAILS=$(mktemp) - check_release_for_stuck_crds "kube-prometheus-stack" "monitoring" - cat "$PREFLIGHT_DETAILS" - rm -f "$PREFLIGHT_DETAILS" - ` +`) - subCtx, cancel := context.WithTimeout(ctx, 30*time.Second) - defer cancel() - cmd := exec.CommandContext(subCtx, "bash", "-c", "set -euo pipefail\n"+bashSnippet) - cmd.Env = append(os.Environ(), - "PATH="+stubDir+":"+os.Getenv("PATH"), - "UNDEPLOY="+undeployPath, - ) - var stdout, stderr bytes.Buffer - cmd.Stdout = &stdout - cmd.Stderr = &stderr - if err := cmd.Run(); err != nil { - t.Fatalf("script exited non-zero.\nerr: %v\nstdout: %s\nstderr: %s", - err, stdout.String(), stderr.String()) - } + stdout, stderr := runPreflightSnippet(t, ctx, stubDir, undeployPath, + `check_release_for_stuck_crds "kube-prometheus-stack" "monitoring"`) - out := stdout.String() - if !strings.Contains(out, "prometheuses.monitoring.coreos.com") { - t.Errorf("expected pre-flight output to name the explicit CRD prometheuses.monitoring.coreos.com; got: %q", out) - } - if !strings.Contains(out, "monitoring/aicr-prometheus") { - t.Errorf("expected pre-flight to report the explicit CR instance; got: %q", out) - } - if !strings.Contains(out, "monitoring.coreos.com/operator") { - t.Errorf("expected pre-flight to surface the unreconciled finalizer; got: %q", out) + for _, want := range []string{"prometheuses.monitoring.coreos.com", "monitoring/aicr-prometheus", "monitoring.coreos.com/operator"} { + if !strings.Contains(stdout, want) { + t.Errorf("expected pre-flight to surface %q; stdout=%q stderr=%q", want, stdout, stderr) + } } } // TestUndeployScript_PreflightDiscoversAnnotatedCRDs proves the retained -// annotation-based discovery still catches release-owned CRDs even when -// helm get manifest is empty (e.g. chart stores CRDs outside templates/). +// annotation-based discovery still catches release-owned CRDs when helm get +// manifest is empty. func TestUndeployScript_PreflightDiscoversAnnotatedCRDs(t *testing.T) { - for _, bin := range []string{"bash", "awk", "sed", "jq"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) - } - } + skipIfMissingBins(t, "bash", "awk", "sed", "jq") ctx := context.Background() outputDir := t.TempDir() @@ -2770,13 +1380,8 @@ func TestUndeployScript_PreflightDiscoversAnnotatedCRDs(t *testing.T) { undeployPath := filepath.Join(outputDir, "undeploy.sh") stubDir := t.TempDir() - - helmStub := "#!/bin/sh\nexit 0\n" - if err := os.WriteFile(filepath.Join(stubDir, "helm"), []byte(helmStub), 0o755); err != nil { - t.Fatalf("write helm stub: %v", err) - } - - kubectlStub := `#!/bin/sh + writeStub(t, stubDir, "helm", "#!/bin/sh\nexit 0\n") + writeStub(t, stubDir, "kubectl", `#!/bin/sh case "$*" in "get crd -o json") echo '{"items":[{"metadata":{"name":"challenges.acme.cert-manager.io","annotations":{"meta.helm.sh/release-name":"cert-manager","meta.helm.sh/release-namespace":"cert-manager"}},"spec":{"group":"acme.cert-manager.io","names":{"plural":"challenges"},"scope":"Namespaced"}}]}' @@ -2791,59 +1396,23 @@ case "$*" in exit 0 ;; esac -` - if err := os.WriteFile(filepath.Join(stubDir, "kubectl"), []byte(kubectlStub), 0o755); err != nil { - t.Fatalf("write kubectl stub: %v", err) - } - - bashSnippet := ` - for fn in extra_crds_for_release capture_kubectl_json check_crd_for_stuck_resources check_release_for_stuck_crds; do - snippet=$(sed -n "/^${fn}()/,/^}/p" "$UNDEPLOY") - eval "$snippet" - done - PREFLIGHT_DETAILS=$(mktemp) - check_release_for_stuck_crds "cert-manager" "cert-manager" - cat "$PREFLIGHT_DETAILS" - rm -f "$PREFLIGHT_DETAILS" - ` +`) - subCtx, cancel := context.WithTimeout(ctx, 30*time.Second) - defer cancel() - cmd := exec.CommandContext(subCtx, "bash", "-c", "set -euo pipefail\n"+bashSnippet) - cmd.Env = append(os.Environ(), - "PATH="+stubDir+":"+os.Getenv("PATH"), - "UNDEPLOY="+undeployPath, - ) - var stdout, stderr bytes.Buffer - cmd.Stdout = &stdout - cmd.Stderr = &stderr - if err := cmd.Run(); err != nil { - t.Fatalf("script exited non-zero.\nerr: %v\nstdout: %s\nstderr: %s", - err, stdout.String(), stderr.String()) - } + stdout, stderr := runPreflightSnippet(t, ctx, stubDir, undeployPath, + `check_release_for_stuck_crds "cert-manager" "cert-manager"`) - out := stdout.String() - if !strings.Contains(out, "challenges.acme.cert-manager.io") { - t.Errorf("expected pre-flight output to name the annotated CRD challenges.acme.cert-manager.io; got: %q", out) - } - if !strings.Contains(out, "cert-manager/test-challenge") { - t.Errorf("expected pre-flight to report the annotated CR instance; got: %q", out) - } - if !strings.Contains(out, "acme.cert-manager.io/finalizer") { - t.Errorf("expected pre-flight to surface the unreconciled finalizer; got: %q", out) + for _, want := range []string{"challenges.acme.cert-manager.io", "cert-manager/test-challenge", "acme.cert-manager.io/finalizer"} { + if !strings.Contains(stdout, want) { + t.Errorf("expected pre-flight to surface %q; stdout=%q stderr=%q", want, stdout, stderr) + } } } // TestUndeployScript_PreflightSkipListCoversManifestDeletedReleases keeps the -// simplified fix explicit: releases whose bundle-managed CRs are deleted from -// manifests before controller uninstall are skipped at pre-flight instead of -// reintroducing manifest parsing and ownership inference. +// explicit skip list for releases whose dependent CRs are deleted from +// manifests before controller uninstall. func TestUndeployScript_PreflightSkipListCoversManifestDeletedReleases(t *testing.T) { - for _, bin := range []string{"bash", "sed"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) - } - } + skipIfMissingBins(t, "bash", "sed") ctx := context.Background() outputDir := t.TempDir() @@ -2856,31 +1425,11 @@ func TestUndeployScript_PreflightSkipListCoversManifestDeletedReleases(t *testin AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` - }{ - Version: "v0.1.0", - }, + }{Version: "v0.1.0"}, ComponentRefs: []recipe.ComponentRef{ - { - Name: "cert-manager", - Namespace: "cert-manager", - Chart: "cert-manager", - Version: "v1.17.2", - Source: "https://charts.jetstack.io", - }, - { - Name: "kgateway", - Namespace: "kgateway-system", - Chart: "kgateway", - Version: "v0.1.0", - Source: "https://example.invalid/charts", - }, - { - Name: "nodewright-operator", - Namespace: "skyhook", - Chart: "nodewright-operator", - Version: "v0.1.0", - Source: "https://example.invalid/charts", - }, + {Name: "cert-manager", Namespace: "cert-manager", Chart: "cert-manager", Version: "v1.17.2", Source: "https://charts.jetstack.io"}, + {Name: "kgateway", Namespace: "kgateway-system", Chart: "kgateway", Version: "v0.1.0", Source: "https://example.invalid/charts"}, + {Name: "nodewright-operator", Namespace: "skyhook", Chart: "nodewright-operator", Version: "v0.1.0", Source: "https://example.invalid/charts"}, }, DeploymentOrder: []string{"cert-manager", "kgateway", "nodewright-operator"}, }, @@ -2921,27 +1470,18 @@ func TestUndeployScript_PreflightSkipListCoversManifestDeletedReleases(t *testin } out := stdout.String() - if !strings.Contains(out, "skip:nodewright-operator") { - t.Errorf("expected skip list to include nodewright-operator; stdout: %q stderr: %q", out, stderr.String()) - } - if !strings.Contains(out, "skip:kgateway") { - t.Errorf("expected skip list to include kgateway; stdout: %q stderr: %q", out, stderr.String()) - } - if !strings.Contains(out, "check:cert-manager") { - t.Errorf("expected cert-manager to remain pre-flight checked; stdout: %q stderr: %q", out, stderr.String()) + for _, want := range []string{"skip:nodewright-operator", "skip:kgateway", "check:cert-manager"} { + if !strings.Contains(out, want) { + t.Errorf("expected %q in output; stdout=%q stderr=%q", want, out, stderr.String()) + } } } // TestUndeployScript_PreflightSkipsForeignAnnotatedExtraCRDs preserves the -// shared-cluster safety property for explicit CRD overrides: if a known CRD -// name is clearly annotated to a different Helm release, pre-flight must not -// scan its CRs for this release. +// shared-cluster safety property for explicit CRD overrides annotated to a +// different release. func TestUndeployScript_PreflightSkipsForeignAnnotatedExtraCRDs(t *testing.T) { - for _, bin := range []string{"bash", "awk", "sed", "jq"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) - } - } + skipIfMissingBins(t, "bash", "awk", "sed", "jq") ctx := context.Background() outputDir := t.TempDir() @@ -2959,13 +1499,8 @@ func TestUndeployScript_PreflightSkipsForeignAnnotatedExtraCRDs(t *testing.T) { undeployPath := filepath.Join(outputDir, "undeploy.sh") stubDir := t.TempDir() - - helmStub := "#!/bin/sh\nexit 0\n" - if err := os.WriteFile(filepath.Join(stubDir, "helm"), []byte(helmStub), 0o755); err != nil { - t.Fatalf("write helm stub: %v", err) - } - - kubectlStub := `#!/bin/sh + writeStub(t, stubDir, "helm", "#!/bin/sh\nexit 0\n") + writeStub(t, stubDir, "kubectl", `#!/bin/sh case "$*" in "get crd -o json") echo '{"items":[{"metadata":{"name":"clusterpolicies.nvidia.com","annotations":{"meta.helm.sh/release-name":"other-release","meta.helm.sh/release-namespace":"other-ns"}},"spec":{"group":"nvidia.com","names":{"plural":"clusterpolicies"},"scope":"Cluster"}}]}' @@ -2980,57 +1515,24 @@ case "$*" in exit 0 ;; esac -` - if err := os.WriteFile(filepath.Join(stubDir, "kubectl"), []byte(kubectlStub), 0o755); err != nil { - t.Fatalf("write kubectl stub: %v", err) - } - - bashSnippet := ` - for fn in extra_crds_for_release capture_kubectl_json check_crd_for_stuck_resources check_release_for_stuck_crds; do - snippet=$(sed -n "/^${fn}()/,/^}/p" "$UNDEPLOY") - eval "$snippet" - done - PREFLIGHT_DETAILS=$(mktemp) - check_release_for_stuck_crds "gpu-operator" "gpu-operator" - cat "$PREFLIGHT_DETAILS" - rm -f "$PREFLIGHT_DETAILS" - ` +`) - subCtx, cancel := context.WithTimeout(ctx, 30*time.Second) - defer cancel() - cmd := exec.CommandContext(subCtx, "bash", "-c", "set -euo pipefail\n"+bashSnippet) - cmd.Env = append(os.Environ(), - "PATH="+stubDir+":"+os.Getenv("PATH"), - "UNDEPLOY="+undeployPath, - ) - var stdout, stderr bytes.Buffer - cmd.Stdout = &stdout - cmd.Stderr = &stderr - if err := cmd.Run(); err != nil { - t.Fatalf("script exited non-zero.\nerr: %v\nstdout: %s\nstderr: %s", - err, stdout.String(), stderr.String()) - } + stdout, stderr := runPreflightSnippet(t, ctx, stubDir, undeployPath, + `check_release_for_stuck_crds "gpu-operator" "gpu-operator"`) - out := stdout.String() - leaked := strings.Contains(out, "clusterpolicies.nvidia.com") || - strings.Contains(out, "foreign-policy") || - strings.Contains(out, "nvidia.com/clusterpolicy") + leaked := strings.Contains(stdout, "clusterpolicies.nvidia.com") || + strings.Contains(stdout, "foreign-policy") || + strings.Contains(stdout, "nvidia.com/clusterpolicy") if leaked { - t.Errorf("pre-flight scanned an explicit override CRD annotated to a different Helm release.\nstdout: %q\nstderr: %q", - out, stderr.String()) + t.Errorf("pre-flight scanned a CRD annotated to a different Helm release.\nstdout=%q stderr=%q", stdout, stderr) } } -// TestUndeployScript_PreflightFailsClosedOnKubectlError asserts that a -// transient `kubectl get crd` failure (API 502, auth hiccup, etc.) causes -// pre-flight to fail closed with a clear error message — NOT silently treat -// "API error" as "no CRDs to check" and proceed to uninstall the operator. +// TestUndeployScript_PreflightFailsClosedOnKubectlError asserts a transient +// `kubectl get crd` failure causes pre-flight to fail closed with a clear +// error message. func TestUndeployScript_PreflightFailsClosedOnKubectlError(t *testing.T) { - for _, bin := range []string{"bash", "awk", "sed", "jq"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) - } - } + skipIfMissingBins(t, "bash", "awk", "sed", "jq") ctx := context.Background() outputDir := t.TempDir() @@ -3048,16 +1550,8 @@ func TestUndeployScript_PreflightFailsClosedOnKubectlError(t *testing.T) { undeployPath := filepath.Join(outputDir, "undeploy.sh") stubDir := t.TempDir() - - helmStub := "#!/bin/sh\nexit 0\n" - if err := os.WriteFile(filepath.Join(stubDir, "helm"), []byte(helmStub), 0o755); err != nil { - t.Fatalf("write helm stub: %v", err) - } - - // Simulate a transient API failure on the CRD list call that both retained - // discovery sources depend on. Before the fail-closed fix, this could - // silently return "" and let pre-flight fast-path to success. - kubectlStub := `#!/bin/sh + writeStub(t, stubDir, "helm", "#!/bin/sh\nexit 0\n") + writeStub(t, stubDir, "kubectl", `#!/bin/sh case "$*" in "get crd -o json") echo "error: the server is currently unable to handle the request (get customresourcedefinitions.apiextensions.k8s.io)" >&2 @@ -3067,10 +1561,7 @@ case "$*" in exit 0 ;; esac -` - if err := os.WriteFile(filepath.Join(stubDir, "kubectl"), []byte(kubectlStub), 0o755); err != nil { - t.Fatalf("write kubectl stub: %v", err) - } +`) bashSnippet := ` for fn in extra_crds_for_release capture_kubectl_json check_crd_for_stuck_resources check_release_for_stuck_crds; do @@ -3096,32 +1587,28 @@ esac err := cmd.Run() if err == nil { - t.Fatalf("regression: check_release_for_stuck_crds returned 0 despite kubectl error — pre-flight would silently pass on transient API failure.\nstdout: %q\nstderr: %q", + t.Fatalf("regression: check_release_for_stuck_crds returned 0 despite kubectl error.\nstdout=%q stderr=%q", stdout.String(), stderr.String()) } if strings.Contains(stderr.String(), "UNREACHABLE") { - t.Fatalf("regression: execution continued past check_release_for_stuck_crds on kubectl error.\nstderr: %q", stderr.String()) - } - if !strings.Contains(stderr.String(), "ERROR: Pre-flight could not list CRDs") { - t.Errorf("expected clear ERROR message about pre-flight failing closed; stderr: %q", stderr.String()) - } - if !strings.Contains(stderr.String(), "gpu-operator") { - t.Errorf("expected ERROR to name the release; stderr: %q", stderr.String()) + t.Fatalf("regression: execution continued past check_release_for_stuck_crds.\nstderr=%q", stderr.String()) } - if !strings.Contains(stderr.String(), "--skip-preflight") { - t.Errorf("expected ERROR to point the operator to --skip-preflight bypass; stderr: %q", stderr.String()) + for _, want := range []string{ + "ERROR: Pre-flight could not list CRDs", + "gpu-operator", + "--skip-preflight", + } { + if !strings.Contains(stderr.String(), want) { + t.Errorf("expected error text %q; stderr=%q", want, stderr.String()) + } } } // TestUndeployScript_PreflightPreservesJSONWhenKubectlWarnsOnStderr asserts -// that successful `kubectl ... -o json` calls remain parseable even when -// kubectl emits warnings on stderr. +// that successful kubectl output remains parseable even when warnings are on +// stderr. func TestUndeployScript_PreflightPreservesJSONWhenKubectlWarnsOnStderr(t *testing.T) { - for _, bin := range []string{"bash", "awk", "sed", "jq"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) - } - } + skipIfMissingBins(t, "bash", "awk", "sed", "jq") ctx := context.Background() outputDir := t.TempDir() @@ -3139,13 +1626,8 @@ func TestUndeployScript_PreflightPreservesJSONWhenKubectlWarnsOnStderr(t *testin undeployPath := filepath.Join(outputDir, "undeploy.sh") stubDir := t.TempDir() - - helmStub := "#!/bin/sh\nexit 0\n" - if err := os.WriteFile(filepath.Join(stubDir, "helm"), []byte(helmStub), 0o755); err != nil { - t.Fatalf("write helm stub: %v", err) - } - - kubectlStub := `#!/bin/sh + writeStub(t, stubDir, "helm", "#!/bin/sh\nexit 0\n") + writeStub(t, stubDir, "kubectl", `#!/bin/sh case "$*" in "get crd -o json") echo "warning: cached discovery response" >&2 @@ -3154,94 +1636,41 @@ case "$*" in "get crd clusterpolicies.nvidia.com -o json") echo "warning: cached CRD read" >&2 echo '{"spec":{"names":{"plural":"clusterpolicies"},"group":"nvidia.com","scope":"Cluster"}}' - ;; - "get clusterpolicies.nvidia.com -o json") - echo "warning: cached CR list" >&2 - echo '{"items":[{"metadata":{"name":"cluster-policy","finalizers":["nvidia.com/clusterpolicy"]}}]}' - ;; - *) - exit 0 - ;; -esac -` - if err := os.WriteFile(filepath.Join(stubDir, "kubectl"), []byte(kubectlStub), 0o755); err != nil { - t.Fatalf("write kubectl stub: %v", err) - } - - bashSnippet := ` - for fn in extra_crds_for_release capture_kubectl_json check_crd_for_stuck_resources check_release_for_stuck_crds; do - snippet=$(sed -n "/^${fn}()/,/^}/p" "$UNDEPLOY") - eval "$snippet" - done - PREFLIGHT_DETAILS=$(mktemp) - check_release_for_stuck_crds "gpu-operator" "gpu-operator" - cat "$PREFLIGHT_DETAILS" - rm -f "$PREFLIGHT_DETAILS" - ` - - subCtx, cancel := context.WithTimeout(ctx, 30*time.Second) - defer cancel() - cmd := exec.CommandContext(subCtx, "bash", "-c", "set -euo pipefail\n"+bashSnippet) - cmd.Env = append(os.Environ(), - "PATH="+stubDir+":"+os.Getenv("PATH"), - "UNDEPLOY="+undeployPath, - ) - var stdout, stderr bytes.Buffer - cmd.Stdout = &stdout - cmd.Stderr = &stderr - if err := cmd.Run(); err != nil { - t.Fatalf("script exited non-zero.\nerr: %v\nstdout: %s\nstderr: %s", - err, stdout.String(), stderr.String()) - } + ;; + "get clusterpolicies.nvidia.com -o json") + echo "warning: cached CR list" >&2 + echo '{"items":[{"metadata":{"name":"cluster-policy","finalizers":["nvidia.com/clusterpolicy"]}}]}' + ;; + *) + exit 0 + ;; +esac +`) - if !strings.Contains(stdout.String(), "cluster-policy") { - t.Errorf("expected pre-flight to keep parsing JSON despite kubectl stderr warnings; got stdout: %q\nstderr: %q", - stdout.String(), stderr.String()) + stdout, stderr := runPreflightSnippet(t, ctx, stubDir, undeployPath, + `check_release_for_stuck_crds "gpu-operator" "gpu-operator"`) + + if !strings.Contains(stdout, "cluster-policy") { + t.Errorf("expected pre-flight to keep parsing JSON despite kubectl warnings; stdout=%q stderr=%q", stdout, stderr) } - if !strings.Contains(stderr.String(), "warning: cached discovery response") { - t.Errorf("expected kubectl stderr warning to remain visible to operators; stderr: %q", stderr.String()) + if !strings.Contains(stderr, "warning: cached discovery response") { + t.Errorf("expected kubectl stderr warning to remain visible; stderr=%q", stderr) } } -// TestUndeployScript_PostflightWarnsOnExplicitExtraCRDs proves post-flight now -// surfaces leftover exact-name CRDs from known crds/-installed releases even -// when they are unannotated and therefore invisible to the Helm-only warning. +// TestUndeployScript_PostflightWarnsOnExplicitExtraCRDs proves post-flight +// surfaces leftover exact-name CRDs from known installed releases. func TestUndeployScript_PostflightWarnsOnExplicitExtraCRDs(t *testing.T) { - for _, bin := range []string{"bash", "sed", "jq", "awk", "sort", "tr"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) - } - } + skipIfMissingBins(t, "bash", "sed", "jq", "awk", "sort", "tr") ctx := context.Background() outputDir := t.TempDir() g := &Generator{ - RecipeResult: &recipe.RecipeResult{ - Kind: "RecipeResult", - APIVersion: "aicr.nvidia.com/v1alpha1", - Metadata: struct { - Version string `json:"version,omitempty" yaml:"version,omitempty"` - AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` - ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` - ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` - }{ - Version: "v0.1.0", - }, - ComponentRefs: []recipe.ComponentRef{ - { - Name: "kube-prometheus-stack", - Namespace: "monitoring", - Chart: "prometheus-community/kube-prometheus-stack", - Version: "82.8.0", - Source: "https://prometheus-community.github.io/helm-charts", - }, - }, - DeploymentOrder: []string{"kube-prometheus-stack"}, - }, - ComponentValues: map[string]map[string]any{ - "kube-prometheus-stack": {}, - }, - Version: "v1.0.0", + RecipeResult: singleComponentRecipe("kube-prometheus-stack", "monitoring", + "prometheus-community/kube-prometheus-stack", "82.8.0", + "https://prometheus-community.github.io/helm-charts"), + ComponentValues: map[string]map[string]any{"kube-prometheus-stack": {}}, + Version: "v1.0.0", } if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate failed: %v", err) @@ -3249,8 +1678,7 @@ func TestUndeployScript_PostflightWarnsOnExplicitExtraCRDs(t *testing.T) { undeployPath := filepath.Join(outputDir, "undeploy.sh") stubDir := t.TempDir() - - kubectlStub := `#!/bin/sh + writeStub(t, stubDir, "kubectl", `#!/bin/sh case "$*" in "get crd -o json") echo '{"items":[{"metadata":{"name":"prometheuses.monitoring.coreos.com"},"spec":{"group":"monitoring.coreos.com","names":{"plural":"prometheuses"},"scope":"Namespaced"}}]}' @@ -3259,10 +1687,7 @@ case "$*" in exit 0 ;; esac -` - if err := os.WriteFile(filepath.Join(stubDir, "kubectl"), []byte(kubectlStub), 0o755); err != nil { - t.Fatalf("write kubectl stub: %v", err) - } +`) bashSnippet := ` for fn in extra_crds_for_release capture_kubectl_json; do @@ -3294,52 +1719,102 @@ esac } out := stdout.String() - if !strings.Contains(out, "WARNING: explicit CRDs from this bundle still present:") { - t.Errorf("expected post-flight explicit CRD warning; got stdout: %q stderr: %q", out, stderr.String()) - } - if !strings.Contains(out, "prometheuses.monitoring.coreos.com") { - t.Errorf("expected post-flight warning to name the leftover CRD; got stdout: %q stderr: %q", out, stderr.String()) + for _, want := range []string{ + "WARNING: explicit CRDs from this bundle still present:", + "prometheuses.monitoring.coreos.com", + } { + if !strings.Contains(out, want) { + t.Errorf("expected %q in post-flight output; stdout=%q stderr=%q", want, out, stderr.String()) + } } } -func TestUndeployScript_KustomizeOnlyBundleIsBashSyntaxValid(t *testing.T) { - if _, err := exec.LookPath("bash"); err != nil { - t.Skip("bash not available; skipping shell-syntax test") - } +// TestUndeployScript_DynamoPlatformOwnsExplicitGroveCRDs verifies the +// dynamo-platform release owns the Grove CRDs via extra_crds_for_release. +func TestUndeployScript_DynamoPlatformOwnsExplicitGroveCRDs(t *testing.T) { + skipIfMissingBins(t, "bash", "sed") ctx := context.Background() outputDir := t.TempDir() g := &Generator{ - RecipeResult: createKustomizeRecipeResult(), - ComponentValues: map[string]map[string]any{ - "my-kustomize-app": {}, - }, - Version: "v1.0.0", + RecipeResult: singleComponentRecipe("dynamo-platform", "dynamo-platform", + "oci://example.com/dynamo-platform", "0.9.1", + "oci://example.com"), + ComponentValues: map[string]map[string]any{"dynamo-platform": {}}, + Version: "v1.0.0", } if _, err := g.Generate(ctx, outputDir); err != nil { t.Fatalf("Generate failed: %v", err) } - undeployPath := filepath.Join(outputDir, "undeploy.sh") + + bashSnippet := ` + snippet=$(sed -n '/^extra_crds_for_release()/,/^}/p' "$UNDEPLOY") + eval "$snippet" + platform_crds=$(extra_crds_for_release "dynamo-platform") + printf '%s\n' "$platform_crds" + test -n "$platform_crds" + test -z "$(extra_crds_for_release "dynamo-crds")" + ` + subCtx, cancel := context.WithTimeout(ctx, 30*time.Second) defer cancel() - cmd := exec.CommandContext(subCtx, "bash", "-n", undeployPath) - output, err := cmd.CombinedOutput() - if err != nil { - t.Fatalf("generated undeploy.sh is not bash-syntax valid for Kustomize-only bundle.\nerr: %v\noutput: %s", - err, string(output)) + cmd := exec.CommandContext(subCtx, "bash", "-c", "set -euo pipefail\n"+bashSnippet) + cmd.Env = append(os.Environ(), "UNDEPLOY="+undeployPath) + var stdout, stderr bytes.Buffer + cmd.Stdout = &stdout + cmd.Stderr = &stderr + if err := cmd.Run(); err != nil { + t.Fatalf("script exited non-zero.\nerr: %v\nstdout: %s\nstderr: %s", + err, stdout.String(), stderr.String()) } -} -func TestUndeployScript_DynamoPlatformOwnsExplicitGroveCRDs(t *testing.T) { - for _, bin := range []string{"bash", "sed"} { - if _, err := exec.LookPath(bin); err != nil { - t.Skipf("%s not available; skipping shell-behavior test", bin) + for _, crd := range []string{ + "podcliques.grove.io", + "podcliquescalinggroups.grove.io", + "podcliquesets.grove.io", + "podgangs.scheduler.grove.io", + } { + if !strings.Contains(stdout.String(), crd) { + t.Errorf("expected dynamo-platform explicit CRD list to include %s; stdout=%q stderr=%q", + crd, stdout.String(), stderr.String()) } } +} - ctx := context.Background() - outputDir := t.TempDir() +// --------------------------------------------------------------------------- +// Golden-file bundle tests +// --------------------------------------------------------------------------- +// +// These tests assert full-tree equivalence of the rendered bundle against +// committed goldens under testdata//. Regenerate with: +// +// go test ./pkg/bundler/deployer/helm/... -run TestBundleGolden -update +// +// The goldens double as reference examples of what a rendered bundle looks +// like for each common shape: upstream-helm-only, manifest-only, mixed +// (upstream + raw manifests → primary + -post folder), kai-scheduler (async +// block), and nodewright-operator (pre-install taint cleanup block). + +func TestBundleGolden_UpstreamHelmOnly(t *testing.T) { + outDir := t.TempDir() + g := &Generator{ + RecipeResult: singleComponentRecipe( + "cert-manager", "cert-manager", "cert-manager", "v1.17.2", + "https://charts.jetstack.io"), + ComponentValues: map[string]map[string]any{ + "cert-manager": {"crds": map[string]any{"enabled": true}}, + }, + Version: "v1.0.0", + } + if _, err := g.Generate(context.Background(), outDir); err != nil { + t.Fatalf("Generate: %v", err) + } + assertBundleGolden(t, outDir, "testdata/upstream_helm_only") +} + +func TestBundleGolden_ManifestOnly(t *testing.T) { + outDir := t.TempDir() g := &Generator{ RecipeResult: &recipe.RecipeResult{ Kind: "RecipeResult", @@ -3349,43 +1824,167 @@ func TestUndeployScript_DynamoPlatformOwnsExplicitGroveCRDs(t *testing.T) { AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` - }{ - Version: "v0.1.0", - }, + }{Version: "v0.1.0"}, ComponentRefs: []recipe.ComponentRef{ - { - Name: "dynamo-platform", - Namespace: "dynamo-platform", - Chart: "oci://example.com/dynamo-platform", - Version: "0.9.1", - Source: "oci://example.com", - }, + {Name: "skyhook-customizations", Namespace: "skyhook"}, + }, + DeploymentOrder: []string{"skyhook-customizations"}, + }, + ComponentValues: map[string]map[string]any{"skyhook-customizations": {}}, + ComponentManifests: map[string]map[string][]byte{ + "skyhook-customizations": { + "components/skyhook-customizations/manifests/customization.yaml": []byte(`# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: customization +`), }, - DeploymentOrder: []string{"dynamo-platform"}, }, + Version: "v1.0.0", + } + if _, err := g.Generate(context.Background(), outDir); err != nil { + t.Fatalf("Generate: %v", err) + } + assertBundleGolden(t, outDir, "testdata/manifest_only") +} + +func TestBundleGolden_MixedGPUOperator(t *testing.T) { + outDir := t.TempDir() + g := &Generator{ + RecipeResult: singleComponentRecipe( + "gpu-operator", "gpu-operator", "gpu-operator", "v25.3.3", + "https://helm.ngc.nvidia.com/nvidia"), ComponentValues: map[string]map[string]any{ - "dynamo-platform": {}, + "gpu-operator": {"driver": map[string]any{"enabled": true}}, + }, + ComponentManifests: map[string]map[string][]byte{ + "gpu-operator": { + "components/gpu-operator/manifests/dcgm-exporter.yaml": []byte(`# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. + +apiVersion: v1 +kind: Service +metadata: + name: dcgm-exporter +`), + }, }, Version: "v1.0.0", } - if _, err := g.Generate(ctx, outputDir); err != nil { - t.Fatalf("Generate failed: %v", err) + if _, err := g.Generate(context.Background(), outDir); err != nil { + t.Fatalf("Generate: %v", err) } - undeployPath := filepath.Join(outputDir, "undeploy.sh") + assertBundleGolden(t, outDir, "testdata/mixed_gpu_operator") +} + +func TestBundleGolden_KaiSchedulerPresent(t *testing.T) { + outDir := t.TempDir() + g := &Generator{ + RecipeResult: &recipe.RecipeResult{ + Kind: "RecipeResult", + APIVersion: "aicr.nvidia.com/v1alpha1", + Metadata: struct { + Version string `json:"version,omitempty" yaml:"version,omitempty"` + AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` + ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` + ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` + }{Version: "v0.1.0"}, + ComponentRefs: []recipe.ComponentRef{ + { + Name: "kai-scheduler", + Namespace: "kai-scheduler", + Chart: "kai-scheduler", + Version: "v0.13.0", + Source: "oci://ghcr.io/nvidia/kai-scheduler", + }, + }, + DeploymentOrder: []string{"kai-scheduler"}, + }, + ComponentValues: map[string]map[string]any{"kai-scheduler": {}}, + Version: "v1.0.0", + } + if _, err := g.Generate(context.Background(), outDir); err != nil { + t.Fatalf("Generate: %v", err) + } + assertBundleGolden(t, outDir, "testdata/kai_scheduler_present") +} + +func TestBundleGolden_NodewrightPresent(t *testing.T) { + outDir := t.TempDir() + // Mirror the production registry: component name "nodewright-operator" + // but the upstream chart is still named "skyhook-operator". This shape + // is what real recipes have post-rename — the registry component name + // drives the name-matched taint cleanup blocks; the chart name drives + // helm install. + g := &Generator{ + RecipeResult: singleComponentRecipe( + "nodewright-operator", "skyhook", "skyhook-operator", "v0.1.0", + "https://example.invalid/charts"), + ComponentValues: map[string]map[string]any{"nodewright-operator": {}}, + Version: "v1.0.0", + } + if _, err := g.Generate(context.Background(), outDir); err != nil { + t.Fatalf("Generate: %v", err) + } + assertBundleGolden(t, outDir, "testdata/nodewright_present") +} + +// --------------------------------------------------------------------------- +// Helpers +// --------------------------------------------------------------------------- + +// readFile reads a file or fails the test with a clear message. +func readFile(t *testing.T, path string) string { + t.Helper() + b, err := os.ReadFile(path) + if err != nil { + t.Fatalf("read %s: %v", path, err) + } + return string(b) +} + +// skipIfMissingBins skips the test if any of the named binaries are missing. +func skipIfMissingBins(t *testing.T, bins ...string) { + t.Helper() + for _, b := range bins { + if _, err := exec.LookPath(b); err != nil { + t.Skipf("%s not available; skipping shell-behavior test", b) + } + } +} + +// writeStub creates an executable stub at stubDir/name. +func writeStub(t *testing.T, stubDir, name, content string) { + t.Helper() + if err := os.WriteFile(filepath.Join(stubDir, name), []byte(content), 0o755); err != nil { + t.Fatalf("write %s stub: %v", name, err) + } +} +// runPreflightSnippet sources the pre-flight helpers from undeployPath and +// runs the given snippet under bash with stubDir prepended to PATH. Returns +// captured stdout and stderr. +func runPreflightSnippet(t *testing.T, ctx context.Context, stubDir, undeployPath, call string) (string, string) { + t.Helper() bashSnippet := ` - snippet=$(sed -n '/^extra_crds_for_release()/,/^}/p' "$UNDEPLOY") - eval "$snippet" - platform_crds=$(extra_crds_for_release "dynamo-platform") - printf '%s\n' "$platform_crds" - test -n "$platform_crds" - test -z "$(extra_crds_for_release "dynamo-crds")" + for fn in extra_crds_for_release capture_kubectl_json check_crd_for_stuck_resources check_release_for_stuck_crds; do + snippet=$(sed -n "/^${fn}()/,/^}/p" "$UNDEPLOY") + eval "$snippet" + done + PREFLIGHT_DETAILS=$(mktemp) + ` + call + ` + cat "$PREFLIGHT_DETAILS" + rm -f "$PREFLIGHT_DETAILS" ` subCtx, cancel := context.WithTimeout(ctx, 30*time.Second) defer cancel() cmd := exec.CommandContext(subCtx, "bash", "-c", "set -euo pipefail\n"+bashSnippet) - cmd.Env = append(os.Environ(), "UNDEPLOY="+undeployPath) + cmd.Env = append(os.Environ(), + "PATH="+stubDir+":"+os.Getenv("PATH"), + "UNDEPLOY="+undeployPath, + ) var stdout, stderr bytes.Buffer cmd.Stdout = &stdout cmd.Stderr = &stderr @@ -3393,16 +1992,143 @@ func TestUndeployScript_DynamoPlatformOwnsExplicitGroveCRDs(t *testing.T) { t.Fatalf("script exited non-zero.\nerr: %v\nstdout: %s\nstderr: %s", err, stdout.String(), stderr.String()) } + return stdout.String(), stderr.String() +} - for _, crd := range []string{ - "podcliques.grove.io", - "podcliquescalinggroups.grove.io", - "podcliquesets.grove.io", - "podgangs.scheduler.grove.io", - } { - if !strings.Contains(stdout.String(), crd) { - t.Errorf("expected dynamo-platform explicit CRD list to include %s; stdout: %q stderr: %q", - crd, stdout.String(), stderr.String()) +// singleComponentRecipe builds a RecipeResult with exactly one Helm component. +func singleComponentRecipe(name, namespace, chart, version, source string) *recipe.RecipeResult { + return &recipe.RecipeResult{ + Kind: "RecipeResult", + APIVersion: "aicr.nvidia.com/v1alpha1", + Metadata: struct { + Version string `json:"version,omitempty" yaml:"version,omitempty"` + AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` + ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` + ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` + }{Version: "v0.1.0"}, + ComponentRefs: []recipe.ComponentRef{ + {Name: name, Namespace: namespace, Chart: chart, Version: version, Source: source}, + }, + DeploymentOrder: []string{name}, + } +} + +func createTestRecipeResult() *recipe.RecipeResult { + return &recipe.RecipeResult{ + Kind: "RecipeResult", + APIVersion: "aicr.nvidia.com/v1alpha1", + Metadata: struct { + Version string `json:"version,omitempty" yaml:"version,omitempty"` + AppliedOverlays []string `json:"appliedOverlays,omitempty" yaml:"appliedOverlays,omitempty"` + ExcludedOverlays []recipe.ExcludedOverlay `json:"excludedOverlays,omitempty" yaml:"excludedOverlays,omitempty"` + ConstraintWarnings []recipe.ConstraintWarning `json:"constraintWarnings,omitempty" yaml:"constraintWarnings,omitempty"` + }{Version: "v0.1.0"}, + Criteria: &recipe.Criteria{ + Service: "eks", + Accelerator: "h100", + Intent: "training", + }, + ComponentRefs: []recipe.ComponentRef{ + { + Name: "cert-manager", + Namespace: "cert-manager", + Chart: "cert-manager", + Version: "v1.17.2", + Source: "https://charts.jetstack.io", + }, + { + Name: "gpu-operator", + Namespace: "gpu-operator", + Chart: "gpu-operator", + Version: "v25.3.3", + Source: "https://helm.ngc.nvidia.com/nvidia", + }, + }, + DeploymentOrder: []string{"cert-manager", "gpu-operator"}, + } +} + +// assertBundleGolden verifies outDir matches the committed bundle at goldenDir. +// With -update, overwrites goldenDir with the tree in outDir. Verifies both +// directions: every golden file exists in outDir, and vice versa. +func assertBundleGolden(t *testing.T, outDir, goldenDir string) { + t.Helper() + actual := listBundleFiles(t, outDir) + + if *update { + // Remove the prior golden tree so stale files don't linger. Keep the + // directory so the writer logic below can mkdir subdirs below it. + if err := os.RemoveAll(goldenDir); err != nil { + t.Fatalf("remove golden dir: %v", err) + } + for _, rel := range actual { + src := filepath.Join(outDir, rel) + dst := filepath.Join(goldenDir, rel) + if err := os.MkdirAll(filepath.Dir(dst), 0o755); err != nil { + t.Fatalf("mkdir %s: %v", filepath.Dir(dst), err) + } + content, err := os.ReadFile(src) + if err != nil { + t.Fatalf("read actual %s: %v", src, err) + } + if err := os.WriteFile(dst, content, 0o644); err != nil { + t.Fatalf("write golden %s: %v", dst, err) + } + } + return + } + + // Compare file lists. + golden := listBundleFiles(t, goldenDir) + if !reflect.DeepEqual(actual, golden) { + t.Fatalf("bundle file tree differs from %s:\n actual: %v\n golden: %v\n(run with -update to regenerate)", + goldenDir, actual, golden) + } + + // Byte-compare each file. + for _, rel := range actual { + got, err := os.ReadFile(filepath.Join(outDir, rel)) + if err != nil { + t.Fatalf("read actual %s: %v", rel, err) + } + want, err := os.ReadFile(filepath.Join(goldenDir, rel)) + if err != nil { + t.Fatalf("read golden %s: %v", rel, err) + } + if !bytes.Equal(got, want) { + t.Errorf("%s differs from golden:\n--- got ---\n%s\n--- want ---\n%s", rel, got, want) + } + } +} + +// listBundleFiles walks dir and returns sorted relative paths of regular files. +func listBundleFiles(t *testing.T, dir string) []string { + t.Helper() + var files []string + err := filepath.Walk(dir, func(path string, info os.FileInfo, walkErr error) error { + if walkErr != nil { + // Return an empty list if the root does not exist yet — in -update + // mode the golden dir may not be present on first run. + if os.IsNotExist(walkErr) && path == dir { + return filepath.SkipDir + } + return walkErr + } + if info.Mode().IsRegular() { + rel, err := filepath.Rel(dir, path) + if err != nil { + return err + } + files = append(files, rel) } + return nil + }) + if err != nil { + t.Fatalf("walk %s: %v", dir, err) } + sort.Strings(files) + return files } + +// Ensure deployer package is referenced so unused-import rules are satisfied. +var _ = deployer.SortComponentRefsByDeploymentOrder diff --git a/pkg/bundler/deployer/helm/templates/README.md.tmpl b/pkg/bundler/deployer/helm/templates/README.md.tmpl index 3c3e874f4..d5ef92b12 100644 --- a/pkg/bundler/deployer/helm/templates/README.md.tmpl +++ b/pkg/bundler/deployer/helm/templates/README.md.tmpl @@ -17,18 +17,14 @@ for GPU-accelerated Kubernetes workloads. ## Components -The following components are included (deployed in order): +The following components are included (deployed in order). Each component +lives in a numbered `NNN-/` folder and is installed as a Helm release +via its own `install.sh`: -| Component | Type | Version | Namespace | Source | -|-----------|------|---------|-----------|--------| +| Component | Version | Namespace | Source | +|-----------|---------|-----------|--------| {{ range .Components -}} -{{ if .IsKustomize -}} -| {{ .Name }} | Kustomize | {{ .Tag }} | {{ .Namespace }} | {{ .Repository }} ({{ .Path }}) | -{{ else if .HasChart -}} -| {{ .Name }} | Helm | {{ .Version }} | {{ .Namespace }} | {{ .ChartName }} ({{ .Repository }}) | -{{ else -}} -| {{ .Name }} | Manifests | | {{ .Namespace }} | | -{{ end -}} +| {{ .Name }} | {{ if .Version }}{{ .Version }}{{ else }}N/A{{ end }} | {{ .Namespace }} | {{ if .Repository }}{{ .ChartName }} ({{ .Repository }}){{ else }}local{{ end }} | {{ end }} {{ if .Constraints }} @@ -58,88 +54,51 @@ Use `--no-wait` to skip Helm chart-level waiting where AICR uses `--wait` (keeps ./deploy.sh --no-wait ``` -> **Note:** The deploy script's final status reflects install/apply results. If `--best-effort` was used, one or more components may still have failed; check warning lines and logs. This does **not** mean the cluster is ready for GPU workloads. On fresh GPU nodes, cluster convergence (Nodewright node tuning, GPU operator operand rollout, DRA kubelet plugin registration) continues asynchronously after the script exits. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details. +> **Note:** The deploy script's final status reflects install/apply results. If `--best-effort` was used, one or more components may still have failed; check warning lines and logs. This does **not** guarantee the cluster is ready to schedule workloads — operator-driven cluster convergence (CRD reconciliation, node tuning, plugin registration, etc.) continues asynchronously after the script exits, in operator-specific ways. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details. ## Manual Installation -Install components individually in order: +Each component folder contains an `install.sh` that runs `helm upgrade --install` +with the right arguments baked in. To install a single component manually: -{{ range .Components }} -### {{ .Name }} - -{{ if .IsKustomize -}} -```bash -kubectl create namespace {{ .Namespace }} --dry-run=client -o yaml | kubectl apply -f - -kustomize build '{{ .Repository }}//{{ .Path }}{{ if .Tag }}?ref={{ .Tag }}{{ end }}' \ - | kubectl apply -n {{ .Namespace }} -f - -``` -{{ else if .HasChart -}} -```bash -{{ if .IsOCI -}} -helm upgrade --install {{ .Name }} {{ .Repository }}/{{ .ChartName }} \ - --version {{ .Version }} \ - -n {{ .Namespace }} --create-namespace \ - -f {{ .Name }}/values.yaml \ - -f {{ .Name }}/cluster-values.yaml \ - --wait --timeout 10m -{{ else -}} -helm upgrade --install {{ .Name }} {{ .ChartName }} \ - --repo {{ .Repository }} \ - --version {{ .Version }} \ - -n {{ .Namespace }} --create-namespace \ - -f {{ .Name }}/values.yaml \ - -f {{ .Name }}/cluster-values.yaml \ - --wait --timeout 10m -{{ end -}} -``` -{{ end -}} -{{ if .HasManifests }} ```bash -kubectl apply -f {{ .Name }}/manifests/ +cd NNN- +bash install.sh ``` -{{ end }} -{{ end }} ## Customization -Each Helm component has two values files in its directory: - -- `values.yaml` — resolved configuration from the recipe. Edit to override defaults: - - ```bash - vim gpu-operator/values.yaml - ``` +Each component folder has its own `values.yaml` (static) and `cluster-values.yaml` +(dynamic, per-cluster). Edit either before deploying: -- `cluster-values.yaml` — install-time parameters. Any paths declared with - `aicr bundle --dynamic :` are pulled out of `values.yaml` - and placed here for you to fill in. The file is always created (empty if - no dynamic paths were declared) and passed to `helm upgrade --install` - alongside `values.yaml` by both `deploy.sh` and the per-component commands - in the "Manual Installation" section above. +```bash +vim NNN-/values.yaml +vim NNN-/cluster-values.yaml +``` ## Upgrade -To upgrade a specific Helm component: +Re-run the per-component install.sh to upgrade an already-installed release: ```bash -helm upgrade --version -n -f /values.yaml -f /cluster-values.yaml --wait --timeout 10m +cd NNN- +bash install.sh ``` ## Uninstall To remove components (reverse order): -{{ range .ComponentsReversed -}} -{{ if .IsKustomize -}} ```bash -kustomize build '{{ .Repository }}//{{ .Path }}{{ if .Tag }}?ref={{ .Tag }}{{ end }}' \ - | kubectl delete -n {{ .Namespace }} --ignore-not-found -f - +./undeploy.sh ``` -{{ else -}} + +Or remove a single release manually: + +{{ range .ComponentsReversed -}} ```bash helm uninstall {{ .Name }} -n {{ .Namespace }} ``` -{{ end -}} {{ end }} ## Troubleshooting @@ -147,22 +106,36 @@ helm uninstall {{ .Name }} -n {{ .Namespace }} ### Check deployment status ```bash -kubectl get pods -A | grep -E 'gpu-operator|network-operator|cert-manager' +kubectl get pods -A | grep -E '{{ range $i, $c := .Components }}{{ if $i }}|{{ end }}{{ $c.Name }}{{ end }}' ``` ### View component logs +Inspect a single component's pods (replace `` and `` +with one of the entries from the table above): + ```bash -kubectl logs -n gpu-operator -l app=gpu-operator +kubectl logs -n -l app.kubernetes.io/instance= ``` - +{{ $hasGPU := false }} +{{- range .Components -}}{{- if eq .Name "gpu-operator" -}}{{- $hasGPU = true -}}{{- end -}}{{- end -}} +{{- if $hasGPU }} ### Verify GPU access ```bash kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | jq '.["nvidia.com/gpu"]' ``` +{{ end }} ## References +- [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md) +{{- $hasGPUOp := false }} +{{- $hasNetOp := false }} +{{- range .Components }}{{ if eq .Name "gpu-operator" }}{{ $hasGPUOp = true }}{{ end }}{{ if eq .Name "network-operator" }}{{ $hasNetOp = true }}{{ end }}{{ end }} +{{- if $hasGPUOp }} - [GPU Operator Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/) +{{- end }} +{{- if $hasNetOp }} - [Network Operator Documentation](https://docs.nvidia.com/networking/display/cokan10/network+operator) +{{- end }} diff --git a/pkg/bundler/deployer/helm/templates/component-README.md.tmpl b/pkg/bundler/deployer/helm/templates/component-README.md.tmpl deleted file mode 100644 index 068bfcd28..000000000 --- a/pkg/bundler/deployer/helm/templates/component-README.md.tmpl +++ /dev/null @@ -1,94 +0,0 @@ -# {{ .Name }} - -{{ if .IsKustomize -}} -Source: {{ .Repository }} -Path: {{ .Path }} -{{ if .Tag -}}Tag: {{ .Tag }}{{ end }} -Namespace: {{ .Namespace }} - -## Install - -```bash -kubectl create namespace {{ .Namespace }} --dry-run=client -o yaml | kubectl apply -f - -kustomize build '{{ .Repository }}//{{ .Path }}{{ if .Tag }}?ref={{ .Tag }}{{ end }}' \ - | kubectl apply -n {{ .Namespace }} -f - -``` -{{ if .HasManifests }} -After the kustomization is applied, apply additional manifests: - -```bash -kubectl apply -f manifests/ -``` -{{ end }} -## Upgrade - -```bash -kustomize build '{{ .Repository }}//{{ .Path }}{{ if .Tag }}?ref={{ .Tag }}{{ end }}' \ - | kubectl apply -n {{ .Namespace }} -f - -``` - -## Uninstall - -```bash -kustomize build '{{ .Repository }}//{{ .Path }}{{ if .Tag }}?ref={{ .Tag }}{{ end }}' \ - | kubectl delete -n {{ .Namespace }} --ignore-not-found -f - -``` -{{ else -}} -Chart: {{ .Repository }}/{{ .ChartName }} -Version: {{ .Version }} -Namespace: {{ .Namespace }} - -## Install - -```bash -{{ if .IsOCI -}} -helm upgrade --install {{ .Name }} {{ .Repository }}/{{ .ChartName }} \ - --version {{ .Version }} \ - -n {{ .Namespace }} --create-namespace \ - -f values.yaml \ - -f cluster-values.yaml \ - --wait --timeout 10m -{{ else -}} -helm upgrade --install {{ .Name }} {{ .ChartName }} \ - --repo {{ .Repository }} \ - --version {{ .Version }} \ - -n {{ .Namespace }} --create-namespace \ - -f values.yaml \ - -f cluster-values.yaml \ - --wait --timeout 10m -{{ end -}} -``` -{{ if .HasManifests }} -After the chart is installed, apply additional manifests: - -```bash -kubectl apply -f manifests/ -``` -{{ end }} -## Upgrade - -```bash -{{ if .IsOCI -}} -helm upgrade {{ .Name }} {{ .Repository }}/{{ .ChartName }} \ - --version {{ .Version }} \ - -n {{ .Namespace }} \ - -f values.yaml \ - -f cluster-values.yaml \ - --wait --timeout 10m -{{ else -}} -helm upgrade {{ .Name }} {{ .ChartName }} \ - --repo {{ .Repository }} \ - --version {{ .Version }} \ - -n {{ .Namespace }} \ - -f values.yaml \ - -f cluster-values.yaml \ - --wait --timeout 10m -{{ end -}} -``` - -## Uninstall - -```bash -helm uninstall {{ .Name }} -n {{ .Namespace }} -``` -{{ end -}} diff --git a/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl b/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl index 0f83eb71c..8a6a084a8 100644 --- a/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl +++ b/pkg/bundler/deployer/helm/templates/deploy.sh.tmpl @@ -1,4 +1,18 @@ #!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + set -euo pipefail # Cloud Native Stack Deployment Script @@ -9,19 +23,18 @@ set -euo pipefail # --best-effort Continue past individual component failures (log warnings) # --retries N Retry failed helm/kubectl operations N times with backoff (default: 5, 0 = fail-fast) # -# This script is optional — each component subdirectory has a README.md with -# manual install commands. For detailed behavior docs (CRD ordering, async -# components, error handling), see the AICR CLI Reference: +# This script is optional — each component subdirectory has its own install.sh +# with a single baked-in `helm upgrade --install` command. For detailed behavior +# docs (CRD ordering, async components, error handling), see the AICR CLI Reference: # https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" # Run helm commands from a temp directory to prevent local chart directories -# (e.g., bundle/nodewright-operator/) from shadowing remote chart references. +# (e.g., bundle/001-nodewright-operator/) from shadowing remote chart references. HELM_WORKDIR="$(mktemp -d)" trap 'rm -rf "${HELM_WORKDIR}"; exit 130' INT TERM trap 'rm -rf "${HELM_WORKDIR}"' EXIT -cd "${HELM_WORKDIR}" HELM_TIMEOUT="10m" NO_WAIT=false @@ -41,6 +54,12 @@ while [[ $# -gt 0 ]]; do esac done +# Export env vars consumed by each folder's install.sh (rendered by localformat). +# DRY_RUN_FLAG / KUBECONFIG_FLAG / HELM_DEBUG_FLAG default to empty strings. +export DRY_RUN_FLAG="${DRY_RUN_FLAG:-}" +export KUBECONFIG_FLAG="${KUBECONFIG_FLAG:-}" +export HELM_DEBUG_FLAG="${HELM_DEBUG_FLAG:-}" + function helm_failed() { if [[ "${BEST_EFFORT}" == "true" ]]; then echo "WARNING: $1 install failed, continuing (--best-effort)" @@ -59,25 +78,6 @@ function backoff_seconds() { echo "${seconds}" } -function retry() { - local desc="$1"; shift - local attempt=0 - while true; do - if "$@"; then - return 0 - fi - attempt=$((attempt + 1)) - if [[ ${attempt} -gt ${MAX_RETRIES} ]]; then - echo "ERROR: ${desc} failed after ${attempt} attempts" - return 1 - fi - local wait_secs - wait_secs=$(backoff_seconds "${attempt}") - echo "RETRY: ${desc} failed (attempt ${attempt}/${MAX_RETRIES}), retrying in ${wait_secs}s..." - sleep "${wait_secs}" - done -} - # Clean up stale Helm hook Jobs before retrying. When a hook Job (e.g., # crd-upgrader) times out or fails, it stays in the namespace and blocks # subsequent install attempts with "Job not ready" errors. @@ -144,84 +144,6 @@ function dump_kai_scheduler_helm_diagnostics() { echo " --- End ${namespace} diagnostics ---" } -# helm_retry contract: -# helm_retry "" "" "" [args...] -# Callers must pass the retry budget as the third positional argument before the -# command to execute. This keeps per-component retry tuning explicit at the -# callsite instead of relying on the global MAX_RETRIES fallback. -function helm_retry() { - local desc="$1" - local namespace="$2" - local max_retries="$3" - shift 3 - local attempt=0 - while true; do - if "$@"; then - return 0 - fi - attempt=$((attempt + 1)) - dump_kai_scheduler_helm_diagnostics "${namespace}" - if [[ ${attempt} -gt ${max_retries} ]]; then - echo "ERROR: ${desc} failed after ${attempt} attempts" - return 1 - fi - cleanup_helm_hooks "${namespace}" - local wait_secs - wait_secs=$(backoff_seconds "${attempt}") - echo "RETRY: ${desc} failed (attempt ${attempt}/${max_retries}), retrying in ${wait_secs}s..." - sleep "${wait_secs}" - done -} - -# kubectl apply that tolerates "no matches for kind" errors. On first deploy, -# pre-install manifests may reference CRDs not yet registered — this is expected -# because post-install re-applies them after helm installs the CRDs. - -# Patterns that indicate ignorable CRD-race conditions -CRD_RACE_PATTERNS=( - "no matches for kind" - "ensure CRDs are installed first" -) - -# Patterns that indicate real errors (auth, webhook, timeout, etc.) -REAL_ERROR_PATTERNS=( - "^error:" - "^Error from server" - "forbidden" - "denied" - "timed out" - "unable to" - "failed to" - "invalid" -) - -function apply_ignoring_crd_race() { - local manifest_path="$1" - local output - if output=$(kubectl apply -f "${manifest_path}" 2>&1); then - echo "${output}" - return 0 - fi - # Build combined patterns from arrays - local crd_filter real_filter - crd_filter=$(IFS='|'; echo "${CRD_RACE_PATTERNS[*]}") - real_filter=$(IFS='|'; echo "${REAL_ERROR_PATTERNS[*]}") - # Strip CRD-race lines, then check if real errors remain - if echo "${output}" | grep -Eiv "${crd_filter}" | grep -Eiq "${real_filter}"; then - echo "${output}" >&2 - return 1 - fi - # Fail safe: if error output contains unrecognized lines, don't silently swallow - local remaining - remaining=$(echo "${output}" | grep -Eiv "${crd_filter}" | grep -v '^$' || true) - if [[ -n "${remaining}" ]]; then - echo "${output}" >&2 - return 1 - fi - echo " (skipped CRD-dependent resources — will re-apply after helm install)" - return 0 -} - # Components that use operator patterns with custom resources that reconcile # asynchronously. Helm --wait may time out waiting for CR readiness even though # all pods start successfully. These components are installed without --wait. @@ -323,13 +245,17 @@ if [[ "${nodewright_available}" == "0" || -z "${nodewright_available}" ]]; then # like "custom.io/gate=true:NoSchedule" — extract the key before the first "=". # If runtimeRequiredTaint is not set, use the default skyhook.nvidia.com. NODEWRIGHT_TAINT_KEY="skyhook.nvidia.com" - custom_taint_line=$(grep 'runtimeRequiredTaint:' "${SCRIPT_DIR}/nodewright-operator/values.yaml" 2>/dev/null || true) - if [[ -n "${custom_taint_line}" ]]; then - taint_value=$(echo "${custom_taint_line}" | sed 's/.*runtimeRequiredTaint:[[:space:]]*//' | tr -d '"' | tr -d "'") - if [[ -n "${taint_value}" ]]; then - # Handle both key=value:effect and key:effect formats - NODEWRIGHT_TAINT_KEY="${taint_value%%=*}" - NODEWRIGHT_TAINT_KEY="${NODEWRIGHT_TAINT_KEY%%:*}" + # Locate nodewright-operator's NNN-prefixed directory at runtime. + nodewright_dir="$(ls -d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-nodewright-operator 2>/dev/null | head -1)" + if [[ -n "${nodewright_dir}" && -f "${nodewright_dir}/values.yaml" ]]; then + custom_taint_line=$(grep 'runtimeRequiredTaint:' "${nodewright_dir}/values.yaml" 2>/dev/null || true) + if [[ -n "${custom_taint_line}" ]]; then + taint_value=$(echo "${custom_taint_line}" | sed 's/.*runtimeRequiredTaint:[[:space:]]*//' | tr -d '"' | tr -d "'") + if [[ -n "${taint_value}" ]]; then + # Handle both key=value:effect and key:effect formats + NODEWRIGHT_TAINT_KEY="${taint_value%%=*}" + NODEWRIGHT_TAINT_KEY="${NODEWRIGHT_TAINT_KEY%%:*}" + fi fi fi stale_nodewright=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{range .spec.taints[*]}{.key}{" "}{end}{"\n"}{end}' 2>/dev/null | grep "${NODEWRIGHT_TAINT_KEY}" | awk '{print $1}' || true) @@ -354,98 +280,96 @@ fi echo "Pre-flight checks passed." echo "Deploying Cloud Native Stack components..." -# Install components in order -{{ range .Components -}} -{{ if .IsKustomize -}} -echo "Installing {{ .Name }} ({{ .Namespace }}) via kustomize..." -kubectl create namespace {{ .Namespace }} --dry-run=client -o yaml | kubectl apply -f - || helm_failed "{{ .Name }}" -retry "{{ .Name }} kustomize apply" bash -c \ - 'kustomize build '\''{{ .Repository }}//{{ .Path }}{{ if .Tag }}?ref={{ .Tag }}{{ end }}'\'' | kubectl apply -n {{ .Namespace }} -f -' \ - || helm_failed "{{ .Name }}" -{{ else if .HasChart -}} -echo "Installing {{ .Name }} ({{ .Namespace }})..." -{{ if .HasManifests -}} -echo "Applying pre-install manifests for {{ .Name }}..." -kubectl create namespace {{ .Namespace }} --dry-run=client -o yaml | kubectl apply -f - || helm_failed "{{ .Name }}" -retry "{{ .Name }} pre-install manifests" apply_ignoring_crd_race "${SCRIPT_DIR}/{{ .Name }}/manifests/" \ - || helm_failed "{{ .Name }}" -{{ end -}} -# Per-component timeout override. Most components use HELM_TIMEOUT (10m). -# Components with slow hooks (e.g., kai-scheduler crd-upgrader image pull -# on cold runners) get a longer timeout to avoid unnecessary retry cycles. -COMPONENT_HELM_TIMEOUT="${HELM_TIMEOUT}" -COMPONENT_MAX_RETRIES="${MAX_RETRIES}" -{{ if eq .Name "kai-scheduler" -}} -COMPONENT_HELM_TIMEOUT="20m" -COMPONENT_MAX_RETRIES="1" -{{ end -}} -# Derive wait args: global --wait/--no-wait behavior + component timeout. -if [[ "${NO_WAIT}" == "true" ]]; then - COMPONENT_WAIT_ARGS="--timeout ${COMPONENT_HELM_TIMEOUT}" -else - COMPONENT_WAIT_ARGS="--wait --timeout ${COMPONENT_HELM_TIMEOUT}" -fi -if echo "${ASYNC_COMPONENTS}" | grep -qw "{{ .Name }}"; then - # Skip --wait (no readiness check) but keep --timeout for hook completion. - COMPONENT_WAIT_ARGS="--timeout ${COMPONENT_HELM_TIMEOUT}" - echo " (async component — skipping --wait, keeping --timeout for hooks)" -fi -{{ if .IsOCI -}} -helm_retry "{{ .Name }} helm install" "{{ .Namespace }}" \ - "${COMPONENT_MAX_RETRIES}" \ - helm upgrade --install {{ .Name }} {{ .Repository }}/{{ .ChartName }} \ - {{ if .Version }}--version {{ .Version }} \ - {{ end -}} - -n {{ .Namespace }} --create-namespace \ - -f "${SCRIPT_DIR}/{{ .Name }}/values.yaml" \ - -f "${SCRIPT_DIR}/{{ .Name }}/cluster-values.yaml" \ - ${COMPONENT_WAIT_ARGS} \ - || helm_failed "{{ .Name }}" -{{ else -}} -helm_retry "{{ .Name }} helm install" "{{ .Namespace }}" \ - "${COMPONENT_MAX_RETRIES}" \ - helm upgrade --install {{ .Name }} {{ .ChartName }} \ - --repo {{ .Repository }} \ - {{ if .Version }}--version {{ .Version }} \ - {{ end -}} - -n {{ .Namespace }} --create-namespace \ - -f "${SCRIPT_DIR}/{{ .Name }}/values.yaml" \ - -f "${SCRIPT_DIR}/{{ .Name }}/cluster-values.yaml" \ - ${COMPONENT_WAIT_ARGS} \ - || helm_failed "{{ .Name }}" -{{ end -}} -{{ if .HasManifests -}} -echo "Applying post-install manifests for {{ .Name }}..." -retry "{{ .Name }} post-install manifests" kubectl apply -f "${SCRIPT_DIR}/{{ .Name }}/manifests/" \ - || helm_failed "{{ .Name }}" -{{ end -}} -{{ if eq .Name "nvidia-dra-driver-gpu" -}} -# Best-effort mitigation for kubelet DRA plugin registration drift. -# After uninstall/reinstall, kubelet's fsnotify watcher may not detect new -# registration sockets. Restarting the plugin DS forces fresh socket creation. -# This does NOT fix cases where kubelet itself has lost registration state — -# a node reboot is required for that. See docs/user/cli-reference.md. -DRA_DS=$(kubectl get daemonset -n {{ .Namespace }} -o name 2>/dev/null | awk '/kubelet-plugin/{print; exit}' || true) -if [[ -n "${DRA_DS}" ]]; then - echo " Restarting DRA kubelet plugin (${DRA_DS##*/}) to ensure registration..." - if ! kubectl rollout restart "${DRA_DS}" -n {{ .Namespace }}; then - echo " WARNING: failed to restart DRA kubelet plugin daemonset" - elif ! kubectl rollout status "${DRA_DS}" -n {{ .Namespace }} --timeout=120s; then - # Always wait for the DRA plugin rollout regardless of --no-wait. - # The restart+wait is a correctness gate (kubelet must re-register the - # DRA plugin socket), not a readiness convenience like --wait. - echo " WARNING: DRA kubelet plugin rollout did not complete within 120s" +# ============================================================================== +# Install loop +# ============================================================================== +# Generic install loop. Each folder's install.sh is rendered by localformat +# with the right helm command baked in — deploy.sh has no per-component +# knowledge here. Per-component special-case logic (async wait, DRA plugin +# restart) runs in the post-install blocks below, matched by component name. +cd "${HELM_WORKDIR}" +for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; do + [[ -d "${dir}" ]] || continue + dir="${dir%/}" + base="${dir##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Source the namespace from the folder's install.sh — not the folder + # basename — because the helm release name and its target namespace can + # differ (e.g. nodewright-operator → namespace skyhook; gpu-operator-post → + # namespace gpu-operator). cleanup_helm_hooks and the kai diagnostics + # both operate on the namespace. + namespace=$(awk '{ for (i=1;i/dev/null | awk '/kubelet-plugin/{print; exit}' || true) + if [[ -n "${DRA_DS}" ]]; then + echo " Restarting DRA kubelet plugin (${DRA_DS##*/}) to ensure registration..." + if ! kubectl rollout restart "${DRA_DS}" -n {{ .Namespace }}; then + echo " WARNING: failed to restart DRA kubelet plugin daemonset" + elif ! kubectl rollout status "${DRA_DS}" -n {{ .Namespace }} --timeout=120s; then + # Always wait for the DRA plugin rollout regardless of --no-wait. + # The restart+wait is a correctness gate (kubelet must re-register the + # DRA plugin socket), not a readiness convenience like --wait. + echo " WARNING: DRA kubelet plugin rollout did not complete within 120s" + fi + else + echo " WARNING: no DRA kubelet plugin daemonset found in {{ .Namespace }}" + fi + fi + {{- end }} + {{- end }} +done + if [[ -n "${FAILED_COMPONENTS}" ]]; then echo "WARNING: the following components failed:${FAILED_COMPONENTS}" echo "Deployment completed with non-fatal errors (--best-effort)." diff --git a/pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl b/pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl index 931e4ab33..75ba2ee44 100644 --- a/pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl +++ b/pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl @@ -1,4 +1,18 @@ #!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + set -euo pipefail # Cloud Native Stack Undeployment Script @@ -43,22 +57,15 @@ fi echo "Undeploying Cloud Native Stack components (timeout: ${HELM_TIMEOUT}s)..." echo "" -# Color codes for component type tags (disabled when stdout is not a terminal) -C_HELM="" C_KUST="" C_MANI="" C_RST="" -if [[ -t 1 ]]; then - C_HELM="\033[36m" C_KUST="\033[33m" C_MANI="\033[35m" C_RST="\033[0m" -fi - -echo " The following components will be removed (in order):" -{{ range .ComponentsReversed -}} -{{ if .IsKustomize -}} -printf " ${C_KUST}%-12s${C_RST} %s (%s)\n" "[kustomize]" "{{ .Name }}" "{{ .Namespace }}" -{{ else if .HasChart -}} -printf " ${C_HELM}%-12s${C_RST} %s (%s)\n" "[helm]" "{{ .Name }}" "{{ .Namespace }}" -{{ else if .HasManifests -}} -printf " ${C_MANI}%-12s${C_RST} %s\n" "[manifests]" "{{ .Name }}" -{{ end -}} -{{ end }} +echo " The following components will be removed (in reverse install order):" +# List NNN-* folders in reverse numeric order +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + printf " %-40s\n" "${name}" +done echo "" # System namespaces that must not be deleted @@ -238,10 +245,7 @@ extra_crds_for_release() { } # Skip pre-flight for releases whose bundle-managed custom resources are -# deleted from manifests before the controller is uninstalled. Those CRs may -# carry operator finalizers safely because undeploy removes them while the -# controller is still running, so scanning them here would false-positive the -# happy path. +# deleted from manifests before the controller is uninstalled. skip_preflight_for_release() { case "$1" in nodewright-operator|kgateway) return 0 ;; @@ -250,9 +254,6 @@ skip_preflight_for_release() { } # Run `kubectl ... -o json` while keeping stdout parseable for jq. -# On failure, stores kubectl stderr in the destination variable and returns non-zero. -# On success, replays any stderr warnings to the terminal without contaminating JSON. -# Args: $1 = destination variable name, $2.. = kubectl args capture_kubectl_json() { local out_var="$1" shift @@ -278,18 +279,12 @@ capture_kubectl_json() { return 0 } -# Check a single CRD for custom resource instances with active finalizers. -# Appends details and remediation commands to the preflight temp files. -# Args: $1 = CRD name, $2 = component name (for display) check_crd_for_stuck_resources() { local crd_name="$1" local component="$2" local crd_json plural group scope resource stuck stuck_json kubectl_err local item_name item_namespace item_finalizers - # Fail-closed on kubectl errors so a transient API/auth hiccup cannot let - # pre-flight silently pass. capture_kubectl_json keeps stdout parseable - # even when kubectl prints warnings on stderr during a successful call. if ! capture_kubectl_json crd_json get crd "${crd_name}" -o json; then kubectl_err="${crd_json}" echo "" >&2 @@ -304,8 +299,6 @@ check_crd_for_stuck_resources() { [[ -z "${plural}" || "${plural}" == "null" ]] && return 0 resource="${plural}.${group}" - # Filter out kubernetes.io/* finalizers — those are handled by K8s controllers - # and will be processed regardless of whether the operator is running. local jq_filter='[.metadata.finalizers // [] | .[] | select(startswith("kubernetes.io/") | not)]' stuck="" if [[ "${scope}" == "Namespaced" ]]; then @@ -320,7 +313,7 @@ check_crd_for_stuck_resources() { while IFS=$'\x1f' read -r item_namespace item_name item_finalizers; do stuck="${stuck} ${item_namespace}/${item_name} finalizers=[${item_finalizers}]"$'\n' done < <(echo "${stuck_json}" \ - | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [(.metadata.namespace // ""), .metadata.name, ($f | join(","))] | join("\u001f")' 2>/dev/null || true) + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [(.metadata.namespace // ""), .metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) else if ! capture_kubectl_json stuck_json get "${resource}" -o json; then echo "" >&2 @@ -333,7 +326,7 @@ check_crd_for_stuck_resources() { while IFS=$'\x1f' read -r item_name item_finalizers; do stuck="${stuck} ${item_name} finalizers=[${item_finalizers}]"$'\n' done < <(echo "${stuck_json}" \ - | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [.metadata.name, ($f | join(","))] | join("\u001f")' 2>/dev/null || true) + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [.metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) fi if [[ -n "${stuck}" ]]; then @@ -344,24 +337,11 @@ check_crd_for_stuck_resources() { fi } -# Check a Helm release for CRDs whose custom resources have active finalizers. -# Discovery intentionally stays narrow: -# 1. The chart's templates/ section, via `helm get manifest` -# 2. CRDs annotated with this Helm release -# 3. A tiny exact-name override list for known crds/-installed operator CRDs -# This keeps the guard understandable and avoids broad cluster-wide ownership -# inference while still catching the high-risk operator CRs that motivated the -# pre-flight in the first place. -# Args: $1 = release name, $2 = namespace check_release_for_stuck_crds() { local release="$1" local ns="$2" local manifest manifest_crds annotated_crds explicit_crds local all_crds_json kubectl_err crd_name - # `helm get manifest` is best-effort: if it fails (release missing, helm - # transient error), keep going and let the annotation- and exact-CRD - # discovery still run. Returning early here would re-introduce a false- - # negative path where a flaky helm hides CRDs that are still present. manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) manifest_crds=$(echo "${manifest}" \ | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') @@ -401,14 +381,7 @@ check_release_for_stuck_crds() { ) | .metadata.name' 2>/dev/null || true)"$'\n' done < <(extra_crds_for_release "${release}") - # All three sources empty → release isn't installed and no leftover CRDs. - # (Distinct from the API-error path above: here kubectl succeeded and we - # genuinely observed no relevant CRDs.) [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${explicit_crds}" ]] && return 0 - # awk 'NF' drops empty lines without the exit-1-on-no-match behavior of - # `grep -v '^$'`, which would abort under `set -euo pipefail` when a release - # has no CRDs (e.g., chart ships none, installed with --skip-crds, already - # cleaned up). printf '%s\n%s\n%s\n' "${manifest_crds}" "${annotated_crds}" "${explicit_crds}" \ | awk 'NF' \ | sort -u \ @@ -443,14 +416,12 @@ else exit 1 fi {{ range .ComponentsReversed -}} - {{ if .HasChart -}} if skip_preflight_for_release "{{ .Name }}"; then echo " Skipping {{ .Name }} ({{ .Namespace }}): bundle deletes dependent manifests before controller uninstall." else echo " Checking {{ .Name }} ({{ .Namespace }})..." check_release_for_stuck_crds "{{ .Name }}" "{{ .Namespace }}" fi - {{ end -}} {{ end }} if [[ -s "${PREFLIGHT_DETAILS}" ]]; then echo "" @@ -472,40 +443,68 @@ else echo "Pre-flight checks passed." fi -# Uninstall components in reverse order -{{ range .ComponentsReversed -}} -{{ if .HasManifests -}} -echo "Deleting manifests for {{ .Name }}..." -kubectl delete -f "${SCRIPT_DIR}/{{ .Name }}/manifests/" --ignore-not-found || true -{{ end -}} -{{ if .IsKustomize -}} -echo "Uninstalling {{ .Name }} ({{ .Namespace }}) via kustomize..." -kustomize build '{{ .Repository }}//{{ .Path }}{{ if .Tag }}?ref={{ .Tag }}{{ end }}' \ - | kubectl delete -n {{ .Namespace }} --ignore-not-found -f - || true -{{ else if .HasChart -}} -echo "Uninstalling {{ .Name }} ({{ .Namespace }})..." -helm_force_uninstall "{{ .Name }}" "{{ .Namespace }}" -delete_release_cluster_resources "{{ .Name }}" "{{ .Namespace }}" -delete_orphaned_webhooks_for_ns "{{ .Namespace }}" -{{ end -}} -{{ end }} +# ============================================================================== +# Uninstall components in reverse install order +# ============================================================================== +# Generic reverse loop: every folder is a Helm release (local-helm or upstream-helm). +# `helm uninstall ` works uniformly for both kinds — that's one of the +# benefits of the uniform local-chart bundle format. +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Derive the namespace from the component name via the per-component blocks + # below. We still need it for helm_force_uninstall and cluster-resource cleanup. + ns="" + {{- range .ComponentsReversed }} + if [[ "${name}" == "{{ .Name }}" ]]; then ns="{{ .Namespace }}"; fi + {{- end }} + # Injected mixed-component "-post" folders share their parent's namespace. + {{- range .ComponentsReversed }} + if [[ "${name}" == "{{ .Name }}-post" ]]; then ns="{{ .Namespace }}"; fi + {{- end }} + if [[ -z "${ns}" ]]; then + echo "Warning: no namespace known for ${name}; skipping uninstall" >&2 + continue + fi + echo "Uninstalling ${name} (${ns})..." + # For local-helm folders (Chart.yaml + templates/), kubectl-delete the + # rendered templates BEFORE helm uninstall. The templates may carry + # helm.sh/hook annotations (post-install/post-upgrade) which helm does + # not track or clean up on uninstall — so without this pre-delete, the + # operator gets removed but its hook-created CRs (and their finalizers) + # linger. Doing the delete while the controller is still running lets + # finalizers clear naturally. + if [[ -d "${dir}/templates" ]]; then + for tpl in "${dir}/templates/"*.yaml; do + [[ -f "${tpl}" ]] || continue + kubectl delete -n "${ns}" -f "${tpl}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done + fi + helm_force_uninstall "${name}" "${ns}" + delete_release_cluster_resources "${name}" "${ns}" + delete_orphaned_webhooks_for_ns "${ns}" +done + # Remove nodewright node taints that persist after operator removal. # Nodewright taints nodes during kernel tuning. The taint key is configurable # via runtimeRequiredTaint (defaults to skyhook.nvidia.com). {{- range .ComponentsReversed }} {{- if eq .Name "nodewright-operator" }} -# Extract the taint key from bundle values. The YAML value is a taint string -# like "custom.io/gate=true:NoSchedule" — extract the key before the first "=". -# If runtimeRequiredTaint is not set, use the default skyhook.nvidia.com. NODEWRIGHT_TAINT_KEY="skyhook.nvidia.com" -custom_taint_line=$(grep 'runtimeRequiredTaint:' "${SCRIPT_DIR}/nodewright-operator/values.yaml" 2>/dev/null || true) -if [[ -n "${custom_taint_line}" ]]; then - # Extract value after "runtimeRequiredTaint:", then extract taint key before "=" - taint_value=$(echo "${custom_taint_line}" | sed 's/.*runtimeRequiredTaint:[[:space:]]*//' | tr -d '"' | tr -d "'") - if [[ -n "${taint_value}" ]]; then - # Handle both key=value:effect and key:effect formats - NODEWRIGHT_TAINT_KEY="${taint_value%%=*}" - NODEWRIGHT_TAINT_KEY="${NODEWRIGHT_TAINT_KEY%%:*}" +# Locate nodewright-operator's NNN-prefixed directory at runtime. +nodewright_dir="$(ls -d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-nodewright-operator 2>/dev/null | head -1)" +if [[ -n "${nodewright_dir}" && -f "${nodewright_dir}/values.yaml" ]]; then + custom_taint_line=$(grep 'runtimeRequiredTaint:' "${nodewright_dir}/values.yaml" 2>/dev/null || true) + if [[ -n "${custom_taint_line}" ]]; then + # Extract value after "runtimeRequiredTaint:", then extract taint key before "=" + taint_value=$(echo "${custom_taint_line}" | sed 's/.*runtimeRequiredTaint:[[:space:]]*//' | tr -d '"' | tr -d "'") + if [[ -n "${taint_value}" ]]; then + # Handle both key=value:effect and key:effect formats + NODEWRIGHT_TAINT_KEY="${taint_value%%=*}" + NODEWRIGHT_TAINT_KEY="${NODEWRIGHT_TAINT_KEY%%:*}" + fi fi fi echo "Removing ${NODEWRIGHT_TAINT_KEY} node taints..." @@ -516,26 +515,17 @@ done {{- end }} # Clean up orphaned CRDs that were owned by this bundle's releases. -# Only delete CRDs whose Helm release annotation matches a component we just uninstalled, -# or that belong to operator-created groups from components in this bundle. -# This avoids deleting CRDs owned by other releases on shared clusters. -# Delete CRDs with resource-policy: keep that were owned by this bundle's releases. -# Match both release-name AND release-namespace to avoid deleting CRDs owned by -# another release with the same name in a different namespace. -{{- range .ComponentsReversed }}{{ if .HasChart }} +# Only delete CRDs whose Helm release annotation matches a component we just uninstalled. +{{- range .ComponentsReversed }} kubectl get crd -o json 2>/dev/null \ | jq -r --arg rel "{{ .Name }}" --arg ns "{{ .Namespace }}" \ '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ | while read -r name; do echo "Deleting CRD ${name} (owned by {{ .Name }}/{{ .Namespace }})..." - # Per-CRD `|| echo Warning:` keeps the loop making best-effort progress - # across the remaining CRDs while surfacing each delete failure - # (RBAC, timeout) with the specific CRD name -- post-flight's - # Helm-annotation re-check catches any that still linger. kubectl delete crd "${name}" --ignore-not-found --wait=false \ || echo "Warning: failed to delete CRD ${name} (owned by {{ .Name }}/{{ .Namespace }}); leftovers will surface in post-flight" >&2 done || echo "Warning: orphan-CRD cleanup for {{ .Name }}/{{ .Namespace }} failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 -{{- end }}{{ end }} +{{- end }} # Intentionally skip automatic deletion of unannotated CRDs matched only by # API group. On shared clusters, those CRDs may be serving another tenant's @@ -544,8 +534,6 @@ kubectl get crd -o json 2>/dev/null \ # Warn about CRDs stuck in deleting state (e.g., customresourcecleanup finalizer # can't be resolved because CR instances still have controller-managed finalizers). -# Force-clearing CRD finalizers can orphan CR data in etcd, so list them and let -# the user decide. stuck_crds=$(kubectl get crd -o json 2>/dev/null | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' 2>/dev/null || true) if [[ -n "${stuck_crds}" ]]; then echo "" @@ -567,8 +555,6 @@ if [[ -n "${stuck_crds}" ]]; then fi # Clean up namespaces after all components are uninstalled. -# PVC deletion is deferred here so that StatefulSet owners are removed first, -# preventing hangs on kubernetes.io/pvc-protection finalizers. {{ range .Namespaces -}} if [[ "${DELETE_PVCS}" == "true" ]] && ! echo " ${PROTECTED_NS} " | grep -q " {{ . }} "; then echo "Deleting PVCs in {{ . }}..." @@ -578,8 +564,12 @@ delete_orphaned_webhooks_for_ns "{{ . }}" delete_namespace "{{ . }}" {{ end }} # Clean up companion namespaces created at runtime by operators. -# Only includes namespaces that are always created by bundle components. +# Only emitted for components whose runtime creates them. +{{- range .ComponentsReversed }} +{{- if eq .Name "kai-scheduler" }} delete_namespace "kai-resource-reservation" +{{- end }} +{{- end }} # Wait for terminating namespaces to finish echo "Waiting for namespaces to terminate..." @@ -599,21 +589,16 @@ for i in $(seq 1 60); do sleep 1 done -# Final webhook cleanup: catch webhooks whose services were removed by namespace -# deletion above. Before namespace deletion, services still exist so the orphan -# check cannot trigger; this post-deletion pass closes that race. +# Final webhook cleanup pass. {{ range .Namespaces -}} delete_orphaned_webhooks_for_ns "{{ . }}" {{ end }} # ============================================================================== # Post-flight verification # ============================================================================== -# Verify the cluster is clean after undeployment. Warn about any stale -# resources that could block a subsequent deploy. postflight_issues=false -# Check for remaining terminating namespaces TERMINATING=$(kubectl get namespaces -o jsonpath='{range .items[?(@.status.phase=="Terminating")]}{.metadata.name}{" "}{end}' 2>/dev/null || true) if [[ -n "${TERMINATING}" ]]; then echo "WARNING: namespaces still terminating: ${TERMINATING}" @@ -621,8 +606,6 @@ if [[ -n "${TERMINATING}" ]]; then postflight_issues=true fi -# Delete stale webhooks whose backing service no longer exists. -# These can block pod creation (fail-closed) after undeployment. kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o json 2>/dev/null | \ jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null | \ sort -u | \ @@ -634,7 +617,6 @@ kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o jso fi done || true -# Check for stale API services stale_apis=$(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true) if [[ -n "${stale_apis}" ]]; then echo "WARNING: unavailable API services found: ${stale_apis}" @@ -643,27 +625,20 @@ if [[ -n "${stale_apis}" ]]; then fi # Check for Helm-annotated CRDs from uninstalled releases. -# Mirrors the per-component CRD deletion loop above: if cleanup succeeded, -# every annotated CRD is gone and this check stays silent. Catches cases -# where the deletion loop logged a transient kubectl/jq warning and moved on. -# CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded -- -# the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow -# finalizer can leave the CRD still listed here even though it is being cleaned up -# normally. Only truly stuck (not-yet-terminating) CRDs should be surfaced. helm_orphaned_crds="" explicit_orphaned_crds="" postflight_all_crds_json="" if capture_kubectl_json postflight_all_crds_json get crd -o json; then : -{{- range .ComponentsReversed }}{{ if .HasChart }} +{{- range .ComponentsReversed }} remaining_helm_crds=$(echo "${postflight_all_crds_json}" \ | jq -r --arg rel "{{ .Name }}" --arg ns "{{ .Namespace }}" \ '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) if [[ -n "${remaining_helm_crds}" ]]; then helm_orphaned_crds="${helm_orphaned_crds} ${remaining_helm_crds}" fi -{{- end }}{{ end }} -{{- range .ComponentsReversed }}{{ if .HasChart }} +{{- end }} +{{- range .ComponentsReversed }} while read -r crd_name; do [[ -z "${crd_name}" ]] && continue remaining_explicit_crd=$(echo "${postflight_all_crds_json}" \ @@ -673,7 +648,7 @@ if capture_kubectl_json postflight_all_crds_json get crd -o json; then explicit_orphaned_crds="${explicit_orphaned_crds}${remaining_explicit_crd}"$'\n' fi done < <(extra_crds_for_release "{{ .Name }}") -{{- end }}{{ end }} +{{- end }} else echo "Warning: failed to enumerate post-flight CRDs; kubectl output: ${postflight_all_crds_json}" >&2 postflight_issues=true @@ -685,10 +660,6 @@ if [[ -n "${helm_orphaned_crds}" ]]; then postflight_issues=true fi -# Check for exact-name CRDs from known crds/-installed releases. -# These CRDs are intentionally tracked by explicit name instead of API group so -# post-flight can surface leftovers without risking cross-tenant cleanup on -# shared clusters. explicit_orphaned_crds=$(printf '%s' "${explicit_orphaned_crds}" | awk 'NF' | sort -u | tr '\n' ' ') if [[ -n "${explicit_orphaned_crds}" ]]; then echo "WARNING: explicit CRDs from this bundle still present: ${explicit_orphaned_crds}" diff --git a/pkg/bundler/deployer/helm/testdata/README.md b/pkg/bundler/deployer/helm/testdata/README.md new file mode 100644 index 000000000..9b893f7c7 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/README.md @@ -0,0 +1,73 @@ +# helm deployer test fixtures + +Each subdirectory under `testdata/` is a golden-file snapshot of a complete +generated bundle for a representative recipe shape. The harness in +[`helm_test.go`](../helm_test.go) (`assertBundleGolden`) walks the freshly +generated tempdir bundle and byte-compares every file against the +checked-in tree. + +The same pattern is used in the sister package's +[`pkg/bundler/deployer/localformat/testdata/`](../../localformat/testdata). + +## Scenarios + +| Directory | What it exercises | +|---|---| +| `upstream_helm_only/` | Bundle with a single non-OCI Helm component (cert-manager) — no Chart.yaml in folder, `upstream.env` carries CHART/REPO/VERSION | +| `manifest_only/` | Component with `defaultRepository: ""` + `manifestFiles` — wrapped chart with synthesized Chart.yaml + templates/ | +| `mixed_gpu_operator/` | Mixed component (Helm chart + raw manifests) — primary `001-gpu-operator/` (upstream-helm) plus injected `002-gpu-operator-post/` (local-helm wrapping the manifests) | +| `kai_scheduler_present/` | OCI Helm component (`oci://...`) — `upstream.env` writes the full OCI URI to CHART, leaves REPO empty; `install.sh` uses `${REPO:+--repo "${REPO}"}` so `--repo` is omitted for OCI | +| `nodewright_present/` | Bundle containing nodewright-operator — exercises the name-matched node taint cleanup block in `deploy.sh` | + +## Regenerating the goldens + +After any change to the helm deployer, the templates, or `localformat`: + +```bash +go test -run "^TestBundleGolden_" ./pkg/bundler/deployer/helm/ -update +``` + +This rewrites every file under `testdata//` to match the freshly +generated bundle. Inspect the diff carefully — every byte change is +reviewer-visible and that is the entire point. + +The rule is symmetric with `pkg/bundler/deployer/localformat`: + +```bash +go test ./pkg/bundler/deployer/localformat/... -update +``` + +## Why these aren't real bundles + +Goldens use **minimal synthetic** input where possible: a one-key +`ConfigMap`, a stub `Service` without a `spec.ports`, etc. They are NOT +meant to be installable into a real cluster. The harness asserts on +generated **bundle layout and rendered text**, not on the runtime +correctness of the manifests inside. Real-cluster runtime correctness is +covered by the chainsaw end-to-end tests under `tests/chainsaw/`. + +## Why no Apache license headers on the YAML/scripts here + +`testdata/**` is excluded from `make license` (see `Makefile`'s +`LICENSE_IGNORES` block). These files are test fixtures, not source +artifacts; running `addlicense` over them would corrupt the goldens by +prepending headers that the runtime generator does not emit, causing the +round-trip test to fail. Test-driven proof: removing the ignore would +break `TestBundleGolden_*` immediately on the next `make lint`. + +## Adding a new scenario + +1. Add a `TestBundleGolden_` test in `helm_test.go` mirroring the + existing examples — construct a `Generator`, call `Generate`, then + `assertBundleGolden(t, outDir, "testdata/")`. +2. Run `go test -run TestBundleGolden_ ./pkg/bundler/deployer/helm/ -update` + once to materialize the golden tree. +3. Inspect the generated tree on disk. If anything looks wrong, fix the + generator (not the golden) and re-`-update`. The golden should be a + faithful capture of what `Generate` actually produces. +4. Commit the test plus the entire `testdata//` directory. + +The tree of goldens you check in becomes a reference catalog of "this is +what a bundle of shape X looks like" — a deliberate side effect of the +pattern that helps reviewers understand the deployer output without +running anything locally. diff --git a/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh new file mode 100644 index 000000000..c752474c7 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" +# shellcheck source=/dev/null +source ./upstream.env + +# CHART carries the full OCI URI for OCI charts and just the chart name for +# HTTP/HTTPS charts. REPO is non-empty only for HTTP/HTTPS charts; the +# ${REPO:+--repo "${REPO}"} expansion adds --repo iff REPO is set. +helm upgrade --install kai-scheduler "${CHART}" \ + ${REPO:+--repo "${REPO}"} --version "${VERSION}" \ + --namespace kai-scheduler --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env new file mode 100644 index 000000000..29f65e84e --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env @@ -0,0 +1,3 @@ +CHART='oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler' +REPO='' +VERSION='v0.13.0' diff --git a/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md new file mode 100644 index 000000000..776d26421 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md @@ -0,0 +1,107 @@ +# Cloud Native Stack Deployment + +Recipe Version: v0.1.0 +Bundler Version: v1.0.0 + +Per-component bundle for deploying NVIDIA Cloud Native Stack components +for GPU-accelerated Kubernetes workloads. + +## Configuration + + + +## Components + +The following components are included (deployed in order). Each component +lives in a numbered `NNN-/` folder and is installed as a Helm release +via its own `install.sh`: + +| Component | Version | Namespace | Source | +|-----------|---------|-----------|--------| +| kai-scheduler | v0.13.0 | kai-scheduler | kai-scheduler (oci://ghcr.io/nvidia/kai-scheduler) | + + + + +## Quick Start + +Run the included deployment script: + +```bash +chmod +x deploy.sh +./deploy.sh +``` + +Use `--no-wait` to skip Helm chart-level waiting where AICR uses `--wait` (keeps `--timeout` for hooks): + +```bash +./deploy.sh --no-wait +``` + +> **Note:** The deploy script's final status reflects install/apply results. If `--best-effort` was used, one or more components may still have failed; check warning lines and logs. This does **not** guarantee the cluster is ready to schedule workloads — operator-driven cluster convergence (CRD reconciliation, node tuning, plugin registration, etc.) continues asynchronously after the script exits, in operator-specific ways. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details. + +## Manual Installation + +Each component folder contains an `install.sh` that runs `helm upgrade --install` +with the right arguments baked in. To install a single component manually: + +```bash +cd NNN- +bash install.sh +``` + +## Customization + +Each component folder has its own `values.yaml` (static) and `cluster-values.yaml` +(dynamic, per-cluster). Edit either before deploying: + +```bash +vim NNN-/values.yaml +vim NNN-/cluster-values.yaml +``` + +## Upgrade + +Re-run the per-component install.sh to upgrade an already-installed release: + +```bash +cd NNN- +bash install.sh +``` + +## Uninstall + +To remove components (reverse order): + +```bash +./undeploy.sh +``` + +Or remove a single release manually: + +```bash +helm uninstall kai-scheduler -n kai-scheduler +``` + + +## Troubleshooting + +### Check deployment status + +```bash +kubectl get pods -A | grep -E 'kai-scheduler' +``` + +### View component logs + +Inspect a single component's pods (replace `` and `` +with one of the entries from the table above): + +```bash +kubectl logs -n -l app.kubernetes.io/instance= +``` + + +## References + +- [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md) diff --git a/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh new file mode 100644 index 000000000..952743d98 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh @@ -0,0 +1,321 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail + +# Cloud Native Stack Deployment Script +# Generated by AICR Bundler v1.0.0 +# +# Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N] +# --no-wait Skip Helm chart-level wait where AICR uses --wait (keeps --timeout for hooks) +# --best-effort Continue past individual component failures (log warnings) +# --retries N Retry failed helm/kubectl operations N times with backoff (default: 5, 0 = fail-fast) +# +# This script is optional — each component subdirectory has its own install.sh +# with a single baked-in `helm upgrade --install` command. For detailed behavior +# docs (CRD ordering, async components, error handling), see the AICR CLI Reference: +# https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Run helm commands from a temp directory to prevent local chart directories +# (e.g., bundle/001-nodewright-operator/) from shadowing remote chart references. +HELM_WORKDIR="$(mktemp -d)" +trap 'rm -rf "${HELM_WORKDIR}"; exit 130' INT TERM +trap 'rm -rf "${HELM_WORKDIR}"' EXIT + +HELM_TIMEOUT="10m" +NO_WAIT=false +BEST_EFFORT=false +FAILED_COMPONENTS="" +MAX_RETRIES=5 + +while [[ $# -gt 0 ]]; do + case "$1" in + --no-wait) NO_WAIT=true; shift ;; + --best-effort) BEST_EFFORT=true; shift ;; + --retries) + if [[ $# -lt 2 ]]; then echo "Error: --retries requires a value"; exit 1; fi + if ! [[ "$2" =~ ^[0-9]+$ ]]; then echo "Error: --retries requires a non-negative integer"; exit 1; fi + MAX_RETRIES="$2"; shift 2 ;; + *) echo "Error: unknown option: $1"; echo "Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N]"; exit 1 ;; + esac +done + +# Export env vars consumed by each folder's install.sh (rendered by localformat). +# DRY_RUN_FLAG / KUBECONFIG_FLAG / HELM_DEBUG_FLAG default to empty strings. +export DRY_RUN_FLAG="${DRY_RUN_FLAG:-}" +export KUBECONFIG_FLAG="${KUBECONFIG_FLAG:-}" +export HELM_DEBUG_FLAG="${HELM_DEBUG_FLAG:-}" + +function helm_failed() { + if [[ "${BEST_EFFORT}" == "true" ]]; then + echo "WARNING: $1 install failed, continuing (--best-effort)" + FAILED_COMPONENTS="${FAILED_COMPONENTS} $1" + else + exit 1 + fi +} + +# Compute backoff delay from attempt number (1-indexed). +# Examples: attempt 1→5s, 2→20s, 3→45s, 4→80s, 5→120s (cap) +function backoff_seconds() { + local attempt=$1 + local seconds=$(( attempt * attempt * 5 )) + if [[ ${seconds} -gt 120 ]]; then seconds=120; fi + echo "${seconds}" +} + +# Clean up stale Helm hook Jobs before retrying. When a hook Job (e.g., +# crd-upgrader) times out or fails, it stays in the namespace and blocks +# subsequent install attempts with "Job not ready" errors. +# Helm hooks are identified by the helm.sh/hook *annotation* (not a label), +# so we list all non-succeeded Jobs and check each individually via JSON. +function cleanup_helm_hooks() { + local namespace="$1" + local job_names + job_names=$(kubectl get jobs -n "${namespace}" \ + --field-selector=status.successful=0 \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ + 2>/dev/null || true) + if [[ -z "${job_names}" ]]; then + return + fi + while IFS= read -r name; do + [[ -z "${name}" ]] && continue + # Get the full Job JSON to reliably check annotations and status + local job_json + job_json=$(kubectl get job "${name}" -n "${namespace}" -o json 2>/dev/null || true) + [[ -z "${job_json}" ]] && continue + # Skip non-hook Jobs (no helm.sh/hook annotation) + local hook_val + hook_val=$(echo "${job_json}" | grep -o '"helm.sh/hook"' || true) + [[ -z "${hook_val}" ]] && continue + # Capture diagnostics before deleting. This helps diagnose transient hook + # failures (e.g., dynamo ssh-keygen) that are otherwise lost after cleanup. + echo " --- Failed hook Job ${name} diagnostics ---" + kubectl describe job "${name}" -n "${namespace}" 2>/dev/null | tail -50 || true + local pod_names + pod_names=$(kubectl get pods -n "${namespace}" -l "job-name=${name}" \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || true) + for pod_name in ${pod_names}; do + echo " --- Hook pod ${pod_name} describe ---" + kubectl describe pod "${pod_name}" -n "${namespace}" 2>/dev/null | tail -50 || true + done + echo " --- End diagnostics for ${name} ---" + # Delete any non-succeeded hook Job. This function only runs after a Helm + # failure, so any hook Job without a successful completion is blocking the + # retry — whether it failed, is stuck Pending (timed out before the pod + # started), or is still active with a stuck container. + echo " Cleaning up stale Helm hook Job ${name} in ${namespace}..." + kubectl delete job "${name}" -n "${namespace}" --ignore-not-found 2>/dev/null || true + done <<< "${job_names}" +} + +function dump_kai_scheduler_helm_diagnostics() { + local namespace="$1" + if [[ "${namespace}" != "kai-scheduler" ]]; then + return + fi + + echo " --- ${namespace} diagnostics ---" + echo " Jobs:" + kubectl get jobs -n "${namespace}" 2>/dev/null || true + echo " Job descriptions:" + kubectl describe jobs -n "${namespace}" 2>/dev/null || true + echo " Pods:" + kubectl get pods -n "${namespace}" -o wide 2>/dev/null || true + echo " Pod descriptions:" + kubectl describe pods -n "${namespace}" 2>/dev/null || true + echo " Recent events:" + kubectl get events -n "${namespace}" --sort-by='.lastTimestamp' 2>/dev/null | tail -30 || true + echo " --- End ${namespace} diagnostics ---" +} + +# Components that use operator patterns with custom resources that reconcile +# asynchronously. Helm --wait may time out waiting for CR readiness even though +# all pods start successfully. These components are installed without --wait. +ASYNC_COMPONENTS="kai-scheduler" + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify the cluster is clean before deploying. Stale webhooks, terminating +# namespaces, and orphaned API services from a previous install can block pod +# creation and namespace deletion, causing silent deployment failures. + +echo "Running pre-flight checks..." + +preflight_failed=false + +# Bundle namespace list (deduplicated) +BUNDLE_NAMESPACES=$(echo "kai-scheduler " | tr ' ' '\n' | sort -u | tr '\n' ' ') + +# Check for terminating namespaces that overlap with our components +for ns in ${BUNDLE_NAMESPACES}; do + phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null || true) + if [[ "${phase}" == "Terminating" ]]; then + echo "ERROR: namespace '${ns}' is still terminating from a previous install." + echo " Wait for it to finish, or force-finalize with:" + echo " kubectl get ns ${ns} -o json | jq '.spec.finalizers=[]' | kubectl replace --raw /api/v1/namespaces/${ns}/finalize -f -" + preflight_failed=true + fi +done + +# Check for stale webhooks whose backing services no longer exist. +# Scoped to bundle namespaces only to avoid false positives from unrelated +# platform webhooks in shared clusters. +if command -v jq &>/dev/null; then + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + # Only check webhooks pointing to our bundle namespaces + is_bundle_ns=false + for ns in ${BUNDLE_NAMESPACES}; do + [[ "${svc_ns}" == "${ns}" ]] && is_bundle_ns=true && break + done + [[ "${is_bundle_ns}" == "false" ]] && continue + + # Use explicit NotFound check to avoid false positives from transient errors + svc_check=$(kubectl get svc "${svc_name}" -n "${svc_ns}" 2>&1) || true + if echo "${svc_check}" | grep -q "NotFound\|not found"; then + echo "ERROR: ${kind} '${wh_name}' references non-existent service ${svc_ns}/${svc_name}." + echo " This will block pod/resource creation. Delete with: kubectl delete ${kind} ${wh_name}" + preflight_failed=true + fi + done < <(kubectl get "${kind}" -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null || true) + done +else + echo "NOTE: jq not found — skipping webhook pre-flight checks. Install jq for full pre-flight validation." +fi + +# Check for stale API services (e.g., custom.metrics.k8s.io from prometheus-adapter) +if command -v jq &>/dev/null; then + for api_svc in $(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true); do + echo "WARNING: API service '${api_svc}' is unavailable. This can block namespace deletion." + echo " Delete with: kubectl delete apiservice ${api_svc}" + # API service issues are warnings, not hard failures — they don't block deployment directly + done +else + echo "NOTE: jq not found — skipping API service pre-flight checks." +fi + +# Check for orphaned CRDs from previous deployments. +# Scoped to CRD groups belonging to components in this bundle to avoid +# false positives from unrelated platform installs on shared clusters. +ORPHANED_CRD_GROUPS="" ORPHANED_CRD_GROUPS="${ORPHANED_CRD_GROUPS} kai.scheduler scheduling.run.ai" +for group in ${ORPHANED_CRD_GROUPS}; do + orphaned=$(kubectl get crd -o name 2>/dev/null | grep "\.${group}$" || true) + if [[ -n "${orphaned}" ]]; then + echo "WARNING: orphaned CRDs from previous deployment: ${orphaned}" + echo " These may cause conflicts. Delete with: kubectl delete ${orphaned}" + fi +done + +# Check for stale nodewright node taints from a previous deployment. +# Only remove taints if nodewright-operator is NOT already running (i.e., fresh deploy). +# If the operator is running, taints are legitimate scheduling guards. + +if [[ "${preflight_failed}" == "true" ]]; then + echo "" + echo "Pre-flight checks failed. Fix the issues above before deploying." + echo "To skip pre-flight checks, run: ./undeploy.sh first, then retry." + exit 1 +fi + +echo "Pre-flight checks passed." +echo "Deploying Cloud Native Stack components..." + +# ============================================================================== +# Install loop +# ============================================================================== +# Generic install loop. Each folder's install.sh is rendered by localformat +# with the right helm command baked in — deploy.sh has no per-component +# knowledge here. Per-component special-case logic (async wait, DRA plugin +# restart) runs in the post-install blocks below, matched by component name. +cd "${HELM_WORKDIR}" +for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; do + [[ -d "${dir}" ]] || continue + dir="${dir%/}" + base="${dir##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Source the namespace from the folder's install.sh — not the folder + # basename — because the helm release name and its target namespace can + # differ (e.g. nodewright-operator → namespace skyhook; gpu-operator-post → + # namespace gpu-operator). cleanup_helm_hooks and the kai diagnostics + # both operate on the namespace. + namespace=$(awk '{ for (i=1;i/dev/null; then + echo "Error: jq is required but not found in PATH." + echo " Install jq: https://jqlang.github.io/jq/download/" + exit 1 +fi + +echo "Undeploying Cloud Native Stack components (timeout: ${HELM_TIMEOUT}s)..." +echo "" + +echo " The following components will be removed (in reverse install order):" +# List NNN-* folders in reverse numeric order +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + printf " %-40s\n" "${name}" +done +echo "" + +# System namespaces that must not be deleted +PROTECTED_NS="kube-system kube-public kube-node-lease default" + +delete_namespace() { + local ns="$1" + if [[ "${KEEP_NS}" == "true" ]]; then return; fi + if echo " ${PROTECTED_NS} " | grep -q " ${ns} "; then return; fi + if ! kubectl get namespace "${ns}" &>/dev/null; then return; fi + echo "Deleting namespace ${ns}..." + kubectl delete namespace "${ns}" --ignore-not-found --wait=false +} + +# Uninstall a Helm release, handling stuck pending states from interrupted deploys. +# Try normal uninstall first; if it fails, retry with --no-hooks to force removal. +helm_force_uninstall() { + local release="$1" + local ns="$2" + if helm uninstall "${release}" -n "${ns}" --timeout "${HELM_TIMEOUT}s" --ignore-not-found 2>/dev/null; then + return + fi + echo " Retrying ${release} removal with --no-hooks..." + helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found || true +} + +# Delete cluster-scoped resources owned by a Helm release. +# These survive namespace deletion and can block subsequent deployments: +# - Webhooks block pod creation when their backing service is gone +# - CRDs with "helm.sh/resource-policy: keep" are retained after chart removal +delete_release_cluster_resources() { + local release="$1" + local ns="$2" + local selector="app.kubernetes.io/managed-by=Helm" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations customresourcedefinitions; do + kubectl get "${kind}" -l "${selector}" -o json 2>/dev/null \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting ${kind}/${name}..." + kubectl delete "${kind}" "${name}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done || echo "Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + done +} + +# Delete webhooks whose backing service is in a specific namespace and no longer exists. +# Scoped to the given namespace to avoid touching unrelated platform webhooks. +# Operator-created webhooks (e.g., kai-scheduler admission) may not carry Helm labels, +# but once their service namespace is deleted, fail-closed webhooks block pod creation. +delete_orphaned_webhooks_for_ns() { + local ns="$1" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + { kubectl get "${kind}" -o json 2>/dev/null \ + | jq -r --arg ns "${ns}" \ + '.items[] | .metadata.name as $wh | .webhooks[] | select(.clientConfig.service != null and .clientConfig.service.namespace == $ns) | [$wh, .clientConfig.service.name] | @tsv' 2>/dev/null \ + | sort -u || true; } \ + | while IFS=$'\t' read -r wh_name svc_name; do + # Delete when namespace is gone, terminating, or backing service is missing. + # Skip on transient errors (auth, timeout, DNS) to avoid removing valid webhooks. + local ns_output ns_phase svc_output + ns_output=$(kubectl get ns "${ns}" 2>&1) || true + if echo "${ns_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + ns_phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null) || true + if [[ "${ns_phase}" == "Terminating" ]]; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} terminating)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + svc_output=$(kubectl get svc "${svc_name}" -n "${ns}" 2>&1) || true + if echo "${svc_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (service ${ns}/${svc_name} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + fi + done + done +} + +# Force-clear finalizers on all namespaced resources to unstick a Terminating namespace. +force_clear_namespace_finalizers() { + local ns="$1" + echo "Force-removing finalizers in namespace ${ns}..." + local kinds + kinds=$(kubectl api-resources --verbs=list --namespaced -o name 2>/dev/null) || { + echo "Warning: failed to enumerate namespaced resource kinds in ${ns}; namespace may stay Terminating" >&2 + return + } + for kind in ${kinds}; do + kubectl get "${kind}" -n "${ns}" -o json 2>/dev/null \ + | jq -r '.items[] | select(.metadata.finalizers // [] | length > 0) | .kind + "/" + .metadata.name' 2>/dev/null \ + | while read -r resource; do + kubectl patch "${resource}" -n "${ns}" --type merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + done || echo "Warning: finalizer-clear pipeline for ${kind} in ${ns} failed (kubectl get / jq error); namespace may stay Terminating" >&2 + done +} + +# Return a small, explicit list of known crds/-installed CRDs whose +# finalizer-bearing custom resources must still be caught before the operator +# is removed. Keep this list intentionally tiny and exact: if a release is not +# listed here, pre-flight relies on chart-manifest and Helm-annotation discovery +# only instead of trying to infer ownership across the whole cluster. +extra_crds_for_release() { + case "$1" in + gpu-operator) + printf '%s\n' \ + "clusterpolicies.nvidia.com" \ + "nvidiadrivers.nvidia.com" \ + "nodefeaturegroups.nfd.k8s-sigs.io" \ + "nodefeaturerules.nfd.k8s-sigs.io" \ + "nodefeatures.nfd.k8s-sigs.io" + ;; + kai-scheduler) + printf '%s\n' \ + "bindrequests.scheduling.run.ai" \ + "configs.kai.scheduler" \ + "podgroups.scheduling.run.ai" \ + "queues.scheduling.run.ai" \ + "schedulingshards.kai.scheduler" \ + "topologies.kai.scheduler" + ;; + k8s-nim-operator) + printf '%s\n' \ + "nemocustomizers.apps.nvidia.com" \ + "nemodatastores.apps.nvidia.com" \ + "nemoentitystores.apps.nvidia.com" \ + "nemoevaluators.apps.nvidia.com" \ + "nemoguardrails.apps.nvidia.com" \ + "nimbuilds.apps.nvidia.com" \ + "nimcaches.apps.nvidia.com" \ + "nimpipelines.apps.nvidia.com" \ + "nimservices.apps.nvidia.com" + ;; + kubeflow-trainer) + printf '%s\n' \ + "clustertrainingruntimes.trainer.kubeflow.org" \ + "trainjobs.trainer.kubeflow.org" \ + "trainingruntimes.trainer.kubeflow.org" \ + "jobsets.jobset.x-k8s.io" + ;; + kube-prometheus-stack) + printf '%s\n' \ + "alertmanagerconfigs.monitoring.coreos.com" \ + "alertmanagers.monitoring.coreos.com" \ + "podmonitors.monitoring.coreos.com" \ + "probes.monitoring.coreos.com" \ + "prometheusagents.monitoring.coreos.com" \ + "prometheuses.monitoring.coreos.com" \ + "prometheusrules.monitoring.coreos.com" \ + "scrapeconfigs.monitoring.coreos.com" \ + "servicemonitors.monitoring.coreos.com" \ + "thanosrulers.monitoring.coreos.com" + ;; + dynamo-platform) + printf '%s\n' \ + "podcliques.grove.io" \ + "podcliquescalinggroups.grove.io" \ + "podcliquesets.grove.io" \ + "podgangs.scheduler.grove.io" + ;; + network-operator) + # This explicit list matches the CRDs enabled by the bundled values: + # nfd=false, sriovNetworkOperator=false, maintenance-operator disabled. + # Intentionally exclude networkattachmentdefinitions.k8s.cni.cncf.io: + # it is a broadly shared CRD, so surfacing or deleting it based only on + # this release would create cross-cluster noise. + printf '%s\n' \ + "nicclusterpolicies.mellanox.com" \ + "hostdevicenetworks.mellanox.com" \ + "ipoibnetworks.mellanox.com" \ + "macvlannetworks.mellanox.com" + ;; + *) ;; + esac +} + +# Skip pre-flight for releases whose bundle-managed custom resources are +# deleted from manifests before the controller is uninstalled. +skip_preflight_for_release() { + case "$1" in + nodewright-operator|kgateway) return 0 ;; + *) return 1 ;; + esac +} + +# Run `kubectl ... -o json` while keeping stdout parseable for jq. +capture_kubectl_json() { + local out_var="$1" + shift + local stderr_file output kubectl_err="" + + if ! stderr_file=$(mktemp "${TMPDIR:-/tmp}/.aicr_kubectl_stderr_XXXXXX"); then + printf -v "${out_var}" '%s' 'mktemp failed while capturing kubectl stderr' + return 1 + fi + + if ! output=$(kubectl "$@" 2>"${stderr_file}"); then + kubectl_err=$(cat "${stderr_file}" 2>/dev/null || true) + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${kubectl_err}" + return 1 + fi + + if [[ -s "${stderr_file}" ]]; then + cat "${stderr_file}" >&2 || true + fi + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${output}" + return 0 +} + +check_crd_for_stuck_resources() { + local crd_name="$1" + local component="$2" + local crd_json plural group scope resource stuck stuck_json kubectl_err + local item_name item_namespace item_finalizers + + if ! capture_kubectl_json crd_json get crd "${crd_name}" -o json; then + kubectl_err="${crd_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not inspect CRD '${crd_name}' (release ${component})." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + read -r plural group scope < <(echo "${crd_json}" | jq -r '[.spec.names.plural, .spec.group, .spec.scope] | @tsv' 2>/dev/null) || return 0 + [[ -z "${plural}" || "${plural}" == "null" ]] && return 0 + + resource="${plural}.${group}" + local jq_filter='[.metadata.finalizers // [] | .[] | select(startswith("kubernetes.io/") | not)]' + stuck="" + if [[ "${scope}" == "Namespaced" ]]; then + if ! capture_kubectl_json stuck_json get "${resource}" -A -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_namespace item_name item_finalizers; do + stuck="${stuck} ${item_namespace}/${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [(.metadata.namespace // ""), .metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + else + if ! capture_kubectl_json stuck_json get "${resource}" -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_name item_finalizers; do + stuck="${stuck} ${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [.metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + fi + + if [[ -n "${stuck}" ]]; then + { + echo " ${component} — ${crd_name}:" + printf '%s' "${stuck}" + } >> "${PREFLIGHT_DETAILS}" + fi +} + +check_release_for_stuck_crds() { + local release="$1" + local ns="$2" + local manifest manifest_crds annotated_crds explicit_crds + local all_crds_json kubectl_err crd_name + manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) + manifest_crds=$(echo "${manifest}" \ + | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') + + all_crds_json="${PREFLIGHT_ALL_CRDS_JSON:-}" + if [[ -z "${all_crds_json}" ]]; then + if ! capture_kubectl_json all_crds_json get crd -o json; then + kubectl_err="${all_crds_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs for release '${release}'." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + fi + + annotated_crds=$(echo "${all_crds_json}" \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) + explicit_crds="" + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + explicit_crds="${explicit_crds}$(echo "${all_crds_json}" \ + | jq -r --arg name "${crd_name}" --arg rel "${release}" --arg ns "${ns}" \ + '.items[] + | select(.metadata.name == $name) + | select( + ((.metadata.annotations["meta.helm.sh/release-name"] // "") == "") + or + ( + .metadata.annotations["meta.helm.sh/release-name"] == $rel + and + .metadata.annotations["meta.helm.sh/release-namespace"] == $ns + ) + ) + | .metadata.name' 2>/dev/null || true)"$'\n' + done < <(extra_crds_for_release "${release}") + [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${explicit_crds}" ]] && return 0 + printf '%s\n%s\n%s\n' "${manifest_crds}" "${annotated_crds}" "${explicit_crds}" \ + | awk 'NF' \ + | sort -u \ + | while read -r crd_name; do + check_crd_for_stuck_resources "${crd_name}" "${release}" + done + return 0 +} + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify no custom resources with active finalizers exist for CRDs owned by +# bundle operators. After helm uninstall removes the operator, CRs with +# finalizers cannot be reconciled — blocking CRD deletion indefinitely. + +if [[ "${SKIP_PREFLIGHT}" == "true" ]]; then + echo "Skipping pre-flight checks (--skip-preflight)." +else + echo "Running pre-flight checks..." + PREFLIGHT_DETAILS=$(mktemp "${TMPDIR:-/tmp}/.aicr_preflight_XXXXXX") + PREFLIGHT_ALL_CRDS_JSON="" + + if ! capture_kubectl_json PREFLIGHT_ALL_CRDS_JSON get crd -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs." >&2 + echo " kubectl output: ${PREFLIGHT_ALL_CRDS_JSON}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove an operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + if skip_preflight_for_release "kai-scheduler"; then + echo " Skipping kai-scheduler (kai-scheduler): bundle deletes dependent manifests before controller uninstall." + else + echo " Checking kai-scheduler (kai-scheduler)..." + check_release_for_stuck_crds "kai-scheduler" "kai-scheduler" + fi + + if [[ -s "${PREFLIGHT_DETAILS}" ]]; then + echo "" + echo "ERROR: Found custom resources with active finalizers that will block undeploy." + echo " After the operator is removed, these finalizers cannot be processed —" + echo " causing an unrecoverable hang during CRD deletion." + echo "" + echo " Delete these resources while their controller is still running," + echo " then re-run ./undeploy.sh" + echo "" + cat "${PREFLIGHT_DETAILS}" + echo "" + echo " To skip this check: ./undeploy.sh --skip-preflight" + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + rm -f "${PREFLIGHT_DETAILS}" + + echo "Pre-flight checks passed." +fi + +# ============================================================================== +# Uninstall components in reverse install order +# ============================================================================== +# Generic reverse loop: every folder is a Helm release (local-helm or upstream-helm). +# `helm uninstall ` works uniformly for both kinds — that's one of the +# benefits of the uniform local-chart bundle format. +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Derive the namespace from the component name via the per-component blocks + # below. We still need it for helm_force_uninstall and cluster-resource cleanup. + ns="" + if [[ "${name}" == "kai-scheduler" ]]; then ns="kai-scheduler"; fi + # Injected mixed-component "-post" folders share their parent's namespace. + if [[ "${name}" == "kai-scheduler-post" ]]; then ns="kai-scheduler"; fi + if [[ -z "${ns}" ]]; then + echo "Warning: no namespace known for ${name}; skipping uninstall" >&2 + continue + fi + echo "Uninstalling ${name} (${ns})..." + # For local-helm folders (Chart.yaml + templates/), kubectl-delete the + # rendered templates BEFORE helm uninstall. The templates may carry + # helm.sh/hook annotations (post-install/post-upgrade) which helm does + # not track or clean up on uninstall — so without this pre-delete, the + # operator gets removed but its hook-created CRs (and their finalizers) + # linger. Doing the delete while the controller is still running lets + # finalizers clear naturally. + if [[ -d "${dir}/templates" ]]; then + for tpl in "${dir}/templates/"*.yaml; do + [[ -f "${tpl}" ]] || continue + kubectl delete -n "${ns}" -f "${tpl}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done + fi + helm_force_uninstall "${name}" "${ns}" + delete_release_cluster_resources "${name}" "${ns}" + delete_orphaned_webhooks_for_ns "${ns}" +done + +# Remove nodewright node taints that persist after operator removal. +# Nodewright taints nodes during kernel tuning. The taint key is configurable +# via runtimeRequiredTaint (defaults to skyhook.nvidia.com). + +# Clean up orphaned CRDs that were owned by this bundle's releases. +# Only delete CRDs whose Helm release annotation matches a component we just uninstalled. +kubectl get crd -o json 2>/dev/null \ + | jq -r --arg rel "kai-scheduler" --arg ns "kai-scheduler" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting CRD ${name} (owned by kai-scheduler/kai-scheduler)..." + kubectl delete crd "${name}" --ignore-not-found --wait=false \ + || echo "Warning: failed to delete CRD ${name} (owned by kai-scheduler/kai-scheduler); leftovers will surface in post-flight" >&2 + done || echo "Warning: orphan-CRD cleanup for kai-scheduler/kai-scheduler failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + +# Intentionally skip automatic deletion of unannotated CRDs matched only by +# API group. On shared clusters, those CRDs may be serving another tenant's +# release in the same group, and we do not have bundle-specific ownership +# metadata to distinguish "ours" from "theirs" safely. + +# Warn about CRDs stuck in deleting state (e.g., customresourcecleanup finalizer +# can't be resolved because CR instances still have controller-managed finalizers). +stuck_crds=$(kubectl get crd -o json 2>/dev/null | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' 2>/dev/null || true) +if [[ -n "${stuck_crds}" ]]; then + echo "" + echo "WARNING: CRDs stuck in deleting state:" + for crd in ${stuck_crds}; do + echo " ${crd}" + done + echo "" + echo " These CRDs have a deletionTimestamp but cannot complete deletion because" + echo " their customresourcecleanup finalizer is waiting for CR instances to be" + echo " removed. If you are certain no data will be lost, you can force-clear the" + echo " finalizers. Note: this may leave orphaned CR data in etcd that is no longer" + echo " accessible through the API." + echo "" + echo " Commands to force-clear (review before running):" + for crd in ${stuck_crds}; do + echo " kubectl patch crd ${crd} --type merge -p '{\"metadata\":{\"finalizers\":null}}'" + done +fi + +# Clean up namespaces after all components are uninstalled. +if [[ "${DELETE_PVCS}" == "true" ]] && ! echo " ${PROTECTED_NS} " | grep -q " kai-scheduler "; then + echo "Deleting PVCs in kai-scheduler..." + kubectl delete pvc --all -n kai-scheduler --ignore-not-found || true +fi +delete_orphaned_webhooks_for_ns "kai-scheduler" +delete_namespace "kai-scheduler" + +# Clean up companion namespaces created at runtime by operators. +# Only emitted for components whose runtime creates them. +delete_namespace "kai-resource-reservation" + +# Wait for terminating namespaces to finish +echo "Waiting for namespaces to terminate..." +for i in $(seq 1 60); do + TERMINATING=$(kubectl get ns --no-headers 2>/dev/null | grep Terminating | awk '{print $1}' || true) + if [[ -z "${TERMINATING}" ]]; then + break + fi + if [[ $i -eq 60 ]]; then + echo "Warning: namespaces still terminating after 60s: ${TERMINATING}" + for ns in ${TERMINATING}; do + force_clear_namespace_finalizers "${ns}" + kubectl delete namespace "${ns}" --ignore-not-found --wait=false 2>/dev/null || true + done + break + fi + sleep 1 +done + +# Final webhook cleanup pass. +delete_orphaned_webhooks_for_ns "kai-scheduler" + +# ============================================================================== +# Post-flight verification +# ============================================================================== + +postflight_issues=false + +TERMINATING=$(kubectl get namespaces -o jsonpath='{range .items[?(@.status.phase=="Terminating")]}{.metadata.name}{" "}{end}' 2>/dev/null || true) +if [[ -n "${TERMINATING}" ]]; then + echo "WARNING: namespaces still terminating: ${TERMINATING}" + echo " A subsequent deploy.sh may fail. Wait or force-finalize these namespaces." + postflight_issues=true +fi + +kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null | \ + sort -u | \ + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + if ! kubectl get ns "${svc_ns}" &>/dev/null || ! kubectl get svc "${svc_name}" -n "${svc_ns}" &>/dev/null; then + echo "Deleting stale webhook ${wh_name} (service ${svc_ns}/${svc_name} missing)..." + kubectl delete mutatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + kubectl delete validatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + fi + done || true + +stale_apis=$(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true) +if [[ -n "${stale_apis}" ]]; then + echo "WARNING: unavailable API services found: ${stale_apis}" + echo " These can block namespace deletion. Delete with: kubectl delete apiservice " + postflight_issues=true +fi + +# Check for Helm-annotated CRDs from uninstalled releases. +helm_orphaned_crds="" +explicit_orphaned_crds="" +postflight_all_crds_json="" +if capture_kubectl_json postflight_all_crds_json get crd -o json; then + : + remaining_helm_crds=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg rel "kai-scheduler" --arg ns "kai-scheduler" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_helm_crds}" ]]; then + helm_orphaned_crds="${helm_orphaned_crds} ${remaining_helm_crds}" + fi + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + remaining_explicit_crd=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg name "${crd_name}" \ + '.items[] | select(.metadata.name==$name and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_explicit_crd}" ]]; then + explicit_orphaned_crds="${explicit_orphaned_crds}${remaining_explicit_crd}"$'\n' + fi + done < <(extra_crds_for_release "kai-scheduler") +else + echo "Warning: failed to enumerate post-flight CRDs; kubectl output: ${postflight_all_crds_json}" >&2 + postflight_issues=true +fi +if [[ -n "${helm_orphaned_crds}" ]]; then + echo "WARNING: Helm-annotated CRDs from uninstalled releases still present:${helm_orphaned_crds}" + echo " Cleanup did not remove all CRDs owned by this bundle's releases." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +explicit_orphaned_crds=$(printf '%s' "${explicit_orphaned_crds}" | awk 'NF' | sort -u | tr '\n' ' ') +if [[ -n "${explicit_orphaned_crds}" ]]; then + echo "WARNING: explicit CRDs from this bundle still present: ${explicit_orphaned_crds}" + echo " These CRDs are installed outside Helm manifest/annotation discovery." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +if [[ "${postflight_issues}" == "true" ]]; then + echo "" + echo "Post-flight: some stale resources remain. Run deploy.sh pre-flight checks to verify before redeploying." +else + echo "Post-flight: cluster is clean." +fi + +echo "Undeployment complete." diff --git a/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml new file mode 100644 index 000000000..98481b3f9 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: skyhook-customizations +description: Generated wrapper chart for skyhook-customizations local content. +type: application +version: 0.1.0 +appVersion: "0.1.0" diff --git a/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.sh b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.sh new file mode 100644 index 000000000..5b55582b7 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.sh @@ -0,0 +1,23 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" + +helm upgrade --install skyhook-customizations ./ \ + --namespace skyhook --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml new file mode 100644 index 000000000..4dee0babb --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml @@ -0,0 +1,6 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: customization diff --git a/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/manifest_only/README.md b/pkg/bundler/deployer/helm/testdata/manifest_only/README.md new file mode 100644 index 000000000..70e844961 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/manifest_only/README.md @@ -0,0 +1,107 @@ +# Cloud Native Stack Deployment + +Recipe Version: v0.1.0 +Bundler Version: v1.0.0 + +Per-component bundle for deploying NVIDIA Cloud Native Stack components +for GPU-accelerated Kubernetes workloads. + +## Configuration + + + +## Components + +The following components are included (deployed in order). Each component +lives in a numbered `NNN-/` folder and is installed as a Helm release +via its own `install.sh`: + +| Component | Version | Namespace | Source | +|-----------|---------|-----------|--------| +| skyhook-customizations | N/A | skyhook | local | + + + + +## Quick Start + +Run the included deployment script: + +```bash +chmod +x deploy.sh +./deploy.sh +``` + +Use `--no-wait` to skip Helm chart-level waiting where AICR uses `--wait` (keeps `--timeout` for hooks): + +```bash +./deploy.sh --no-wait +``` + +> **Note:** The deploy script's final status reflects install/apply results. If `--best-effort` was used, one or more components may still have failed; check warning lines and logs. This does **not** guarantee the cluster is ready to schedule workloads — operator-driven cluster convergence (CRD reconciliation, node tuning, plugin registration, etc.) continues asynchronously after the script exits, in operator-specific ways. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details. + +## Manual Installation + +Each component folder contains an `install.sh` that runs `helm upgrade --install` +with the right arguments baked in. To install a single component manually: + +```bash +cd NNN- +bash install.sh +``` + +## Customization + +Each component folder has its own `values.yaml` (static) and `cluster-values.yaml` +(dynamic, per-cluster). Edit either before deploying: + +```bash +vim NNN-/values.yaml +vim NNN-/cluster-values.yaml +``` + +## Upgrade + +Re-run the per-component install.sh to upgrade an already-installed release: + +```bash +cd NNN- +bash install.sh +``` + +## Uninstall + +To remove components (reverse order): + +```bash +./undeploy.sh +``` + +Or remove a single release manually: + +```bash +helm uninstall skyhook-customizations -n skyhook +``` + + +## Troubleshooting + +### Check deployment status + +```bash +kubectl get pods -A | grep -E 'skyhook-customizations' +``` + +### View component logs + +Inspect a single component's pods (replace `` and `` +with one of the entries from the table above): + +```bash +kubectl logs -n -l app.kubernetes.io/instance= +``` + + +## References + +- [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md) diff --git a/pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh b/pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh new file mode 100644 index 000000000..4b2bcd645 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh @@ -0,0 +1,321 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail + +# Cloud Native Stack Deployment Script +# Generated by AICR Bundler v1.0.0 +# +# Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N] +# --no-wait Skip Helm chart-level wait where AICR uses --wait (keeps --timeout for hooks) +# --best-effort Continue past individual component failures (log warnings) +# --retries N Retry failed helm/kubectl operations N times with backoff (default: 5, 0 = fail-fast) +# +# This script is optional — each component subdirectory has its own install.sh +# with a single baked-in `helm upgrade --install` command. For detailed behavior +# docs (CRD ordering, async components, error handling), see the AICR CLI Reference: +# https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Run helm commands from a temp directory to prevent local chart directories +# (e.g., bundle/001-nodewright-operator/) from shadowing remote chart references. +HELM_WORKDIR="$(mktemp -d)" +trap 'rm -rf "${HELM_WORKDIR}"; exit 130' INT TERM +trap 'rm -rf "${HELM_WORKDIR}"' EXIT + +HELM_TIMEOUT="10m" +NO_WAIT=false +BEST_EFFORT=false +FAILED_COMPONENTS="" +MAX_RETRIES=5 + +while [[ $# -gt 0 ]]; do + case "$1" in + --no-wait) NO_WAIT=true; shift ;; + --best-effort) BEST_EFFORT=true; shift ;; + --retries) + if [[ $# -lt 2 ]]; then echo "Error: --retries requires a value"; exit 1; fi + if ! [[ "$2" =~ ^[0-9]+$ ]]; then echo "Error: --retries requires a non-negative integer"; exit 1; fi + MAX_RETRIES="$2"; shift 2 ;; + *) echo "Error: unknown option: $1"; echo "Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N]"; exit 1 ;; + esac +done + +# Export env vars consumed by each folder's install.sh (rendered by localformat). +# DRY_RUN_FLAG / KUBECONFIG_FLAG / HELM_DEBUG_FLAG default to empty strings. +export DRY_RUN_FLAG="${DRY_RUN_FLAG:-}" +export KUBECONFIG_FLAG="${KUBECONFIG_FLAG:-}" +export HELM_DEBUG_FLAG="${HELM_DEBUG_FLAG:-}" + +function helm_failed() { + if [[ "${BEST_EFFORT}" == "true" ]]; then + echo "WARNING: $1 install failed, continuing (--best-effort)" + FAILED_COMPONENTS="${FAILED_COMPONENTS} $1" + else + exit 1 + fi +} + +# Compute backoff delay from attempt number (1-indexed). +# Examples: attempt 1→5s, 2→20s, 3→45s, 4→80s, 5→120s (cap) +function backoff_seconds() { + local attempt=$1 + local seconds=$(( attempt * attempt * 5 )) + if [[ ${seconds} -gt 120 ]]; then seconds=120; fi + echo "${seconds}" +} + +# Clean up stale Helm hook Jobs before retrying. When a hook Job (e.g., +# crd-upgrader) times out or fails, it stays in the namespace and blocks +# subsequent install attempts with "Job not ready" errors. +# Helm hooks are identified by the helm.sh/hook *annotation* (not a label), +# so we list all non-succeeded Jobs and check each individually via JSON. +function cleanup_helm_hooks() { + local namespace="$1" + local job_names + job_names=$(kubectl get jobs -n "${namespace}" \ + --field-selector=status.successful=0 \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ + 2>/dev/null || true) + if [[ -z "${job_names}" ]]; then + return + fi + while IFS= read -r name; do + [[ -z "${name}" ]] && continue + # Get the full Job JSON to reliably check annotations and status + local job_json + job_json=$(kubectl get job "${name}" -n "${namespace}" -o json 2>/dev/null || true) + [[ -z "${job_json}" ]] && continue + # Skip non-hook Jobs (no helm.sh/hook annotation) + local hook_val + hook_val=$(echo "${job_json}" | grep -o '"helm.sh/hook"' || true) + [[ -z "${hook_val}" ]] && continue + # Capture diagnostics before deleting. This helps diagnose transient hook + # failures (e.g., dynamo ssh-keygen) that are otherwise lost after cleanup. + echo " --- Failed hook Job ${name} diagnostics ---" + kubectl describe job "${name}" -n "${namespace}" 2>/dev/null | tail -50 || true + local pod_names + pod_names=$(kubectl get pods -n "${namespace}" -l "job-name=${name}" \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || true) + for pod_name in ${pod_names}; do + echo " --- Hook pod ${pod_name} describe ---" + kubectl describe pod "${pod_name}" -n "${namespace}" 2>/dev/null | tail -50 || true + done + echo " --- End diagnostics for ${name} ---" + # Delete any non-succeeded hook Job. This function only runs after a Helm + # failure, so any hook Job without a successful completion is blocking the + # retry — whether it failed, is stuck Pending (timed out before the pod + # started), or is still active with a stuck container. + echo " Cleaning up stale Helm hook Job ${name} in ${namespace}..." + kubectl delete job "${name}" -n "${namespace}" --ignore-not-found 2>/dev/null || true + done <<< "${job_names}" +} + +function dump_kai_scheduler_helm_diagnostics() { + local namespace="$1" + if [[ "${namespace}" != "kai-scheduler" ]]; then + return + fi + + echo " --- ${namespace} diagnostics ---" + echo " Jobs:" + kubectl get jobs -n "${namespace}" 2>/dev/null || true + echo " Job descriptions:" + kubectl describe jobs -n "${namespace}" 2>/dev/null || true + echo " Pods:" + kubectl get pods -n "${namespace}" -o wide 2>/dev/null || true + echo " Pod descriptions:" + kubectl describe pods -n "${namespace}" 2>/dev/null || true + echo " Recent events:" + kubectl get events -n "${namespace}" --sort-by='.lastTimestamp' 2>/dev/null | tail -30 || true + echo " --- End ${namespace} diagnostics ---" +} + +# Components that use operator patterns with custom resources that reconcile +# asynchronously. Helm --wait may time out waiting for CR readiness even though +# all pods start successfully. These components are installed without --wait. +ASYNC_COMPONENTS="kai-scheduler" + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify the cluster is clean before deploying. Stale webhooks, terminating +# namespaces, and orphaned API services from a previous install can block pod +# creation and namespace deletion, causing silent deployment failures. + +echo "Running pre-flight checks..." + +preflight_failed=false + +# Bundle namespace list (deduplicated) +BUNDLE_NAMESPACES=$(echo "skyhook " | tr ' ' '\n' | sort -u | tr '\n' ' ') + +# Check for terminating namespaces that overlap with our components +for ns in ${BUNDLE_NAMESPACES}; do + phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null || true) + if [[ "${phase}" == "Terminating" ]]; then + echo "ERROR: namespace '${ns}' is still terminating from a previous install." + echo " Wait for it to finish, or force-finalize with:" + echo " kubectl get ns ${ns} -o json | jq '.spec.finalizers=[]' | kubectl replace --raw /api/v1/namespaces/${ns}/finalize -f -" + preflight_failed=true + fi +done + +# Check for stale webhooks whose backing services no longer exist. +# Scoped to bundle namespaces only to avoid false positives from unrelated +# platform webhooks in shared clusters. +if command -v jq &>/dev/null; then + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + # Only check webhooks pointing to our bundle namespaces + is_bundle_ns=false + for ns in ${BUNDLE_NAMESPACES}; do + [[ "${svc_ns}" == "${ns}" ]] && is_bundle_ns=true && break + done + [[ "${is_bundle_ns}" == "false" ]] && continue + + # Use explicit NotFound check to avoid false positives from transient errors + svc_check=$(kubectl get svc "${svc_name}" -n "${svc_ns}" 2>&1) || true + if echo "${svc_check}" | grep -q "NotFound\|not found"; then + echo "ERROR: ${kind} '${wh_name}' references non-existent service ${svc_ns}/${svc_name}." + echo " This will block pod/resource creation. Delete with: kubectl delete ${kind} ${wh_name}" + preflight_failed=true + fi + done < <(kubectl get "${kind}" -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null || true) + done +else + echo "NOTE: jq not found — skipping webhook pre-flight checks. Install jq for full pre-flight validation." +fi + +# Check for stale API services (e.g., custom.metrics.k8s.io from prometheus-adapter) +if command -v jq &>/dev/null; then + for api_svc in $(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true); do + echo "WARNING: API service '${api_svc}' is unavailable. This can block namespace deletion." + echo " Delete with: kubectl delete apiservice ${api_svc}" + # API service issues are warnings, not hard failures — they don't block deployment directly + done +else + echo "NOTE: jq not found — skipping API service pre-flight checks." +fi + +# Check for orphaned CRDs from previous deployments. +# Scoped to CRD groups belonging to components in this bundle to avoid +# false positives from unrelated platform installs on shared clusters. +ORPHANED_CRD_GROUPS="" +for group in ${ORPHANED_CRD_GROUPS}; do + orphaned=$(kubectl get crd -o name 2>/dev/null | grep "\.${group}$" || true) + if [[ -n "${orphaned}" ]]; then + echo "WARNING: orphaned CRDs from previous deployment: ${orphaned}" + echo " These may cause conflicts. Delete with: kubectl delete ${orphaned}" + fi +done + +# Check for stale nodewright node taints from a previous deployment. +# Only remove taints if nodewright-operator is NOT already running (i.e., fresh deploy). +# If the operator is running, taints are legitimate scheduling guards. + +if [[ "${preflight_failed}" == "true" ]]; then + echo "" + echo "Pre-flight checks failed. Fix the issues above before deploying." + echo "To skip pre-flight checks, run: ./undeploy.sh first, then retry." + exit 1 +fi + +echo "Pre-flight checks passed." +echo "Deploying Cloud Native Stack components..." + +# ============================================================================== +# Install loop +# ============================================================================== +# Generic install loop. Each folder's install.sh is rendered by localformat +# with the right helm command baked in — deploy.sh has no per-component +# knowledge here. Per-component special-case logic (async wait, DRA plugin +# restart) runs in the post-install blocks below, matched by component name. +cd "${HELM_WORKDIR}" +for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; do + [[ -d "${dir}" ]] || continue + dir="${dir%/}" + base="${dir##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Source the namespace from the folder's install.sh — not the folder + # basename — because the helm release name and its target namespace can + # differ (e.g. nodewright-operator → namespace skyhook; gpu-operator-post → + # namespace gpu-operator). cleanup_helm_hooks and the kai diagnostics + # both operate on the namespace. + namespace=$(awk '{ for (i=1;i/dev/null; then + echo "Error: jq is required but not found in PATH." + echo " Install jq: https://jqlang.github.io/jq/download/" + exit 1 +fi + +echo "Undeploying Cloud Native Stack components (timeout: ${HELM_TIMEOUT}s)..." +echo "" + +echo " The following components will be removed (in reverse install order):" +# List NNN-* folders in reverse numeric order +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + printf " %-40s\n" "${name}" +done +echo "" + +# System namespaces that must not be deleted +PROTECTED_NS="kube-system kube-public kube-node-lease default" + +delete_namespace() { + local ns="$1" + if [[ "${KEEP_NS}" == "true" ]]; then return; fi + if echo " ${PROTECTED_NS} " | grep -q " ${ns} "; then return; fi + if ! kubectl get namespace "${ns}" &>/dev/null; then return; fi + echo "Deleting namespace ${ns}..." + kubectl delete namespace "${ns}" --ignore-not-found --wait=false +} + +# Uninstall a Helm release, handling stuck pending states from interrupted deploys. +# Try normal uninstall first; if it fails, retry with --no-hooks to force removal. +helm_force_uninstall() { + local release="$1" + local ns="$2" + if helm uninstall "${release}" -n "${ns}" --timeout "${HELM_TIMEOUT}s" --ignore-not-found 2>/dev/null; then + return + fi + echo " Retrying ${release} removal with --no-hooks..." + helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found || true +} + +# Delete cluster-scoped resources owned by a Helm release. +# These survive namespace deletion and can block subsequent deployments: +# - Webhooks block pod creation when their backing service is gone +# - CRDs with "helm.sh/resource-policy: keep" are retained after chart removal +delete_release_cluster_resources() { + local release="$1" + local ns="$2" + local selector="app.kubernetes.io/managed-by=Helm" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations customresourcedefinitions; do + kubectl get "${kind}" -l "${selector}" -o json 2>/dev/null \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting ${kind}/${name}..." + kubectl delete "${kind}" "${name}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done || echo "Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + done +} + +# Delete webhooks whose backing service is in a specific namespace and no longer exists. +# Scoped to the given namespace to avoid touching unrelated platform webhooks. +# Operator-created webhooks (e.g., kai-scheduler admission) may not carry Helm labels, +# but once their service namespace is deleted, fail-closed webhooks block pod creation. +delete_orphaned_webhooks_for_ns() { + local ns="$1" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + { kubectl get "${kind}" -o json 2>/dev/null \ + | jq -r --arg ns "${ns}" \ + '.items[] | .metadata.name as $wh | .webhooks[] | select(.clientConfig.service != null and .clientConfig.service.namespace == $ns) | [$wh, .clientConfig.service.name] | @tsv' 2>/dev/null \ + | sort -u || true; } \ + | while IFS=$'\t' read -r wh_name svc_name; do + # Delete when namespace is gone, terminating, or backing service is missing. + # Skip on transient errors (auth, timeout, DNS) to avoid removing valid webhooks. + local ns_output ns_phase svc_output + ns_output=$(kubectl get ns "${ns}" 2>&1) || true + if echo "${ns_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + ns_phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null) || true + if [[ "${ns_phase}" == "Terminating" ]]; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} terminating)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + svc_output=$(kubectl get svc "${svc_name}" -n "${ns}" 2>&1) || true + if echo "${svc_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (service ${ns}/${svc_name} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + fi + done + done +} + +# Force-clear finalizers on all namespaced resources to unstick a Terminating namespace. +force_clear_namespace_finalizers() { + local ns="$1" + echo "Force-removing finalizers in namespace ${ns}..." + local kinds + kinds=$(kubectl api-resources --verbs=list --namespaced -o name 2>/dev/null) || { + echo "Warning: failed to enumerate namespaced resource kinds in ${ns}; namespace may stay Terminating" >&2 + return + } + for kind in ${kinds}; do + kubectl get "${kind}" -n "${ns}" -o json 2>/dev/null \ + | jq -r '.items[] | select(.metadata.finalizers // [] | length > 0) | .kind + "/" + .metadata.name' 2>/dev/null \ + | while read -r resource; do + kubectl patch "${resource}" -n "${ns}" --type merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + done || echo "Warning: finalizer-clear pipeline for ${kind} in ${ns} failed (kubectl get / jq error); namespace may stay Terminating" >&2 + done +} + +# Return a small, explicit list of known crds/-installed CRDs whose +# finalizer-bearing custom resources must still be caught before the operator +# is removed. Keep this list intentionally tiny and exact: if a release is not +# listed here, pre-flight relies on chart-manifest and Helm-annotation discovery +# only instead of trying to infer ownership across the whole cluster. +extra_crds_for_release() { + case "$1" in + gpu-operator) + printf '%s\n' \ + "clusterpolicies.nvidia.com" \ + "nvidiadrivers.nvidia.com" \ + "nodefeaturegroups.nfd.k8s-sigs.io" \ + "nodefeaturerules.nfd.k8s-sigs.io" \ + "nodefeatures.nfd.k8s-sigs.io" + ;; + kai-scheduler) + printf '%s\n' \ + "bindrequests.scheduling.run.ai" \ + "configs.kai.scheduler" \ + "podgroups.scheduling.run.ai" \ + "queues.scheduling.run.ai" \ + "schedulingshards.kai.scheduler" \ + "topologies.kai.scheduler" + ;; + k8s-nim-operator) + printf '%s\n' \ + "nemocustomizers.apps.nvidia.com" \ + "nemodatastores.apps.nvidia.com" \ + "nemoentitystores.apps.nvidia.com" \ + "nemoevaluators.apps.nvidia.com" \ + "nemoguardrails.apps.nvidia.com" \ + "nimbuilds.apps.nvidia.com" \ + "nimcaches.apps.nvidia.com" \ + "nimpipelines.apps.nvidia.com" \ + "nimservices.apps.nvidia.com" + ;; + kubeflow-trainer) + printf '%s\n' \ + "clustertrainingruntimes.trainer.kubeflow.org" \ + "trainjobs.trainer.kubeflow.org" \ + "trainingruntimes.trainer.kubeflow.org" \ + "jobsets.jobset.x-k8s.io" + ;; + kube-prometheus-stack) + printf '%s\n' \ + "alertmanagerconfigs.monitoring.coreos.com" \ + "alertmanagers.monitoring.coreos.com" \ + "podmonitors.monitoring.coreos.com" \ + "probes.monitoring.coreos.com" \ + "prometheusagents.monitoring.coreos.com" \ + "prometheuses.monitoring.coreos.com" \ + "prometheusrules.monitoring.coreos.com" \ + "scrapeconfigs.monitoring.coreos.com" \ + "servicemonitors.monitoring.coreos.com" \ + "thanosrulers.monitoring.coreos.com" + ;; + dynamo-platform) + printf '%s\n' \ + "podcliques.grove.io" \ + "podcliquescalinggroups.grove.io" \ + "podcliquesets.grove.io" \ + "podgangs.scheduler.grove.io" + ;; + network-operator) + # This explicit list matches the CRDs enabled by the bundled values: + # nfd=false, sriovNetworkOperator=false, maintenance-operator disabled. + # Intentionally exclude networkattachmentdefinitions.k8s.cni.cncf.io: + # it is a broadly shared CRD, so surfacing or deleting it based only on + # this release would create cross-cluster noise. + printf '%s\n' \ + "nicclusterpolicies.mellanox.com" \ + "hostdevicenetworks.mellanox.com" \ + "ipoibnetworks.mellanox.com" \ + "macvlannetworks.mellanox.com" + ;; + *) ;; + esac +} + +# Skip pre-flight for releases whose bundle-managed custom resources are +# deleted from manifests before the controller is uninstalled. +skip_preflight_for_release() { + case "$1" in + nodewright-operator|kgateway) return 0 ;; + *) return 1 ;; + esac +} + +# Run `kubectl ... -o json` while keeping stdout parseable for jq. +capture_kubectl_json() { + local out_var="$1" + shift + local stderr_file output kubectl_err="" + + if ! stderr_file=$(mktemp "${TMPDIR:-/tmp}/.aicr_kubectl_stderr_XXXXXX"); then + printf -v "${out_var}" '%s' 'mktemp failed while capturing kubectl stderr' + return 1 + fi + + if ! output=$(kubectl "$@" 2>"${stderr_file}"); then + kubectl_err=$(cat "${stderr_file}" 2>/dev/null || true) + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${kubectl_err}" + return 1 + fi + + if [[ -s "${stderr_file}" ]]; then + cat "${stderr_file}" >&2 || true + fi + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${output}" + return 0 +} + +check_crd_for_stuck_resources() { + local crd_name="$1" + local component="$2" + local crd_json plural group scope resource stuck stuck_json kubectl_err + local item_name item_namespace item_finalizers + + if ! capture_kubectl_json crd_json get crd "${crd_name}" -o json; then + kubectl_err="${crd_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not inspect CRD '${crd_name}' (release ${component})." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + read -r plural group scope < <(echo "${crd_json}" | jq -r '[.spec.names.plural, .spec.group, .spec.scope] | @tsv' 2>/dev/null) || return 0 + [[ -z "${plural}" || "${plural}" == "null" ]] && return 0 + + resource="${plural}.${group}" + local jq_filter='[.metadata.finalizers // [] | .[] | select(startswith("kubernetes.io/") | not)]' + stuck="" + if [[ "${scope}" == "Namespaced" ]]; then + if ! capture_kubectl_json stuck_json get "${resource}" -A -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_namespace item_name item_finalizers; do + stuck="${stuck} ${item_namespace}/${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [(.metadata.namespace // ""), .metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + else + if ! capture_kubectl_json stuck_json get "${resource}" -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_name item_finalizers; do + stuck="${stuck} ${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [.metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + fi + + if [[ -n "${stuck}" ]]; then + { + echo " ${component} — ${crd_name}:" + printf '%s' "${stuck}" + } >> "${PREFLIGHT_DETAILS}" + fi +} + +check_release_for_stuck_crds() { + local release="$1" + local ns="$2" + local manifest manifest_crds annotated_crds explicit_crds + local all_crds_json kubectl_err crd_name + manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) + manifest_crds=$(echo "${manifest}" \ + | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') + + all_crds_json="${PREFLIGHT_ALL_CRDS_JSON:-}" + if [[ -z "${all_crds_json}" ]]; then + if ! capture_kubectl_json all_crds_json get crd -o json; then + kubectl_err="${all_crds_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs for release '${release}'." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + fi + + annotated_crds=$(echo "${all_crds_json}" \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) + explicit_crds="" + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + explicit_crds="${explicit_crds}$(echo "${all_crds_json}" \ + | jq -r --arg name "${crd_name}" --arg rel "${release}" --arg ns "${ns}" \ + '.items[] + | select(.metadata.name == $name) + | select( + ((.metadata.annotations["meta.helm.sh/release-name"] // "") == "") + or + ( + .metadata.annotations["meta.helm.sh/release-name"] == $rel + and + .metadata.annotations["meta.helm.sh/release-namespace"] == $ns + ) + ) + | .metadata.name' 2>/dev/null || true)"$'\n' + done < <(extra_crds_for_release "${release}") + [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${explicit_crds}" ]] && return 0 + printf '%s\n%s\n%s\n' "${manifest_crds}" "${annotated_crds}" "${explicit_crds}" \ + | awk 'NF' \ + | sort -u \ + | while read -r crd_name; do + check_crd_for_stuck_resources "${crd_name}" "${release}" + done + return 0 +} + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify no custom resources with active finalizers exist for CRDs owned by +# bundle operators. After helm uninstall removes the operator, CRs with +# finalizers cannot be reconciled — blocking CRD deletion indefinitely. + +if [[ "${SKIP_PREFLIGHT}" == "true" ]]; then + echo "Skipping pre-flight checks (--skip-preflight)." +else + echo "Running pre-flight checks..." + PREFLIGHT_DETAILS=$(mktemp "${TMPDIR:-/tmp}/.aicr_preflight_XXXXXX") + PREFLIGHT_ALL_CRDS_JSON="" + + if ! capture_kubectl_json PREFLIGHT_ALL_CRDS_JSON get crd -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs." >&2 + echo " kubectl output: ${PREFLIGHT_ALL_CRDS_JSON}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove an operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + if skip_preflight_for_release "skyhook-customizations"; then + echo " Skipping skyhook-customizations (skyhook): bundle deletes dependent manifests before controller uninstall." + else + echo " Checking skyhook-customizations (skyhook)..." + check_release_for_stuck_crds "skyhook-customizations" "skyhook" + fi + + if [[ -s "${PREFLIGHT_DETAILS}" ]]; then + echo "" + echo "ERROR: Found custom resources with active finalizers that will block undeploy." + echo " After the operator is removed, these finalizers cannot be processed —" + echo " causing an unrecoverable hang during CRD deletion." + echo "" + echo " Delete these resources while their controller is still running," + echo " then re-run ./undeploy.sh" + echo "" + cat "${PREFLIGHT_DETAILS}" + echo "" + echo " To skip this check: ./undeploy.sh --skip-preflight" + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + rm -f "${PREFLIGHT_DETAILS}" + + echo "Pre-flight checks passed." +fi + +# ============================================================================== +# Uninstall components in reverse install order +# ============================================================================== +# Generic reverse loop: every folder is a Helm release (local-helm or upstream-helm). +# `helm uninstall ` works uniformly for both kinds — that's one of the +# benefits of the uniform local-chart bundle format. +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Derive the namespace from the component name via the per-component blocks + # below. We still need it for helm_force_uninstall and cluster-resource cleanup. + ns="" + if [[ "${name}" == "skyhook-customizations" ]]; then ns="skyhook"; fi + # Injected mixed-component "-post" folders share their parent's namespace. + if [[ "${name}" == "skyhook-customizations-post" ]]; then ns="skyhook"; fi + if [[ -z "${ns}" ]]; then + echo "Warning: no namespace known for ${name}; skipping uninstall" >&2 + continue + fi + echo "Uninstalling ${name} (${ns})..." + # For local-helm folders (Chart.yaml + templates/), kubectl-delete the + # rendered templates BEFORE helm uninstall. The templates may carry + # helm.sh/hook annotations (post-install/post-upgrade) which helm does + # not track or clean up on uninstall — so without this pre-delete, the + # operator gets removed but its hook-created CRs (and their finalizers) + # linger. Doing the delete while the controller is still running lets + # finalizers clear naturally. + if [[ -d "${dir}/templates" ]]; then + for tpl in "${dir}/templates/"*.yaml; do + [[ -f "${tpl}" ]] || continue + kubectl delete -n "${ns}" -f "${tpl}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done + fi + helm_force_uninstall "${name}" "${ns}" + delete_release_cluster_resources "${name}" "${ns}" + delete_orphaned_webhooks_for_ns "${ns}" +done + +# Remove nodewright node taints that persist after operator removal. +# Nodewright taints nodes during kernel tuning. The taint key is configurable +# via runtimeRequiredTaint (defaults to skyhook.nvidia.com). + +# Clean up orphaned CRDs that were owned by this bundle's releases. +# Only delete CRDs whose Helm release annotation matches a component we just uninstalled. +kubectl get crd -o json 2>/dev/null \ + | jq -r --arg rel "skyhook-customizations" --arg ns "skyhook" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting CRD ${name} (owned by skyhook-customizations/skyhook)..." + kubectl delete crd "${name}" --ignore-not-found --wait=false \ + || echo "Warning: failed to delete CRD ${name} (owned by skyhook-customizations/skyhook); leftovers will surface in post-flight" >&2 + done || echo "Warning: orphan-CRD cleanup for skyhook-customizations/skyhook failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + +# Intentionally skip automatic deletion of unannotated CRDs matched only by +# API group. On shared clusters, those CRDs may be serving another tenant's +# release in the same group, and we do not have bundle-specific ownership +# metadata to distinguish "ours" from "theirs" safely. + +# Warn about CRDs stuck in deleting state (e.g., customresourcecleanup finalizer +# can't be resolved because CR instances still have controller-managed finalizers). +stuck_crds=$(kubectl get crd -o json 2>/dev/null | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' 2>/dev/null || true) +if [[ -n "${stuck_crds}" ]]; then + echo "" + echo "WARNING: CRDs stuck in deleting state:" + for crd in ${stuck_crds}; do + echo " ${crd}" + done + echo "" + echo " These CRDs have a deletionTimestamp but cannot complete deletion because" + echo " their customresourcecleanup finalizer is waiting for CR instances to be" + echo " removed. If you are certain no data will be lost, you can force-clear the" + echo " finalizers. Note: this may leave orphaned CR data in etcd that is no longer" + echo " accessible through the API." + echo "" + echo " Commands to force-clear (review before running):" + for crd in ${stuck_crds}; do + echo " kubectl patch crd ${crd} --type merge -p '{\"metadata\":{\"finalizers\":null}}'" + done +fi + +# Clean up namespaces after all components are uninstalled. +if [[ "${DELETE_PVCS}" == "true" ]] && ! echo " ${PROTECTED_NS} " | grep -q " skyhook "; then + echo "Deleting PVCs in skyhook..." + kubectl delete pvc --all -n skyhook --ignore-not-found || true +fi +delete_orphaned_webhooks_for_ns "skyhook" +delete_namespace "skyhook" + +# Clean up companion namespaces created at runtime by operators. +# Only emitted for components whose runtime creates them. + +# Wait for terminating namespaces to finish +echo "Waiting for namespaces to terminate..." +for i in $(seq 1 60); do + TERMINATING=$(kubectl get ns --no-headers 2>/dev/null | grep Terminating | awk '{print $1}' || true) + if [[ -z "${TERMINATING}" ]]; then + break + fi + if [[ $i -eq 60 ]]; then + echo "Warning: namespaces still terminating after 60s: ${TERMINATING}" + for ns in ${TERMINATING}; do + force_clear_namespace_finalizers "${ns}" + kubectl delete namespace "${ns}" --ignore-not-found --wait=false 2>/dev/null || true + done + break + fi + sleep 1 +done + +# Final webhook cleanup pass. +delete_orphaned_webhooks_for_ns "skyhook" + +# ============================================================================== +# Post-flight verification +# ============================================================================== + +postflight_issues=false + +TERMINATING=$(kubectl get namespaces -o jsonpath='{range .items[?(@.status.phase=="Terminating")]}{.metadata.name}{" "}{end}' 2>/dev/null || true) +if [[ -n "${TERMINATING}" ]]; then + echo "WARNING: namespaces still terminating: ${TERMINATING}" + echo " A subsequent deploy.sh may fail. Wait or force-finalize these namespaces." + postflight_issues=true +fi + +kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null | \ + sort -u | \ + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + if ! kubectl get ns "${svc_ns}" &>/dev/null || ! kubectl get svc "${svc_name}" -n "${svc_ns}" &>/dev/null; then + echo "Deleting stale webhook ${wh_name} (service ${svc_ns}/${svc_name} missing)..." + kubectl delete mutatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + kubectl delete validatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + fi + done || true + +stale_apis=$(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true) +if [[ -n "${stale_apis}" ]]; then + echo "WARNING: unavailable API services found: ${stale_apis}" + echo " These can block namespace deletion. Delete with: kubectl delete apiservice " + postflight_issues=true +fi + +# Check for Helm-annotated CRDs from uninstalled releases. +helm_orphaned_crds="" +explicit_orphaned_crds="" +postflight_all_crds_json="" +if capture_kubectl_json postflight_all_crds_json get crd -o json; then + : + remaining_helm_crds=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg rel "skyhook-customizations" --arg ns "skyhook" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_helm_crds}" ]]; then + helm_orphaned_crds="${helm_orphaned_crds} ${remaining_helm_crds}" + fi + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + remaining_explicit_crd=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg name "${crd_name}" \ + '.items[] | select(.metadata.name==$name and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_explicit_crd}" ]]; then + explicit_orphaned_crds="${explicit_orphaned_crds}${remaining_explicit_crd}"$'\n' + fi + done < <(extra_crds_for_release "skyhook-customizations") +else + echo "Warning: failed to enumerate post-flight CRDs; kubectl output: ${postflight_all_crds_json}" >&2 + postflight_issues=true +fi +if [[ -n "${helm_orphaned_crds}" ]]; then + echo "WARNING: Helm-annotated CRDs from uninstalled releases still present:${helm_orphaned_crds}" + echo " Cleanup did not remove all CRDs owned by this bundle's releases." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +explicit_orphaned_crds=$(printf '%s' "${explicit_orphaned_crds}" | awk 'NF' | sort -u | tr '\n' ' ') +if [[ -n "${explicit_orphaned_crds}" ]]; then + echo "WARNING: explicit CRDs from this bundle still present: ${explicit_orphaned_crds}" + echo " These CRDs are installed outside Helm manifest/annotation discovery." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +if [[ "${postflight_issues}" == "true" ]]; then + echo "" + echo "Post-flight: some stale resources remain. Run deploy.sh pre-flight checks to verify before redeploying." +else + echo "Post-flight: cluster is clean." +fi + +echo "Undeployment complete." diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.sh b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.sh new file mode 100644 index 000000000..ef9daafa9 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.sh @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" +# shellcheck source=/dev/null +source ./upstream.env + +# CHART carries the full OCI URI for OCI charts and just the chart name for +# HTTP/HTTPS charts. REPO is non-empty only for HTTP/HTTPS charts; the +# ${REPO:+--repo "${REPO}"} expansion adds --repo iff REPO is set. +helm upgrade --install gpu-operator "${CHART}" \ + ${REPO:+--repo "${REPO}"} --version "${VERSION}" \ + --namespace gpu-operator --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.env b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.env new file mode 100644 index 000000000..d81c9a2dc --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.env @@ -0,0 +1,3 @@ +CHART='gpu-operator' +REPO='https://helm.ngc.nvidia.com/nvidia' +VERSION='v25.3.3' diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml new file mode 100644 index 000000000..317d910fb --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml @@ -0,0 +1,4 @@ +# Generated by Cloud Native Stack +--- +driver: + enabled: true diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yaml b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yaml new file mode 100644 index 000000000..2dc5b2a16 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: gpu-operator-post +description: Generated wrapper chart for gpu-operator local content. +type: application +version: 0.1.0 +appVersion: "0.1.0" diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.sh b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.sh new file mode 100644 index 000000000..ea8b750c2 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.sh @@ -0,0 +1,23 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" + +helm upgrade --install gpu-operator-post ./ \ + --namespace gpu-operator --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml new file mode 100644 index 000000000..8bb1eebb7 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml @@ -0,0 +1,6 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. + +apiVersion: v1 +kind: Service +metadata: + name: dcgm-exporter diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml new file mode 100644 index 000000000..317d910fb --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml @@ -0,0 +1,4 @@ +# Generated by Cloud Native Stack +--- +driver: + enabled: true diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md new file mode 100644 index 000000000..e0a7f2d86 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md @@ -0,0 +1,114 @@ +# Cloud Native Stack Deployment + +Recipe Version: v0.1.0 +Bundler Version: v1.0.0 + +Per-component bundle for deploying NVIDIA Cloud Native Stack components +for GPU-accelerated Kubernetes workloads. + +## Configuration + + + +## Components + +The following components are included (deployed in order). Each component +lives in a numbered `NNN-/` folder and is installed as a Helm release +via its own `install.sh`: + +| Component | Version | Namespace | Source | +|-----------|---------|-----------|--------| +| gpu-operator | v25.3.3 | gpu-operator | gpu-operator (https://helm.ngc.nvidia.com/nvidia) | + + + + +## Quick Start + +Run the included deployment script: + +```bash +chmod +x deploy.sh +./deploy.sh +``` + +Use `--no-wait` to skip Helm chart-level waiting where AICR uses `--wait` (keeps `--timeout` for hooks): + +```bash +./deploy.sh --no-wait +``` + +> **Note:** The deploy script's final status reflects install/apply results. If `--best-effort` was used, one or more components may still have failed; check warning lines and logs. This does **not** guarantee the cluster is ready to schedule workloads — operator-driven cluster convergence (CRD reconciliation, node tuning, plugin registration, etc.) continues asynchronously after the script exits, in operator-specific ways. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details. + +## Manual Installation + +Each component folder contains an `install.sh` that runs `helm upgrade --install` +with the right arguments baked in. To install a single component manually: + +```bash +cd NNN- +bash install.sh +``` + +## Customization + +Each component folder has its own `values.yaml` (static) and `cluster-values.yaml` +(dynamic, per-cluster). Edit either before deploying: + +```bash +vim NNN-/values.yaml +vim NNN-/cluster-values.yaml +``` + +## Upgrade + +Re-run the per-component install.sh to upgrade an already-installed release: + +```bash +cd NNN- +bash install.sh +``` + +## Uninstall + +To remove components (reverse order): + +```bash +./undeploy.sh +``` + +Or remove a single release manually: + +```bash +helm uninstall gpu-operator -n gpu-operator +``` + + +## Troubleshooting + +### Check deployment status + +```bash +kubectl get pods -A | grep -E 'gpu-operator' +``` + +### View component logs + +Inspect a single component's pods (replace `` and `` +with one of the entries from the table above): + +```bash +kubectl logs -n -l app.kubernetes.io/instance= +``` + +### Verify GPU access + +```bash +kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | jq '.["nvidia.com/gpu"]' +``` + + +## References + +- [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md) +- [GPU Operator Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/) diff --git a/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh new file mode 100644 index 000000000..993f8b8b3 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh @@ -0,0 +1,321 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail + +# Cloud Native Stack Deployment Script +# Generated by AICR Bundler v1.0.0 +# +# Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N] +# --no-wait Skip Helm chart-level wait where AICR uses --wait (keeps --timeout for hooks) +# --best-effort Continue past individual component failures (log warnings) +# --retries N Retry failed helm/kubectl operations N times with backoff (default: 5, 0 = fail-fast) +# +# This script is optional — each component subdirectory has its own install.sh +# with a single baked-in `helm upgrade --install` command. For detailed behavior +# docs (CRD ordering, async components, error handling), see the AICR CLI Reference: +# https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Run helm commands from a temp directory to prevent local chart directories +# (e.g., bundle/001-nodewright-operator/) from shadowing remote chart references. +HELM_WORKDIR="$(mktemp -d)" +trap 'rm -rf "${HELM_WORKDIR}"; exit 130' INT TERM +trap 'rm -rf "${HELM_WORKDIR}"' EXIT + +HELM_TIMEOUT="10m" +NO_WAIT=false +BEST_EFFORT=false +FAILED_COMPONENTS="" +MAX_RETRIES=5 + +while [[ $# -gt 0 ]]; do + case "$1" in + --no-wait) NO_WAIT=true; shift ;; + --best-effort) BEST_EFFORT=true; shift ;; + --retries) + if [[ $# -lt 2 ]]; then echo "Error: --retries requires a value"; exit 1; fi + if ! [[ "$2" =~ ^[0-9]+$ ]]; then echo "Error: --retries requires a non-negative integer"; exit 1; fi + MAX_RETRIES="$2"; shift 2 ;; + *) echo "Error: unknown option: $1"; echo "Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N]"; exit 1 ;; + esac +done + +# Export env vars consumed by each folder's install.sh (rendered by localformat). +# DRY_RUN_FLAG / KUBECONFIG_FLAG / HELM_DEBUG_FLAG default to empty strings. +export DRY_RUN_FLAG="${DRY_RUN_FLAG:-}" +export KUBECONFIG_FLAG="${KUBECONFIG_FLAG:-}" +export HELM_DEBUG_FLAG="${HELM_DEBUG_FLAG:-}" + +function helm_failed() { + if [[ "${BEST_EFFORT}" == "true" ]]; then + echo "WARNING: $1 install failed, continuing (--best-effort)" + FAILED_COMPONENTS="${FAILED_COMPONENTS} $1" + else + exit 1 + fi +} + +# Compute backoff delay from attempt number (1-indexed). +# Examples: attempt 1→5s, 2→20s, 3→45s, 4→80s, 5→120s (cap) +function backoff_seconds() { + local attempt=$1 + local seconds=$(( attempt * attempt * 5 )) + if [[ ${seconds} -gt 120 ]]; then seconds=120; fi + echo "${seconds}" +} + +# Clean up stale Helm hook Jobs before retrying. When a hook Job (e.g., +# crd-upgrader) times out or fails, it stays in the namespace and blocks +# subsequent install attempts with "Job not ready" errors. +# Helm hooks are identified by the helm.sh/hook *annotation* (not a label), +# so we list all non-succeeded Jobs and check each individually via JSON. +function cleanup_helm_hooks() { + local namespace="$1" + local job_names + job_names=$(kubectl get jobs -n "${namespace}" \ + --field-selector=status.successful=0 \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ + 2>/dev/null || true) + if [[ -z "${job_names}" ]]; then + return + fi + while IFS= read -r name; do + [[ -z "${name}" ]] && continue + # Get the full Job JSON to reliably check annotations and status + local job_json + job_json=$(kubectl get job "${name}" -n "${namespace}" -o json 2>/dev/null || true) + [[ -z "${job_json}" ]] && continue + # Skip non-hook Jobs (no helm.sh/hook annotation) + local hook_val + hook_val=$(echo "${job_json}" | grep -o '"helm.sh/hook"' || true) + [[ -z "${hook_val}" ]] && continue + # Capture diagnostics before deleting. This helps diagnose transient hook + # failures (e.g., dynamo ssh-keygen) that are otherwise lost after cleanup. + echo " --- Failed hook Job ${name} diagnostics ---" + kubectl describe job "${name}" -n "${namespace}" 2>/dev/null | tail -50 || true + local pod_names + pod_names=$(kubectl get pods -n "${namespace}" -l "job-name=${name}" \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || true) + for pod_name in ${pod_names}; do + echo " --- Hook pod ${pod_name} describe ---" + kubectl describe pod "${pod_name}" -n "${namespace}" 2>/dev/null | tail -50 || true + done + echo " --- End diagnostics for ${name} ---" + # Delete any non-succeeded hook Job. This function only runs after a Helm + # failure, so any hook Job without a successful completion is blocking the + # retry — whether it failed, is stuck Pending (timed out before the pod + # started), or is still active with a stuck container. + echo " Cleaning up stale Helm hook Job ${name} in ${namespace}..." + kubectl delete job "${name}" -n "${namespace}" --ignore-not-found 2>/dev/null || true + done <<< "${job_names}" +} + +function dump_kai_scheduler_helm_diagnostics() { + local namespace="$1" + if [[ "${namespace}" != "kai-scheduler" ]]; then + return + fi + + echo " --- ${namespace} diagnostics ---" + echo " Jobs:" + kubectl get jobs -n "${namespace}" 2>/dev/null || true + echo " Job descriptions:" + kubectl describe jobs -n "${namespace}" 2>/dev/null || true + echo " Pods:" + kubectl get pods -n "${namespace}" -o wide 2>/dev/null || true + echo " Pod descriptions:" + kubectl describe pods -n "${namespace}" 2>/dev/null || true + echo " Recent events:" + kubectl get events -n "${namespace}" --sort-by='.lastTimestamp' 2>/dev/null | tail -30 || true + echo " --- End ${namespace} diagnostics ---" +} + +# Components that use operator patterns with custom resources that reconcile +# asynchronously. Helm --wait may time out waiting for CR readiness even though +# all pods start successfully. These components are installed without --wait. +ASYNC_COMPONENTS="kai-scheduler" + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify the cluster is clean before deploying. Stale webhooks, terminating +# namespaces, and orphaned API services from a previous install can block pod +# creation and namespace deletion, causing silent deployment failures. + +echo "Running pre-flight checks..." + +preflight_failed=false + +# Bundle namespace list (deduplicated) +BUNDLE_NAMESPACES=$(echo "gpu-operator " | tr ' ' '\n' | sort -u | tr '\n' ' ') + +# Check for terminating namespaces that overlap with our components +for ns in ${BUNDLE_NAMESPACES}; do + phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null || true) + if [[ "${phase}" == "Terminating" ]]; then + echo "ERROR: namespace '${ns}' is still terminating from a previous install." + echo " Wait for it to finish, or force-finalize with:" + echo " kubectl get ns ${ns} -o json | jq '.spec.finalizers=[]' | kubectl replace --raw /api/v1/namespaces/${ns}/finalize -f -" + preflight_failed=true + fi +done + +# Check for stale webhooks whose backing services no longer exist. +# Scoped to bundle namespaces only to avoid false positives from unrelated +# platform webhooks in shared clusters. +if command -v jq &>/dev/null; then + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + # Only check webhooks pointing to our bundle namespaces + is_bundle_ns=false + for ns in ${BUNDLE_NAMESPACES}; do + [[ "${svc_ns}" == "${ns}" ]] && is_bundle_ns=true && break + done + [[ "${is_bundle_ns}" == "false" ]] && continue + + # Use explicit NotFound check to avoid false positives from transient errors + svc_check=$(kubectl get svc "${svc_name}" -n "${svc_ns}" 2>&1) || true + if echo "${svc_check}" | grep -q "NotFound\|not found"; then + echo "ERROR: ${kind} '${wh_name}' references non-existent service ${svc_ns}/${svc_name}." + echo " This will block pod/resource creation. Delete with: kubectl delete ${kind} ${wh_name}" + preflight_failed=true + fi + done < <(kubectl get "${kind}" -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null || true) + done +else + echo "NOTE: jq not found — skipping webhook pre-flight checks. Install jq for full pre-flight validation." +fi + +# Check for stale API services (e.g., custom.metrics.k8s.io from prometheus-adapter) +if command -v jq &>/dev/null; then + for api_svc in $(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true); do + echo "WARNING: API service '${api_svc}' is unavailable. This can block namespace deletion." + echo " Delete with: kubectl delete apiservice ${api_svc}" + # API service issues are warnings, not hard failures — they don't block deployment directly + done +else + echo "NOTE: jq not found — skipping API service pre-flight checks." +fi + +# Check for orphaned CRDs from previous deployments. +# Scoped to CRD groups belonging to components in this bundle to avoid +# false positives from unrelated platform installs on shared clusters. +ORPHANED_CRD_GROUPS="" ORPHANED_CRD_GROUPS="${ORPHANED_CRD_GROUPS} nvidia.com nfd.k8s-sigs.io" +for group in ${ORPHANED_CRD_GROUPS}; do + orphaned=$(kubectl get crd -o name 2>/dev/null | grep "\.${group}$" || true) + if [[ -n "${orphaned}" ]]; then + echo "WARNING: orphaned CRDs from previous deployment: ${orphaned}" + echo " These may cause conflicts. Delete with: kubectl delete ${orphaned}" + fi +done + +# Check for stale nodewright node taints from a previous deployment. +# Only remove taints if nodewright-operator is NOT already running (i.e., fresh deploy). +# If the operator is running, taints are legitimate scheduling guards. + +if [[ "${preflight_failed}" == "true" ]]; then + echo "" + echo "Pre-flight checks failed. Fix the issues above before deploying." + echo "To skip pre-flight checks, run: ./undeploy.sh first, then retry." + exit 1 +fi + +echo "Pre-flight checks passed." +echo "Deploying Cloud Native Stack components..." + +# ============================================================================== +# Install loop +# ============================================================================== +# Generic install loop. Each folder's install.sh is rendered by localformat +# with the right helm command baked in — deploy.sh has no per-component +# knowledge here. Per-component special-case logic (async wait, DRA plugin +# restart) runs in the post-install blocks below, matched by component name. +cd "${HELM_WORKDIR}" +for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; do + [[ -d "${dir}" ]] || continue + dir="${dir%/}" + base="${dir##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Source the namespace from the folder's install.sh — not the folder + # basename — because the helm release name and its target namespace can + # differ (e.g. nodewright-operator → namespace skyhook; gpu-operator-post → + # namespace gpu-operator). cleanup_helm_hooks and the kai diagnostics + # both operate on the namespace. + namespace=$(awk '{ for (i=1;i/dev/null; then + echo "Error: jq is required but not found in PATH." + echo " Install jq: https://jqlang.github.io/jq/download/" + exit 1 +fi + +echo "Undeploying Cloud Native Stack components (timeout: ${HELM_TIMEOUT}s)..." +echo "" + +echo " The following components will be removed (in reverse install order):" +# List NNN-* folders in reverse numeric order +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + printf " %-40s\n" "${name}" +done +echo "" + +# System namespaces that must not be deleted +PROTECTED_NS="kube-system kube-public kube-node-lease default" + +delete_namespace() { + local ns="$1" + if [[ "${KEEP_NS}" == "true" ]]; then return; fi + if echo " ${PROTECTED_NS} " | grep -q " ${ns} "; then return; fi + if ! kubectl get namespace "${ns}" &>/dev/null; then return; fi + echo "Deleting namespace ${ns}..." + kubectl delete namespace "${ns}" --ignore-not-found --wait=false +} + +# Uninstall a Helm release, handling stuck pending states from interrupted deploys. +# Try normal uninstall first; if it fails, retry with --no-hooks to force removal. +helm_force_uninstall() { + local release="$1" + local ns="$2" + if helm uninstall "${release}" -n "${ns}" --timeout "${HELM_TIMEOUT}s" --ignore-not-found 2>/dev/null; then + return + fi + echo " Retrying ${release} removal with --no-hooks..." + helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found || true +} + +# Delete cluster-scoped resources owned by a Helm release. +# These survive namespace deletion and can block subsequent deployments: +# - Webhooks block pod creation when their backing service is gone +# - CRDs with "helm.sh/resource-policy: keep" are retained after chart removal +delete_release_cluster_resources() { + local release="$1" + local ns="$2" + local selector="app.kubernetes.io/managed-by=Helm" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations customresourcedefinitions; do + kubectl get "${kind}" -l "${selector}" -o json 2>/dev/null \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting ${kind}/${name}..." + kubectl delete "${kind}" "${name}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done || echo "Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + done +} + +# Delete webhooks whose backing service is in a specific namespace and no longer exists. +# Scoped to the given namespace to avoid touching unrelated platform webhooks. +# Operator-created webhooks (e.g., kai-scheduler admission) may not carry Helm labels, +# but once their service namespace is deleted, fail-closed webhooks block pod creation. +delete_orphaned_webhooks_for_ns() { + local ns="$1" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + { kubectl get "${kind}" -o json 2>/dev/null \ + | jq -r --arg ns "${ns}" \ + '.items[] | .metadata.name as $wh | .webhooks[] | select(.clientConfig.service != null and .clientConfig.service.namespace == $ns) | [$wh, .clientConfig.service.name] | @tsv' 2>/dev/null \ + | sort -u || true; } \ + | while IFS=$'\t' read -r wh_name svc_name; do + # Delete when namespace is gone, terminating, or backing service is missing. + # Skip on transient errors (auth, timeout, DNS) to avoid removing valid webhooks. + local ns_output ns_phase svc_output + ns_output=$(kubectl get ns "${ns}" 2>&1) || true + if echo "${ns_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + ns_phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null) || true + if [[ "${ns_phase}" == "Terminating" ]]; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} terminating)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + svc_output=$(kubectl get svc "${svc_name}" -n "${ns}" 2>&1) || true + if echo "${svc_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (service ${ns}/${svc_name} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + fi + done + done +} + +# Force-clear finalizers on all namespaced resources to unstick a Terminating namespace. +force_clear_namespace_finalizers() { + local ns="$1" + echo "Force-removing finalizers in namespace ${ns}..." + local kinds + kinds=$(kubectl api-resources --verbs=list --namespaced -o name 2>/dev/null) || { + echo "Warning: failed to enumerate namespaced resource kinds in ${ns}; namespace may stay Terminating" >&2 + return + } + for kind in ${kinds}; do + kubectl get "${kind}" -n "${ns}" -o json 2>/dev/null \ + | jq -r '.items[] | select(.metadata.finalizers // [] | length > 0) | .kind + "/" + .metadata.name' 2>/dev/null \ + | while read -r resource; do + kubectl patch "${resource}" -n "${ns}" --type merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + done || echo "Warning: finalizer-clear pipeline for ${kind} in ${ns} failed (kubectl get / jq error); namespace may stay Terminating" >&2 + done +} + +# Return a small, explicit list of known crds/-installed CRDs whose +# finalizer-bearing custom resources must still be caught before the operator +# is removed. Keep this list intentionally tiny and exact: if a release is not +# listed here, pre-flight relies on chart-manifest and Helm-annotation discovery +# only instead of trying to infer ownership across the whole cluster. +extra_crds_for_release() { + case "$1" in + gpu-operator) + printf '%s\n' \ + "clusterpolicies.nvidia.com" \ + "nvidiadrivers.nvidia.com" \ + "nodefeaturegroups.nfd.k8s-sigs.io" \ + "nodefeaturerules.nfd.k8s-sigs.io" \ + "nodefeatures.nfd.k8s-sigs.io" + ;; + kai-scheduler) + printf '%s\n' \ + "bindrequests.scheduling.run.ai" \ + "configs.kai.scheduler" \ + "podgroups.scheduling.run.ai" \ + "queues.scheduling.run.ai" \ + "schedulingshards.kai.scheduler" \ + "topologies.kai.scheduler" + ;; + k8s-nim-operator) + printf '%s\n' \ + "nemocustomizers.apps.nvidia.com" \ + "nemodatastores.apps.nvidia.com" \ + "nemoentitystores.apps.nvidia.com" \ + "nemoevaluators.apps.nvidia.com" \ + "nemoguardrails.apps.nvidia.com" \ + "nimbuilds.apps.nvidia.com" \ + "nimcaches.apps.nvidia.com" \ + "nimpipelines.apps.nvidia.com" \ + "nimservices.apps.nvidia.com" + ;; + kubeflow-trainer) + printf '%s\n' \ + "clustertrainingruntimes.trainer.kubeflow.org" \ + "trainjobs.trainer.kubeflow.org" \ + "trainingruntimes.trainer.kubeflow.org" \ + "jobsets.jobset.x-k8s.io" + ;; + kube-prometheus-stack) + printf '%s\n' \ + "alertmanagerconfigs.monitoring.coreos.com" \ + "alertmanagers.monitoring.coreos.com" \ + "podmonitors.monitoring.coreos.com" \ + "probes.monitoring.coreos.com" \ + "prometheusagents.monitoring.coreos.com" \ + "prometheuses.monitoring.coreos.com" \ + "prometheusrules.monitoring.coreos.com" \ + "scrapeconfigs.monitoring.coreos.com" \ + "servicemonitors.monitoring.coreos.com" \ + "thanosrulers.monitoring.coreos.com" + ;; + dynamo-platform) + printf '%s\n' \ + "podcliques.grove.io" \ + "podcliquescalinggroups.grove.io" \ + "podcliquesets.grove.io" \ + "podgangs.scheduler.grove.io" + ;; + network-operator) + # This explicit list matches the CRDs enabled by the bundled values: + # nfd=false, sriovNetworkOperator=false, maintenance-operator disabled. + # Intentionally exclude networkattachmentdefinitions.k8s.cni.cncf.io: + # it is a broadly shared CRD, so surfacing or deleting it based only on + # this release would create cross-cluster noise. + printf '%s\n' \ + "nicclusterpolicies.mellanox.com" \ + "hostdevicenetworks.mellanox.com" \ + "ipoibnetworks.mellanox.com" \ + "macvlannetworks.mellanox.com" + ;; + *) ;; + esac +} + +# Skip pre-flight for releases whose bundle-managed custom resources are +# deleted from manifests before the controller is uninstalled. +skip_preflight_for_release() { + case "$1" in + nodewright-operator|kgateway) return 0 ;; + *) return 1 ;; + esac +} + +# Run `kubectl ... -o json` while keeping stdout parseable for jq. +capture_kubectl_json() { + local out_var="$1" + shift + local stderr_file output kubectl_err="" + + if ! stderr_file=$(mktemp "${TMPDIR:-/tmp}/.aicr_kubectl_stderr_XXXXXX"); then + printf -v "${out_var}" '%s' 'mktemp failed while capturing kubectl stderr' + return 1 + fi + + if ! output=$(kubectl "$@" 2>"${stderr_file}"); then + kubectl_err=$(cat "${stderr_file}" 2>/dev/null || true) + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${kubectl_err}" + return 1 + fi + + if [[ -s "${stderr_file}" ]]; then + cat "${stderr_file}" >&2 || true + fi + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${output}" + return 0 +} + +check_crd_for_stuck_resources() { + local crd_name="$1" + local component="$2" + local crd_json plural group scope resource stuck stuck_json kubectl_err + local item_name item_namespace item_finalizers + + if ! capture_kubectl_json crd_json get crd "${crd_name}" -o json; then + kubectl_err="${crd_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not inspect CRD '${crd_name}' (release ${component})." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + read -r plural group scope < <(echo "${crd_json}" | jq -r '[.spec.names.plural, .spec.group, .spec.scope] | @tsv' 2>/dev/null) || return 0 + [[ -z "${plural}" || "${plural}" == "null" ]] && return 0 + + resource="${plural}.${group}" + local jq_filter='[.metadata.finalizers // [] | .[] | select(startswith("kubernetes.io/") | not)]' + stuck="" + if [[ "${scope}" == "Namespaced" ]]; then + if ! capture_kubectl_json stuck_json get "${resource}" -A -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_namespace item_name item_finalizers; do + stuck="${stuck} ${item_namespace}/${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [(.metadata.namespace // ""), .metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + else + if ! capture_kubectl_json stuck_json get "${resource}" -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_name item_finalizers; do + stuck="${stuck} ${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [.metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + fi + + if [[ -n "${stuck}" ]]; then + { + echo " ${component} — ${crd_name}:" + printf '%s' "${stuck}" + } >> "${PREFLIGHT_DETAILS}" + fi +} + +check_release_for_stuck_crds() { + local release="$1" + local ns="$2" + local manifest manifest_crds annotated_crds explicit_crds + local all_crds_json kubectl_err crd_name + manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) + manifest_crds=$(echo "${manifest}" \ + | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') + + all_crds_json="${PREFLIGHT_ALL_CRDS_JSON:-}" + if [[ -z "${all_crds_json}" ]]; then + if ! capture_kubectl_json all_crds_json get crd -o json; then + kubectl_err="${all_crds_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs for release '${release}'." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + fi + + annotated_crds=$(echo "${all_crds_json}" \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) + explicit_crds="" + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + explicit_crds="${explicit_crds}$(echo "${all_crds_json}" \ + | jq -r --arg name "${crd_name}" --arg rel "${release}" --arg ns "${ns}" \ + '.items[] + | select(.metadata.name == $name) + | select( + ((.metadata.annotations["meta.helm.sh/release-name"] // "") == "") + or + ( + .metadata.annotations["meta.helm.sh/release-name"] == $rel + and + .metadata.annotations["meta.helm.sh/release-namespace"] == $ns + ) + ) + | .metadata.name' 2>/dev/null || true)"$'\n' + done < <(extra_crds_for_release "${release}") + [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${explicit_crds}" ]] && return 0 + printf '%s\n%s\n%s\n' "${manifest_crds}" "${annotated_crds}" "${explicit_crds}" \ + | awk 'NF' \ + | sort -u \ + | while read -r crd_name; do + check_crd_for_stuck_resources "${crd_name}" "${release}" + done + return 0 +} + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify no custom resources with active finalizers exist for CRDs owned by +# bundle operators. After helm uninstall removes the operator, CRs with +# finalizers cannot be reconciled — blocking CRD deletion indefinitely. + +if [[ "${SKIP_PREFLIGHT}" == "true" ]]; then + echo "Skipping pre-flight checks (--skip-preflight)." +else + echo "Running pre-flight checks..." + PREFLIGHT_DETAILS=$(mktemp "${TMPDIR:-/tmp}/.aicr_preflight_XXXXXX") + PREFLIGHT_ALL_CRDS_JSON="" + + if ! capture_kubectl_json PREFLIGHT_ALL_CRDS_JSON get crd -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs." >&2 + echo " kubectl output: ${PREFLIGHT_ALL_CRDS_JSON}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove an operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + if skip_preflight_for_release "gpu-operator"; then + echo " Skipping gpu-operator (gpu-operator): bundle deletes dependent manifests before controller uninstall." + else + echo " Checking gpu-operator (gpu-operator)..." + check_release_for_stuck_crds "gpu-operator" "gpu-operator" + fi + + if [[ -s "${PREFLIGHT_DETAILS}" ]]; then + echo "" + echo "ERROR: Found custom resources with active finalizers that will block undeploy." + echo " After the operator is removed, these finalizers cannot be processed —" + echo " causing an unrecoverable hang during CRD deletion." + echo "" + echo " Delete these resources while their controller is still running," + echo " then re-run ./undeploy.sh" + echo "" + cat "${PREFLIGHT_DETAILS}" + echo "" + echo " To skip this check: ./undeploy.sh --skip-preflight" + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + rm -f "${PREFLIGHT_DETAILS}" + + echo "Pre-flight checks passed." +fi + +# ============================================================================== +# Uninstall components in reverse install order +# ============================================================================== +# Generic reverse loop: every folder is a Helm release (local-helm or upstream-helm). +# `helm uninstall ` works uniformly for both kinds — that's one of the +# benefits of the uniform local-chart bundle format. +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Derive the namespace from the component name via the per-component blocks + # below. We still need it for helm_force_uninstall and cluster-resource cleanup. + ns="" + if [[ "${name}" == "gpu-operator" ]]; then ns="gpu-operator"; fi + # Injected mixed-component "-post" folders share their parent's namespace. + if [[ "${name}" == "gpu-operator-post" ]]; then ns="gpu-operator"; fi + if [[ -z "${ns}" ]]; then + echo "Warning: no namespace known for ${name}; skipping uninstall" >&2 + continue + fi + echo "Uninstalling ${name} (${ns})..." + # For local-helm folders (Chart.yaml + templates/), kubectl-delete the + # rendered templates BEFORE helm uninstall. The templates may carry + # helm.sh/hook annotations (post-install/post-upgrade) which helm does + # not track or clean up on uninstall — so without this pre-delete, the + # operator gets removed but its hook-created CRs (and their finalizers) + # linger. Doing the delete while the controller is still running lets + # finalizers clear naturally. + if [[ -d "${dir}/templates" ]]; then + for tpl in "${dir}/templates/"*.yaml; do + [[ -f "${tpl}" ]] || continue + kubectl delete -n "${ns}" -f "${tpl}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done + fi + helm_force_uninstall "${name}" "${ns}" + delete_release_cluster_resources "${name}" "${ns}" + delete_orphaned_webhooks_for_ns "${ns}" +done + +# Remove nodewright node taints that persist after operator removal. +# Nodewright taints nodes during kernel tuning. The taint key is configurable +# via runtimeRequiredTaint (defaults to skyhook.nvidia.com). + +# Clean up orphaned CRDs that were owned by this bundle's releases. +# Only delete CRDs whose Helm release annotation matches a component we just uninstalled. +kubectl get crd -o json 2>/dev/null \ + | jq -r --arg rel "gpu-operator" --arg ns "gpu-operator" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting CRD ${name} (owned by gpu-operator/gpu-operator)..." + kubectl delete crd "${name}" --ignore-not-found --wait=false \ + || echo "Warning: failed to delete CRD ${name} (owned by gpu-operator/gpu-operator); leftovers will surface in post-flight" >&2 + done || echo "Warning: orphan-CRD cleanup for gpu-operator/gpu-operator failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + +# Intentionally skip automatic deletion of unannotated CRDs matched only by +# API group. On shared clusters, those CRDs may be serving another tenant's +# release in the same group, and we do not have bundle-specific ownership +# metadata to distinguish "ours" from "theirs" safely. + +# Warn about CRDs stuck in deleting state (e.g., customresourcecleanup finalizer +# can't be resolved because CR instances still have controller-managed finalizers). +stuck_crds=$(kubectl get crd -o json 2>/dev/null | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' 2>/dev/null || true) +if [[ -n "${stuck_crds}" ]]; then + echo "" + echo "WARNING: CRDs stuck in deleting state:" + for crd in ${stuck_crds}; do + echo " ${crd}" + done + echo "" + echo " These CRDs have a deletionTimestamp but cannot complete deletion because" + echo " their customresourcecleanup finalizer is waiting for CR instances to be" + echo " removed. If you are certain no data will be lost, you can force-clear the" + echo " finalizers. Note: this may leave orphaned CR data in etcd that is no longer" + echo " accessible through the API." + echo "" + echo " Commands to force-clear (review before running):" + for crd in ${stuck_crds}; do + echo " kubectl patch crd ${crd} --type merge -p '{\"metadata\":{\"finalizers\":null}}'" + done +fi + +# Clean up namespaces after all components are uninstalled. +if [[ "${DELETE_PVCS}" == "true" ]] && ! echo " ${PROTECTED_NS} " | grep -q " gpu-operator "; then + echo "Deleting PVCs in gpu-operator..." + kubectl delete pvc --all -n gpu-operator --ignore-not-found || true +fi +delete_orphaned_webhooks_for_ns "gpu-operator" +delete_namespace "gpu-operator" + +# Clean up companion namespaces created at runtime by operators. +# Only emitted for components whose runtime creates them. + +# Wait for terminating namespaces to finish +echo "Waiting for namespaces to terminate..." +for i in $(seq 1 60); do + TERMINATING=$(kubectl get ns --no-headers 2>/dev/null | grep Terminating | awk '{print $1}' || true) + if [[ -z "${TERMINATING}" ]]; then + break + fi + if [[ $i -eq 60 ]]; then + echo "Warning: namespaces still terminating after 60s: ${TERMINATING}" + for ns in ${TERMINATING}; do + force_clear_namespace_finalizers "${ns}" + kubectl delete namespace "${ns}" --ignore-not-found --wait=false 2>/dev/null || true + done + break + fi + sleep 1 +done + +# Final webhook cleanup pass. +delete_orphaned_webhooks_for_ns "gpu-operator" + +# ============================================================================== +# Post-flight verification +# ============================================================================== + +postflight_issues=false + +TERMINATING=$(kubectl get namespaces -o jsonpath='{range .items[?(@.status.phase=="Terminating")]}{.metadata.name}{" "}{end}' 2>/dev/null || true) +if [[ -n "${TERMINATING}" ]]; then + echo "WARNING: namespaces still terminating: ${TERMINATING}" + echo " A subsequent deploy.sh may fail. Wait or force-finalize these namespaces." + postflight_issues=true +fi + +kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null | \ + sort -u | \ + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + if ! kubectl get ns "${svc_ns}" &>/dev/null || ! kubectl get svc "${svc_name}" -n "${svc_ns}" &>/dev/null; then + echo "Deleting stale webhook ${wh_name} (service ${svc_ns}/${svc_name} missing)..." + kubectl delete mutatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + kubectl delete validatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + fi + done || true + +stale_apis=$(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true) +if [[ -n "${stale_apis}" ]]; then + echo "WARNING: unavailable API services found: ${stale_apis}" + echo " These can block namespace deletion. Delete with: kubectl delete apiservice " + postflight_issues=true +fi + +# Check for Helm-annotated CRDs from uninstalled releases. +helm_orphaned_crds="" +explicit_orphaned_crds="" +postflight_all_crds_json="" +if capture_kubectl_json postflight_all_crds_json get crd -o json; then + : + remaining_helm_crds=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg rel "gpu-operator" --arg ns "gpu-operator" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_helm_crds}" ]]; then + helm_orphaned_crds="${helm_orphaned_crds} ${remaining_helm_crds}" + fi + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + remaining_explicit_crd=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg name "${crd_name}" \ + '.items[] | select(.metadata.name==$name and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_explicit_crd}" ]]; then + explicit_orphaned_crds="${explicit_orphaned_crds}${remaining_explicit_crd}"$'\n' + fi + done < <(extra_crds_for_release "gpu-operator") +else + echo "Warning: failed to enumerate post-flight CRDs; kubectl output: ${postflight_all_crds_json}" >&2 + postflight_issues=true +fi +if [[ -n "${helm_orphaned_crds}" ]]; then + echo "WARNING: Helm-annotated CRDs from uninstalled releases still present:${helm_orphaned_crds}" + echo " Cleanup did not remove all CRDs owned by this bundle's releases." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +explicit_orphaned_crds=$(printf '%s' "${explicit_orphaned_crds}" | awk 'NF' | sort -u | tr '\n' ' ') +if [[ -n "${explicit_orphaned_crds}" ]]; then + echo "WARNING: explicit CRDs from this bundle still present: ${explicit_orphaned_crds}" + echo " These CRDs are installed outside Helm manifest/annotation discovery." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +if [[ "${postflight_issues}" == "true" ]]; then + echo "" + echo "Post-flight: some stale resources remain. Run deploy.sh pre-flight checks to verify before redeploying." +else + echo "Post-flight: cluster is clean." +fi + +echo "Undeployment complete." diff --git a/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/cluster-values.yaml b/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/cluster-values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/cluster-values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/install.sh b/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/install.sh new file mode 100644 index 000000000..20b32d136 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/install.sh @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" +# shellcheck source=/dev/null +source ./upstream.env + +# CHART carries the full OCI URI for OCI charts and just the chart name for +# HTTP/HTTPS charts. REPO is non-empty only for HTTP/HTTPS charts; the +# ${REPO:+--repo "${REPO}"} expansion adds --repo iff REPO is set. +helm upgrade --install nodewright-operator "${CHART}" \ + ${REPO:+--repo "${REPO}"} --version "${VERSION}" \ + --namespace skyhook --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/upstream.env b/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/upstream.env new file mode 100644 index 000000000..0bb2f06f7 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/upstream.env @@ -0,0 +1,3 @@ +CHART='skyhook-operator' +REPO='https://example.invalid/charts' +VERSION='v0.1.0' diff --git a/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/values.yaml b/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/nodewright_present/README.md b/pkg/bundler/deployer/helm/testdata/nodewright_present/README.md new file mode 100644 index 000000000..69467173d --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/nodewright_present/README.md @@ -0,0 +1,107 @@ +# Cloud Native Stack Deployment + +Recipe Version: v0.1.0 +Bundler Version: v1.0.0 + +Per-component bundle for deploying NVIDIA Cloud Native Stack components +for GPU-accelerated Kubernetes workloads. + +## Configuration + + + +## Components + +The following components are included (deployed in order). Each component +lives in a numbered `NNN-/` folder and is installed as a Helm release +via its own `install.sh`: + +| Component | Version | Namespace | Source | +|-----------|---------|-----------|--------| +| nodewright-operator | v0.1.0 | skyhook | skyhook-operator (https://example.invalid/charts) | + + + + +## Quick Start + +Run the included deployment script: + +```bash +chmod +x deploy.sh +./deploy.sh +``` + +Use `--no-wait` to skip Helm chart-level waiting where AICR uses `--wait` (keeps `--timeout` for hooks): + +```bash +./deploy.sh --no-wait +``` + +> **Note:** The deploy script's final status reflects install/apply results. If `--best-effort` was used, one or more components may still have failed; check warning lines and logs. This does **not** guarantee the cluster is ready to schedule workloads — operator-driven cluster convergence (CRD reconciliation, node tuning, plugin registration, etc.) continues asynchronously after the script exits, in operator-specific ways. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details. + +## Manual Installation + +Each component folder contains an `install.sh` that runs `helm upgrade --install` +with the right arguments baked in. To install a single component manually: + +```bash +cd NNN- +bash install.sh +``` + +## Customization + +Each component folder has its own `values.yaml` (static) and `cluster-values.yaml` +(dynamic, per-cluster). Edit either before deploying: + +```bash +vim NNN-/values.yaml +vim NNN-/cluster-values.yaml +``` + +## Upgrade + +Re-run the per-component install.sh to upgrade an already-installed release: + +```bash +cd NNN- +bash install.sh +``` + +## Uninstall + +To remove components (reverse order): + +```bash +./undeploy.sh +``` + +Or remove a single release manually: + +```bash +helm uninstall nodewright-operator -n skyhook +``` + + +## Troubleshooting + +### Check deployment status + +```bash +kubectl get pods -A | grep -E 'nodewright-operator' +``` + +### View component logs + +Inspect a single component's pods (replace `` and `` +with one of the entries from the table above): + +```bash +kubectl logs -n -l app.kubernetes.io/instance= +``` + + +## References + +- [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md) diff --git a/pkg/bundler/deployer/helm/testdata/nodewright_present/deploy.sh b/pkg/bundler/deployer/helm/testdata/nodewright_present/deploy.sh new file mode 100644 index 000000000..e1f29403c --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/nodewright_present/deploy.sh @@ -0,0 +1,351 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail + +# Cloud Native Stack Deployment Script +# Generated by AICR Bundler v1.0.0 +# +# Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N] +# --no-wait Skip Helm chart-level wait where AICR uses --wait (keeps --timeout for hooks) +# --best-effort Continue past individual component failures (log warnings) +# --retries N Retry failed helm/kubectl operations N times with backoff (default: 5, 0 = fail-fast) +# +# This script is optional — each component subdirectory has its own install.sh +# with a single baked-in `helm upgrade --install` command. For detailed behavior +# docs (CRD ordering, async components, error handling), see the AICR CLI Reference: +# https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Run helm commands from a temp directory to prevent local chart directories +# (e.g., bundle/001-nodewright-operator/) from shadowing remote chart references. +HELM_WORKDIR="$(mktemp -d)" +trap 'rm -rf "${HELM_WORKDIR}"; exit 130' INT TERM +trap 'rm -rf "${HELM_WORKDIR}"' EXIT + +HELM_TIMEOUT="10m" +NO_WAIT=false +BEST_EFFORT=false +FAILED_COMPONENTS="" +MAX_RETRIES=5 + +while [[ $# -gt 0 ]]; do + case "$1" in + --no-wait) NO_WAIT=true; shift ;; + --best-effort) BEST_EFFORT=true; shift ;; + --retries) + if [[ $# -lt 2 ]]; then echo "Error: --retries requires a value"; exit 1; fi + if ! [[ "$2" =~ ^[0-9]+$ ]]; then echo "Error: --retries requires a non-negative integer"; exit 1; fi + MAX_RETRIES="$2"; shift 2 ;; + *) echo "Error: unknown option: $1"; echo "Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N]"; exit 1 ;; + esac +done + +# Export env vars consumed by each folder's install.sh (rendered by localformat). +# DRY_RUN_FLAG / KUBECONFIG_FLAG / HELM_DEBUG_FLAG default to empty strings. +export DRY_RUN_FLAG="${DRY_RUN_FLAG:-}" +export KUBECONFIG_FLAG="${KUBECONFIG_FLAG:-}" +export HELM_DEBUG_FLAG="${HELM_DEBUG_FLAG:-}" + +function helm_failed() { + if [[ "${BEST_EFFORT}" == "true" ]]; then + echo "WARNING: $1 install failed, continuing (--best-effort)" + FAILED_COMPONENTS="${FAILED_COMPONENTS} $1" + else + exit 1 + fi +} + +# Compute backoff delay from attempt number (1-indexed). +# Examples: attempt 1→5s, 2→20s, 3→45s, 4→80s, 5→120s (cap) +function backoff_seconds() { + local attempt=$1 + local seconds=$(( attempt * attempt * 5 )) + if [[ ${seconds} -gt 120 ]]; then seconds=120; fi + echo "${seconds}" +} + +# Clean up stale Helm hook Jobs before retrying. When a hook Job (e.g., +# crd-upgrader) times out or fails, it stays in the namespace and blocks +# subsequent install attempts with "Job not ready" errors. +# Helm hooks are identified by the helm.sh/hook *annotation* (not a label), +# so we list all non-succeeded Jobs and check each individually via JSON. +function cleanup_helm_hooks() { + local namespace="$1" + local job_names + job_names=$(kubectl get jobs -n "${namespace}" \ + --field-selector=status.successful=0 \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ + 2>/dev/null || true) + if [[ -z "${job_names}" ]]; then + return + fi + while IFS= read -r name; do + [[ -z "${name}" ]] && continue + # Get the full Job JSON to reliably check annotations and status + local job_json + job_json=$(kubectl get job "${name}" -n "${namespace}" -o json 2>/dev/null || true) + [[ -z "${job_json}" ]] && continue + # Skip non-hook Jobs (no helm.sh/hook annotation) + local hook_val + hook_val=$(echo "${job_json}" | grep -o '"helm.sh/hook"' || true) + [[ -z "${hook_val}" ]] && continue + # Capture diagnostics before deleting. This helps diagnose transient hook + # failures (e.g., dynamo ssh-keygen) that are otherwise lost after cleanup. + echo " --- Failed hook Job ${name} diagnostics ---" + kubectl describe job "${name}" -n "${namespace}" 2>/dev/null | tail -50 || true + local pod_names + pod_names=$(kubectl get pods -n "${namespace}" -l "job-name=${name}" \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || true) + for pod_name in ${pod_names}; do + echo " --- Hook pod ${pod_name} describe ---" + kubectl describe pod "${pod_name}" -n "${namespace}" 2>/dev/null | tail -50 || true + done + echo " --- End diagnostics for ${name} ---" + # Delete any non-succeeded hook Job. This function only runs after a Helm + # failure, so any hook Job without a successful completion is blocking the + # retry — whether it failed, is stuck Pending (timed out before the pod + # started), or is still active with a stuck container. + echo " Cleaning up stale Helm hook Job ${name} in ${namespace}..." + kubectl delete job "${name}" -n "${namespace}" --ignore-not-found 2>/dev/null || true + done <<< "${job_names}" +} + +function dump_kai_scheduler_helm_diagnostics() { + local namespace="$1" + if [[ "${namespace}" != "kai-scheduler" ]]; then + return + fi + + echo " --- ${namespace} diagnostics ---" + echo " Jobs:" + kubectl get jobs -n "${namespace}" 2>/dev/null || true + echo " Job descriptions:" + kubectl describe jobs -n "${namespace}" 2>/dev/null || true + echo " Pods:" + kubectl get pods -n "${namespace}" -o wide 2>/dev/null || true + echo " Pod descriptions:" + kubectl describe pods -n "${namespace}" 2>/dev/null || true + echo " Recent events:" + kubectl get events -n "${namespace}" --sort-by='.lastTimestamp' 2>/dev/null | tail -30 || true + echo " --- End ${namespace} diagnostics ---" +} + +# Components that use operator patterns with custom resources that reconcile +# asynchronously. Helm --wait may time out waiting for CR readiness even though +# all pods start successfully. These components are installed without --wait. +ASYNC_COMPONENTS="kai-scheduler" + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify the cluster is clean before deploying. Stale webhooks, terminating +# namespaces, and orphaned API services from a previous install can block pod +# creation and namespace deletion, causing silent deployment failures. + +echo "Running pre-flight checks..." + +preflight_failed=false + +# Bundle namespace list (deduplicated) +BUNDLE_NAMESPACES=$(echo "skyhook " | tr ' ' '\n' | sort -u | tr '\n' ' ') + +# Check for terminating namespaces that overlap with our components +for ns in ${BUNDLE_NAMESPACES}; do + phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null || true) + if [[ "${phase}" == "Terminating" ]]; then + echo "ERROR: namespace '${ns}' is still terminating from a previous install." + echo " Wait for it to finish, or force-finalize with:" + echo " kubectl get ns ${ns} -o json | jq '.spec.finalizers=[]' | kubectl replace --raw /api/v1/namespaces/${ns}/finalize -f -" + preflight_failed=true + fi +done + +# Check for stale webhooks whose backing services no longer exist. +# Scoped to bundle namespaces only to avoid false positives from unrelated +# platform webhooks in shared clusters. +if command -v jq &>/dev/null; then + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + # Only check webhooks pointing to our bundle namespaces + is_bundle_ns=false + for ns in ${BUNDLE_NAMESPACES}; do + [[ "${svc_ns}" == "${ns}" ]] && is_bundle_ns=true && break + done + [[ "${is_bundle_ns}" == "false" ]] && continue + + # Use explicit NotFound check to avoid false positives from transient errors + svc_check=$(kubectl get svc "${svc_name}" -n "${svc_ns}" 2>&1) || true + if echo "${svc_check}" | grep -q "NotFound\|not found"; then + echo "ERROR: ${kind} '${wh_name}' references non-existent service ${svc_ns}/${svc_name}." + echo " This will block pod/resource creation. Delete with: kubectl delete ${kind} ${wh_name}" + preflight_failed=true + fi + done < <(kubectl get "${kind}" -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null || true) + done +else + echo "NOTE: jq not found — skipping webhook pre-flight checks. Install jq for full pre-flight validation." +fi + +# Check for stale API services (e.g., custom.metrics.k8s.io from prometheus-adapter) +if command -v jq &>/dev/null; then + for api_svc in $(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true); do + echo "WARNING: API service '${api_svc}' is unavailable. This can block namespace deletion." + echo " Delete with: kubectl delete apiservice ${api_svc}" + # API service issues are warnings, not hard failures — they don't block deployment directly + done +else + echo "NOTE: jq not found — skipping API service pre-flight checks." +fi + +# Check for orphaned CRDs from previous deployments. +# Scoped to CRD groups belonging to components in this bundle to avoid +# false positives from unrelated platform installs on shared clusters. +ORPHANED_CRD_GROUPS="" +for group in ${ORPHANED_CRD_GROUPS}; do + orphaned=$(kubectl get crd -o name 2>/dev/null | grep "\.${group}$" || true) + if [[ -n "${orphaned}" ]]; then + echo "WARNING: orphaned CRDs from previous deployment: ${orphaned}" + echo " These may cause conflicts. Delete with: kubectl delete ${orphaned}" + fi +done + +# Check for stale nodewright node taints from a previous deployment. +# Only remove taints if nodewright-operator is NOT already running (i.e., fresh deploy). +# If the operator is running, taints are legitimate scheduling guards. +# Only remove taints if nodewright-operator is not running with available replicas. +# A crashlooping or scaled-to-zero operator still leaves stale taints. +nodewright_available=$(kubectl get deploy -n skyhook -l app.kubernetes.io/name=skyhook-operator -o jsonpath='{.items[0].status.availableReplicas}' 2>/dev/null || echo "0") +if [[ "${nodewright_available}" == "0" || -z "${nodewright_available}" ]]; then + # Extract the taint key from bundle values. The YAML value is a taint string + # like "custom.io/gate=true:NoSchedule" — extract the key before the first "=". + # If runtimeRequiredTaint is not set, use the default skyhook.nvidia.com. + NODEWRIGHT_TAINT_KEY="skyhook.nvidia.com" + # Locate nodewright-operator's NNN-prefixed directory at runtime. + nodewright_dir="$(ls -d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-nodewright-operator 2>/dev/null | head -1)" + if [[ -n "${nodewright_dir}" && -f "${nodewright_dir}/values.yaml" ]]; then + custom_taint_line=$(grep 'runtimeRequiredTaint:' "${nodewright_dir}/values.yaml" 2>/dev/null || true) + if [[ -n "${custom_taint_line}" ]]; then + taint_value=$(echo "${custom_taint_line}" | sed 's/.*runtimeRequiredTaint:[[:space:]]*//' | tr -d '"' | tr -d "'") + if [[ -n "${taint_value}" ]]; then + # Handle both key=value:effect and key:effect formats + NODEWRIGHT_TAINT_KEY="${taint_value%%=*}" + NODEWRIGHT_TAINT_KEY="${NODEWRIGHT_TAINT_KEY%%:*}" + fi + fi + fi + stale_nodewright=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{range .spec.taints[*]}{.key}{" "}{end}{"\n"}{end}' 2>/dev/null | grep "${NODEWRIGHT_TAINT_KEY}" | awk '{print $1}' || true) + if [[ -n "${stale_nodewright}" ]]; then + echo "WARNING: nodes with stale ${NODEWRIGHT_TAINT_KEY} taints (no running nodewright-operator): ${stale_nodewright}" + echo " Removing stale taints to unblock scheduling..." + for node in ${stale_nodewright}; do + kubectl taint node "${node}" "${NODEWRIGHT_TAINT_KEY}-" 2>/dev/null || true + done + fi +fi + +if [[ "${preflight_failed}" == "true" ]]; then + echo "" + echo "Pre-flight checks failed. Fix the issues above before deploying." + echo "To skip pre-flight checks, run: ./undeploy.sh first, then retry." + exit 1 +fi + +echo "Pre-flight checks passed." +echo "Deploying Cloud Native Stack components..." + +# ============================================================================== +# Install loop +# ============================================================================== +# Generic install loop. Each folder's install.sh is rendered by localformat +# with the right helm command baked in — deploy.sh has no per-component +# knowledge here. Per-component special-case logic (async wait, DRA plugin +# restart) runs in the post-install blocks below, matched by component name. +cd "${HELM_WORKDIR}" +for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; do + [[ -d "${dir}" ]] || continue + dir="${dir%/}" + base="${dir##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Source the namespace from the folder's install.sh — not the folder + # basename — because the helm release name and its target namespace can + # differ (e.g. nodewright-operator → namespace skyhook; gpu-operator-post → + # namespace gpu-operator). cleanup_helm_hooks and the kai diagnostics + # both operate on the namespace. + namespace=$(awk '{ for (i=1;i/dev/null; then + echo "Error: jq is required but not found in PATH." + echo " Install jq: https://jqlang.github.io/jq/download/" + exit 1 +fi + +echo "Undeploying Cloud Native Stack components (timeout: ${HELM_TIMEOUT}s)..." +echo "" + +echo " The following components will be removed (in reverse install order):" +# List NNN-* folders in reverse numeric order +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + printf " %-40s\n" "${name}" +done +echo "" + +# System namespaces that must not be deleted +PROTECTED_NS="kube-system kube-public kube-node-lease default" + +delete_namespace() { + local ns="$1" + if [[ "${KEEP_NS}" == "true" ]]; then return; fi + if echo " ${PROTECTED_NS} " | grep -q " ${ns} "; then return; fi + if ! kubectl get namespace "${ns}" &>/dev/null; then return; fi + echo "Deleting namespace ${ns}..." + kubectl delete namespace "${ns}" --ignore-not-found --wait=false +} + +# Uninstall a Helm release, handling stuck pending states from interrupted deploys. +# Try normal uninstall first; if it fails, retry with --no-hooks to force removal. +helm_force_uninstall() { + local release="$1" + local ns="$2" + if helm uninstall "${release}" -n "${ns}" --timeout "${HELM_TIMEOUT}s" --ignore-not-found 2>/dev/null; then + return + fi + echo " Retrying ${release} removal with --no-hooks..." + helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found || true +} + +# Delete cluster-scoped resources owned by a Helm release. +# These survive namespace deletion and can block subsequent deployments: +# - Webhooks block pod creation when their backing service is gone +# - CRDs with "helm.sh/resource-policy: keep" are retained after chart removal +delete_release_cluster_resources() { + local release="$1" + local ns="$2" + local selector="app.kubernetes.io/managed-by=Helm" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations customresourcedefinitions; do + kubectl get "${kind}" -l "${selector}" -o json 2>/dev/null \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting ${kind}/${name}..." + kubectl delete "${kind}" "${name}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done || echo "Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + done +} + +# Delete webhooks whose backing service is in a specific namespace and no longer exists. +# Scoped to the given namespace to avoid touching unrelated platform webhooks. +# Operator-created webhooks (e.g., kai-scheduler admission) may not carry Helm labels, +# but once their service namespace is deleted, fail-closed webhooks block pod creation. +delete_orphaned_webhooks_for_ns() { + local ns="$1" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + { kubectl get "${kind}" -o json 2>/dev/null \ + | jq -r --arg ns "${ns}" \ + '.items[] | .metadata.name as $wh | .webhooks[] | select(.clientConfig.service != null and .clientConfig.service.namespace == $ns) | [$wh, .clientConfig.service.name] | @tsv' 2>/dev/null \ + | sort -u || true; } \ + | while IFS=$'\t' read -r wh_name svc_name; do + # Delete when namespace is gone, terminating, or backing service is missing. + # Skip on transient errors (auth, timeout, DNS) to avoid removing valid webhooks. + local ns_output ns_phase svc_output + ns_output=$(kubectl get ns "${ns}" 2>&1) || true + if echo "${ns_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + ns_phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null) || true + if [[ "${ns_phase}" == "Terminating" ]]; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} terminating)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + svc_output=$(kubectl get svc "${svc_name}" -n "${ns}" 2>&1) || true + if echo "${svc_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (service ${ns}/${svc_name} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + fi + done + done +} + +# Force-clear finalizers on all namespaced resources to unstick a Terminating namespace. +force_clear_namespace_finalizers() { + local ns="$1" + echo "Force-removing finalizers in namespace ${ns}..." + local kinds + kinds=$(kubectl api-resources --verbs=list --namespaced -o name 2>/dev/null) || { + echo "Warning: failed to enumerate namespaced resource kinds in ${ns}; namespace may stay Terminating" >&2 + return + } + for kind in ${kinds}; do + kubectl get "${kind}" -n "${ns}" -o json 2>/dev/null \ + | jq -r '.items[] | select(.metadata.finalizers // [] | length > 0) | .kind + "/" + .metadata.name' 2>/dev/null \ + | while read -r resource; do + kubectl patch "${resource}" -n "${ns}" --type merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + done || echo "Warning: finalizer-clear pipeline for ${kind} in ${ns} failed (kubectl get / jq error); namespace may stay Terminating" >&2 + done +} + +# Return a small, explicit list of known crds/-installed CRDs whose +# finalizer-bearing custom resources must still be caught before the operator +# is removed. Keep this list intentionally tiny and exact: if a release is not +# listed here, pre-flight relies on chart-manifest and Helm-annotation discovery +# only instead of trying to infer ownership across the whole cluster. +extra_crds_for_release() { + case "$1" in + gpu-operator) + printf '%s\n' \ + "clusterpolicies.nvidia.com" \ + "nvidiadrivers.nvidia.com" \ + "nodefeaturegroups.nfd.k8s-sigs.io" \ + "nodefeaturerules.nfd.k8s-sigs.io" \ + "nodefeatures.nfd.k8s-sigs.io" + ;; + kai-scheduler) + printf '%s\n' \ + "bindrequests.scheduling.run.ai" \ + "configs.kai.scheduler" \ + "podgroups.scheduling.run.ai" \ + "queues.scheduling.run.ai" \ + "schedulingshards.kai.scheduler" \ + "topologies.kai.scheduler" + ;; + k8s-nim-operator) + printf '%s\n' \ + "nemocustomizers.apps.nvidia.com" \ + "nemodatastores.apps.nvidia.com" \ + "nemoentitystores.apps.nvidia.com" \ + "nemoevaluators.apps.nvidia.com" \ + "nemoguardrails.apps.nvidia.com" \ + "nimbuilds.apps.nvidia.com" \ + "nimcaches.apps.nvidia.com" \ + "nimpipelines.apps.nvidia.com" \ + "nimservices.apps.nvidia.com" + ;; + kubeflow-trainer) + printf '%s\n' \ + "clustertrainingruntimes.trainer.kubeflow.org" \ + "trainjobs.trainer.kubeflow.org" \ + "trainingruntimes.trainer.kubeflow.org" \ + "jobsets.jobset.x-k8s.io" + ;; + kube-prometheus-stack) + printf '%s\n' \ + "alertmanagerconfigs.monitoring.coreos.com" \ + "alertmanagers.monitoring.coreos.com" \ + "podmonitors.monitoring.coreos.com" \ + "probes.monitoring.coreos.com" \ + "prometheusagents.monitoring.coreos.com" \ + "prometheuses.monitoring.coreos.com" \ + "prometheusrules.monitoring.coreos.com" \ + "scrapeconfigs.monitoring.coreos.com" \ + "servicemonitors.monitoring.coreos.com" \ + "thanosrulers.monitoring.coreos.com" + ;; + dynamo-platform) + printf '%s\n' \ + "podcliques.grove.io" \ + "podcliquescalinggroups.grove.io" \ + "podcliquesets.grove.io" \ + "podgangs.scheduler.grove.io" + ;; + network-operator) + # This explicit list matches the CRDs enabled by the bundled values: + # nfd=false, sriovNetworkOperator=false, maintenance-operator disabled. + # Intentionally exclude networkattachmentdefinitions.k8s.cni.cncf.io: + # it is a broadly shared CRD, so surfacing or deleting it based only on + # this release would create cross-cluster noise. + printf '%s\n' \ + "nicclusterpolicies.mellanox.com" \ + "hostdevicenetworks.mellanox.com" \ + "ipoibnetworks.mellanox.com" \ + "macvlannetworks.mellanox.com" + ;; + *) ;; + esac +} + +# Skip pre-flight for releases whose bundle-managed custom resources are +# deleted from manifests before the controller is uninstalled. +skip_preflight_for_release() { + case "$1" in + nodewright-operator|kgateway) return 0 ;; + *) return 1 ;; + esac +} + +# Run `kubectl ... -o json` while keeping stdout parseable for jq. +capture_kubectl_json() { + local out_var="$1" + shift + local stderr_file output kubectl_err="" + + if ! stderr_file=$(mktemp "${TMPDIR:-/tmp}/.aicr_kubectl_stderr_XXXXXX"); then + printf -v "${out_var}" '%s' 'mktemp failed while capturing kubectl stderr' + return 1 + fi + + if ! output=$(kubectl "$@" 2>"${stderr_file}"); then + kubectl_err=$(cat "${stderr_file}" 2>/dev/null || true) + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${kubectl_err}" + return 1 + fi + + if [[ -s "${stderr_file}" ]]; then + cat "${stderr_file}" >&2 || true + fi + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${output}" + return 0 +} + +check_crd_for_stuck_resources() { + local crd_name="$1" + local component="$2" + local crd_json plural group scope resource stuck stuck_json kubectl_err + local item_name item_namespace item_finalizers + + if ! capture_kubectl_json crd_json get crd "${crd_name}" -o json; then + kubectl_err="${crd_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not inspect CRD '${crd_name}' (release ${component})." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + read -r plural group scope < <(echo "${crd_json}" | jq -r '[.spec.names.plural, .spec.group, .spec.scope] | @tsv' 2>/dev/null) || return 0 + [[ -z "${plural}" || "${plural}" == "null" ]] && return 0 + + resource="${plural}.${group}" + local jq_filter='[.metadata.finalizers // [] | .[] | select(startswith("kubernetes.io/") | not)]' + stuck="" + if [[ "${scope}" == "Namespaced" ]]; then + if ! capture_kubectl_json stuck_json get "${resource}" -A -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_namespace item_name item_finalizers; do + stuck="${stuck} ${item_namespace}/${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [(.metadata.namespace // ""), .metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + else + if ! capture_kubectl_json stuck_json get "${resource}" -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_name item_finalizers; do + stuck="${stuck} ${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [.metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + fi + + if [[ -n "${stuck}" ]]; then + { + echo " ${component} — ${crd_name}:" + printf '%s' "${stuck}" + } >> "${PREFLIGHT_DETAILS}" + fi +} + +check_release_for_stuck_crds() { + local release="$1" + local ns="$2" + local manifest manifest_crds annotated_crds explicit_crds + local all_crds_json kubectl_err crd_name + manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) + manifest_crds=$(echo "${manifest}" \ + | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') + + all_crds_json="${PREFLIGHT_ALL_CRDS_JSON:-}" + if [[ -z "${all_crds_json}" ]]; then + if ! capture_kubectl_json all_crds_json get crd -o json; then + kubectl_err="${all_crds_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs for release '${release}'." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + fi + + annotated_crds=$(echo "${all_crds_json}" \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) + explicit_crds="" + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + explicit_crds="${explicit_crds}$(echo "${all_crds_json}" \ + | jq -r --arg name "${crd_name}" --arg rel "${release}" --arg ns "${ns}" \ + '.items[] + | select(.metadata.name == $name) + | select( + ((.metadata.annotations["meta.helm.sh/release-name"] // "") == "") + or + ( + .metadata.annotations["meta.helm.sh/release-name"] == $rel + and + .metadata.annotations["meta.helm.sh/release-namespace"] == $ns + ) + ) + | .metadata.name' 2>/dev/null || true)"$'\n' + done < <(extra_crds_for_release "${release}") + [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${explicit_crds}" ]] && return 0 + printf '%s\n%s\n%s\n' "${manifest_crds}" "${annotated_crds}" "${explicit_crds}" \ + | awk 'NF' \ + | sort -u \ + | while read -r crd_name; do + check_crd_for_stuck_resources "${crd_name}" "${release}" + done + return 0 +} + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify no custom resources with active finalizers exist for CRDs owned by +# bundle operators. After helm uninstall removes the operator, CRs with +# finalizers cannot be reconciled — blocking CRD deletion indefinitely. + +if [[ "${SKIP_PREFLIGHT}" == "true" ]]; then + echo "Skipping pre-flight checks (--skip-preflight)." +else + echo "Running pre-flight checks..." + PREFLIGHT_DETAILS=$(mktemp "${TMPDIR:-/tmp}/.aicr_preflight_XXXXXX") + PREFLIGHT_ALL_CRDS_JSON="" + + if ! capture_kubectl_json PREFLIGHT_ALL_CRDS_JSON get crd -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs." >&2 + echo " kubectl output: ${PREFLIGHT_ALL_CRDS_JSON}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove an operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + if skip_preflight_for_release "nodewright-operator"; then + echo " Skipping nodewright-operator (skyhook): bundle deletes dependent manifests before controller uninstall." + else + echo " Checking nodewright-operator (skyhook)..." + check_release_for_stuck_crds "nodewright-operator" "skyhook" + fi + + if [[ -s "${PREFLIGHT_DETAILS}" ]]; then + echo "" + echo "ERROR: Found custom resources with active finalizers that will block undeploy." + echo " After the operator is removed, these finalizers cannot be processed —" + echo " causing an unrecoverable hang during CRD deletion." + echo "" + echo " Delete these resources while their controller is still running," + echo " then re-run ./undeploy.sh" + echo "" + cat "${PREFLIGHT_DETAILS}" + echo "" + echo " To skip this check: ./undeploy.sh --skip-preflight" + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + rm -f "${PREFLIGHT_DETAILS}" + + echo "Pre-flight checks passed." +fi + +# ============================================================================== +# Uninstall components in reverse install order +# ============================================================================== +# Generic reverse loop: every folder is a Helm release (local-helm or upstream-helm). +# `helm uninstall ` works uniformly for both kinds — that's one of the +# benefits of the uniform local-chart bundle format. +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Derive the namespace from the component name via the per-component blocks + # below. We still need it for helm_force_uninstall and cluster-resource cleanup. + ns="" + if [[ "${name}" == "nodewright-operator" ]]; then ns="skyhook"; fi + # Injected mixed-component "-post" folders share their parent's namespace. + if [[ "${name}" == "nodewright-operator-post" ]]; then ns="skyhook"; fi + if [[ -z "${ns}" ]]; then + echo "Warning: no namespace known for ${name}; skipping uninstall" >&2 + continue + fi + echo "Uninstalling ${name} (${ns})..." + # For local-helm folders (Chart.yaml + templates/), kubectl-delete the + # rendered templates BEFORE helm uninstall. The templates may carry + # helm.sh/hook annotations (post-install/post-upgrade) which helm does + # not track or clean up on uninstall — so without this pre-delete, the + # operator gets removed but its hook-created CRs (and their finalizers) + # linger. Doing the delete while the controller is still running lets + # finalizers clear naturally. + if [[ -d "${dir}/templates" ]]; then + for tpl in "${dir}/templates/"*.yaml; do + [[ -f "${tpl}" ]] || continue + kubectl delete -n "${ns}" -f "${tpl}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done + fi + helm_force_uninstall "${name}" "${ns}" + delete_release_cluster_resources "${name}" "${ns}" + delete_orphaned_webhooks_for_ns "${ns}" +done + +# Remove nodewright node taints that persist after operator removal. +# Nodewright taints nodes during kernel tuning. The taint key is configurable +# via runtimeRequiredTaint (defaults to skyhook.nvidia.com). +NODEWRIGHT_TAINT_KEY="skyhook.nvidia.com" +# Locate nodewright-operator's NNN-prefixed directory at runtime. +nodewright_dir="$(ls -d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-nodewright-operator 2>/dev/null | head -1)" +if [[ -n "${nodewright_dir}" && -f "${nodewright_dir}/values.yaml" ]]; then + custom_taint_line=$(grep 'runtimeRequiredTaint:' "${nodewright_dir}/values.yaml" 2>/dev/null || true) + if [[ -n "${custom_taint_line}" ]]; then + # Extract value after "runtimeRequiredTaint:", then extract taint key before "=" + taint_value=$(echo "${custom_taint_line}" | sed 's/.*runtimeRequiredTaint:[[:space:]]*//' | tr -d '"' | tr -d "'") + if [[ -n "${taint_value}" ]]; then + # Handle both key=value:effect and key:effect formats + NODEWRIGHT_TAINT_KEY="${taint_value%%=*}" + NODEWRIGHT_TAINT_KEY="${NODEWRIGHT_TAINT_KEY%%:*}" + fi + fi +fi +echo "Removing ${NODEWRIGHT_TAINT_KEY} node taints..." +for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do + kubectl taint node "${node}" "${NODEWRIGHT_TAINT_KEY}-" 2>/dev/null || true +done + +# Clean up orphaned CRDs that were owned by this bundle's releases. +# Only delete CRDs whose Helm release annotation matches a component we just uninstalled. +kubectl get crd -o json 2>/dev/null \ + | jq -r --arg rel "nodewright-operator" --arg ns "skyhook" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting CRD ${name} (owned by nodewright-operator/skyhook)..." + kubectl delete crd "${name}" --ignore-not-found --wait=false \ + || echo "Warning: failed to delete CRD ${name} (owned by nodewright-operator/skyhook); leftovers will surface in post-flight" >&2 + done || echo "Warning: orphan-CRD cleanup for nodewright-operator/skyhook failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + +# Intentionally skip automatic deletion of unannotated CRDs matched only by +# API group. On shared clusters, those CRDs may be serving another tenant's +# release in the same group, and we do not have bundle-specific ownership +# metadata to distinguish "ours" from "theirs" safely. + +# Warn about CRDs stuck in deleting state (e.g., customresourcecleanup finalizer +# can't be resolved because CR instances still have controller-managed finalizers). +stuck_crds=$(kubectl get crd -o json 2>/dev/null | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' 2>/dev/null || true) +if [[ -n "${stuck_crds}" ]]; then + echo "" + echo "WARNING: CRDs stuck in deleting state:" + for crd in ${stuck_crds}; do + echo " ${crd}" + done + echo "" + echo " These CRDs have a deletionTimestamp but cannot complete deletion because" + echo " their customresourcecleanup finalizer is waiting for CR instances to be" + echo " removed. If you are certain no data will be lost, you can force-clear the" + echo " finalizers. Note: this may leave orphaned CR data in etcd that is no longer" + echo " accessible through the API." + echo "" + echo " Commands to force-clear (review before running):" + for crd in ${stuck_crds}; do + echo " kubectl patch crd ${crd} --type merge -p '{\"metadata\":{\"finalizers\":null}}'" + done +fi + +# Clean up namespaces after all components are uninstalled. +if [[ "${DELETE_PVCS}" == "true" ]] && ! echo " ${PROTECTED_NS} " | grep -q " skyhook "; then + echo "Deleting PVCs in skyhook..." + kubectl delete pvc --all -n skyhook --ignore-not-found || true +fi +delete_orphaned_webhooks_for_ns "skyhook" +delete_namespace "skyhook" + +# Clean up companion namespaces created at runtime by operators. +# Only emitted for components whose runtime creates them. + +# Wait for terminating namespaces to finish +echo "Waiting for namespaces to terminate..." +for i in $(seq 1 60); do + TERMINATING=$(kubectl get ns --no-headers 2>/dev/null | grep Terminating | awk '{print $1}' || true) + if [[ -z "${TERMINATING}" ]]; then + break + fi + if [[ $i -eq 60 ]]; then + echo "Warning: namespaces still terminating after 60s: ${TERMINATING}" + for ns in ${TERMINATING}; do + force_clear_namespace_finalizers "${ns}" + kubectl delete namespace "${ns}" --ignore-not-found --wait=false 2>/dev/null || true + done + break + fi + sleep 1 +done + +# Final webhook cleanup pass. +delete_orphaned_webhooks_for_ns "skyhook" + +# ============================================================================== +# Post-flight verification +# ============================================================================== + +postflight_issues=false + +TERMINATING=$(kubectl get namespaces -o jsonpath='{range .items[?(@.status.phase=="Terminating")]}{.metadata.name}{" "}{end}' 2>/dev/null || true) +if [[ -n "${TERMINATING}" ]]; then + echo "WARNING: namespaces still terminating: ${TERMINATING}" + echo " A subsequent deploy.sh may fail. Wait or force-finalize these namespaces." + postflight_issues=true +fi + +kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null | \ + sort -u | \ + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + if ! kubectl get ns "${svc_ns}" &>/dev/null || ! kubectl get svc "${svc_name}" -n "${svc_ns}" &>/dev/null; then + echo "Deleting stale webhook ${wh_name} (service ${svc_ns}/${svc_name} missing)..." + kubectl delete mutatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + kubectl delete validatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + fi + done || true + +stale_apis=$(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true) +if [[ -n "${stale_apis}" ]]; then + echo "WARNING: unavailable API services found: ${stale_apis}" + echo " These can block namespace deletion. Delete with: kubectl delete apiservice " + postflight_issues=true +fi + +# Check for Helm-annotated CRDs from uninstalled releases. +helm_orphaned_crds="" +explicit_orphaned_crds="" +postflight_all_crds_json="" +if capture_kubectl_json postflight_all_crds_json get crd -o json; then + : + remaining_helm_crds=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg rel "nodewright-operator" --arg ns "skyhook" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_helm_crds}" ]]; then + helm_orphaned_crds="${helm_orphaned_crds} ${remaining_helm_crds}" + fi + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + remaining_explicit_crd=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg name "${crd_name}" \ + '.items[] | select(.metadata.name==$name and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_explicit_crd}" ]]; then + explicit_orphaned_crds="${explicit_orphaned_crds}${remaining_explicit_crd}"$'\n' + fi + done < <(extra_crds_for_release "nodewright-operator") +else + echo "Warning: failed to enumerate post-flight CRDs; kubectl output: ${postflight_all_crds_json}" >&2 + postflight_issues=true +fi +if [[ -n "${helm_orphaned_crds}" ]]; then + echo "WARNING: Helm-annotated CRDs from uninstalled releases still present:${helm_orphaned_crds}" + echo " Cleanup did not remove all CRDs owned by this bundle's releases." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +explicit_orphaned_crds=$(printf '%s' "${explicit_orphaned_crds}" | awk 'NF' | sort -u | tr '\n' ' ') +if [[ -n "${explicit_orphaned_crds}" ]]; then + echo "WARNING: explicit CRDs from this bundle still present: ${explicit_orphaned_crds}" + echo " These CRDs are installed outside Helm manifest/annotation discovery." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +if [[ "${postflight_issues}" == "true" ]]; then + echo "" + echo "Post-flight: some stale resources remain. Run deploy.sh pre-flight checks to verify before redeploying." +else + echo "Post-flight: cluster is clean." +fi + +echo "Undeployment complete." diff --git a/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml new file mode 100644 index 000000000..9c936cbd8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml @@ -0,0 +1,2 @@ +# Generated by Cloud Native Stack +--- diff --git a/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.sh b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.sh new file mode 100644 index 000000000..d4dd63bb7 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.sh @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" +# shellcheck source=/dev/null +source ./upstream.env + +# CHART carries the full OCI URI for OCI charts and just the chart name for +# HTTP/HTTPS charts. REPO is non-empty only for HTTP/HTTPS charts; the +# ${REPO:+--repo "${REPO}"} expansion adds --repo iff REPO is set. +helm upgrade --install cert-manager "${CHART}" \ + ${REPO:+--repo "${REPO}"} --version "${VERSION}" \ + --namespace cert-manager --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.env b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.env new file mode 100644 index 000000000..09437ac9c --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.env @@ -0,0 +1,3 @@ +CHART='cert-manager' +REPO='https://charts.jetstack.io' +VERSION='v1.17.2' diff --git a/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml new file mode 100644 index 000000000..1aa29a1c0 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml @@ -0,0 +1,4 @@ +# Generated by Cloud Native Stack +--- +crds: + enabled: true diff --git a/pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md new file mode 100644 index 000000000..57404f66a --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md @@ -0,0 +1,107 @@ +# Cloud Native Stack Deployment + +Recipe Version: v0.1.0 +Bundler Version: v1.0.0 + +Per-component bundle for deploying NVIDIA Cloud Native Stack components +for GPU-accelerated Kubernetes workloads. + +## Configuration + + + +## Components + +The following components are included (deployed in order). Each component +lives in a numbered `NNN-/` folder and is installed as a Helm release +via its own `install.sh`: + +| Component | Version | Namespace | Source | +|-----------|---------|-----------|--------| +| cert-manager | v1.17.2 | cert-manager | cert-manager (https://charts.jetstack.io) | + + + + +## Quick Start + +Run the included deployment script: + +```bash +chmod +x deploy.sh +./deploy.sh +``` + +Use `--no-wait` to skip Helm chart-level waiting where AICR uses `--wait` (keeps `--timeout` for hooks): + +```bash +./deploy.sh --no-wait +``` + +> **Note:** The deploy script's final status reflects install/apply results. If `--best-effort` was used, one or more components may still have failed; check warning lines and logs. This does **not** guarantee the cluster is ready to schedule workloads — operator-driven cluster convergence (CRD reconciliation, node tuning, plugin registration, etc.) continues asynchronously after the script exits, in operator-specific ways. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details. + +## Manual Installation + +Each component folder contains an `install.sh` that runs `helm upgrade --install` +with the right arguments baked in. To install a single component manually: + +```bash +cd NNN- +bash install.sh +``` + +## Customization + +Each component folder has its own `values.yaml` (static) and `cluster-values.yaml` +(dynamic, per-cluster). Edit either before deploying: + +```bash +vim NNN-/values.yaml +vim NNN-/cluster-values.yaml +``` + +## Upgrade + +Re-run the per-component install.sh to upgrade an already-installed release: + +```bash +cd NNN- +bash install.sh +``` + +## Uninstall + +To remove components (reverse order): + +```bash +./undeploy.sh +``` + +Or remove a single release manually: + +```bash +helm uninstall cert-manager -n cert-manager +``` + + +## Troubleshooting + +### Check deployment status + +```bash +kubectl get pods -A | grep -E 'cert-manager' +``` + +### View component logs + +Inspect a single component's pods (replace `` and `` +with one of the entries from the table above): + +```bash +kubectl logs -n -l app.kubernetes.io/instance= +``` + + +## References + +- [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md) diff --git a/pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh new file mode 100644 index 000000000..7050af8e8 --- /dev/null +++ b/pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh @@ -0,0 +1,321 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail + +# Cloud Native Stack Deployment Script +# Generated by AICR Bundler v1.0.0 +# +# Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N] +# --no-wait Skip Helm chart-level wait where AICR uses --wait (keeps --timeout for hooks) +# --best-effort Continue past individual component failures (log warnings) +# --retries N Retry failed helm/kubectl operations N times with backoff (default: 5, 0 = fail-fast) +# +# This script is optional — each component subdirectory has its own install.sh +# with a single baked-in `helm upgrade --install` command. For detailed behavior +# docs (CRD ordering, async components, error handling), see the AICR CLI Reference: +# https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Run helm commands from a temp directory to prevent local chart directories +# (e.g., bundle/001-nodewright-operator/) from shadowing remote chart references. +HELM_WORKDIR="$(mktemp -d)" +trap 'rm -rf "${HELM_WORKDIR}"; exit 130' INT TERM +trap 'rm -rf "${HELM_WORKDIR}"' EXIT + +HELM_TIMEOUT="10m" +NO_WAIT=false +BEST_EFFORT=false +FAILED_COMPONENTS="" +MAX_RETRIES=5 + +while [[ $# -gt 0 ]]; do + case "$1" in + --no-wait) NO_WAIT=true; shift ;; + --best-effort) BEST_EFFORT=true; shift ;; + --retries) + if [[ $# -lt 2 ]]; then echo "Error: --retries requires a value"; exit 1; fi + if ! [[ "$2" =~ ^[0-9]+$ ]]; then echo "Error: --retries requires a non-negative integer"; exit 1; fi + MAX_RETRIES="$2"; shift 2 ;; + *) echo "Error: unknown option: $1"; echo "Usage: ./deploy.sh [--no-wait] [--best-effort] [--retries N]"; exit 1 ;; + esac +done + +# Export env vars consumed by each folder's install.sh (rendered by localformat). +# DRY_RUN_FLAG / KUBECONFIG_FLAG / HELM_DEBUG_FLAG default to empty strings. +export DRY_RUN_FLAG="${DRY_RUN_FLAG:-}" +export KUBECONFIG_FLAG="${KUBECONFIG_FLAG:-}" +export HELM_DEBUG_FLAG="${HELM_DEBUG_FLAG:-}" + +function helm_failed() { + if [[ "${BEST_EFFORT}" == "true" ]]; then + echo "WARNING: $1 install failed, continuing (--best-effort)" + FAILED_COMPONENTS="${FAILED_COMPONENTS} $1" + else + exit 1 + fi +} + +# Compute backoff delay from attempt number (1-indexed). +# Examples: attempt 1→5s, 2→20s, 3→45s, 4→80s, 5→120s (cap) +function backoff_seconds() { + local attempt=$1 + local seconds=$(( attempt * attempt * 5 )) + if [[ ${seconds} -gt 120 ]]; then seconds=120; fi + echo "${seconds}" +} + +# Clean up stale Helm hook Jobs before retrying. When a hook Job (e.g., +# crd-upgrader) times out or fails, it stays in the namespace and blocks +# subsequent install attempts with "Job not ready" errors. +# Helm hooks are identified by the helm.sh/hook *annotation* (not a label), +# so we list all non-succeeded Jobs and check each individually via JSON. +function cleanup_helm_hooks() { + local namespace="$1" + local job_names + job_names=$(kubectl get jobs -n "${namespace}" \ + --field-selector=status.successful=0 \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ + 2>/dev/null || true) + if [[ -z "${job_names}" ]]; then + return + fi + while IFS= read -r name; do + [[ -z "${name}" ]] && continue + # Get the full Job JSON to reliably check annotations and status + local job_json + job_json=$(kubectl get job "${name}" -n "${namespace}" -o json 2>/dev/null || true) + [[ -z "${job_json}" ]] && continue + # Skip non-hook Jobs (no helm.sh/hook annotation) + local hook_val + hook_val=$(echo "${job_json}" | grep -o '"helm.sh/hook"' || true) + [[ -z "${hook_val}" ]] && continue + # Capture diagnostics before deleting. This helps diagnose transient hook + # failures (e.g., dynamo ssh-keygen) that are otherwise lost after cleanup. + echo " --- Failed hook Job ${name} diagnostics ---" + kubectl describe job "${name}" -n "${namespace}" 2>/dev/null | tail -50 || true + local pod_names + pod_names=$(kubectl get pods -n "${namespace}" -l "job-name=${name}" \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || true) + for pod_name in ${pod_names}; do + echo " --- Hook pod ${pod_name} describe ---" + kubectl describe pod "${pod_name}" -n "${namespace}" 2>/dev/null | tail -50 || true + done + echo " --- End diagnostics for ${name} ---" + # Delete any non-succeeded hook Job. This function only runs after a Helm + # failure, so any hook Job without a successful completion is blocking the + # retry — whether it failed, is stuck Pending (timed out before the pod + # started), or is still active with a stuck container. + echo " Cleaning up stale Helm hook Job ${name} in ${namespace}..." + kubectl delete job "${name}" -n "${namespace}" --ignore-not-found 2>/dev/null || true + done <<< "${job_names}" +} + +function dump_kai_scheduler_helm_diagnostics() { + local namespace="$1" + if [[ "${namespace}" != "kai-scheduler" ]]; then + return + fi + + echo " --- ${namespace} diagnostics ---" + echo " Jobs:" + kubectl get jobs -n "${namespace}" 2>/dev/null || true + echo " Job descriptions:" + kubectl describe jobs -n "${namespace}" 2>/dev/null || true + echo " Pods:" + kubectl get pods -n "${namespace}" -o wide 2>/dev/null || true + echo " Pod descriptions:" + kubectl describe pods -n "${namespace}" 2>/dev/null || true + echo " Recent events:" + kubectl get events -n "${namespace}" --sort-by='.lastTimestamp' 2>/dev/null | tail -30 || true + echo " --- End ${namespace} diagnostics ---" +} + +# Components that use operator patterns with custom resources that reconcile +# asynchronously. Helm --wait may time out waiting for CR readiness even though +# all pods start successfully. These components are installed without --wait. +ASYNC_COMPONENTS="kai-scheduler" + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify the cluster is clean before deploying. Stale webhooks, terminating +# namespaces, and orphaned API services from a previous install can block pod +# creation and namespace deletion, causing silent deployment failures. + +echo "Running pre-flight checks..." + +preflight_failed=false + +# Bundle namespace list (deduplicated) +BUNDLE_NAMESPACES=$(echo "cert-manager " | tr ' ' '\n' | sort -u | tr '\n' ' ') + +# Check for terminating namespaces that overlap with our components +for ns in ${BUNDLE_NAMESPACES}; do + phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null || true) + if [[ "${phase}" == "Terminating" ]]; then + echo "ERROR: namespace '${ns}' is still terminating from a previous install." + echo " Wait for it to finish, or force-finalize with:" + echo " kubectl get ns ${ns} -o json | jq '.spec.finalizers=[]' | kubectl replace --raw /api/v1/namespaces/${ns}/finalize -f -" + preflight_failed=true + fi +done + +# Check for stale webhooks whose backing services no longer exist. +# Scoped to bundle namespaces only to avoid false positives from unrelated +# platform webhooks in shared clusters. +if command -v jq &>/dev/null; then + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + # Only check webhooks pointing to our bundle namespaces + is_bundle_ns=false + for ns in ${BUNDLE_NAMESPACES}; do + [[ "${svc_ns}" == "${ns}" ]] && is_bundle_ns=true && break + done + [[ "${is_bundle_ns}" == "false" ]] && continue + + # Use explicit NotFound check to avoid false positives from transient errors + svc_check=$(kubectl get svc "${svc_name}" -n "${svc_ns}" 2>&1) || true + if echo "${svc_check}" | grep -q "NotFound\|not found"; then + echo "ERROR: ${kind} '${wh_name}' references non-existent service ${svc_ns}/${svc_name}." + echo " This will block pod/resource creation. Delete with: kubectl delete ${kind} ${wh_name}" + preflight_failed=true + fi + done < <(kubectl get "${kind}" -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null || true) + done +else + echo "NOTE: jq not found — skipping webhook pre-flight checks. Install jq for full pre-flight validation." +fi + +# Check for stale API services (e.g., custom.metrics.k8s.io from prometheus-adapter) +if command -v jq &>/dev/null; then + for api_svc in $(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true); do + echo "WARNING: API service '${api_svc}' is unavailable. This can block namespace deletion." + echo " Delete with: kubectl delete apiservice ${api_svc}" + # API service issues are warnings, not hard failures — they don't block deployment directly + done +else + echo "NOTE: jq not found — skipping API service pre-flight checks." +fi + +# Check for orphaned CRDs from previous deployments. +# Scoped to CRD groups belonging to components in this bundle to avoid +# false positives from unrelated platform installs on shared clusters. +ORPHANED_CRD_GROUPS="" +for group in ${ORPHANED_CRD_GROUPS}; do + orphaned=$(kubectl get crd -o name 2>/dev/null | grep "\.${group}$" || true) + if [[ -n "${orphaned}" ]]; then + echo "WARNING: orphaned CRDs from previous deployment: ${orphaned}" + echo " These may cause conflicts. Delete with: kubectl delete ${orphaned}" + fi +done + +# Check for stale nodewright node taints from a previous deployment. +# Only remove taints if nodewright-operator is NOT already running (i.e., fresh deploy). +# If the operator is running, taints are legitimate scheduling guards. + +if [[ "${preflight_failed}" == "true" ]]; then + echo "" + echo "Pre-flight checks failed. Fix the issues above before deploying." + echo "To skip pre-flight checks, run: ./undeploy.sh first, then retry." + exit 1 +fi + +echo "Pre-flight checks passed." +echo "Deploying Cloud Native Stack components..." + +# ============================================================================== +# Install loop +# ============================================================================== +# Generic install loop. Each folder's install.sh is rendered by localformat +# with the right helm command baked in — deploy.sh has no per-component +# knowledge here. Per-component special-case logic (async wait, DRA plugin +# restart) runs in the post-install blocks below, matched by component name. +cd "${HELM_WORKDIR}" +for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; do + [[ -d "${dir}" ]] || continue + dir="${dir%/}" + base="${dir##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Source the namespace from the folder's install.sh — not the folder + # basename — because the helm release name and its target namespace can + # differ (e.g. nodewright-operator → namespace skyhook; gpu-operator-post → + # namespace gpu-operator). cleanup_helm_hooks and the kai diagnostics + # both operate on the namespace. + namespace=$(awk '{ for (i=1;i/dev/null; then + echo "Error: jq is required but not found in PATH." + echo " Install jq: https://jqlang.github.io/jq/download/" + exit 1 +fi + +echo "Undeploying Cloud Native Stack components (timeout: ${HELM_TIMEOUT}s)..." +echo "" + +echo " The following components will be removed (in reverse install order):" +# List NNN-* folders in reverse numeric order +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + printf " %-40s\n" "${name}" +done +echo "" + +# System namespaces that must not be deleted +PROTECTED_NS="kube-system kube-public kube-node-lease default" + +delete_namespace() { + local ns="$1" + if [[ "${KEEP_NS}" == "true" ]]; then return; fi + if echo " ${PROTECTED_NS} " | grep -q " ${ns} "; then return; fi + if ! kubectl get namespace "${ns}" &>/dev/null; then return; fi + echo "Deleting namespace ${ns}..." + kubectl delete namespace "${ns}" --ignore-not-found --wait=false +} + +# Uninstall a Helm release, handling stuck pending states from interrupted deploys. +# Try normal uninstall first; if it fails, retry with --no-hooks to force removal. +helm_force_uninstall() { + local release="$1" + local ns="$2" + if helm uninstall "${release}" -n "${ns}" --timeout "${HELM_TIMEOUT}s" --ignore-not-found 2>/dev/null; then + return + fi + echo " Retrying ${release} removal with --no-hooks..." + helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found || true +} + +# Delete cluster-scoped resources owned by a Helm release. +# These survive namespace deletion and can block subsequent deployments: +# - Webhooks block pod creation when their backing service is gone +# - CRDs with "helm.sh/resource-policy: keep" are retained after chart removal +delete_release_cluster_resources() { + local release="$1" + local ns="$2" + local selector="app.kubernetes.io/managed-by=Helm" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations customresourcedefinitions; do + kubectl get "${kind}" -l "${selector}" -o json 2>/dev/null \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting ${kind}/${name}..." + kubectl delete "${kind}" "${name}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done || echo "Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + done +} + +# Delete webhooks whose backing service is in a specific namespace and no longer exists. +# Scoped to the given namespace to avoid touching unrelated platform webhooks. +# Operator-created webhooks (e.g., kai-scheduler admission) may not carry Helm labels, +# but once their service namespace is deleted, fail-closed webhooks block pod creation. +delete_orphaned_webhooks_for_ns() { + local ns="$1" + for kind in mutatingwebhookconfigurations validatingwebhookconfigurations; do + { kubectl get "${kind}" -o json 2>/dev/null \ + | jq -r --arg ns "${ns}" \ + '.items[] | .metadata.name as $wh | .webhooks[] | select(.clientConfig.service != null and .clientConfig.service.namespace == $ns) | [$wh, .clientConfig.service.name] | @tsv' 2>/dev/null \ + | sort -u || true; } \ + | while IFS=$'\t' read -r wh_name svc_name; do + # Delete when namespace is gone, terminating, or backing service is missing. + # Skip on transient errors (auth, timeout, DNS) to avoid removing valid webhooks. + local ns_output ns_phase svc_output + ns_output=$(kubectl get ns "${ns}" 2>&1) || true + if echo "${ns_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + ns_phase=$(kubectl get ns "${ns}" -o jsonpath='{.status.phase}' 2>/dev/null) || true + if [[ "${ns_phase}" == "Terminating" ]]; then + echo "Deleting orphaned ${kind}/${wh_name} (namespace ${ns} terminating)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + continue + fi + svc_output=$(kubectl get svc "${svc_name}" -n "${ns}" 2>&1) || true + if echo "${svc_output}" | grep -q "NotFound"; then + echo "Deleting orphaned ${kind}/${wh_name} (service ${ns}/${svc_name} not found)..." + kubectl delete "${kind}" "${wh_name}" --ignore-not-found + fi + done + done +} + +# Force-clear finalizers on all namespaced resources to unstick a Terminating namespace. +force_clear_namespace_finalizers() { + local ns="$1" + echo "Force-removing finalizers in namespace ${ns}..." + local kinds + kinds=$(kubectl api-resources --verbs=list --namespaced -o name 2>/dev/null) || { + echo "Warning: failed to enumerate namespaced resource kinds in ${ns}; namespace may stay Terminating" >&2 + return + } + for kind in ${kinds}; do + kubectl get "${kind}" -n "${ns}" -o json 2>/dev/null \ + | jq -r '.items[] | select(.metadata.finalizers // [] | length > 0) | .kind + "/" + .metadata.name' 2>/dev/null \ + | while read -r resource; do + kubectl patch "${resource}" -n "${ns}" --type merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + done || echo "Warning: finalizer-clear pipeline for ${kind} in ${ns} failed (kubectl get / jq error); namespace may stay Terminating" >&2 + done +} + +# Return a small, explicit list of known crds/-installed CRDs whose +# finalizer-bearing custom resources must still be caught before the operator +# is removed. Keep this list intentionally tiny and exact: if a release is not +# listed here, pre-flight relies on chart-manifest and Helm-annotation discovery +# only instead of trying to infer ownership across the whole cluster. +extra_crds_for_release() { + case "$1" in + gpu-operator) + printf '%s\n' \ + "clusterpolicies.nvidia.com" \ + "nvidiadrivers.nvidia.com" \ + "nodefeaturegroups.nfd.k8s-sigs.io" \ + "nodefeaturerules.nfd.k8s-sigs.io" \ + "nodefeatures.nfd.k8s-sigs.io" + ;; + kai-scheduler) + printf '%s\n' \ + "bindrequests.scheduling.run.ai" \ + "configs.kai.scheduler" \ + "podgroups.scheduling.run.ai" \ + "queues.scheduling.run.ai" \ + "schedulingshards.kai.scheduler" \ + "topologies.kai.scheduler" + ;; + k8s-nim-operator) + printf '%s\n' \ + "nemocustomizers.apps.nvidia.com" \ + "nemodatastores.apps.nvidia.com" \ + "nemoentitystores.apps.nvidia.com" \ + "nemoevaluators.apps.nvidia.com" \ + "nemoguardrails.apps.nvidia.com" \ + "nimbuilds.apps.nvidia.com" \ + "nimcaches.apps.nvidia.com" \ + "nimpipelines.apps.nvidia.com" \ + "nimservices.apps.nvidia.com" + ;; + kubeflow-trainer) + printf '%s\n' \ + "clustertrainingruntimes.trainer.kubeflow.org" \ + "trainjobs.trainer.kubeflow.org" \ + "trainingruntimes.trainer.kubeflow.org" \ + "jobsets.jobset.x-k8s.io" + ;; + kube-prometheus-stack) + printf '%s\n' \ + "alertmanagerconfigs.monitoring.coreos.com" \ + "alertmanagers.monitoring.coreos.com" \ + "podmonitors.monitoring.coreos.com" \ + "probes.monitoring.coreos.com" \ + "prometheusagents.monitoring.coreos.com" \ + "prometheuses.monitoring.coreos.com" \ + "prometheusrules.monitoring.coreos.com" \ + "scrapeconfigs.monitoring.coreos.com" \ + "servicemonitors.monitoring.coreos.com" \ + "thanosrulers.monitoring.coreos.com" + ;; + dynamo-platform) + printf '%s\n' \ + "podcliques.grove.io" \ + "podcliquescalinggroups.grove.io" \ + "podcliquesets.grove.io" \ + "podgangs.scheduler.grove.io" + ;; + network-operator) + # This explicit list matches the CRDs enabled by the bundled values: + # nfd=false, sriovNetworkOperator=false, maintenance-operator disabled. + # Intentionally exclude networkattachmentdefinitions.k8s.cni.cncf.io: + # it is a broadly shared CRD, so surfacing or deleting it based only on + # this release would create cross-cluster noise. + printf '%s\n' \ + "nicclusterpolicies.mellanox.com" \ + "hostdevicenetworks.mellanox.com" \ + "ipoibnetworks.mellanox.com" \ + "macvlannetworks.mellanox.com" + ;; + *) ;; + esac +} + +# Skip pre-flight for releases whose bundle-managed custom resources are +# deleted from manifests before the controller is uninstalled. +skip_preflight_for_release() { + case "$1" in + nodewright-operator|kgateway) return 0 ;; + *) return 1 ;; + esac +} + +# Run `kubectl ... -o json` while keeping stdout parseable for jq. +capture_kubectl_json() { + local out_var="$1" + shift + local stderr_file output kubectl_err="" + + if ! stderr_file=$(mktemp "${TMPDIR:-/tmp}/.aicr_kubectl_stderr_XXXXXX"); then + printf -v "${out_var}" '%s' 'mktemp failed while capturing kubectl stderr' + return 1 + fi + + if ! output=$(kubectl "$@" 2>"${stderr_file}"); then + kubectl_err=$(cat "${stderr_file}" 2>/dev/null || true) + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${kubectl_err}" + return 1 + fi + + if [[ -s "${stderr_file}" ]]; then + cat "${stderr_file}" >&2 || true + fi + rm -f "${stderr_file}" + printf -v "${out_var}" '%s' "${output}" + return 0 +} + +check_crd_for_stuck_resources() { + local crd_name="$1" + local component="$2" + local crd_json plural group scope resource stuck stuck_json kubectl_err + local item_name item_namespace item_finalizers + + if ! capture_kubectl_json crd_json get crd "${crd_name}" -o json; then + kubectl_err="${crd_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not inspect CRD '${crd_name}' (release ${component})." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + read -r plural group scope < <(echo "${crd_json}" | jq -r '[.spec.names.plural, .spec.group, .spec.scope] | @tsv' 2>/dev/null) || return 0 + [[ -z "${plural}" || "${plural}" == "null" ]] && return 0 + + resource="${plural}.${group}" + local jq_filter='[.metadata.finalizers // [] | .[] | select(startswith("kubernetes.io/") | not)]' + stuck="" + if [[ "${scope}" == "Namespaced" ]]; then + if ! capture_kubectl_json stuck_json get "${resource}" -A -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_namespace item_name item_finalizers; do + stuck="${stuck} ${item_namespace}/${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [(.metadata.namespace // ""), .metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + else + if ! capture_kubectl_json stuck_json get "${resource}" -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list '${resource}' (release ${component})." >&2 + echo " kubectl output: ${stuck_json}" >&2 + echo " Failing closed so we do not silently miss a stuck CR." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + while IFS=$'\x1f' read -r item_name item_finalizers; do + stuck="${stuck} ${item_name} finalizers=[${item_finalizers}]"$'\n' + done < <(echo "${stuck_json}" \ + | jq -r '.items[] | ('"${jq_filter}"') as $f | select(($f | length) > 0) | [.metadata.name, ($f | join(","))] | join("")' 2>/dev/null || true) + fi + + if [[ -n "${stuck}" ]]; then + { + echo " ${component} — ${crd_name}:" + printf '%s' "${stuck}" + } >> "${PREFLIGHT_DETAILS}" + fi +} + +check_release_for_stuck_crds() { + local release="$1" + local ns="$2" + local manifest manifest_crds annotated_crds explicit_crds + local all_crds_json kubectl_err crd_name + manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) + manifest_crds=$(echo "${manifest}" \ + | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') + + all_crds_json="${PREFLIGHT_ALL_CRDS_JSON:-}" + if [[ -z "${all_crds_json}" ]]; then + if ! capture_kubectl_json all_crds_json get crd -o json; then + kubectl_err="${all_crds_json}" + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs for release '${release}'." >&2 + echo " kubectl output: ${kubectl_err}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove the operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + return 1 + fi + fi + + annotated_crds=$(echo "${all_crds_json}" \ + | jq -r --arg rel "${release}" --arg ns "${ns}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) + explicit_crds="" + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + explicit_crds="${explicit_crds}$(echo "${all_crds_json}" \ + | jq -r --arg name "${crd_name}" --arg rel "${release}" --arg ns "${ns}" \ + '.items[] + | select(.metadata.name == $name) + | select( + ((.metadata.annotations["meta.helm.sh/release-name"] // "") == "") + or + ( + .metadata.annotations["meta.helm.sh/release-name"] == $rel + and + .metadata.annotations["meta.helm.sh/release-namespace"] == $ns + ) + ) + | .metadata.name' 2>/dev/null || true)"$'\n' + done < <(extra_crds_for_release "${release}") + [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${explicit_crds}" ]] && return 0 + printf '%s\n%s\n%s\n' "${manifest_crds}" "${annotated_crds}" "${explicit_crds}" \ + | awk 'NF' \ + | sort -u \ + | while read -r crd_name; do + check_crd_for_stuck_resources "${crd_name}" "${release}" + done + return 0 +} + +# ============================================================================== +# Pre-flight checks +# ============================================================================== +# Verify no custom resources with active finalizers exist for CRDs owned by +# bundle operators. After helm uninstall removes the operator, CRs with +# finalizers cannot be reconciled — blocking CRD deletion indefinitely. + +if [[ "${SKIP_PREFLIGHT}" == "true" ]]; then + echo "Skipping pre-flight checks (--skip-preflight)." +else + echo "Running pre-flight checks..." + PREFLIGHT_DETAILS=$(mktemp "${TMPDIR:-/tmp}/.aicr_preflight_XXXXXX") + PREFLIGHT_ALL_CRDS_JSON="" + + if ! capture_kubectl_json PREFLIGHT_ALL_CRDS_JSON get crd -o json; then + echo "" >&2 + echo "ERROR: Pre-flight could not list CRDs." >&2 + echo " kubectl output: ${PREFLIGHT_ALL_CRDS_JSON}" >&2 + echo " Failing closed: if this is a transient API/auth error the script" >&2 + echo " may remove an operator before finalizers are reconciled." >&2 + echo " Re-run after the API recovers, or use --skip-preflight to bypass." >&2 + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + if skip_preflight_for_release "cert-manager"; then + echo " Skipping cert-manager (cert-manager): bundle deletes dependent manifests before controller uninstall." + else + echo " Checking cert-manager (cert-manager)..." + check_release_for_stuck_crds "cert-manager" "cert-manager" + fi + + if [[ -s "${PREFLIGHT_DETAILS}" ]]; then + echo "" + echo "ERROR: Found custom resources with active finalizers that will block undeploy." + echo " After the operator is removed, these finalizers cannot be processed —" + echo " causing an unrecoverable hang during CRD deletion." + echo "" + echo " Delete these resources while their controller is still running," + echo " then re-run ./undeploy.sh" + echo "" + cat "${PREFLIGHT_DETAILS}" + echo "" + echo " To skip this check: ./undeploy.sh --skip-preflight" + rm -f "${PREFLIGHT_DETAILS}" + exit 1 + fi + rm -f "${PREFLIGHT_DETAILS}" + + echo "Pre-flight checks passed." +fi + +# ============================================================================== +# Uninstall components in reverse install order +# ============================================================================== +# Generic reverse loop: every folder is a Helm release (local-helm or upstream-helm). +# `helm uninstall ` works uniformly for both kinds — that's one of the +# benefits of the uniform local-chart bundle format. +for dir in $(ls -1d "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ 2>/dev/null | sort -r); do + [[ -d "${dir}" ]] || continue + base="${dir%/}" + base="${base##*/}" + name="${base#[0-9][0-9][0-9]-}" + # Derive the namespace from the component name via the per-component blocks + # below. We still need it for helm_force_uninstall and cluster-resource cleanup. + ns="" + if [[ "${name}" == "cert-manager" ]]; then ns="cert-manager"; fi + # Injected mixed-component "-post" folders share their parent's namespace. + if [[ "${name}" == "cert-manager-post" ]]; then ns="cert-manager"; fi + if [[ -z "${ns}" ]]; then + echo "Warning: no namespace known for ${name}; skipping uninstall" >&2 + continue + fi + echo "Uninstalling ${name} (${ns})..." + # For local-helm folders (Chart.yaml + templates/), kubectl-delete the + # rendered templates BEFORE helm uninstall. The templates may carry + # helm.sh/hook annotations (post-install/post-upgrade) which helm does + # not track or clean up on uninstall — so without this pre-delete, the + # operator gets removed but its hook-created CRs (and their finalizers) + # linger. Doing the delete while the controller is still running lets + # finalizers clear naturally. + if [[ -d "${dir}/templates" ]]; then + for tpl in "${dir}/templates/"*.yaml; do + [[ -f "${tpl}" ]] || continue + kubectl delete -n "${ns}" -f "${tpl}" --ignore-not-found --timeout="${HELM_TIMEOUT}s" || true + done + fi + helm_force_uninstall "${name}" "${ns}" + delete_release_cluster_resources "${name}" "${ns}" + delete_orphaned_webhooks_for_ns "${ns}" +done + +# Remove nodewright node taints that persist after operator removal. +# Nodewright taints nodes during kernel tuning. The taint key is configurable +# via runtimeRequiredTaint (defaults to skyhook.nvidia.com). + +# Clean up orphaned CRDs that were owned by this bundle's releases. +# Only delete CRDs whose Helm release annotation matches a component we just uninstalled. +kubectl get crd -o json 2>/dev/null \ + | jq -r --arg rel "cert-manager" --arg ns "cert-manager" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null \ + | while read -r name; do + echo "Deleting CRD ${name} (owned by cert-manager/cert-manager)..." + kubectl delete crd "${name}" --ignore-not-found --wait=false \ + || echo "Warning: failed to delete CRD ${name} (owned by cert-manager/cert-manager); leftovers will surface in post-flight" >&2 + done || echo "Warning: orphan-CRD cleanup for cert-manager/cert-manager failed (kubectl get / jq error); leftovers will surface in post-flight" >&2 + +# Intentionally skip automatic deletion of unannotated CRDs matched only by +# API group. On shared clusters, those CRDs may be serving another tenant's +# release in the same group, and we do not have bundle-specific ownership +# metadata to distinguish "ours" from "theirs" safely. + +# Warn about CRDs stuck in deleting state (e.g., customresourcecleanup finalizer +# can't be resolved because CR instances still have controller-managed finalizers). +stuck_crds=$(kubectl get crd -o json 2>/dev/null | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' 2>/dev/null || true) +if [[ -n "${stuck_crds}" ]]; then + echo "" + echo "WARNING: CRDs stuck in deleting state:" + for crd in ${stuck_crds}; do + echo " ${crd}" + done + echo "" + echo " These CRDs have a deletionTimestamp but cannot complete deletion because" + echo " their customresourcecleanup finalizer is waiting for CR instances to be" + echo " removed. If you are certain no data will be lost, you can force-clear the" + echo " finalizers. Note: this may leave orphaned CR data in etcd that is no longer" + echo " accessible through the API." + echo "" + echo " Commands to force-clear (review before running):" + for crd in ${stuck_crds}; do + echo " kubectl patch crd ${crd} --type merge -p '{\"metadata\":{\"finalizers\":null}}'" + done +fi + +# Clean up namespaces after all components are uninstalled. +if [[ "${DELETE_PVCS}" == "true" ]] && ! echo " ${PROTECTED_NS} " | grep -q " cert-manager "; then + echo "Deleting PVCs in cert-manager..." + kubectl delete pvc --all -n cert-manager --ignore-not-found || true +fi +delete_orphaned_webhooks_for_ns "cert-manager" +delete_namespace "cert-manager" + +# Clean up companion namespaces created at runtime by operators. +# Only emitted for components whose runtime creates them. + +# Wait for terminating namespaces to finish +echo "Waiting for namespaces to terminate..." +for i in $(seq 1 60); do + TERMINATING=$(kubectl get ns --no-headers 2>/dev/null | grep Terminating | awk '{print $1}' || true) + if [[ -z "${TERMINATING}" ]]; then + break + fi + if [[ $i -eq 60 ]]; then + echo "Warning: namespaces still terminating after 60s: ${TERMINATING}" + for ns in ${TERMINATING}; do + force_clear_namespace_finalizers "${ns}" + kubectl delete namespace "${ns}" --ignore-not-found --wait=false 2>/dev/null || true + done + break + fi + sleep 1 +done + +# Final webhook cleanup pass. +delete_orphaned_webhooks_for_ns "cert-manager" + +# ============================================================================== +# Post-flight verification +# ============================================================================== + +postflight_issues=false + +TERMINATING=$(kubectl get namespaces -o jsonpath='{range .items[?(@.status.phase=="Terminating")]}{.metadata.name}{" "}{end}' 2>/dev/null || true) +if [[ -n "${TERMINATING}" ]]; then + echo "WARNING: namespaces still terminating: ${TERMINATING}" + echo " A subsequent deploy.sh may fail. Wait or force-finalize these namespaces." + postflight_issues=true +fi + +kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o json 2>/dev/null | \ + jq -r '.items[] | .metadata.name as $wh | .webhooks[]? | select(.clientConfig.service != null) | [$wh, .clientConfig.service.namespace, .clientConfig.service.name] | @tsv' 2>/dev/null | \ + sort -u | \ + while IFS=$'\t' read -r wh_name svc_ns svc_name; do + if ! kubectl get ns "${svc_ns}" &>/dev/null || ! kubectl get svc "${svc_name}" -n "${svc_ns}" &>/dev/null; then + echo "Deleting stale webhook ${wh_name} (service ${svc_ns}/${svc_name} missing)..." + kubectl delete mutatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + kubectl delete validatingwebhookconfiguration "${wh_name}" --ignore-not-found 2>/dev/null || true + fi + done || true + +stale_apis=$(kubectl get apiservices -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | .type == "Available" and .status == "False") | .metadata.name' 2>/dev/null || true) +if [[ -n "${stale_apis}" ]]; then + echo "WARNING: unavailable API services found: ${stale_apis}" + echo " These can block namespace deletion. Delete with: kubectl delete apiservice " + postflight_issues=true +fi + +# Check for Helm-annotated CRDs from uninstalled releases. +helm_orphaned_crds="" +explicit_orphaned_crds="" +postflight_all_crds_json="" +if capture_kubectl_json postflight_all_crds_json get crd -o json; then + : + remaining_helm_crds=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg rel "cert-manager" --arg ns "cert-manager" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_helm_crds}" ]]; then + helm_orphaned_crds="${helm_orphaned_crds} ${remaining_helm_crds}" + fi + while read -r crd_name; do + [[ -z "${crd_name}" ]] && continue + remaining_explicit_crd=$(echo "${postflight_all_crds_json}" \ + | jq -r --arg name "${crd_name}" \ + '.items[] | select(.metadata.name==$name and .metadata.deletionTimestamp==null) | .metadata.name' 2>/dev/null || true) + if [[ -n "${remaining_explicit_crd}" ]]; then + explicit_orphaned_crds="${explicit_orphaned_crds}${remaining_explicit_crd}"$'\n' + fi + done < <(extra_crds_for_release "cert-manager") +else + echo "Warning: failed to enumerate post-flight CRDs; kubectl output: ${postflight_all_crds_json}" >&2 + postflight_issues=true +fi +if [[ -n "${helm_orphaned_crds}" ]]; then + echo "WARNING: Helm-annotated CRDs from uninstalled releases still present:${helm_orphaned_crds}" + echo " Cleanup did not remove all CRDs owned by this bundle's releases." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +explicit_orphaned_crds=$(printf '%s' "${explicit_orphaned_crds}" | awk 'NF' | sort -u | tr '\n' ' ') +if [[ -n "${explicit_orphaned_crds}" ]]; then + echo "WARNING: explicit CRDs from this bundle still present: ${explicit_orphaned_crds}" + echo " These CRDs are installed outside Helm manifest/annotation discovery." + echo " Delete with: kubectl delete crd " + postflight_issues=true +fi + +if [[ "${postflight_issues}" == "true" ]]; then + echo "" + echo "Post-flight: some stale resources remain. Run deploy.sh pre-flight checks to verify before redeploying." +else + echo "Post-flight: cluster is clean." +fi + +echo "Undeployment complete." diff --git a/pkg/bundler/deployer/localformat/doc.go b/pkg/bundler/deployer/localformat/doc.go new file mode 100644 index 000000000..34e45fce4 --- /dev/null +++ b/pkg/bundler/deployer/localformat/doc.go @@ -0,0 +1,101 @@ +// Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package localformat writes the uniform numbered local-chart bundle layout. +// Currently consumed by the helm deployer (--deployer helm). Designed to be +// consumable by additional deployers (e.g. helmfile per #632, argocd, Flux) +// without per-deployer changes to the writer; those integrations are not yet +// wired in this package. +// +// # Layout +// +// Each emitted folder is named NNN-/ where NNN is a zero-padded +// 1-based index. Folders are one of two kinds, distinguished solely by the +// presence or absence of Chart.yaml: +// +// - KindUpstreamHelm — no Chart.yaml. The folder carries values.yaml, +// cluster-values.yaml, upstream.env (CHART/REPO/VERSION), and a rendered +// install.sh that installs the upstream chart via `helm upgrade --install`. +// +// - KindLocalHelm — Chart.yaml + templates/ present. The folder is a +// self-contained Helm chart; install.sh installs `./` as a local chart. +// +// The Chart.yaml presence rule is the sole branch point for consumers. No +// component-kind metadata is re-read at deploy time. This is deliberate: +// a previous design branched deploy.sh on Helm/Kustomize/raw-manifest kinds, +// which bled component-type classification into every deployer. Chart.yaml +// presence reduces that to a single on-disk signal every deployer honors. +// +// # Classification +// +// Recipe shape determines the folder kind: +// +// Helm repository set, no manifests → KindUpstreamHelm +// Helm repository set, with raw manifests → KindUpstreamHelm primary + +// KindLocalHelm "-post" injection +// Helm repository empty, manifests only → KindLocalHelm (wrapped) +// Kustomize (Tag/Path set) → KindLocalHelm (kustomize build +// output wrapped as templates/manifest.yaml) +// +// # Mixed components and the "-post" injection +// +// When a single recipe component declares both an upstream Helm chart and raw +// manifests, Write emits two adjacent folders: the primary NNN-/ as +// KindUpstreamHelm, immediately followed by (NNN+1)--post/ as +// KindLocalHelm wrapping the raw manifests. Subsequent components shift by +// one. The "mixed" concept does not appear in the recipe types, deployment +// order, or bundle result — it exists only in the bundle layout. +// +// The -post folder deploys after the upstream chart, so raw manifests that +// reference the chart's CRDs apply against a cluster where those CRDs already +// exist. This is what makes the earlier pre-apply-with-retry mechanism (which +// applied raw manifests before the chart and retried on "CRD not found" +// errors) structurally unnecessary. +// +// # Base-format invariants +// +// These are load-bearing contracts. Callers and contributors should not +// violate them without changing the design: +// +// 1. localformat never writes deployer-specific files. deploy.sh, +// helmfile.yaml, argocd Application CRs, Flux HelmReleases, and the like +// are produced by the respective deployer after Write returns. Write +// owns per-folder content; deployers own top-level orchestration files. +// This separation is what makes a single folder layout consumable by +// every deployer without localformat growing per-deployer branches. +// +// 2. install.sh is never name-customized. Rendered from one of exactly two +// templates (upstream-helm, local-helm), parameterized only by data +// (name, namespace, upstream ref). Name-keyed component quirks +// (kai-scheduler async skip, skyhook taint cleanup, DRA restart, orphan +// CRD scan) stay in deploy.sh as name-matched blocks — not in install.sh. +// This is the structural barrier that keeps per-folder scripts from +// accumulating drift the way deploy.sh's branching did. +// +// 3. Write is deterministic and idempotent. Same Options in, same on-disk +// bytes and same Folder slice out. Map iteration is sorted; no +// timestamps or random suffixes are embedded in generated content. +// +// # Caller contract +// +// Callers pass an ordered Components slice (sorted by deployment order) +// and a ComponentManifests map (name → path → rendered bytes) that drives +// both the -post injection for mixed components and the template contents +// for manifest-only wrapped charts. Write returns a []Folder manifest so +// deployers can generate their own orchestration files without +// re-classifying or re-reading disk. +// +// Further detail: ticket #662 carries the original design discussion and +// alternatives considered. +package localformat diff --git a/pkg/bundler/deployer/localformat/folder.go b/pkg/bundler/deployer/localformat/folder.go new file mode 100644 index 000000000..79ede3e78 --- /dev/null +++ b/pkg/bundler/deployer/localformat/folder.go @@ -0,0 +1,59 @@ +// Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package localformat + +// FolderKind classifies a written folder by the presence/absence of Chart.yaml. +type FolderKind int + +const ( + // KindUpstreamHelm: folder contains no Chart.yaml; install.sh references + // an upstream Helm chart via upstream.env. + KindUpstreamHelm FolderKind = iota + // KindLocalHelm: folder contains a generated Chart.yaml + templates/; + // install.sh installs ./ as a local chart. + KindLocalHelm +) + +// String returns the stable textual name for the kind. Used by logs and +// golden-file diagnostics so diffs show kind names rather than integers. +func (k FolderKind) String() string { + switch k { + case KindUpstreamHelm: + return "upstream-helm" + case KindLocalHelm: + return "local-helm" + default: + return "unknown" + } +} + +// Upstream holds upstream chart reference fields written to upstream.env. +type Upstream struct { + Chart string + Repo string + Version string +} + +// Folder describes one written folder. Returned by Write so callers +// (deployers) can generate orchestration files without re-classifying. +type Folder struct { + Index int // 1-based; rendered as zero-padded 3-digit prefix in Dir + Dir string // e.g. "001-nfd" + Kind FolderKind + Name string // component name, or "-post" for injected + Parent string // component this folder belongs to (== Name for primary) + Upstream *Upstream // set iff Kind == KindUpstreamHelm + Files []string // relative paths (to OutputDir) of files written in this folder +} diff --git a/pkg/bundler/deployer/localformat/kustomize.go b/pkg/bundler/deployer/localformat/kustomize.go new file mode 100644 index 000000000..27e7135c8 --- /dev/null +++ b/pkg/bundler/deployer/localformat/kustomize.go @@ -0,0 +1,53 @@ +// Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package localformat + +import ( + "context" + + "sigs.k8s.io/kustomize/api/krusty" + "sigs.k8s.io/kustomize/kyaml/filesys" + + "github.com/NVIDIA/aicr/pkg/errors" +) + +// buildKustomize runs `kustomize build` against path using the in-process +// kustomize Go library, returning the flattened single-YAML-document output. +// Uses filesys.MakeFsOnDisk so relative resource refs inside the kustomization +// resolve as they would on the command line. +// +// Cancellation is best-effort: krusty.Kustomizer.Run does not accept a +// context, so we honor ctx by checking ctx.Err() before invocation. A +// cancellation that fires mid-build is observed only after Run returns; +// for bundle overlays Run completes in milliseconds, so this is acceptable. +// +// Returns ErrCodeInternal on kustomize build or YAML marshal failure; +// ErrCodeTimeout if the context is canceled before invocation. +func buildKustomize(ctx context.Context, path string) ([]byte, error) { + if err := ctx.Err(); err != nil { + return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err) + } + fs := filesys.MakeFsOnDisk() + k := krusty.MakeKustomizer(krusty.MakeDefaultOptions()) + rm, err := k.Run(fs, path) + if err != nil { + return nil, errors.Wrap(errors.ErrCodeInternal, "kustomize build failed", err) + } + out, err := rm.AsYaml() + if err != nil { + return nil, errors.Wrap(errors.ErrCodeInternal, "kustomize YAML marshal", err) + } + return out, nil +} diff --git a/pkg/bundler/deployer/localformat/local_helm.go b/pkg/bundler/deployer/localformat/local_helm.go new file mode 100644 index 000000000..f59e5ff22 --- /dev/null +++ b/pkg/bundler/deployer/localformat/local_helm.go @@ -0,0 +1,174 @@ +// Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package localformat + +import ( + "embed" + "fmt" + "log/slog" + "os" + "path/filepath" + "sort" + "strings" + "text/template" + + "github.com/NVIDIA/aicr/pkg/bundler/deployer" + "github.com/NVIDIA/aicr/pkg/errors" + "github.com/NVIDIA/aicr/pkg/manifest" +) + +// hasYAMLObjects returns true if content contains at least one YAML object +// (a non-comment, non-blank, non-separator line). Used to skip writing +// fully-conditional manifests that rendered to nothing once values were +// applied. Mirrors the helper that lived in the old helm deployer. +func hasYAMLObjects(content []byte) bool { + for _, line := range strings.Split(string(content), "\n") { + trimmed := strings.TrimSpace(line) + if trimmed == "" || strings.HasPrefix(trimmed, "#") || trimmed == "---" { + continue + } + return true + } + return false +} + +//go:embed templates/install-local-helm.sh.tmpl templates/chart.yaml.tmpl +var localHelmTemplates embed.FS + +var ( + localHelmInstallTmpl = template.Must( + template.ParseFS(localHelmTemplates, "templates/install-local-helm.sh.tmpl"), + ) + localHelmChartTmpl = template.Must( + template.ParseFS(localHelmTemplates, "templates/chart.yaml.tmpl"), + ) +) + +// writeLocalHelmFolder writes Chart.yaml + templates/* + values.yaml + +// cluster-values.yaml + install.sh into outputDir/dir. name is the folder +// and release name ("" for primary, "-post" for injected mixed); +// parent is the originating component name (== name for primary). manifests +// is the per-path rendered bytes map for this folder; may be empty (a +// manifest-only component with no manifests still yields a valid empty +// templates/ directory). Returns the Folder manifest with Files listed in +// deterministic order. +func writeLocalHelmFolder( + outputDir, dir string, idx int, c Component, + manifests map[string][]byte, renderInput manifest.RenderInput, + name, parent string, +) (Folder, error) { + + folderDir, err := deployer.SafeJoin(outputDir, dir) + if err != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInvalidRequest, "folder path unsafe", err) + } + if err = os.MkdirAll(folderDir, 0o755); err != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInternal, + fmt.Sprintf("create folder %s", dir), err) + } + templatesDir, err := deployer.SafeJoin(folderDir, "templates") + if err != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInvalidRequest, "templates dir path unsafe", err) + } + if err = os.MkdirAll(templatesDir, 0o755); err != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInternal, + fmt.Sprintf("create templates dir for %s", dir), err) + } + + // Chart.yaml + chartData := struct { + Name string + Parent string + }{name, parent} + if err = renderTemplateToFile(localHelmChartTmpl, chartData, folderDir, "Chart.yaml", 0o644); err != nil { + return Folder{}, err + } + + // values.yaml + cluster-values.yaml + if err = writeValueFiles(folderDir, c); err != nil { + return Folder{}, err + } + + // templates/* from manifests (sorted for determinism; basename-collision is + // an error so two manifests with the same file name cannot silently overwrite). + sortedPaths := make([]string, 0, len(manifests)) + for p := range manifests { + sortedPaths = append(sortedPaths, p) + } + sort.Strings(sortedPaths) + + seen := make(map[string]string, len(sortedPaths)) + templateRelPaths := make([]string, 0, len(sortedPaths)) + for _, p := range sortedPaths { + baseName := filepath.Base(p) + if prev, ok := seen[baseName]; ok { + return Folder{}, errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("manifest basename collision in component %q: %q and %q both resolve to %q", + c.Name, prev, p, baseName)) + } + seen[baseName] = p + + rendered, rerr := manifest.Render(manifests[p], renderInput) + if rerr != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInternal, + fmt.Sprintf("render manifest %s for %s", p, c.Name), rerr) + } + // Skip writing if the rendered output has no YAML objects (only + // comments / blanks / separators) — typical for fully-conditional + // manifests when the relevant value was set false at bundle time. + // Mirrors the OLD helm deployer's hasYAMLObjects check. + if !hasYAMLObjects(rendered) { + slog.Debug("skipping empty manifest after render", + "component", c.Name, "manifest", baseName) + continue + } + outPath, jerr := deployer.SafeJoin(templatesDir, baseName) + if jerr != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInvalidRequest, + fmt.Sprintf("template file path unsafe: %s", baseName), jerr) + } + if werr := writeFile(outPath, rendered, 0o644); werr != nil { + return Folder{}, werr + } + templateRelPaths = append(templateRelPaths, filepath.Join(dir, "templates", baseName)) + } + + // install.sh + installData := struct { + Name string + Namespace string + }{name, c.Namespace} + if err = renderTemplateToFile(localHelmInstallTmpl, installData, folderDir, "install.sh", 0o755); err != nil { + return Folder{}, err + } + + files := make([]string, 0, 3+len(templateRelPaths)+1) + files = append(files, + filepath.Join(dir, "Chart.yaml"), + filepath.Join(dir, "values.yaml"), + filepath.Join(dir, "cluster-values.yaml"), + ) + files = append(files, templateRelPaths...) + files = append(files, filepath.Join(dir, "install.sh")) + + return Folder{ + Index: idx, + Dir: dir, + Kind: KindLocalHelm, + Name: name, + Parent: parent, + Files: files, + }, nil +} diff --git a/pkg/bundler/deployer/localformat/templates/chart.yaml.tmpl b/pkg/bundler/deployer/localformat/templates/chart.yaml.tmpl new file mode 100644 index 000000000..66e152ecf --- /dev/null +++ b/pkg/bundler/deployer/localformat/templates/chart.yaml.tmpl @@ -0,0 +1,20 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: {{ .Name }} +description: Generated wrapper chart for {{ .Parent }} local content. +type: application +version: 0.1.0 +appVersion: "0.1.0" diff --git a/pkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmpl b/pkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmpl new file mode 100644 index 000000000..63be5f816 --- /dev/null +++ b/pkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmpl @@ -0,0 +1,23 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" + +helm upgrade --install {{ .Name }} ./ \ + --namespace {{ .Namespace }} --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmpl b/pkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmpl new file mode 100644 index 000000000..4128053d0 --- /dev/null +++ b/pkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmpl @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" +# shellcheck source=/dev/null +source ./upstream.env + +# CHART carries the full OCI URI for OCI charts and just the chart name for +# HTTP/HTTPS charts. REPO is non-empty only for HTTP/HTTPS charts; the +# ${REPO:+--repo "${REPO}"} expansion adds --repo iff REPO is set. +helm upgrade --install {{ .Name }} "${CHART}" \ + ${REPO:+--repo "${REPO}"} --version "${VERSION}" \ + --namespace {{ .Namespace }} --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/localformat/testdata/README.md b/pkg/bundler/deployer/localformat/testdata/README.md new file mode 100644 index 000000000..68a809017 --- /dev/null +++ b/pkg/bundler/deployer/localformat/testdata/README.md @@ -0,0 +1,22 @@ +# localformat test fixtures + +Golden-file fixtures for `localformat.Write`'s per-folder output. Each +subdirectory captures the bytes a single bundle folder of a particular +kind looks like, and the harness in +[`writer_test.go`](../writer_test.go) (`assertGolden`) byte-compares. + +| Directory | Folder kind under test | +|---|---| +| `upstream_helm_only/` | `KindUpstreamHelm` — folder with `install.sh` + `upstream.env`, no `Chart.yaml` | +| `local_helm_manifest_only/` | `KindLocalHelm` for a manifest-only component — `Chart.yaml` + `templates/` + values | +| `kustomize_input/` | Input to `buildKustomize` (kustomization.yaml + resource); not a golden — fed into the kustomize build path | + +For background on the pattern, regen command, and why these files don't +carry Apache license headers, see +[`pkg/bundler/deployer/helm/testdata/README.md`](../../helm/testdata/README.md). + +## Regenerate + +```bash +go test ./pkg/bundler/deployer/localformat/... -update +``` diff --git a/pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yaml b/pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yaml new file mode 100644 index 000000000..e68416071 --- /dev/null +++ b/pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yaml @@ -0,0 +1,20 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: example +data: + key: value diff --git a/pkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yaml b/pkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yaml new file mode 100644 index 000000000..931b14e47 --- /dev/null +++ b/pkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yaml @@ -0,0 +1,18 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +resources: + - cm.yaml diff --git a/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yaml b/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yaml new file mode 100644 index 000000000..98481b3f9 --- /dev/null +++ b/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: skyhook-customizations +description: Generated wrapper chart for skyhook-customizations local content. +type: application +version: 0.1.0 +appVersion: "0.1.0" diff --git a/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.sh b/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.sh new file mode 100644 index 000000000..5b55582b7 --- /dev/null +++ b/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.sh @@ -0,0 +1,23 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" + +helm upgrade --install skyhook-customizations ./ \ + --namespace skyhook --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yaml b/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yaml new file mode 100644 index 000000000..69b32512b --- /dev/null +++ b/pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yaml @@ -0,0 +1,18 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: x diff --git a/pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.sh b/pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.sh new file mode 100644 index 000000000..351339e48 --- /dev/null +++ b/pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.sh @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" +# shellcheck source=/dev/null +source ./upstream.env + +# CHART carries the full OCI URI for OCI charts and just the chart name for +# HTTP/HTTPS charts. REPO is non-empty only for HTTP/HTTPS charts; the +# ${REPO:+--repo "${REPO}"} expansion adds --repo iff REPO is set. +helm upgrade --install nfd "${CHART}" \ + ${REPO:+--repo "${REPO}"} --version "${VERSION}" \ + --namespace node-feature-discovery --create-namespace \ + -f values.yaml -f cluster-values.yaml \ + ${COMPONENT_WAIT_ARGS:-} ${DRY_RUN_FLAG:-} ${KUBECONFIG_FLAG:-} ${HELM_DEBUG_FLAG:-} diff --git a/pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.env b/pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.env new file mode 100644 index 000000000..bf2aaef15 --- /dev/null +++ b/pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.env @@ -0,0 +1,3 @@ +CHART='node-feature-discovery' +REPO='https://kubernetes-sigs.github.io/node-feature-discovery/charts' +VERSION='v0.16.1' diff --git a/pkg/bundler/deployer/localformat/upstream_helm.go b/pkg/bundler/deployer/localformat/upstream_helm.go new file mode 100644 index 000000000..7a440dac1 --- /dev/null +++ b/pkg/bundler/deployer/localformat/upstream_helm.go @@ -0,0 +1,113 @@ +// Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package localformat + +import ( + "embed" + "fmt" + "os" + "path/filepath" + "strings" + "text/template" + + "github.com/NVIDIA/aicr/pkg/bundler/deployer" + "github.com/NVIDIA/aicr/pkg/errors" +) + +//go:embed templates/install-upstream-helm.sh.tmpl +var upstreamHelmTemplates embed.FS + +// shellSingleQuote wraps s in single quotes for safe inclusion in a shell +// `source`-able file (e.g. KEY='value'). Embedded single quotes are escaped +// with the canonical close-escape-reopen sequence `'\”`. +// +// Single quotes (vs double quotes) are required: inside double quotes, +// `$()`, backticks, and `$VAR` still expand, so a malicious or +// pathological value could execute when sourced. +func shellSingleQuote(s string) string { + return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'" +} + +var upstreamHelmTmpl = template.Must( + template.ParseFS(upstreamHelmTemplates, "templates/install-upstream-helm.sh.tmpl"), +) + +// writeUpstreamHelmFolder writes values.yaml + cluster-values.yaml + upstream.env + install.sh +// into outputDir/dir. Returns the Folder manifest (Files are all relative to outputDir). +func writeUpstreamHelmFolder(outputDir, dir string, idx int, c Component) (Folder, error) { + folderDir, err := deployer.SafeJoin(outputDir, dir) + if err != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInvalidRequest, "folder path unsafe", err) + } + if err = os.MkdirAll(folderDir, 0o755); err != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInternal, + fmt.Sprintf("create folder %s", dir), err) + } + + if err = writeValueFiles(folderDir, c); err != nil { + return Folder{}, err + } + + // For OCI charts, helm wants the full URI as the chart argument (no --repo). + // For HTTP/HTTPS charts, helm wants the chart name + a separate --repo flag. + // Encode that into upstream.env so install.sh can be branch-free. + chart, repo := c.ChartName, c.Repository + if c.IsOCI { + chart = strings.TrimRight(c.Repository, "/") + "/" + c.ChartName + repo = "" + } + // install.sh sources upstream.env; values must be shell-safe so a value + // containing a single quote, $(...), or backticks can't escape into + // command execution. shellSingleQuote wraps the value in single quotes + // and replaces any embedded single quote with the four-character + // sequence '\'' — a closed quote, an escaped quote, then a re-opened + // one — which is the only POSIX-safe escape inside single quotes. + envContent := fmt.Sprintf("CHART=%s\nREPO=%s\nVERSION=%s\n", + shellSingleQuote(chart), shellSingleQuote(repo), shellSingleQuote(c.Version)) + envPath, err := deployer.SafeJoin(folderDir, "upstream.env") + if err != nil { + return Folder{}, errors.Wrap(errors.ErrCodeInvalidRequest, "upstream.env path unsafe", err) + } + if err = writeFile(envPath, []byte(envContent), 0o600); err != nil { + return Folder{}, err + } + + installData := struct { + Name string + Namespace string + }{c.Name, c.Namespace} + if err = renderTemplateToFile(upstreamHelmTmpl, installData, folderDir, "install.sh", 0o755); err != nil { + return Folder{}, err + } + + return Folder{ + Index: idx, + Dir: dir, + Kind: KindUpstreamHelm, + Name: c.Name, + Parent: c.Name, + Upstream: &Upstream{ + Chart: c.ChartName, + Repo: c.Repository, + Version: c.Version, + }, + Files: []string{ + filepath.Join(dir, "values.yaml"), + filepath.Join(dir, "cluster-values.yaml"), + filepath.Join(dir, "upstream.env"), + filepath.Join(dir, "install.sh"), + }, + }, nil +} diff --git a/pkg/bundler/deployer/localformat/writer.go b/pkg/bundler/deployer/localformat/writer.go new file mode 100644 index 000000000..0d54c6aca --- /dev/null +++ b/pkg/bundler/deployer/localformat/writer.go @@ -0,0 +1,372 @@ +// Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package localformat + +import ( + "bytes" + "context" + "fmt" + "log/slog" + "os" + "text/template" + + "github.com/NVIDIA/aicr/pkg/bundler/deployer" + "github.com/NVIDIA/aicr/pkg/component" + "github.com/NVIDIA/aicr/pkg/errors" + "github.com/NVIDIA/aicr/pkg/manifest" +) + +// Component is the per-component input for Write. Fields mirror the subset of +// pkg/bundler/deployer/helm.ComponentData that localformat needs. +type Component struct { + Name string + Namespace string + // Helm upstream ref (empty for manifest-only components) + Repository string + ChartName string + Version string + IsOCI bool + // Kustomize (empty for helm components) + Tag string + Path string + // Values hydrated by the component bundler + Values map[string]any + DynamicPaths []string // paths moved from values.yaml into cluster-values.yaml +} + +// Options configures Write. +type Options struct { + OutputDir string + Components []Component // ordered per DeploymentOrder + ComponentManifests map[string]map[string][]byte // name → path → rendered bytes +} + +// renderInputFor builds the per-component manifest.RenderInput. The Helm +// templates inside ComponentManifests reference ".Values[componentName]" and +// ".Release.Namespace" / ".Chart.{Name,Version}" — those all derive from the +// Component itself, so we construct it here rather than asking callers to +// pre-build N separate RenderInputs in lockstep with Components. +func renderInputFor(c Component) manifest.RenderInput { + chart := c.ChartName + if chart == "" { + chart = c.Name + } + return manifest.RenderInput{ + ComponentName: c.Name, + Namespace: c.Namespace, + ChartName: chart, + ChartVersion: deployer.NormalizeVersionWithDefault(c.Version), + Values: c.Values, + } +} + +// Write emits the numbered folder layout. Deterministic and idempotent. +// +// Removes any pre-existing NNN-* folders under OutputDir before writing, so +// reusing the same --output across recipe regenerations does not leave stale +// component folders that the deployer's loop would later install. Top-level +// orchestration files (deploy.sh, undeploy.sh, README.md, attestation/) are +// left intact; only files under [0-9][0-9][0-9]-* are removed. +func Write(ctx context.Context, opts Options) ([]Folder, error) { + // Honor cancellation before any filesystem mutation. + if err := ctx.Err(); err != nil { + return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err) + } + // Fail fast if the layout's three-digit prefix can't accommodate the + // component count. Mixed components inject a second folder per + // component, so the upper bound is 2*len(Components). The deploy/undeploy + // templates glob [0-9][0-9][0-9]-*/, so a 4-digit prefix would be + // silently skipped. + if 2*len(opts.Components) > 999 { + return nil, errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("too many components (%d): NNN- folder prefix supports at most 999 entries", + len(opts.Components))) + } + if err := os.MkdirAll(opts.OutputDir, 0o755); err != nil { + return nil, errors.Wrap(errors.ErrCodeInternal, "create output dir", err) + } + if err := pruneStaleFolders(opts.OutputDir); err != nil { + return nil, err + } + + // Detect -post collisions up front: if a recipe declares both a + // mixed component "foo" (Helm + manifests) and a separate component + // "foo-post", the injection rule would synthesize a second "foo-post" + // folder/release that collides with the explicitly-declared one. + declared := make(map[string]struct{}, len(opts.Components)) + for _, c := range opts.Components { + declared[c.Name] = struct{}{} + } + for _, c := range opts.Components { + if len(opts.ComponentManifests[c.Name]) == 0 { + continue + } + // Mixed component (helm + manifests) → would inject "-post". + if c.Repository == "" { + continue // manifest-only doesn't inject; already a single local-helm folder + } + if _, clash := declared[c.Name+"-post"]; clash { + return nil, errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("component %q is mixed (helm + manifests) and would inject %q-post, but a component named %q-post is already declared in the recipe — rename one to avoid collision", + c.Name, c.Name, c.Name)) + } + } + + folders := make([]Folder, 0, len(opts.Components)) + idx := 1 + for _, c := range opts.Components { + if err := ctx.Err(); err != nil { + return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err) + } + if !deployer.IsSafePathComponent(c.Name) { + return nil, errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("invalid component name %q", c.Name)) + } + + // Reject kustomize + raw manifests: each recipe component must declare + // EITHER kustomize (Tag/Path) OR raw manifests, not both. The bundle + // shape can only wrap one primary source into the local chart. + if (c.Tag != "" || c.Path != "") && len(opts.ComponentManifests[c.Name]) > 0 { + return nil, errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("component %q has both kustomize (Tag/Path) and raw manifests; use one", c.Name)) + } + + kind := classify(c, opts.ComponentManifests[c.Name]) + dir := fmt.Sprintf("%03d-%s", idx, c.Name) + + switch kind { + case KindUpstreamHelm: + f, err := writeUpstreamHelmFolder(opts.OutputDir, dir, idx, c) + if err != nil { + return nil, err + } + folders = append(folders, f) + slog.Info("wrote local chart folder", "index", idx, "dir", dir, "kind", kind.String(), "parent", c.Name) + idx++ + + // Mixed component: upstream chart + raw manifests. + // Emit an injected -post wrapped chart immediately after the primary so + // raw manifests apply post-install (after helm has registered the chart's CRDs). + // The "mixed" concept lives only here at the bundle layer — no recipe metadata involved. + if manifests := opts.ComponentManifests[c.Name]; len(manifests) > 0 { + postName := c.Name + "-post" + postDir := fmt.Sprintf("%03d-%s", idx, postName) + postFolder, postErr := writeLocalHelmFolder( + opts.OutputDir, postDir, idx, c, + manifests, renderInputFor(c), + postName, c.Name, + ) + if postErr != nil { + return nil, postErr + } + folders = append(folders, postFolder) + slog.Info("wrote local chart folder", "index", idx, "dir", postDir, "kind", KindLocalHelm.String(), "parent", c.Name) + idx++ + } + case KindLocalHelm: + manifests := opts.ComponentManifests[c.Name] + if c.Tag != "" || c.Path != "" { + // Kustomize-typed: materialize the overlay output to a single + // templates/manifest.yaml inside the wrapped chart. + // + // Path is required (kustomize needs somewhere to build from); + // Tag is only meaningful with a git Repository. Reject the + // incomplete combinations explicitly so a recipe author sees + // the misconfiguration rather than a silent empty build. + if c.Path == "" { + return nil, errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("kustomize component %q has Tag but no Path; Path is required", c.Name)) + } + if c.Tag != "" && c.Repository == "" { + return nil, errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("kustomize component %q has Tag but no Repository; Tag is only meaningful with a git Repository", c.Name)) + } + // Build target: git URL form for git-sourced kustomizations + // (matches the original deploy.sh.tmpl convention), or local + // filesystem path otherwise. Only append ?ref= when Tag is + // non-empty — kustomize distinguishes `repo//path` (no ref, + // HEAD) from `repo//path?ref=` (empty ref, error). + target := c.Path + if c.Repository != "" { + target = fmt.Sprintf("%s//%s", c.Repository, c.Path) + if c.Tag != "" { + target += "?ref=" + c.Tag + } + } + rendered, kerr := buildKustomize(ctx, target) + if kerr != nil { + return nil, kerr + } + manifests = map[string][]byte{"manifest.yaml": rendered} + } + f, err := writeLocalHelmFolder(opts.OutputDir, dir, idx, c, + manifests, renderInputFor(c), + c.Name, c.Name) + if err != nil { + return nil, err + } + folders = append(folders, f) + slog.Info("wrote local chart folder", "index", idx, "dir", dir, "kind", kind.String(), "parent", c.Name) + idx++ + } + } + return folders, nil +} + +// valueSplit carries the results of splitting component values into static +// (values.yaml) and dynamic (cluster-values.yaml) maps. +type valueSplit struct { + static map[string]any + dynamic map[string]any +} + +// splitDynamicPaths deep-copies values and moves the named dot-paths into a +// separate dynamic map. Paths not present in values are still added to the +// dynamic map with an empty-string value so cluster-values.yaml carries the +// full set of dynamic keys for operators to fill in at install time. +// +// Unexported because lifting it into pkg/bundler/deployer would create an +// import cycle (deployer → component → checksum → deployer); keeping it +// in this leaf subpackage avoids that. +func splitDynamicPaths(values map[string]any, dynamicPaths []string) valueSplit { + static := component.DeepCopyMap(values) + dynamic := make(map[string]any) + for _, path := range dynamicPaths { + val, found := component.GetValueByPath(static, path) + if found { + component.RemoveValueByPath(static, path) + } else { + val = "" + } + component.SetValueByPath(dynamic, path, val) + } + return valueSplit{static: static, dynamic: dynamic} +} + +// classify determines the primary folder kind for a component. +func classify(c Component, manifests map[string][]byte) FolderKind { + if c.Tag != "" || c.Path != "" { + // Kustomize-typed — Task 9 adds actual kustomize build support. + return KindLocalHelm + } + if c.Repository == "" && len(manifests) > 0 { + return KindLocalHelm + } + return KindUpstreamHelm +} + +// writeValueFiles writes values.yaml (static) and cluster-values.yaml (dynamic) +// into folderDir, splitting c.Values via c.DynamicPaths. Extracted because +// both per-folder writers (upstream-helm and local-helm) perform this +// identical dance; keeping it in one place preserves the single-source-of- +// truth guarantee around dynamic-values semantics. +func writeValueFiles(folderDir string, c Component) error { + split := splitDynamicPaths(c.Values, c.DynamicPaths) + if _, _, err := deployer.WriteValuesFile(split.static, folderDir, "values.yaml"); err != nil { + return errors.Wrap(errors.ErrCodeInternal, + fmt.Sprintf("write values.yaml for %s", c.Name), err) + } + if _, _, err := deployer.WriteValuesFile(split.dynamic, folderDir, "cluster-values.yaml"); err != nil { + return errors.Wrap(errors.ErrCodeInternal, + fmt.Sprintf("write cluster-values.yaml for %s", c.Name), err) + } + return nil +} + +// renderTemplateToFile executes tmpl against data, SafeJoin-checks the output +// path, and writes the rendered bytes with mode. Extracted because the +// render-then-write dance is repeated for install.sh in both writers and for +// Chart.yaml in the local-helm writer — three call sites, identical shape. +func renderTemplateToFile(tmpl *template.Template, data any, + folderDir, filename string, mode os.FileMode, +) error { + + var buf bytes.Buffer + if err := tmpl.Execute(&buf, data); err != nil { + return errors.Wrap(errors.ErrCodeInternal, + fmt.Sprintf("render %s", filename), err) + } + outPath, err := deployer.SafeJoin(folderDir, filename) + if err != nil { + return errors.Wrap(errors.ErrCodeInvalidRequest, + fmt.Sprintf("%s path unsafe", filename), err) + } + return writeFile(outPath, buf.Bytes(), mode) +} + +// pruneStaleFolders removes pre-existing NNN-/ directories under +// outputDir so a reused output directory cannot accumulate components from +// a previous recipe generation. Only directories matching the strict +// `[0-9][0-9][0-9]-*` pattern are removed; top-level orchestration files +// (deploy.sh, README.md, etc.) and any other directories are left alone. +func pruneStaleFolders(outputDir string) error { + entries, err := os.ReadDir(outputDir) + if err != nil { + if os.IsNotExist(err) { + return nil + } + return errors.Wrap(errors.ErrCodeInternal, "read output dir", err) + } + for _, e := range entries { + if !e.IsDir() { + continue + } + name := e.Name() + // Must be NNN-: 3 digits then a hyphen. + if len(name) < 4 || name[3] != '-' { + continue + } + ok := true + for i := 0; i < 3; i++ { + if name[i] < '0' || name[i] > '9' { + ok = false + break + } + } + if !ok { + continue + } + full, joinErr := deployer.SafeJoin(outputDir, name) + if joinErr != nil { + return errors.Wrap(errors.ErrCodeInternal, "prune stale folder unsafe", joinErr) + } + if rmErr := os.RemoveAll(full); rmErr != nil { + return errors.Wrap(errors.ErrCodeInternal, + fmt.Sprintf("remove stale folder %s", name), rmErr) + } + } + return nil +} + +// writeFile writes contents to path with the given mode, returning any +// error from write or close. Close errors on writable handles are captured +// so buffered-flush failures are not silently dropped (see CLAUDE.md rule). +// Returns StructuredErrors with ErrCodeInternal so callers can propagate +// with "return err" — no double-wrapping. +func writeFile(path string, contents []byte, mode os.FileMode) error { + f, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, mode) + if err != nil { + return errors.Wrap(errors.ErrCodeInternal, fmt.Sprintf("open %s", path), err) + } + _, writeErr := f.Write(contents) + closeErr := f.Close() + if writeErr != nil { + return errors.Wrap(errors.ErrCodeInternal, fmt.Sprintf("write %s", path), writeErr) + } + if closeErr != nil { + return errors.Wrap(errors.ErrCodeInternal, fmt.Sprintf("close %s", path), closeErr) + } + return nil +} diff --git a/pkg/bundler/deployer/localformat/writer_test.go b/pkg/bundler/deployer/localformat/writer_test.go new file mode 100644 index 000000000..7de941bec --- /dev/null +++ b/pkg/bundler/deployer/localformat/writer_test.go @@ -0,0 +1,539 @@ +// Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package localformat_test + +import ( + "context" + stderrors "errors" + "flag" + "os" + "path/filepath" + "reflect" + "sort" + "strings" + "testing" + + "github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat" + "github.com/NVIDIA/aicr/pkg/errors" +) + +var update = flag.Bool("update", false, "update golden files") + +func TestWrite_UpstreamHelmOnly(t *testing.T) { + outDir := t.TempDir() + + folders, err := localformat.Write(context.Background(), localformat.Options{ + OutputDir: outDir, + Components: []localformat.Component{{ + Name: "nfd", + Namespace: "node-feature-discovery", + Repository: "https://kubernetes-sigs.github.io/node-feature-discovery/charts", + ChartName: "node-feature-discovery", + Version: "v0.16.1", + Values: map[string]any{"image": map[string]any{"tag": "v0.16.1"}}, + }}, + }) + if err != nil { + t.Fatalf("Write: %v", err) + } + + if len(folders) != 1 { + t.Fatalf("want 1 folder, got %d", len(folders)) + } + if got, want := folders[0].Dir, "001-nfd"; got != want { + t.Errorf("folders[0].Dir = %q, want %q", got, want) + } + if got, want := folders[0].Kind, localformat.KindUpstreamHelm; got != want { + t.Errorf("folders[0].Kind = %v, want %v", got, want) + } + + // Files written on disk + for _, rel := range []string{"install.sh", "values.yaml", "cluster-values.yaml", "upstream.env"} { + if _, err := os.Stat(filepath.Join(outDir, "001-nfd", rel)); err != nil { + t.Errorf("missing file %s: %v", rel, err) + } + } + // No Chart.yaml for upstream-helm + if _, err := os.Stat(filepath.Join(outDir, "001-nfd", "Chart.yaml")); !os.IsNotExist(err) { + t.Errorf("Chart.yaml must not exist for upstream-helm folder") + } + + // Golden-file compare for install.sh + upstream.env + assertGolden(t, outDir, "testdata/upstream_helm_only", "001-nfd/install.sh") + assertGolden(t, outDir, "testdata/upstream_helm_only", "001-nfd/upstream.env") +} + +func TestWrite_LocalHelmManifestOnly(t *testing.T) { + outDir := t.TempDir() + + folders, err := localformat.Write(context.Background(), localformat.Options{ + OutputDir: outDir, + Components: []localformat.Component{{ + Name: "skyhook-customizations", + Namespace: "skyhook", + Repository: "", // empty: manifest-only + }}, + ComponentManifests: map[string]map[string][]byte{ + "skyhook-customizations": { + // Realistic input: project recipe manifests carry a license header + // (see recipes/components/gpu-operator/manifests/dcgm-exporter.yaml). + "components/skyhook-customizations/manifests/customization.yaml": []byte(`# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: x +`), + }, + }, + }) + if err != nil { + t.Fatalf("Write: %v", err) + } + if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm { + t.Fatalf("want 1 local-helm folder, got %d folders kind=%v", len(folders), folders[0].Kind) + } + + for _, rel := range []string{"install.sh", "values.yaml", "cluster-values.yaml", "Chart.yaml", "templates/customization.yaml"} { + if _, err := os.Stat(filepath.Join(outDir, "001-skyhook-customizations", rel)); err != nil { + t.Errorf("missing file %s: %v", rel, err) + } + } + // upstream.env MUST NOT exist for local-helm + if _, err := os.Stat(filepath.Join(outDir, "001-skyhook-customizations", "upstream.env")); !os.IsNotExist(err) { + t.Errorf("upstream.env must not exist for local-helm folder") + } + + assertGolden(t, outDir, "testdata/local_helm_manifest_only", "001-skyhook-customizations/install.sh") + assertGolden(t, outDir, "testdata/local_helm_manifest_only", "001-skyhook-customizations/Chart.yaml") + assertGolden(t, outDir, "testdata/local_helm_manifest_only", "001-skyhook-customizations/templates/customization.yaml") +} + +func TestWrite_Mixed(t *testing.T) { + outDir := t.TempDir() + + folders, err := localformat.Write(context.Background(), localformat.Options{ + OutputDir: outDir, + Components: []localformat.Component{{ + Name: "gpu-operator", + Namespace: "gpu-operator", + Repository: "https://nvidia.github.io/gpu-operator", + ChartName: "nvidia/gpu-operator", + Version: "v24.9.1", + }}, + ComponentManifests: map[string]map[string][]byte{ + "gpu-operator": { + // Realistic: real project manifests carry a license header. + "components/gpu-operator/manifests/dcgm-exporter.yaml": []byte(`# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: dcgm +`), + }, + }, + }) + if err != nil { + t.Fatalf("Write: %v", err) + } + + if len(folders) != 2 { + t.Fatalf("want 2 folders, got %d", len(folders)) + } + if folders[0].Dir != "001-gpu-operator" || folders[0].Kind != localformat.KindUpstreamHelm { + t.Errorf("folders[0] = %+v, want 001-gpu-operator / upstream-helm", folders[0]) + } + if folders[1].Dir != "002-gpu-operator-post" || folders[1].Kind != localformat.KindLocalHelm { + t.Errorf("folders[1] = %+v, want 002-gpu-operator-post / local-helm", folders[1]) + } + if folders[1].Parent != "gpu-operator" { + t.Errorf("folders[1].Parent = %q, want gpu-operator", folders[1].Parent) + } + if folders[1].Name != "gpu-operator-post" { + t.Errorf("folders[1].Name = %q, want gpu-operator-post", folders[1].Name) + } + + // Primary has NO Chart.yaml (upstream-helm) + if _, err := os.Stat(filepath.Join(outDir, "001-gpu-operator", "Chart.yaml")); !os.IsNotExist(err) { + t.Errorf("primary must not have Chart.yaml") + } + // Post HAS Chart.yaml + templates/dcgm-exporter.yaml + if _, err := os.Stat(filepath.Join(outDir, "002-gpu-operator-post", "Chart.yaml")); err != nil { + t.Errorf("post must have Chart.yaml: %v", err) + } + if _, err := os.Stat(filepath.Join(outDir, "002-gpu-operator-post", "templates", "dcgm-exporter.yaml")); err != nil { + t.Errorf("post must have templates/dcgm-exporter.yaml: %v", err) + } + + // Post's upstream.env MUST NOT exist (wrapped chart, not upstream ref) + if _, err := os.Stat(filepath.Join(outDir, "002-gpu-operator-post", "upstream.env")); !os.IsNotExist(err) { + t.Errorf("post must not have upstream.env") + } +} + +func TestWrite_Ordering(t *testing.T) { + outDir := t.TempDir() + mk := func(name, repo string) localformat.Component { + return localformat.Component{ + Name: name, + Namespace: name, + Repository: repo, + ChartName: name, + Version: "v1.0.0", + } + } + + // b is mixed: helm repo set + manifests → emits b primary + b-post injected + folders, err := localformat.Write(context.Background(), localformat.Options{ + OutputDir: outDir, + Components: []localformat.Component{ + mk("a", "https://a.example"), + mk("b", "https://b.example"), + mk("c", "https://c.example"), + }, + ComponentManifests: map[string]map[string][]byte{ + "b": { + "b/manifests/x.yaml": []byte(`# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: x +`), + }, + }, + }) + if err != nil { + t.Fatalf("Write: %v", err) + } + + got := make([]string, 0, len(folders)) + for _, f := range folders { + got = append(got, f.Dir) + } + want := []string{"001-a", "002-b", "003-b-post", "004-c"} + if !reflect.DeepEqual(got, want) { + t.Errorf("folder order = %v, want %v", got, want) + } + + // Verify the primary/post relationship on b + if folders[1].Kind != localformat.KindUpstreamHelm { + t.Errorf("folders[1] (b) = %v, want KindUpstreamHelm", folders[1].Kind) + } + if folders[2].Kind != localformat.KindLocalHelm || folders[2].Parent != "b" || folders[2].Name != "b-post" { + t.Errorf("folders[2] (b-post) = %+v, want KindLocalHelm parent=b name=b-post", folders[2]) + } + + // Verify subsequent indices are correct on the Folder struct itself (not just the Dir) + wantIndices := []int{1, 2, 3, 4} + for i, f := range folders { + if f.Index != wantIndices[i] { + t.Errorf("folders[%d].Index = %d, want %d (dir=%s)", i, f.Index, wantIndices[i], f.Dir) + } + } +} + +func TestWrite_Kustomize(t *testing.T) { + outDir := t.TempDir() + + // Absolute path to the kustomize fixture. `filepath.Abs` resolves the + // test-relative "testdata/kustomize_input" to something buildKustomize + // can feed to kustomize's on-disk filesystem. + kustomizePath, err := filepath.Abs(filepath.Join("testdata", "kustomize_input")) + if err != nil { + t.Fatalf("abs path: %v", err) + } + + folders, err := localformat.Write(context.Background(), localformat.Options{ + OutputDir: outDir, + Components: []localformat.Component{{ + Name: "my-kustomize", + Namespace: "mk", + // Local kustomize: Path only. Tag/Repository are only meaningful + // for git-sourced kustomizations and are validated as a pair by + // Write — a Tag without Repository would (correctly) be rejected. + Path: kustomizePath, + }}, + }) + if err != nil { + t.Fatalf("Write: %v", err) + } + if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm { + t.Fatalf("want 1 local-helm folder (kustomize wrapped), got %d folders kind=%v", len(folders), folders[0].Kind) + } + + // manifest.yaml is the single flattened output of kustomize build + manifestPath := filepath.Join(outDir, "001-my-kustomize", "templates", "manifest.yaml") + if _, err := os.Stat(manifestPath); err != nil { + t.Errorf("missing templates/manifest.yaml: %v", err) + } + // Chart.yaml should still exist (wrapped chart) + if _, err := os.Stat(filepath.Join(outDir, "001-my-kustomize", "Chart.yaml")); err != nil { + t.Errorf("missing Chart.yaml: %v", err) + } +} + +func TestWrite_Deterministic(t *testing.T) { + kustomizePath, err := filepath.Abs(filepath.Join("testdata", "kustomize_input")) + if err != nil { + t.Fatalf("abs path: %v", err) + } + opts := func(dir string) localformat.Options { + return localformat.Options{ + OutputDir: dir, + Components: []localformat.Component{ + { + Name: "a", + Namespace: "a", + Repository: "https://a.example", + ChartName: "a", + Version: "v1", + Values: map[string]any{"image": map[string]any{"tag": "v1"}}, + }, + { + Name: "b", + Namespace: "b", + Repository: "https://b.example", + ChartName: "b", + Version: "v1", + }, + { + // Kustomize component to lock determinism on the + // kustomize build path (manifest.yaml ordering, etc.). + Name: "k", + Namespace: "k", + Path: kustomizePath, + }, + }, + // b is mixed — exercise the -post injection path in the determinism check + ComponentManifests: map[string]map[string][]byte{ + "b": { + // Two manifests with distinct basenames to exercise sorted iteration + "b/manifests/m1.yaml": []byte("---\napiVersion: v1\nkind: ConfigMap\nmetadata:\n name: m1\n"), + "b/manifests/m2.yaml": []byte("---\napiVersion: v1\nkind: ConfigMap\nmetadata:\n name: m2\n"), + }, + }, + } + } + d1, d2 := t.TempDir(), t.TempDir() + if _, err := localformat.Write(context.Background(), opts(d1)); err != nil { + t.Fatalf("Write 1: %v", err) + } + if _, err := localformat.Write(context.Background(), opts(d2)); err != nil { + t.Fatalf("Write 2: %v", err) + } + assertDirsEqual(t, d1, d2) +} + +func TestWrite_KustomizeWithManifestsRejected(t *testing.T) { + // Point Path at the existing kustomize fixture so Tag/Path are set + // realistically, but attach raw manifests alongside — bundle must refuse. + kustomizePath, err := filepath.Abs(filepath.Join("testdata", "kustomize_input")) + if err != nil { + t.Fatalf("abs path: %v", err) + } + + _, err = localformat.Write(context.Background(), localformat.Options{ + OutputDir: t.TempDir(), + Components: []localformat.Component{{ + Name: "busted-component", + Namespace: "ns", + Tag: "v1.0.0", + Path: kustomizePath, + }}, + ComponentManifests: map[string]map[string][]byte{ + "busted-component": { + "extra/m.yaml": []byte("apiVersion: v1\nkind: ConfigMap\nmetadata:\n name: x\n"), + }, + }, + }) + if err == nil { + t.Fatalf("want error rejecting kustomize + raw manifests, got nil") + } + // Must be a structured error with ErrCodeInvalidRequest + var structErr *errors.StructuredError + if !stderrors.As(err, &structErr) { + t.Fatalf("expected *errors.StructuredError, got %T: %v", err, err) + } + if structErr.Code != errors.ErrCodeInvalidRequest { + t.Errorf("error code = %s, want %s (full error: %v)", structErr.Code, errors.ErrCodeInvalidRequest, err) + } + // Message should name the component and reference the conflict + msg := err.Error() + if !strings.Contains(msg, "busted-component") || !strings.Contains(msg, "kustomize") || !strings.Contains(msg, "manifests") { + t.Errorf("error message should mention component name + conflict; got: %s", msg) + } +} + +func TestWrite_PathContainment(t *testing.T) { + _, err := localformat.Write(context.Background(), localformat.Options{ + OutputDir: t.TempDir(), + Components: []localformat.Component{{ + Name: "../escape", + Repository: "https://example.com", + }}, + }) + if err == nil { + t.Fatalf("want error rejecting unsafe component name, got nil") + } + var structErr *errors.StructuredError + if !stderrors.As(err, &structErr) { + t.Fatalf("expected *errors.StructuredError, got %T: %v", err, err) + } + if structErr.Code != errors.ErrCodeInvalidRequest { + t.Errorf("code = %v, want ErrCodeInvalidRequest", structErr.Code) + } + if !strings.Contains(err.Error(), "../escape") { + t.Errorf("error should name the offending component; got: %v", err) + } +} + +func TestWrite_ContextCancellation(t *testing.T) { + ctx, cancel := context.WithCancel(context.Background()) + cancel() // cancel before calling Write + + _, err := localformat.Write(ctx, localformat.Options{ + OutputDir: t.TempDir(), + Components: []localformat.Component{{ + Name: "a", + Repository: "https://a.example", + ChartName: "a", + Version: "v1", + }}, + }) + if err == nil { + t.Fatalf("want error on cancelled context, got nil") + } + var structErr *errors.StructuredError + if !stderrors.As(err, &structErr) { + t.Fatalf("expected *errors.StructuredError, got %T: %v", err, err) + } + if structErr.Code != errors.ErrCodeTimeout { + t.Errorf("code = %v, want ErrCodeTimeout", structErr.Code) + } +} + +// assertDirsEqual walks d1 and compares each file to the corresponding file +// in d2 (same relative path). Fails on missing files, extra files, or content +// mismatch. Path-relative compare — absolute TempDir prefix is stripped. +func assertDirsEqual(t *testing.T, d1, d2 string) { + t.Helper() + files1 := listFiles(t, d1) + files2 := listFiles(t, d2) + if !reflect.DeepEqual(files1, files2) { + t.Fatalf("file trees differ:\n d1=%v\n d2=%v", files1, files2) + } + for _, rel := range files1 { + b1, err := os.ReadFile(filepath.Join(d1, rel)) + if err != nil { + t.Fatalf("read %s from d1: %v", rel, err) + } + b2, err := os.ReadFile(filepath.Join(d2, rel)) + if err != nil { + t.Fatalf("read %s from d2: %v", rel, err) + } + if string(b1) != string(b2) { + t.Errorf("content differs at %s:\n--- d1 ---\n%s\n--- d2 ---\n%s", rel, b1, b2) + } + } +} + +// listFiles returns sorted relative paths of all regular files under dir. +func listFiles(t *testing.T, dir string) []string { + t.Helper() + var files []string + err := filepath.Walk(dir, func(path string, info os.FileInfo, walkErr error) error { + if walkErr != nil { + return walkErr + } + if info.Mode().IsRegular() { + rel, err := filepath.Rel(dir, path) + if err != nil { + return err + } + files = append(files, rel) + } + return nil + }) + if err != nil { + t.Fatalf("walk %s: %v", dir, err) + } + sort.Strings(files) + return files +} + +// assertGolden reads outDir/relPath and diffs it against goldenDir/relPath. +// With -update, writes the actual content to the golden path. +func assertGolden(t *testing.T, outDir, goldenDir, relPath string) { + t.Helper() + got, err := os.ReadFile(filepath.Join(outDir, relPath)) + if err != nil { + t.Fatalf("read actual %s: %v", relPath, err) + } + goldenPath := filepath.Join(goldenDir, relPath) + if *update { + if err = os.MkdirAll(filepath.Dir(goldenPath), 0o755); err != nil { + t.Fatalf("mkdir golden: %v", err) + } + if err = os.WriteFile(goldenPath, got, 0o644); err != nil { + t.Fatalf("write golden: %v", err) + } + return + } + want, err := os.ReadFile(goldenPath) + if err != nil { + t.Fatalf("read golden %s: %v (run with -update to regenerate)", goldenPath, err) + } + if string(got) != string(want) { + t.Errorf("%s differs from golden:\n--- got ---\n%s\n--- want ---\n%s", relPath, got, want) + } +} diff --git a/pkg/bundler/doc.go b/pkg/bundler/doc.go index 96908c100..b5fb3d9f5 100644 --- a/pkg/bundler/doc.go +++ b/pkg/bundler/doc.go @@ -56,10 +56,13 @@ Components are defined in recipes/registry.yaml: Helm (default): - README.md: Root deployment guide with ordered steps - deploy.sh: Automation script (0755) + - undeploy.sh: Reverse-order uninstall script (0755) - recipe.yaml: Copy of the input recipe - - /values.yaml: Helm values per component - - /README.md: Component install/upgrade/uninstall - - /manifests/: Optional manifest files + - NNN-/install.sh: Per-folder install script + - NNN-/values.yaml: Static Helm values + - NNN-/cluster-values.yaml: Per-cluster dynamic values + - NNN-/upstream.env: CHART/REPO/VERSION (upstream-helm folders) + - NNN-/Chart.yaml + templates/: Local chart (local-helm folders) Argo CD: - app-of-apps.yaml: Parent Argo CD Application diff --git a/pkg/bundler/handler.go b/pkg/bundler/handler.go index 2c8794258..82c8655f0 100644 --- a/pkg/bundler/handler.go +++ b/pkg/bundler/handler.go @@ -58,9 +58,11 @@ const DefaultBundleTimeout = defaults.BundleHandlerTimeout // The response is a zip archive containing the Helm per-component bundle: // - README.md: Root deployment guide // - deploy.sh: Automation script +// - undeploy.sh: Reverse-order uninstall script // - recipe.yaml: Copy of the input recipe -// - /values.yaml: Helm values per component -// - /README.md: Component install/upgrade/uninstall +// - NNN-/install.sh: Per-folder install script +// - NNN-/values.yaml: Static Helm values +// - NNN-/cluster-values.yaml: Per-cluster dynamic values // - checksums.txt: SHA256 checksums of generated files // // Example: diff --git a/pkg/bundler/handler_test.go b/pkg/bundler/handler_test.go index 6fcbe5cad..bc6561dbe 100644 --- a/pkg/bundler/handler_test.go +++ b/pkg/bundler/handler_test.go @@ -286,7 +286,7 @@ func TestBundleEndpointValidRequest(t *testing.T) { if _, ok := expectedFiles[f.Name]; ok { expectedFiles[f.Name] = true } - if f.Name == "gpu-operator/values.yaml" { + if f.Name == "001-gpu-operator/values.yaml" { foundGPUValues = true } } @@ -297,7 +297,7 @@ func TestBundleEndpointValidRequest(t *testing.T) { } } if !foundGPUValues { - t.Error("expected gpu-operator/values.yaml not found in zip archive") + t.Error("expected 001-gpu-operator/values.yaml not found in zip archive") } // Log files for debugging @@ -457,7 +457,7 @@ func TestZipResponseContainsExpectedFiles(t *testing.T) { if _, ok := expectedFiles[f.Name]; ok { expectedFiles[f.Name] = true } - if f.Name == "gpu-operator/values.yaml" { + if f.Name == "001-gpu-operator/values.yaml" { foundGPUValues = true } } @@ -468,7 +468,7 @@ func TestZipResponseContainsExpectedFiles(t *testing.T) { } } if !foundGPUValues { - t.Error("expected gpu-operator/values.yaml not found in zip") + t.Error("expected 001-gpu-operator/values.yaml not found in zip") } t.Log("Files in zip:") diff --git a/tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml b/tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml index 55784e410..139e6d30a 100644 --- a/tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml +++ b/tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml @@ -82,6 +82,10 @@ spec: test -x "${WORK}/bundle/deploy.sh" test -f "${WORK}/bundle/recipe.yaml" # Verify all 15 component directories exist (aws-ebs-csi-driver excluded: disabled by default in EKS overlay) + # Helm deployer now produces NNN-/ numbered folders + # (#662). Mixed components (helm + raw manifests) emit two + # adjacent folders — the primary plus an injected "-post" + # wrapper — so glob to the primary only by excluding "-post". for component in \ aws-efa cert-manager \ dynamo-crds dynamo-platform gpu-operator \ @@ -89,7 +93,8 @@ spec: kgateway kgateway-crds kube-prometheus-stack \ nvidia-dra-driver-gpu nvsentinel prometheus-adapter \ nodewright-customizations nodewright-operator; do - test -d "${WORK}/bundle/${component}" || { echo "missing: ${component}"; exit 1; } + match=$(ls -d "${WORK}"/bundle/[0-9][0-9][0-9]-"${component}" 2>/dev/null | head -1) + [ -n "${match}" ] || { echo "missing: ${component}"; exit 1; } done check: ($error == null): true diff --git a/tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml b/tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml index b84070753..3a5f1aa85 100644 --- a/tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml +++ b/tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml @@ -61,8 +61,13 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-gpu-operator-templates" + # gpu-operator is mixed (helm + manifests): manifests live in the + # sibling -post folder under templates/. Glob the indexed prefix + # since the deployment-order index can shift across recipes. + MANIFEST=$(ls "${WORK}"/bundle-defaults/[0-9][0-9][0-9]-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | head -1) + [ -n "${MANIFEST}" ] || { echo "dcgm-exporter.yaml not found in any *-gpu-operator-post/templates/" >&2; exit 1; } chainsaw assert \ - --resource "${WORK}/bundle-defaults/gpu-operator/manifests/dcgm-exporter.yaml" \ + --resource "${MANIFEST}" \ --file ./assert-dcgm-defaults.yaml \ --no-color --timeout 5s @@ -87,10 +92,10 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-gpu-operator-templates" - MANIFEST="${WORK}/bundle-dcgm-disabled/gpu-operator/manifests/dcgm-exporter.yaml" ## When create=false, the template gate prevents any YAML objects ## from being rendered. The bundler removes empty manifests. - if [ -f "${MANIFEST}" ]; then + MATCHES=$(ls "${WORK}"/bundle-dcgm-disabled/[0-9][0-9][0-9]-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | wc -l) + if [ "${MATCHES}" -ne 0 ]; then echo "FAIL: dcgm-exporter.yaml should not exist when create=false" exit 1 fi diff --git a/tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml b/tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml index 93aa95391..c3f7390b6 100644 --- a/tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml +++ b/tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml @@ -61,12 +61,15 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-kgateway-templates" + ## kgateway is a mixed component (upstream Helm + raw manifests), + ## so its raw manifests render into an injected NNN-kgateway-post/ + ## wrapped chart's templates/ folder under #662's layout. + MANIFEST=$(ls "${WORK}"/bundle-defaults/[0-9][0-9][0-9]-kgateway-post/templates/inference-gateway.yaml 2>/dev/null | head -1) + [ -n "${MANIFEST}" ] || { echo "kgateway inference-gateway.yaml not found" >&2; exit 1; } ## The manifest contains two resources (GatewayParameters + Gateway). ## Verify both are present. - grep -q 'kind: GatewayParameters' \ - "${WORK}/bundle-defaults/kgateway/manifests/inference-gateway.yaml" - grep -q 'kind: Gateway' \ - "${WORK}/bundle-defaults/kgateway/manifests/inference-gateway.yaml" + grep -q 'kind: GatewayParameters' "${MANIFEST}" + grep -q 'kind: Gateway' "${MANIFEST}" ## ── With system node scheduling ──────────────────────────────────── @@ -89,7 +92,8 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-kgateway-templates" - MANIFEST="${WORK}/bundle-scheduling/kgateway/manifests/inference-gateway.yaml" + MANIFEST=$(ls "${WORK}"/bundle-scheduling/[0-9][0-9][0-9]-kgateway-post/templates/inference-gateway.yaml 2>/dev/null | head -1) + [ -n "${MANIFEST}" ] || { echo "kgateway inference-gateway.yaml not found in scheduling bundle" >&2; exit 1; } ## Verify nodeSelector was injected into GatewayParameters grep -q 'nodeSelector:' "${MANIFEST}" grep -q 'nodeGroup: system-pool' "${MANIFEST}" diff --git a/tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml b/tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml index 193483733..109ac4508 100644 --- a/tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml +++ b/tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml @@ -65,8 +65,12 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-nodewright-template" + TUNING=$(ls "${WORK}"/bundle-defaults/[0-9][0-9][0-9]-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1) + + [ -n "${TUNING}" ] || { echo "tuning.yaml not found" >&2; exit 1; } + chainsaw assert \ - --resource "${WORK}/bundle-defaults/nodewright-customizations/manifests/tuning.yaml" \ + --resource "${TUNING}" \ --file ./assert-tuning-defaults.yaml \ --no-color --timeout 5s @@ -91,8 +95,12 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-nodewright-template" + TUNING=$(ls "${WORK}"/bundle-no-autotaint/[0-9][0-9][0-9]-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1) + + [ -n "${TUNING}" ] || { echo "tuning.yaml not found" >&2; exit 1; } + chainsaw assert \ - --resource "${WORK}/bundle-no-autotaint/nodewright-customizations/manifests/tuning.yaml" \ + --resource "${TUNING}" \ --file ./assert-tuning-no-autotaint.yaml \ --no-color --timeout 5s @@ -117,8 +125,12 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-nodewright-template" + TUNING=$(ls "${WORK}"/bundle-garbage/[0-9][0-9][0-9]-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1) + + [ -n "${TUNING}" ] || { echo "tuning.yaml not found" >&2; exit 1; } + chainsaw assert \ - --resource "${WORK}/bundle-garbage/nodewright-customizations/manifests/tuning.yaml" \ + --resource "${TUNING}" \ --file ./assert-tuning-defaults.yaml \ --no-color --timeout 5s @@ -145,8 +157,12 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-nodewright-template" + TUNING=$(ls "${WORK}"/bundle-scheduling/[0-9][0-9][0-9]-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1) + + [ -n "${TUNING}" ] || { echo "tuning.yaml not found" >&2; exit 1; } + chainsaw assert \ - --resource "${WORK}/bundle-scheduling/nodewright-customizations/manifests/tuning.yaml" \ + --resource "${TUNING}" \ --file ./assert-tuning-scheduling.yaml \ --no-color --timeout 5s cleanup: diff --git a/tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml b/tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml index 48f846ac0..f1618a646 100644 --- a/tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml +++ b/tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml @@ -45,7 +45,12 @@ spec: WORK="/tmp/chainsaw-bundle-dynamic" ${AICR_BIN} bundle -r "${WORK}/recipe.yaml" -o "${WORK}/helm-dynamic" \ --dynamic gpuoperator:driver.version - test -f "${WORK}/helm-dynamic/gpu-operator/cluster-values.yaml" + # Components live in NNN-/ folders (per-component bundle layout). + # gpu-operator is mixed (helm + manifests): the upstream-helm half holds + # cluster-values.yaml. Index can shift, so glob the prefix. + CV=$(ls "${WORK}"/helm-dynamic/[0-9][0-9][0-9]-gpu-operator/cluster-values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${CV}" ] || { echo "cluster-values.yaml not found for gpu-operator" >&2; exit 1; } - name: assert-cluster-values-structure description: cluster-values.yaml has the dynamic path at the correct YAML location. @@ -53,8 +58,11 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-dynamic" + CV=$(ls "${WORK}"/helm-dynamic/[0-9][0-9][0-9]-gpu-operator/cluster-values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${CV}" ] || { echo "cluster-values.yaml not found for gpu-operator" >&2; exit 1; } printf 'apiVersion: helm/v1\nkind: Values\n' > "${WORK}/cluster-values-check.yaml" - sed '/^---$/d' "${WORK}/helm-dynamic/gpu-operator/cluster-values.yaml" >> "${WORK}/cluster-values-check.yaml" + sed '/^---$/d' "${CV}" >> "${WORK}/cluster-values-check.yaml" chainsaw assert \ --resource "${WORK}/cluster-values-check.yaml" \ --file ./assert-cluster-values.yaml \ @@ -66,23 +74,32 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-dynamic" - for dir in "${WORK}"/helm-dynamic/*/; do + for dir in "${WORK}"/helm-dynamic/[0-9]*/; do comp=$(basename "$dir") test -f "${dir}cluster-values.yaml" \ || { echo "${comp} missing cluster-values.yaml" >&2; exit 1; } done # Only gpu-operator's should have dynamic content - grep -q "driver" "${WORK}/helm-dynamic/gpu-operator/cluster-values.yaml" + CV=$(ls "${WORK}"/helm-dynamic/[0-9][0-9][0-9]-gpu-operator/cluster-values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + grep -q "driver" "${CV}" check: ($error == null): true - - name: helm-deploy-script-references-cluster-values - description: deploy.sh includes -f cluster-values.yaml for dynamic components. + - name: helm-install-scripts-reference-cluster-values + description: | + Each per-component install.sh references cluster-values.yaml (via -f). + deploy.sh is now a generic loop that delegates to install.sh; the + cluster-values.yaml reference moved into install.sh in #662. try: - script: content: | WORK="/tmp/chainsaw-bundle-dynamic" - grep -q "cluster-values.yaml" "${WORK}/helm-dynamic/deploy.sh" + # Spot-check at least one folder's install.sh references cluster-values.yaml + INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9][0-9][0-9]-gpu-operator/install.sh 2>/dev/null \ + | grep -v -- "-post/" | head -1) + [ -n "${INSTALL}" ] || { echo "gpu-operator install.sh not found" >&2; exit 1; } + grep -q "cluster-values.yaml" "${INSTALL}" check: ($error == null): true @@ -95,12 +112,16 @@ spec: WORK="/tmp/chainsaw-bundle-dynamic" ${AICR_BIN} bundle -r "${WORK}/recipe.yaml" -o "${WORK}/helm-nodynamic" # Every component should have cluster-values.yaml - for dir in "${WORK}"/helm-nodynamic/*/; do + for dir in "${WORK}"/helm-nodynamic/[0-9]*/; do test -f "${dir}cluster-values.yaml" \ || { echo "$(basename "$dir") missing cluster-values.yaml" >&2; exit 1; } done - # deploy.sh should always reference cluster-values.yaml - grep -q "cluster-values.yaml" "${WORK}/helm-nodynamic/deploy.sh" + # Each per-component install.sh should reference cluster-values.yaml + # (deploy.sh is generic; the reference now lives in install.sh per #662). + for sh in "${WORK}"/helm-nodynamic/[0-9]*/install.sh; do + grep -q "cluster-values.yaml" "${sh}" \ + || { echo "$(dirname "${sh}") install.sh missing cluster-values.yaml reference" >&2; exit 1; } + done check: ($error == null): true @@ -194,8 +215,13 @@ spec: ${AICR_BIN} bundle -r "${WORK}/recipe.yaml" -o "${WORK}/helm-combined" \ --dynamic gpuoperator:driver.version \ --set gpuoperator:gds.enabled=true - test -f "${WORK}/helm-combined/gpu-operator/cluster-values.yaml" - grep -q "gds" "${WORK}/helm-combined/gpu-operator/values.yaml" + CV=$(ls "${WORK}"/helm-combined/[0-9][0-9][0-9]-gpu-operator/cluster-values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${CV}" ] || { echo "cluster-values.yaml not found for gpu-operator" >&2; exit 1; } + VAL=$(ls "${WORK}"/helm-combined/[0-9][0-9][0-9]-gpu-operator/values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${VAL}" ] || { echo "values.yaml not found for gpu-operator" >&2; exit 1; } + grep -q "gds" "${VAL}" check: ($error == null): true @@ -210,13 +236,16 @@ spec: --dynamic gpuoperator:driver.version \ --dynamic certmanager:crds.enabled # Both components should have cluster-values.yaml - test -f "${WORK}/helm-multi-dynamic/gpu-operator/cluster-values.yaml" - test -f "${WORK}/helm-multi-dynamic/cert-manager/cluster-values.yaml" + GPU_CV=$(ls "${WORK}"/helm-multi-dynamic/[0-9][0-9][0-9]-gpu-operator/cluster-values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${GPU_CV}" ] || { echo "gpu-operator cluster-values.yaml missing" >&2; exit 1; } + CM_CV=$(ls "${WORK}"/helm-multi-dynamic/[0-9][0-9][0-9]-cert-manager/cluster-values.yaml 2>/dev/null | head -1) + [ -n "${CM_CV}" ] || { echo "cert-manager cluster-values.yaml missing" >&2; exit 1; } # Dynamic components should have pre-populated content - grep -q "driver" "${WORK}/helm-multi-dynamic/gpu-operator/cluster-values.yaml" - grep -q "crds" "${WORK}/helm-multi-dynamic/cert-manager/cluster-values.yaml" + grep -q "driver" "${GPU_CV}" + grep -q "crds" "${CM_CV}" # All components should have cluster-values.yaml (empty for non-dynamic) - for dir in "${WORK}"/helm-multi-dynamic/*/; do + for dir in "${WORK}"/helm-multi-dynamic/[0-9]*/; do test -f "${dir}cluster-values.yaml" \ || { echo "$(basename "$dir") missing cluster-values.yaml" >&2; exit 1; } done @@ -232,11 +261,12 @@ spec: WORK="/tmp/chainsaw-bundle-dynamic" ${AICR_BIN} bundle -r "${WORK}/recipe.yaml" -o "${WORK}/helm-nonexistent" \ --dynamic gpuoperator:nonexistent.field - test -f "${WORK}/helm-nonexistent/gpu-operator/cluster-values.yaml" - grep -q "nonexistent" "${WORK}/helm-nonexistent/gpu-operator/cluster-values.yaml" + CV=$(ls "${WORK}"/helm-nonexistent/[0-9][0-9][0-9]-gpu-operator/cluster-values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${CV}" ] || { echo "gpu-operator cluster-values.yaml missing" >&2; exit 1; } + grep -q "nonexistent" "${CV}" # The nonexistent path should be stubbed to empty string - grep -q 'field: ""' "${WORK}/helm-nonexistent/gpu-operator/cluster-values.yaml" \ - || grep -q "field: ''" "${WORK}/helm-nonexistent/gpu-operator/cluster-values.yaml" + grep -q 'field: ""' "${CV}" || grep -q "field: ''" "${CV}" check: ($error == null): true @@ -250,14 +280,15 @@ spec: ${AICR_BIN} bundle -r "${WORK}/recipe.yaml" -o "${WORK}/helm-disabled-dynamic" \ --set awsebscsidriver:enabled=false \ --dynamic awsebscsidriver:controller.replicaCount - # Disabled component directory should NOT exist - ! test -d "${WORK}/helm-disabled-dynamic/aws-ebs-csi-driver" - # No cluster-values.yaml for disabled component - ! test -f "${WORK}/helm-disabled-dynamic/aws-ebs-csi-driver/cluster-values.yaml" + # Disabled component directory should NOT exist (any NNN-aws-ebs-csi-driver folder) + MATCHES=$(ls -d "${WORK}"/helm-disabled-dynamic/[0-9][0-9][0-9]-aws-ebs-csi-driver 2>/dev/null | wc -l) + [ "${MATCHES}" -eq 0 ] || { echo "disabled aws-ebs-csi-driver folder still exists" >&2; exit 1; } # deploy.sh should not mention disabled component ! grep -q "aws-ebs-csi-driver" "${WORK}/helm-disabled-dynamic/deploy.sh" # Enabled component should still work - test -f "${WORK}/helm-disabled-dynamic/gpu-operator/values.yaml" + VAL=$(ls "${WORK}"/helm-disabled-dynamic/[0-9][0-9][0-9]-gpu-operator/values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${VAL}" ] || { echo "gpu-operator values.yaml missing" >&2; exit 1; } check: ($error == null): true @@ -267,8 +298,12 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-dynamic" - VALUES="${WORK}/helm-dynamic/gpu-operator/values.yaml" - CLUSTER="${WORK}/helm-dynamic/gpu-operator/cluster-values.yaml" + VALUES=$(ls "${WORK}"/helm-dynamic/[0-9][0-9][0-9]-gpu-operator/values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + CLUSTER=$(ls "${WORK}"/helm-dynamic/[0-9][0-9][0-9]-gpu-operator/cluster-values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${VALUES}" ] && [ -n "${CLUSTER}" ] \ + || { echo "gpu-operator values.yaml or cluster-values.yaml missing" >&2; exit 1; } # 1. driver section should still exist in values.yaml (driver.enabled remains) grep -q "driver:" "${VALUES}" # 2. driver.version must NOT appear anywhere in values.yaml diff --git a/tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml b/tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml index 6c0a61b15..f662995f2 100644 --- a/tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml +++ b/tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml @@ -74,8 +74,12 @@ spec: - script: content: | WORK="/tmp/chainsaw-bundle-scheduling" + # gpu-operator's primary upstream-helm folder (not the -post one). + VAL=$(ls "${WORK}"/bundle/[0-9][0-9][0-9]-gpu-operator/values.yaml 2>/dev/null \ + | grep -v -- "-gpu-operator-post" | head -1) + [ -n "${VAL}" ] || { echo "gpu-operator values.yaml not found" >&2; exit 1; } printf 'apiVersion: helm/v1\nkind: Values\n' > "${WORK}/gpu-operator-values.yaml" - sed '/^---$/d' "${WORK}/bundle/gpu-operator/values.yaml" >> "${WORK}/gpu-operator-values.yaml" + sed '/^---$/d' "${VAL}" >> "${WORK}/gpu-operator-values.yaml" chainsaw assert \ --resource "${WORK}/gpu-operator-values.yaml" \ --file ./assert-gpu-operator-values.yaml \ diff --git a/tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml b/tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml index a69e0f296..e502a19c6 100644 --- a/tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml +++ b/tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml @@ -49,8 +49,10 @@ spec: test -f "${WORK}/basic/deploy.sh" test -f "${WORK}/basic/README.md" test -f "${WORK}/basic/recipe.yaml" - ls "${WORK}"/basic/*/values.yaml >/dev/null 2>&1 - test -d "${WORK}/basic/gpu-operator" + ls "${WORK}"/basic/[0-9]*/values.yaml >/dev/null 2>&1 + # gpu-operator lives under NNN-gpu-operator/ (helm half) in the new layout + ls -d "${WORK}"/basic/[0-9][0-9][0-9]-gpu-operator >/dev/null 2>&1 \ + || { echo "no NNN-gpu-operator folder found" >&2; exit 1; } check: ($error == null): true diff --git a/tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml b/tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml index b6782be46..45f82408a 100644 --- a/tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml +++ b/tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml @@ -109,7 +109,8 @@ spec: test -f "${WORK}/bundle/README.md" test -f "${WORK}/bundle/deploy.sh" test -f "${WORK}/bundle/recipe.yaml" - ls "${WORK}"/bundle/*/values.yaml >/dev/null 2>&1 + # Components live in NNN-/ folders (per-component bundle layout). + ls "${WORK}"/bundle/[0-9]*/values.yaml >/dev/null 2>&1 check: ($error == null): true @@ -119,8 +120,13 @@ spec: - script: content: | WORK="/tmp/chainsaw-cuj1-training" + # gpu-operator is mixed (helm + manifests): values.yaml lives in the + # upstream-helm half (NNN-gpu-operator), not the -post sibling. + GPU_VALUES=$(ls "${WORK}"/bundle/[0-9][0-9][0-9]-gpu-operator/values.yaml 2>/dev/null \ + | grep -v -- '-gpu-operator-post' | head -1) + [ -n "${GPU_VALUES}" ] || { echo "gpu-operator values.yaml not found" >&2; exit 1; } printf 'apiVersion: helm/v1\nkind: Values\n' > "${WORK}/gpu-operator-values.yaml" - sed '/^---$/d' "${WORK}/bundle/gpu-operator/values.yaml" >> "${WORK}/gpu-operator-values.yaml" + sed '/^---$/d' "${GPU_VALUES}" >> "${WORK}/gpu-operator-values.yaml" chainsaw assert \ --resource "${WORK}/gpu-operator-values.yaml" \ --file ./assert-bundle-scheduling.yaml \ @@ -132,8 +138,10 @@ spec: - script: content: | WORK="/tmp/chainsaw-cuj1-training" + NFD_VALUES=$(ls "${WORK}"/bundle/[0-9][0-9][0-9]-nfd/values.yaml 2>/dev/null | head -1) + [ -n "${NFD_VALUES}" ] || { echo "nfd values.yaml not found" >&2; exit 1; } printf 'apiVersion: helm/v1\nkind: Values\n' > "${WORK}/nfd-values.yaml" - sed '/^---$/d' "${WORK}/bundle/nfd/values.yaml" >> "${WORK}/nfd-values.yaml" + sed '/^---$/d' "${NFD_VALUES}" >> "${WORK}/nfd-values.yaml" chainsaw assert \ --resource "${WORK}/nfd-values.yaml" \ --file ./assert-bundle-scheduling-nfd.yaml \ diff --git a/vendor/github.com/go-openapi/strfmt/CONTRIBUTORS.md b/vendor/github.com/go-openapi/strfmt/CONTRIBUTORS.md index e49700d4d..a5d5ed6e6 100644 --- a/vendor/github.com/go-openapi/strfmt/CONTRIBUTORS.md +++ b/vendor/github.com/go-openapi/strfmt/CONTRIBUTORS.md @@ -4,12 +4,12 @@ | Total Contributors | Total Contributions | | --- | --- | -| 40 | 225 | +| 40 | 231 | | Username | All Time Contribution Count | All Commits | | --- | --- | --- | | @casualjim | 88 | | -| @fredbi | 57 | | +| @fredbi | 63 | | | @youyuanwu | 13 | | | @jlambatl | 9 | | | @GlenDC | 5 | | diff --git a/vendor/github.com/go-openapi/strfmt/README.md b/vendor/github.com/go-openapi/strfmt/README.md index a0cf64275..4afef4373 100644 --- a/vendor/github.com/go-openapi/strfmt/README.md +++ b/vendor/github.com/go-openapi/strfmt/README.md @@ -16,14 +16,6 @@ Golang support for string formats defined by JSON Schema and OpenAPI. ## Announcements -* **2025-12-19** : new community chat on discord - * a new discord community channel is available to be notified of changes and support users - * our venerable Slack channel remains open, and will be eventually discontinued on **2026-03-31** - -You may join the discord community by clicking the invite link on the discord badge (also above). [![Discord Channel][discord-badge]][discord-url] - -Or join our Slack channel: [![Slack Channel][slack-logo]![slack-badge]][slack-url] - * **2026-03-07** : v0.26.0 **dropped dependency to the mongodb driver** * mongodb users can still use this package without any change * however, we have frozen the back-compatible support for mongodb driver at v2.5.0 @@ -177,9 +169,9 @@ This library ships under the [SPDX-License-Identifier: Apache-2.0](./LICENSE). ## Other documentation * [All-time contributors](./CONTRIBUTORS.md) -* [Contributing guidelines](.github/CONTRIBUTING.md) -* [Maintainers documentation](docs/MAINTAINERS.md) -* [Code style](docs/STYLE.md) +* [Contributing guidelines][contributing-doc-site] +* [Maintainers documentation][maintainers-doc-site] +* [Code style][style-doc-site] ## Cutting a new release @@ -214,9 +206,6 @@ Maintainers can cut a new release by either: [doc-url]: https://goswagger.io/go-openapi [godoc-badge]: https://pkg.go.dev/badge/github.com/go-openapi/strfmt [godoc-url]: http://pkg.go.dev/github.com/go-openapi/strfmt -[slack-logo]: https://a.slack-edge.com/e6a93c1/img/icons/favicon-32.png -[slack-badge]: https://img.shields.io/badge/slack-blue?link=https%3A%2F%2Fgoswagger.slack.com%2Farchives%2FC04R30YM -[slack-url]: https://goswagger.slack.com/archives/C04R30YMU [discord-badge]: https://img.shields.io/discord/1446918742398341256?logo=discord&label=discord&color=blue [discord-url]: https://discord.gg/FfnFYaC3k5 @@ -228,3 +217,7 @@ Maintainers can cut a new release by either: [goversion-url]: https://github.com/go-openapi/strfmt/blob/master/go.mod [top-badge]: https://img.shields.io/github/languages/top/go-openapi/strfmt [commits-badge]: https://img.shields.io/github/commits-since/go-openapi/strfmt/latest + +[contributing-doc-site]: https://go-openapi.github.io/doc-site/contributing/contributing/index.html +[maintainers-doc-site]: https://go-openapi.github.io/doc-site/maintainers/index.html +[style-doc-site]: https://go-openapi.github.io/doc-site/contributing/style/index.html