Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ LICENSE_IGNORES = \
-ignore 'bundles/**' \
-ignore 'dist/**' \
-ignore 'vendor/**' \
-ignore '**/testdata/**' \
Comment thread
lockwobr marked this conversation as resolved.
-ignore 'site/public/**' \
-ignore 'site/resources/**' \
-ignore 'site/node_modules/**'
Expand Down
21 changes: 21 additions & 0 deletions docs/contributor/component.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,27 @@ The bundler system converts RecipeInput objects into deployment artifacts. Artif
- **Node scheduling**: Registry defines paths for injecting node selectors and tolerations
- **Structured errors**: Uses `pkg/errors` for error codes and wrapping

### Local Format (Shared Bundle Layout)

`pkg/bundler/deployer/localformat` writes the uniform numbered `NNN-<component>/` bundle layout consumed by every deployer. It owns per-folder content (Chart.yaml, values.yaml, cluster-values.yaml, install.sh, templates/, upstream.env). Deployers (`helm`, future `helmfile` per [#632](https://github.com/NVIDIA/aicr/issues/632), `argocd`, `argocd-helm`) call `localformat.Write()` and then add their own top-level orchestration files (deploy.sh, helmfile.yaml, Application CRs, etc.) — they never re-classify components or duplicate the per-folder writer.
Comment thread
coderabbitai[bot] marked this conversation as resolved.

**Classification rule** (single source of truth, in `localformat.classify`):

| Recipe shape | Folder kind | Notes |
|---|---|---|
| `helm.defaultRepository` set, no `manifestFiles` | `KindUpstreamHelm` | upstream chart referenced via `upstream.env`; no Chart.yaml |
| `helm.defaultRepository` set + `manifestFiles` (mixed) | `KindUpstreamHelm` (primary) + `KindLocalHelm` (`-post` injected) | two adjacent folders; raw manifests deploy post-install |
| `helm.defaultRepository == ""` + `manifestFiles` | `KindLocalHelm` | manifest-only wrapped chart |
| `kustomize` (Tag/Path set) | `KindLocalHelm` | `kustomize build` at bundle time → `templates/manifest.yaml` |

**Load-bearing invariants** (don't violate without changing the design):

1. **`localformat` never writes deployer-specific files.** `deploy.sh`, `helmfile.yaml`, argocd `Application` CRs, Flux `HelmRelease`s — all produced by the respective deployer after `Write()` returns. This separation is what makes a single layout consumable by every deployer.
2. **`install.sh` is never name-customized.** It is rendered from one of exactly two templates (`install-upstream-helm.sh.tmpl`, `install-local-helm.sh.tmpl`), parameterized only by data (name, namespace, upstream ref). Name-keyed quirks (kai-scheduler async timeout, nodewright-operator taint cleanup, DRA restart, orphan-CRD scan) stay in `deploy.sh` as name-matched blocks — not in `install.sh`. This is the structural barrier that prevents per-folder scripts from accumulating drift.
3. **`Write` is deterministic and idempotent.** Same inputs → same on-disk bytes → same `Folder` slice. Map iteration is sorted; no timestamps or random suffixes are embedded.

For the full classification table, base-format invariants, and the helm deployer's call site, see `pkg/bundler/deployer/localformat/doc.go` (godoc) and `pkg/bundler/deployer/helm/helm.go::Generate`. Further design history: ticket [#662](https://github.com/NVIDIA/aicr/issues/662).

## Quick Start

### Adding a New Component (Declarative Approach)
Expand Down
52 changes: 40 additions & 12 deletions docs/user/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -1116,22 +1116,50 @@ bundles/
```
bundles/
├── README.md # Deployment guide with ordered steps
├── deploy.sh # One-command deployment script
├── deploy.sh # Generic install loop + name-matched blocks
├── undeploy.sh # Generic reverse loop
├── recipe.yaml # Recipe used to generate bundle
├── checksums.txt # SHA256 checksums
├── attestation/ # Present when --attest is used
│ ├── bundle-attestation.sigstore.json # SLSA Build Provenance v1
│ └── aicr-attestation.sigstore.json # Binary SLSA provenance chain
├── gpu-operator/
│ ├── values.yaml # Component-specific Helm values
│ ├── README.md # Per-component install/upgrade/uninstall
│ └── manifests/ # Additional manifests (if any)
│ └── dcgm-exporter.yaml
└── cert-manager/
├── 001-cert-manager/ # Upstream-helm folder: no Chart.yaml
│ ├── install.sh # Rendered: helm upgrade --install ... --repo ${REPO}
│ ├── values.yaml
│ ├── cluster-values.yaml # Dynamic-path overrides (operator-edited)
│ └── upstream.env # CHART, REPO, VERSION (sourced by install.sh)
├── 002-gpu-operator/ # Mixed component primary (upstream-helm)
│ ├── install.sh
│ ├── values.yaml
│ ├── cluster-values.yaml
│ └── upstream.env
└── 003-gpu-operator-post/ # Injected -post wrapped chart (mixed component's raw manifests)
├── Chart.yaml # Local-helm folder: Chart.yaml + templates/ present
├── install.sh # Rendered: helm upgrade --install ... ./
├── values.yaml
└── README.md
├── cluster-values.yaml
└── templates/
└── dcgm-exporter.yaml
```

**Folder layout rules:**

- Folders are numbered `NNN-<component>/` (1-based, zero-padded). Numbering is regenerated on every bundle.
- Each folder is one of two **kinds**, distinguished by the presence of `Chart.yaml`:
- **upstream-helm** — no `Chart.yaml`; `upstream.env` carries `CHART`/`REPO`/`VERSION`; `install.sh` installs the upstream chart.
- **local-helm** — `Chart.yaml` + `templates/`; `install.sh` installs the local chart (`helm upgrade --install <name> ./`).
- **Mixed components** (Helm chart + raw manifests) emit **two adjacent folders**: a primary upstream-helm `NNN-<name>/` and an injected `(NNN+1)-<name>-post/` local-helm wrapper carrying the raw manifests. Subsequent components shift by one.
- Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart.
Comment thread
lockwobr marked this conversation as resolved.
- Kustomize-typed components run `kustomize build` at bundle time; the output becomes a single `templates/manifest.yaml` inside a local-helm folder.

**Breaking change vs. earlier releases:**

Previous releases used a flat `<component>/` layout with `manifests/` siblings and a `--deployer helm` script that branched on component kind. The new format is uniform:

- All folders carry a rendered `install.sh`. The top-level `deploy.sh` is a generic loop with no per-component branching — name-matched special-case blocks (nodewright-operator taint cleanup, kai-scheduler async timeout, orphan-CRD scan, DRA kubelet-plugin restart) live around the loop, not inside it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new layout section correctly says folders carry install.sh, but the later deployment note still says each component subdirectory contains a README.md with the manual Helm command. That is no longer true in the NNN-* layout; please update the note around line 1317 to point at per-folder install.sh instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. docs/user/cli-reference.md:1300-1317 now describes the NNN-prefixed layout and points at per-folder install.sh:

# Navigate to bundle
cd bundles

# Review root README and a component's values
cat README.md
cat 001-gpu-operator/values.yaml
...

Note: deploy.sh and undeploy.sh are convenience scripts — not the only deployment path. Each NNN-<component>/ folder contains a rendered install.sh that runs the exact helm upgrade --install command for manual or pipeline-driven deployment.

- Raw manifests for mixed components now apply **post-install only**, via the injected `-post` wrapped chart. The earlier pre-apply mechanism with a CRD-race retry wrapper is gone — Helm now owns CRD ordering for mixed components natively.
- Tooling that parsed bundle paths by bare component name must account for the `NNN-` prefix.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path-format change still has an in-tree consumer using the old layout: kwok/scripts/validate-scheduling.sh:384 still points at ${WORK_DIR}/bundle/kube-prometheus-stack/values.yaml. Under the new layout it should glob ${WORK_DIR}/bundle/[0-9][0-9][0-9]-kube-prometheus-stack/values.yaml; otherwise the KWOK storage patch is silently skipped.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. kwok/scripts/validate-scheduling.sh:381-394 now globs the NNN-prefixed folder:

prom_values=$(ls -1 "${WORK_DIR}/bundle"/[0-9][0-9][0-9]-kube-prometheus-stack/values.yaml 2>/dev/null | head -1)
if [[ -n "$prom_values" && -f "$prom_values" ]] && yq eval ...

This restores the KWOK storage patch under the new layout.


**Argo CD bundle structure** (with `--deployer argocd`):
```
bundles/
Expand Down Expand Up @@ -1273,11 +1301,11 @@ If you need to enforce specific install-time values (e.g., pinning `driver.versi

```shell
# Navigate to bundle
cd bundles/gpu-operator
cd bundles

# Review configuration
cat values.yaml
# Review root README and a component's values
cat README.md
cat 001-gpu-operator/values.yaml

# Verify integrity
sha256sum -c checksums.txt
Expand All @@ -1286,7 +1314,7 @@ sha256sum -c checksums.txt
chmod +x deploy.sh && ./deploy.sh
```

> **Note:** `deploy.sh` and `undeploy.sh` are convenience scripts — not the only deployment path. Each component subdirectory contains a `README.md` with the exact `helm upgrade --install` command for manual or pipeline-driven deployment.
> **Note:** `deploy.sh` and `undeploy.sh` are convenience scripts — not the only deployment path. Each `NNN-<component>/` folder contains a rendered `install.sh` that runs the exact `helm upgrade --install` command for manual or pipeline-driven deployment.

#### Deploy Script Behavior (`deploy.sh`)

Expand Down
6 changes: 4 additions & 2 deletions kwok/scripts/validate-scheduling.sh
Original file line number Diff line number Diff line change
Expand Up @@ -381,8 +381,10 @@ generate_bundle() {
# KWOK clusters use emptyDir for Prometheus storage (no PVC/StorageClass).
# Cloud overlays (EKS, AKS) set emptyDir: null + volumeClaimTemplate which
# the Prometheus CRD rejects. Restore emptyDir and remove PVC for KWOK.
local prom_values="${WORK_DIR}/bundle/kube-prometheus-stack/values.yaml"
if [[ -f "$prom_values" ]] && yq eval '.prometheus.prometheusSpec.storageSpec.emptyDir' "$prom_values" 2>/dev/null | grep -q 'null'; then
# Bundle layout uses NNN-prefixed folders, so glob to find the kube-prometheus-stack folder.
local prom_values
prom_values=$(ls -1 "${WORK_DIR}/bundle"/[0-9][0-9][0-9]-kube-prometheus-stack/values.yaml 2>/dev/null | head -1)
if [[ -n "$prom_values" && -f "$prom_values" ]] && yq eval '.prometheus.prometheusSpec.storageSpec.emptyDir' "$prom_values" 2>/dev/null | grep -q 'null'; then
log_info "Fixing kube-prometheus-stack storageSpec for KWOK (emptyDir instead of PVC)"
yq eval -i '
.prometheus.prometheusSpec.storageSpec.emptyDir = {"medium": "", "sizeLimit": "10Gi"} |
Expand Down
68 changes: 36 additions & 32 deletions pkg/bundler/bundler_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -249,25 +249,25 @@ func TestMake_Success(t *testing.T) {
}
}

// Verify per-component directories
for _, comp := range []string{"gpu-operator", "network-operator"} {
valuesPath := filepath.Join(tmpDir, comp, "values.yaml")
// Verify per-component directories (numbered by deployment order)
componentDirs := map[string]string{
"gpu-operator": "001-gpu-operator",
"network-operator": "002-network-operator",
}
for comp, dir := range componentDirs {
valuesPath := filepath.Join(tmpDir, dir, "values.yaml")
if _, statErr := os.Stat(valuesPath); os.IsNotExist(statErr) {
t.Errorf("expected %s/values.yaml was not created", comp)
}
readmePath := filepath.Join(tmpDir, comp, "README.md")
if _, statErr := os.Stat(readmePath); os.IsNotExist(statErr) {
t.Errorf("expected %s/README.md was not created", comp)
t.Errorf("expected %s/values.yaml was not created (component %s)", dir, comp)
}
}

// No Chart.yaml should exist
// No Chart.yaml should exist at top level
chartPath := filepath.Join(tmpDir, "Chart.yaml")
if _, statErr := os.Stat(chartPath); !os.IsNotExist(statErr) {
t.Error("Chart.yaml should not exist in per-component bundle")
}

// Verify output summary (3 root + 2 components × 2 files = 7, +1 recipe.yaml = 8)
// Verify output summary (3 root + 2 components × multiple files >= 7)
if output.TotalFiles < 7 {
t.Errorf("expected at least 7 files, got %d", output.TotalFiles)
}
Expand Down Expand Up @@ -317,14 +317,16 @@ func TestMake_DisabledComponentsFiltered(t *testing.T) {
t.Fatal("Make() returned nil output")
}

// Enabled component should have a directory
if _, statErr := os.Stat(filepath.Join(tmpDir, "gpu-operator", "values.yaml")); os.IsNotExist(statErr) {
t.Error("expected gpu-operator/values.yaml to be created")
// Enabled component should have a directory (numbering reflects only enabled components)
if _, statErr := os.Stat(filepath.Join(tmpDir, "001-gpu-operator", "values.yaml")); os.IsNotExist(statErr) {
t.Error("expected 001-gpu-operator/values.yaml to be created")
}

// Disabled component should NOT have a directory
if _, statErr := os.Stat(filepath.Join(tmpDir, "aws-ebs-csi-driver")); !os.IsNotExist(statErr) {
t.Error("expected aws-ebs-csi-driver directory to NOT be created")
// Disabled component should NOT have a directory (under any numbering)
for _, dir := range []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver"} {
if _, statErr := os.Stat(filepath.Join(tmpDir, dir)); !os.IsNotExist(statErr) {
t.Errorf("expected %s directory to NOT be created", dir)
}
}

// deploy.sh should not reference the disabled component
Expand Down Expand Up @@ -420,7 +422,10 @@ func TestMake_SetEnabledOverridesPrecedence(t *testing.T) {
t.Fatalf("Make() error = %v", makeErr)
}

_, statErr := os.Stat(filepath.Join(tmpDir, "aws-ebs-csi-driver"))
// When included, the component appears as the second numbered folder
// (gpu-operator is 001, aws-ebs-csi-driver is 002). The flat layout
// is gone in this PR — only assert against the numbered path.
_, statErr := os.Stat(filepath.Join(tmpDir, "002-aws-ebs-csi-driver"))
included := !os.IsNotExist(statErr)

Comment thread
coderabbitai[bot] marked this conversation as resolved.
if included != tt.expectIncluded {
Expand Down Expand Up @@ -461,7 +466,8 @@ func TestMake_SetEnabledNotLeakedToHelmValues(t *testing.T) {
t.Fatalf("Make() error = %v", makeErr)
}

valuesPath := filepath.Join(tmpDir, "aws-ebs-csi-driver", "values.yaml")
// aws-ebs-csi-driver is the 2nd component in deployment order (after gpu-operator)
valuesPath := filepath.Join(tmpDir, "002-aws-ebs-csi-driver", "values.yaml")
valuesData, readErr := os.ReadFile(valuesPath)
if readErr != nil {
t.Fatalf("failed to read values.yaml: %v", readErr)
Expand Down Expand Up @@ -519,10 +525,10 @@ func TestMake_WithValueOverrides(t *testing.T) {
t.Fatal("Make() returned nil output")
}

// Verify gpu-operator/values.yaml was created
valuesPath := filepath.Join(tmpDir, "gpu-operator", "values.yaml")
// Verify 001-gpu-operator/values.yaml was created (single component → 001)
valuesPath := filepath.Join(tmpDir, "001-gpu-operator", "values.yaml")
if _, err := os.Stat(valuesPath); os.IsNotExist(err) {
t.Fatal("gpu-operator/values.yaml was not created")
t.Fatal("001-gpu-operator/values.yaml was not created")
}
}

Expand Down Expand Up @@ -1190,19 +1196,17 @@ func TestMake_DisabledComponentWithDynamic(t *testing.T) {
t.Fatalf("Make() error = %v", makeErr)
}

// Disabled component should NOT have a directory at all
if _, statErr := os.Stat(filepath.Join(tmpDir, "aws-ebs-csi-driver")); !os.IsNotExist(statErr) {
t.Error("expected aws-ebs-csi-driver directory to NOT be created (component is disabled)")
}

// Disabled component should NOT have cluster-values.yaml
if _, statErr := os.Stat(filepath.Join(tmpDir, "aws-ebs-csi-driver", "cluster-values.yaml")); !os.IsNotExist(statErr) {
t.Error("expected aws-ebs-csi-driver/cluster-values.yaml to NOT exist (component is disabled)")
// Disabled component should NOT have a directory at all (under any numbering).
// The directory check implies cluster-values.yaml absence, so don't double-check.
for _, dir := range []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver"} {
if _, statErr := os.Stat(filepath.Join(tmpDir, dir)); !os.IsNotExist(statErr) {
t.Errorf("expected %s directory to NOT be created (component is disabled)", dir)
}
}

// Enabled component should still exist
if _, statErr := os.Stat(filepath.Join(tmpDir, "gpu-operator", "values.yaml")); os.IsNotExist(statErr) {
t.Error("expected gpu-operator/values.yaml to be created")
// Enabled component should still exist (gpu-operator is the only enabled → 001)
if _, statErr := os.Stat(filepath.Join(tmpDir, "001-gpu-operator", "values.yaml")); os.IsNotExist(statErr) {
t.Error("expected 001-gpu-operator/values.yaml to be created")
}

// deploy.sh should not reference the disabled component
Expand Down
14 changes: 10 additions & 4 deletions pkg/bundler/deployer/helm/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,19 @@

// Package helm generates per-component Helm bundles from recipe results.
//
// Generates a directory per component with individual values and install instructions:
// Per-component folder layout (NNN-prefixed, written by pkg/bundler/deployer/localformat):
//
// - NNN-<component>/install.sh: Per-folder install script
// - NNN-<component>/values.yaml: Static Helm values
// - NNN-<component>/cluster-values.yaml: Per-cluster dynamic values
// - NNN-<component>/upstream.env: CHART/REPO/VERSION (upstream-helm folders)
// - NNN-<component>/Chart.yaml + templates/: Local chart (local-helm folders)
//
// Top-level files (owned by this deployer):
//
// - <component>/values.yaml: Helm values per component
// - <component>/README.md: Component install/upgrade/uninstall
// - <component>/manifests/: Optional manifest files
// - README.md: Root deployment guide with ordered steps
// - deploy.sh: Automation script (0755)
// - undeploy.sh: Reverse-order uninstall script (0755)
// - checksums.txt: SHA256 digests for verification (optional)
//
// Usage:
Expand Down
Loading
Loading