From afc102635248371c76df7ceee4c128ae905be84f Mon Sep 17 00:00:00 2001 From: Yuan Chen Date: Wed, 29 Apr 2026 17:41:55 -0700 Subject: [PATCH] fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The registry entry for `gke-nccl-tcpxo` declared its namespace under a top-level `manifest:` block: - name: gke-nccl-tcpxo ... manifest: defaultNamespace: kube-system `manifest:` is **not** a parsed field on the registry's `ComponentConfig` struct — only `helm:` and `kustomize:` are recognized — so `manifest.defaultNamespace` was silently ignored. The established manifest-only Helm-wrapper pattern (used today by `nodewright-customizations`) is to declare the component as `helm:` with an empty `defaultRepository`: helm: defaultRepository: "" defaultNamespace: kube-system Bug surfacing timeline: - Pre-#706, manifest-only components were installed by the root `deploy.sh` via raw `kubectl apply -f .../manifests/`. Those manifests carry inline `metadata.namespace: kube-system`, so the empty registry default was harmless; `kubectl apply` did not need `ComponentRef.Namespace` for routing. - #706 (`feat(bundler)\!: uniform NNN-folder bundle layout via localformat`) wraps every component — manifest-only included — as a local Helm chart. The generated `install.sh` now always emits `helm upgrade --install ./ --namespace --create-namespace`, which requires `ComponentRef.Namespace`. With the unparsed `manifest:` block, that field is empty, producing: helm upgrade --install gke-nccl-tcpxo ./ \ --namespace --create-namespace \ Shell argument collapsing makes Helm parse the literal `--create-namespace` as the namespace name and fails with: Error: create: failed to create: namespaces "--create-namespace" not found - The first KWOK GPU run after #706 was cancelled, and earlier runs used the pre-#706 deployer path where the empty namespace was inert. PR #715 is one of the first post-#706 runs to actually complete the H100 GKE-COS training jobs (its registry/base.yaml changes auto-promote the GKE-COS Tier-2 KWOK matrix), and it surfaced the failure. Fix: switch `gke-nccl-tcpxo` to the existing manifest-only Helm pattern, matching `nodewright-customizations`. Verified locally: $ aicr recipe --service gke --accelerator h100 \ --intent training --os cos -o /tmp/recipe.yaml $ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle $ grep "helm upgrade" /tmp/bundle/*-gke-nccl-tcpxo/install.sh helm upgrade --install gke-nccl-tcpxo ./ \ --namespace kube-system --create-namespace \ Refs: #706, #715 --- recipes/registry.yaml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/recipes/registry.yaml b/recipes/registry.yaml index 8a0bc6c65..bc7b73c31 100644 --- a/recipes/registry.yaml +++ b/recipes/registry.yaml @@ -112,7 +112,9 @@ components: displayName: gke-nccl-tcpxo valueOverrideKeys: - gkenccltcpxo - manifest: + helm: + # Manifest-only component - no external Helm chart, uses manifestFiles + defaultRepository: "" defaultNamespace: kube-system - name: aws-efa