(WIP) ci: harden H100 GPU tests on slow runners by yuanchen8911 · Pull Request #687 · NVIDIA/aicr

yuanchen8911 · 2026-04-25T15:26:30Z

Summary

Hardens the H100 nvkind training and inference CI paths for slow/cold runners by consolidating the duplicated workflows, adding bounded control-plane and runtime component gates, tuning kind-only monitoring behavior, and improving diagnostics/artifact collection for failures.

Motivation / Context

Recent H100 Kind runs exposed related reliability issues: slow image pulls and kind image loads consumed most of the job budget, fragile single-node control planes could leave downstream workloads unreconciled after a transient restart, kube-prometheus-operator could later crash after Helm reported success, and Chainsaw cleanup/API calls could obscure the actual failed assertion.

Fixes: N/A
Related: #675

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: GitHub Actions GPU CI and Chainsaw health checks

Implementation Notes

Collapse H100 training and inference into a shared reusable workflow, with only intent/platform/Chainsaw path differing between Kubeflow training and Dynamo/kgateway inference.
Keep PR GPU execution on the pull-request/<number> push mirror after ok-to-test; remove the direct pull_request H100 path because the self-hosted GPU runners reject it before checkout.
Pin H100 nvkind inputs through .settings.yaml, including the nvkind git ref and H100-specific kind node image.
Add runner and cluster setup hardening: stale same-name kind cluster cleanup, H100/Docker/GPU/disk/memory preflight checks, kind node image warming, inotify tuning, Docker GPU validation, and bounded Docker/system diagnostics.
Add opt-in control-plane static-pod resource patches and leader-election tuning for kind H100 jobs only.
Add a reusable control-plane health action with bounded kubectl/docker calls, restart diagnostics, stability-window checks for new restarts, and optional recovery for unhealthy control-plane static pods.
Split aicr-build so the runtime install builds the CLI and snapshot agent first, while conformance validator images are built later only when needed.
Keep runtime bundle installs strict in H100 jobs (wait: true, best_effort: false) while preserving default behavior for other callers.
Add a fast post-runtime component health gate for monitoring, KAI, and the job-specific platform components: Kubeflow Trainer for training, Dynamo plus kgateway for inference.
Add a pre-Chainsaw kube-prometheus-operator rollout gate with operator logs, previous logs, deployment describe output, and monitoring events.
Run Chainsaw with --skip-delete in the H100 health phase so cleanup API slowness does not mask the failed assertion; the workflow-level kind cleanup still runs afterward.
Add kind-only kube-prometheus-stack overrides: disable Grafana, skip default rule creation, scope operator-owned monitoring CR watches to the monitoring namespace, and tune prometheus-operator probes/resources for slow kind control planes. These overrides are not applied to EKS/GKE/AKS/OKE or other real cluster recipes.
Tune slow cold-run components: KAI uses a 30-minute timeout with a capped retry budget, Dynamo uses a 20-minute timeout with --server-side=false when supported by Helm, and kube-prometheus-stack keeps the default timeout/retry budget because retries help more than a longer single Helm wait when Grafana/operator startup stalls.
Extract bounded GPU diagnostics into composite/scripts so failures capture control-plane, monitoring, GPU operator, KAI, Kubeflow, Dynamo, kgateway, and kind/Docker state without unbounded kubectl hangs.

Testing

bash -n \
  .github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh \
  .github/actions/gpu-test-cleanup/collect-debug-artifacts.sh \
  .github/scripts/gpu-chainsaw-health.sh \
  .github/scripts/gpu-debug-diagnostics.sh \
  .github/scripts/gpu-runtime-component-health.sh \
  .github/scripts/gpu-smoke-run-nvidia-smi.sh \
  .github/scripts/gpu-smoke-show-nvidia-smi-log.sh \
  .github/scripts/gpu-validate-conformance.sh

shellcheck \
  .github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh \
  .github/actions/gpu-test-cleanup/collect-debug-artifacts.sh \
  .github/scripts/gpu-runtime-component-health.sh \
  .github/scripts/gpu-smoke-show-nvidia-smi-log.sh

yamllint \
  .github/actions/check-control-plane-health/action.yml \
  .github/actions/gpu-debug-diagnostics/action.yml \
  .github/actions/gpu-smoke-nvidia-smi/action.yml \
  .github/workflows/gpu-h100-kind-runtime-test.yaml \
  .github/workflows/gpu-h100-training-test.yaml \
  .github/workflows/gpu-h100-inference-test.yaml \
  recipes/overlays/kind.yaml \
  tests/chainsaw/ai-conformance/kind-common/assert-monitoring.yaml \
  tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml \
  tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml

actionlint \
  .github/workflows/gpu-h100-kind-runtime-test.yaml \
  .github/workflows/gpu-h100-training-test.yaml \
  .github/workflows/gpu-h100-inference-test.yaml

go test ./pkg/bundler/deployer/helm

git diff --check

Targeted checks passed. Full make qualify was not run locally because this PR is scoped to CI workflows/actions, shell helpers, kind recipe overrides, Chainsaw assertions, and Helm deploy-template behavior; the targeted checks above cover the changed surfaces without launching the full project qualification suite.

Risk Assessment

Low - Isolated change, well-tested, easy to revert
Medium - Touches multiple components or has broader impact
High - Breaking change, affects critical paths, or complex rollout

Rollout notes: Runtime strictness, control-plane resource patches, leader-election tuning, kind-only prometheus-operator tuning, Grafana disablement, Chainsaw --skip-delete, and control-plane recovery are scoped to the H100/kind CI path or are opt-in for other callers. Real cluster overlays such as EKS, GKE, AKS, and OKE are not changed by the kind-only monitoring overrides.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) - GPG signing info

coderabbitai · 2026-04-25T20:29:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new composite GitHub Action to check Kind control-plane health with extensive stability, probe, lease, and recovery options, plus its executable script. Refactors many composite actions to call external scripts (build, install, deploy, validate, diagnostics, cleanup), expands action inputs/outputs (timeouts, versions, cluster tuning, artifact controls, nvkind/h100 images), introduces numerous new CI/cluster helper scripts and reusable workflows, updates Helm deploy templates/tests for per-component timeout/retry/server-side logic, and adds test assertions, monitoring variants, and .settings.yaml entries for nvkind and H100 node image.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request has no author-provided description; only empty/missing content was found in the description field.	Add a meaningful pull request description that explains the purpose, scope, and key changes of this PR for reviewers.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title clearly summarizes the main change: hardening H100 GPU tests on slow runners through CI improvements, configuration, and new health checks.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/gpu-h100-inference-test.yaml:
- Around line 138-142: Multiple workflow steps invoke the composite action
./.github/actions/check-control-plane-health but only the first step ("Check
control plane health after runtime install") sets wait_timeout; update every
invocation of that action (including the other control-plane health gate steps
referenced) to pass an explicit wait_timeout input so they no longer rely on the
action default. Locate each step that uses uses:
./.github/actions/check-control-plane-health and add a wait_timeout: value
(consistent across gates, e.g., "PT5M" or your chosen timeout) alongside the
existing inputs such as cluster_name to make the timeout behavior explicit and
predictable.

In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 135-139: Multiple control-plane health check steps call the local
action ./.github/actions/check-control-plane-health but only the first step
("Check control plane health after runtime install") sets wait_timeout: 120s
while later invocations fall back to 60s; update every invocation of the
check-control-plane-health action to include the wait_timeout input set to
"120s" (i.e., add or change wait_timeout: 120s in each step that uses
./.github/actions/check-control-plane-health) so all checkpoints use the same
timeout.

In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 614-616: The test currently only checks for the literal `-gt 2`,
which misses other caps; update the assertion in helm_test.go (where `script` is
inspected) to detect any retry cap by searching for patterns like comparisons
(`-gt <number>`) or direct numeric assignments to `COMPONENT_MAX_RETRIES` (e.g.,
`COMPONENT_MAX_RETRIES="1"`). Replace the simple contains check with a
regex-based check that fails if `COMPONENT_MAX_RETRIES` is followed by any
`-gt\s*\d+` or is assigned a fixed numeric value, and update the t.Error message
accordingly so the test reliably ensures kube-prometheus-stack retries are not
capped.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1688b843-61f0-45b4-a69c-ec0594443a68

📥 Commits

Reviewing files that changed from the base of the PR and between 3c9f6ec and 2cd8d2c.

📒 Files selected for processing (11)

.github/actions/check-control-plane-health/action.yml
.github/actions/gpu-operator-install/action.yml
.github/actions/gpu-snapshot-validate/action.yml
.github/actions/gpu-test-cleanup/action.yml
.github/actions/install-karpenter-kwok/action.yml
.github/workflows/gpu-h100-inference-test.yaml
.github/workflows/gpu-h100-training-test.yaml
kwok/scripts/install-karpenter-kwok.sh
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/check-control-plane-health/action.yml:
- Around line 34-37: The MAX_RESTARTS input (from max_restarts) must be
validated as a non-negative integer before any arithmetic (e.g., comparisons at
line 95) to avoid crashes or incorrect semantics; update the action run script
to read the max_restarts input, trim it, check it matches a non-negative integer
regex (^[0-9]+$), reject empty or negative values with a clear error (exit
non-zero) or normalize to a documented default, and only then convert to an
integer for comparisons and arithmetic where MAX_RESTARTS is used.

In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 523-628: There are two near-duplicate tests
(TestGenerate_DeployScriptDynamoPlatformTimeout and
TestGenerate_DeployScriptKubePrometheusStackTimeout) that should be consolidated
into a single table-driven test to avoid duplication; create a new test (e.g.,
TestGenerate_DeployScriptComponentTimeouts) that iterates over a slice of cases
describing the component name, namespace, chart/version/source, expected
COMPONENT_HELM_TIMEOUT string, expected COMPONENT_MAX_RETRIES string or comment,
and whether the retry-cap block should be present. For each case construct a
Generator with the appropriate RecipeResult and ComponentValues, call
Generator.Generate(ctx, outputDir), read deploy.sh and assert expectations (use
existing checks like strings.Contains, strings.Index and the retryCapPattern
regex) per-case; factor common assertions into small helper closures if helpful.
Ensure you reference and reuse Generator.Generate, the deploy.sh file parsing,
the regex retryCapPattern and the original assertion strings (e.g.,
`COMPONENT_HELM_TIMEOUT="..."`, `COMPONENT_MAX_RETRIES="..."`) so the new
table-driven test covers both previous scenarios and is easy to extend.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c7308148-2dfc-4358-b1c3-9682057633a7

📥 Commits

Reviewing files that changed from the base of the PR and between 2cd8d2c and 7639a85.

📒 Files selected for processing (11)

.github/actions/check-control-plane-health/action.yml
.github/actions/gpu-operator-install/action.yml
.github/actions/gpu-snapshot-validate/action.yml
.github/actions/gpu-test-cleanup/action.yml
.github/actions/install-karpenter-kwok/action.yml
.github/workflows/gpu-h100-inference-test.yaml
.github/workflows/gpu-h100-training-test.yaml
kwok/scripts/install-karpenter-kwok.sh
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/gpu-h100-training-test.yaml (1)
42-43: 🧹 Nitpick | 🔵 Trivial

Disable persisted git credentials in check-paths.

This job only checks out the workspace so dorny/paths-filter can diff locally; it never needs to push. Set persist-credentials: false here too for defense in depth.

Based on learnings, in GitHub Actions workflows that only need the workspace/files, "set persist-credentials: false on actions/checkout to avoid persisting a token and to provide defense-in-depth against unintended git authentication."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/gpu-h100-training-test.yaml around lines 42 - 43, The
checkout step currently uses actions/checkout (actions/checkout@de0fac2...) and
leaves credentials persisted; update the checkout invocation used before
dorny/paths-filter to include persist-credentials: false so the job does not
store a token (this is the actions/checkout call referenced in the diff) while
keeping the existing dorny/paths-filter step unchanged.

♻️ Duplicate comments (1)

.github/actions/gpu-test-cleanup/action.yml (1)

75-119: ⚠️ Potential issue | 🟠 Major

Bound the remaining Docker calls in diagnostics and cleanup.

docker ps, docker rm -f, and the prune calls are still unbounded here. If the daemon is wedged on a slow runner, this action can still sit until the job timeout after a failure/cancel, which defeats the new bounded-diagnostics path.

⏱️ Suggested fix

+        docker_timeout() {
+          local limit="$1"; shift
+          timeout "${limit}" docker "$@"
+        }
+
-        docker ps --filter "label=io.x-k8s.kind.cluster=${KIND_CLUSTER_NAME}" \
+        docker_timeout 30s ps --filter "label=io.x-k8s.kind.cluster=${KIND_CLUSTER_NAME}" \
           --format '{{.Names}}' | sort | while read -r node_container; do
             ...
           done

-        remaining_containers=$(docker ps -aq --filter "label=${kind_cluster_label}")
+        remaining_containers=$(docker_timeout 30s ps -aq --filter "label=${kind_cluster_label}")
         if [[ -n "${remaining_containers}" ]]; then
           echo "Removing leftover kind containers for ${KIND_CLUSTER_NAME}:"
-          docker ps -a --filter "label=${kind_cluster_label}"
-          docker rm -f ${remaining_containers} || true
+          docker_timeout 30s ps -a --filter "label=${kind_cluster_label}"
+          docker_timeout 30s rm -f ${remaining_containers} || true
         fi
-        timeout 60s docker builder prune -f --filter "until=24h" || true
-        timeout 60s docker system prune -f --filter "until=24h" || true
+        docker_timeout 60s builder prune -f --filter "until=24h" || true
+        docker_timeout 60s system prune -f --filter "until=24h" || true

Also applies to: 136-145

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/actions/gpu-test-cleanup/action.yml around lines 75 - 119, The
diagnostics loop still runs unbounded Docker commands (notably the initial
docker ps that feeds the while-read, docker rm -f and any docker prune calls
referenced elsewhere), so wrap those calls with a timeout and fail-safe to
prevent hanging; specifically, add a timeout (e.g., timeout 30s or 120s as
appropriate) around the docker ps invocation that generates node_container, and
around any docker rm -f and docker system prune/docker image prune/docker
container prune invocations, and ensure the while read -r node_container loop
guards against empty input (already done via [[ -z "${node_container}" ]] &&
continue) so keep that check; update the unique places in this file where docker
ps, docker rm -f, and prune calls occur to use timeout and "|| true" so the
action cannot block the job.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/check-control-plane-health/action.yml:
- Around line 296-299: The lease lookups use a hardcoded namespace "kube-system"
instead of the configurable NAMESPACE; update the kubectl lease commands that
reference kube-system (the lines invoking kubectl_kind -n kube-system get lease
"${component}" -o yaml) to use the action variable ${NAMESPACE} instead, and
make the same change for the other occurrence around the block that handles
COMPONENTS (also referenced in dump_static_pod_runtime_diagnostics and
kubectl_kind calls) so lease diagnostics run against the configured namespace.

In @.github/actions/gpu-cluster-setup/action.yml:
- Around line 566-572: The current block captures the exit code of the timed
nvkind invocation in create_status but treats all non-zero codes the same;
update the post-run handling (the variables and commands around timeout
"${CLUSTER_CREATE_TIMEOUT}" nvkind cluster create "${CREATE_ARGS[@]}" and
create_status) to distinguish a timeout (exit code 124) from other failures: if
create_status == 124, keep the current warning and continue to post-create
checks; for any other non-zero create_status, exit immediately (fail the action)
so real nvkind errors surface instead of being masked.

In `@docs/user/cli-reference.md`:
- Around line 1325-1326: Update the compound-duration wording in the CLI
reference text for the two install descriptions: replace the unhyphenated "30
minute" with "30-minute" in the kai-scheduler sentence and replace "20 minute"
with "20-minute" in the dynamo-platform sentence so the compound adjectives are
hyphenated correctly; look for the sentences mentioning "kai-scheduler" and
"dynamo-platform" and change those two duration tokens.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 147-172: The diagnostics in dump_dynamo_platform_helm_diagnostics
execute kubectl commands with no request timeout and therefore can block inside
helm_retry; update each kubectl invocation in
dump_dynamo_platform_helm_diagnostics (including get
deployments/jobs/pods/events, describe, and all kubectl logs calls) to include a
bounded request timeout (e.g. --request-timeout=10s) or use a
KUBECTL_REQUEST_TIMEOUT variable and apply it to every kubectl call so these
diagnostics will time out and allow helm_retry backoff and retry budget to
proceed.
- Around line 415-418: Add a Helm version preflight: before appending the
--server-side=false flag to COMPONENT_HELM_APPLY_ARGS, detect whether the
installed helm client supports that flag (e.g., run a lightweight version check
or attempt `helm upgrade --help` parsing) and only push --server-side=false when
supported; if unsupported, leave COMPONENT_HELM_APPLY_ARGS without that flag and
emit a clear warning. Update the generated README to explicitly state that
bundles require a Helm client that supports --server-side=false (Helm v4+ or
compatible) and document the preflight check behavior and fallback warning.

---

Outside diff comments:
In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 42-43: The checkout step currently uses actions/checkout
(actions/checkout@de0fac2...) and leaves credentials persisted; update the
checkout invocation used before dorny/paths-filter to include
persist-credentials: false so the job does not store a token (this is the
actions/checkout call referenced in the diff) while keeping the existing
dorny/paths-filter step unchanged.

---

Duplicate comments:
In @.github/actions/gpu-test-cleanup/action.yml:
- Around line 75-119: The diagnostics loop still runs unbounded Docker commands
(notably the initial docker ps that feeds the while-read, docker rm -f and any
docker prune calls referenced elsewhere), so wrap those calls with a timeout and
fail-safe to prevent hanging; specifically, add a timeout (e.g., timeout 30s or
120s as appropriate) around the docker ps invocation that generates
node_container, and around any docker rm -f and docker system prune/docker image
prune/docker container prune invocations, and ensure the while read -r
node_container loop guards against empty input (already done via [[ -z
"${node_container}" ]] && continue) so keep that check; update the unique places
in this file where docker ps, docker rm -f, and prune calls occur to use timeout
and "|| true" so the action cannot block the job.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 87b104b5-3f6c-41f6-bd3f-838be471a608

📥 Commits

Reviewing files that changed from the base of the PR and between 73dfd1e and 22304c9.

📒 Files selected for processing (17)

.github/actions/check-control-plane-health/action.yml
.github/actions/gpu-cluster-setup/action.yml
.github/actions/gpu-operator-install/action.yml
.github/actions/gpu-test-cleanup/action.yml
.github/actions/install-karpenter-kwok/action.yml
.github/workflows/gpu-h100-inference-test.yaml
.github/workflows/gpu-h100-training-test.yaml
.github/workflows/gpu-smoke-test.yaml
docs/user/cli-reference.md
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/README.md.tmpl
pkg/bundler/deployer/helm/templates/component-README.md.tmpl
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
tests/chainsaw/ai-conformance/README.md
tests/chainsaw/ai-conformance/kind-common/assert-monitoring.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml
tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml

yuanchen8911 · 2026-04-26T22:15:03Z

Also fixed the non-inline review notes in ff99d6a: the GPU path checkouts now set persist-credentials: false, GPU cleanup Docker calls are bounded with timeouts, and the large GPU-specific shell bodies were moved into checked-in scripts so CI logs no longer print the full script source. The setup verifier also now falls back to the static pod manifest when live mirror pod command/args are empty, which should address the latest H100 setup failure around --leader-elect-lease-duration=300s.

coderabbitai

Actionable comments posted: 25

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh (1)
105-105: 🧹 Nitpick | 🔵 Trivial

Add trailing newline.

POSIX convention; some tools behave unexpectedly without it.
♻️ Proposed fix
   done || true
+
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh at line 105, The
file collect-debug-artifacts.sh is missing a trailing newline at EOF; open
.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh and add a single
newline character at the end of the file (ensure the last line ends with '\n')
so the script follows POSIX convention and avoids tools that expect a trailing
newline.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/aicr-build/build-snapshot-agent.sh:
- Around line 6-10: The unbounded docker build invocation "docker build -t
ko.local:smoke-test -f - . <<'DOCKERFILE'" can hang CI; wrap this command with a
timeout (e.g., using the GNU timeout utility) to bound the build (select a
sensible duration such as 10m and make it fail-fast on timeout) so the script
exits non‑zero when the build exceeds the limit; update the line to invoke
timeout against that docker build invocation and ensure the script
checks/propagates the exit code.

In @.github/actions/aicr-build/build-validator-images.sh:
- Around line 1-3: Add the missing Apache 2.0 license header to this script for
consistency: insert the same Apache-2.0 comment block used in the other scripts
(e.g., delete-stale-kind-cluster.sh, runner-preflight.sh) immediately after the
existing shebang (#!/usr/bin/env bash) and before the set -euo pipefail line so
the license is present while preserving the script's execution settings.
- Around line 21-22: The loop that unconditionally runs mkdir -p for each phase
(using PHASES and the loop variable phase) can mask missing test fixtures;
change it to check whether validators/${phase}/testdata already exists and if
not emit a warning mentioning the phase before creating it (or skip creation if
you want to force explicit testdata presence), so update the for loop that calls
mkdir -p "validators/${phase}/testdata" to perform an existence test and log a
warning referencing the phase when the directory is absent prior to creation.

In @.github/actions/gpu-cluster-setup/check-runner-capacity.sh:
- Around line 17-20: The current check reads available space in GiB
(free_disk_gb) which loses precision near the threshold; change to byte-accurate
comparison by reading bytes from df (use df -B1 --output=avail / or similar)
into a variable like free_disk_bytes, compute the threshold in bytes by
multiplying MIN_FREE_DISK_GB by 1024^3 (e.g., MIN_FREE_DISK_BYTES) and compare
free_disk_bytes < MIN_FREE_DISK_BYTES, then update the error message that
references free_disk_gb/MIN_FREE_DISK_GB to report both bytes and a
human-readable GiB value for clarity; update references to free_disk_gb and the
comparison branch accordingly so the script gates on exact bytes.

In @.github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh:
- Line 21: The single-line restart "sudo systemctl restart docker" can hang and
must be bounded and verified before continuing; replace or wrap that call in
logic that applies a restart timeout and then polls the daemon health (e.g., use
"systemctl restart" with a watchdog/timeout and then loop using "systemctl
is-active docker" or "docker info" with backoff) to confirm the service is
"active" within a defined timeout, and if it does not become healthy, log the
error and exit non‑zero. Update the script around the existing restart
invocation to implement the timeout, polling/retry, and explicit failure
handling so later steps only run when Docker is confirmed healthy.

In @.github/actions/gpu-cluster-setup/create-gpu-kind-cluster.sh:
- Around line 96-99: Add an explicit cleanup trap to remove the temporary
resources created by mktemp: after creating patch_dir and config_template,
register a trap that runs on EXIT (and optionally on INT/TERM) to rm -rf
"$patch_dir" "$config_template"; ensure the trap only attempts removal if the
variables are non-empty to avoid accidental deletes (e.g., test -n "$patch_dir"
&& rm -rf "$patch_dir"), and place the trap definition immediately after the
mktemp calls so cleanup always runs even on errors in functions referencing
patch_dir and config_template.

In @.github/actions/gpu-cluster-setup/delete-stale-kind-cluster.sh:
- Around line 27-32: The docker rm invocation uses an unquoted variable
remaining_containers which can suffer word-splitting/glob issues; update the
removal to safely handle the list by quoting the variable (e.g., docker rm -f
"${remaining_containers}") or switch to a safe alternative like piping docker ps
-q into xargs; change the line that calls docker rm -f ${remaining_containers}
and reference remaining_containers, kind_cluster_label and KIND_CLUSTER_NAME
when making the adjustment.

In @.github/actions/gpu-cluster-setup/install-nvkind.sh:
- Around line 18-19: Validate that NVKIND_VERSION is set and non-empty before
running go install (fail early with a clear error if missing), and after
installing invoke the nvkind binary via its explicit installed path (e.g., "$(go
env GOPATH)/bin/nvkind" or "$GOPATH/bin/nvkind") rather than relying solely on
PATH; reference NVKIND_VERSION for the version check and use the explicit nvkind
executable path for the --help invocation so failures are diagnosed even if PATH
assumptions from actions/setup-go change.

In @.github/actions/gpu-cluster-setup/warm-kind-node-image.sh:
- Around line 24-27: The current check reads GiB-rounded available space into
free_disk_gb and can misjudge values near the threshold; change to read
byte-accurate available bytes (use df --output=avail -B1 / and capture the
integer into avail_bytes), compute required_bytes by multiplying
MIN_FREE_DISK_GB by 1024^3 (e.g. required_bytes=$((MIN_FREE_DISK_GB * 1024 *
1024 * 1024))), then compare integers (if (( avail_bytes < required_bytes ));
then echo the error using a human-friendly GiB representation while including
the exact byte values) — update the variables free_disk_gb → avail_bytes (or add
avail_bytes) and the comparison that currently references MIN_FREE_DISK_GB so
the script uses byte-accurate comparison.

In @.github/actions/gpu-operator-install/generate-bundle.sh:
- Around line 18-23: The bundle generation can be contaminated by leftover files
in the bundle/ directory; before running "./aicr bundle --recipe recipe.yaml ...
--output bundle" ensure you clean or recreate the output directory (e.g., check
for bundle/ and rm -rf its contents or remove and mkdir bundle) so the
subsequent "ls -la bundle/" and produced artifacts are deterministic and contain
only newly generated files.

In @.github/actions/gpu-operator-install/install-gpu-operator-helm.sh:
- Around line 18-29: The helm commands should be made idempotent and the chart
version pinned: change the `helm repo add nvidia
https://helm.ngc.nvidia.com/nvidia` invocation to include `--force-update` so
repeated runs don't fail, and modify the `helm upgrade -i ... gpu-operator
nvidia/gpu-operator` invocation to include a specific chart version via
`--version=<SEMVER>` (replace <SEMVER> with the desired pinned version) so the
GPU Operator chart cannot drift; keep the same kube context variable
`KIND_CLUSTER_NAME` and namespace `gpu-operator` flags when applying these
changes.

In @.github/actions/gpu-operator-install/wait-gpu-operands-bundle.sh:
- Around line 23-38: The kubectl calls in the GPU operand wait script can hang;
update the three invocations that use kubectl
--context="kind-${KIND_CLUSTER_NAME}" -n gpu-operator to include a client-side
request timeout so CI can't block: add --request-timeout=10s (or 15s) to the get
daemonset used to compute count, add --request-timeout=300s (matching rollout)
to the kubectl rollout status command, and add --request-timeout=10s (or 15s) to
the final kubectl get pods call so all three commands (the get daemonset/count,
rollout status, and get pods) are bounded. Ensure you modify the exact commands
shown (the kubectl get daemonset used to set the count, the kubectl rollout
status daemonset -l app=nvidia-device-plugin-daemonset --timeout=300s, and the
kubectl get pods) to include the --request-timeout flag.

In @.github/actions/gpu-operator-install/wait-gpu-operands-helm.sh:
- Around line 18-22: Before calling "kubectl ... rollout status daemonset -l
app=nvidia-device-plugin-daemonset", add a loop that polls for the DaemonSet
existence using "kubectl --context=\"kind-${KIND_CLUSTER_NAME}\" -n gpu-operator
get daemonset -l app=nvidia-device-plugin-daemonset" with a total timeout
(≈300s) and short sleep intervals; only proceed to run "rollout status" once the
get returns a result, and if the DaemonSet never appears within the timeout exit
with a non-zero status or print a clear error—this mirrors the wait logic used
in wait-gpu-operands-bundle.sh and prevents immediate rollout status failures
when the controller hasn't created the DaemonSet yet.

In @.github/actions/gpu-snapshot-validate/validate-snapshot-gpu.sh:
- Around line 19-28: The script reads GPU_COUNT from snapshot.yaml and then does
a numeric comparison, but if GPU_COUNT is missing or non-numeric the arithmetic
test in the if using "${GPU_COUNT}" will raise a shell error; update the code
around the GPU_COUNT extraction and the comparison (referencing the GPU_COUNT
variable and the if that tests -lt ${MIN_GPU_COUNT}) to first validate that
GPU_COUNT is a non-empty integer (e.g., using a regex like '^[0-9]+$' or an
explicit numeric cast check), emit a clear ::error:: and exit if it is
missing/invalid, and only then perform the numeric -lt comparison against
MIN_GPU_COUNT so the CI shows a helpful error rather than a shell arithmetic
failure.

In @.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh:
- Around line 27-35: The validator validate_seconds_input currently allows "0";
update it to reject zero by adding a numeric-range check after the integer regex
test: keep the integer check for input_value, then if [[ "${input_value}" -le 0
]]; then echo an ::error:: message referencing the input_name (e.g.
KO_BUILD_TIMEOUT) and exit 1. Ensure this change is applied inside
validate_seconds_input so any callers (like KO_BUILD_TIMEOUT) will fail on 0.
- Around line 40-41: Replace the unbounded kubectl apply call with a bounded
invocation by prefixing with timeout 30s and adding the kubectl flag
--request-timeout=10s; specifically update the command kubectl
--context="kind-${KIND_CLUSTER_NAME}" apply -f
kwok/manifests/karpenter/nodepool.yaml to use timeout 30s and
--request-timeout=10s so the step will fail fast on slow or unhealthy clusters.

In @.github/scripts/gpu-chainsaw-health.sh:
- Around line 4-10: The script relies on implicit argument handling; add
explicit validation for the positional argument test_dir (check if $1 is empty,
print a clear usage message and exit 1) and introduce an optional global timeout
variable (e.g., GLOBAL_TIMEOUT or CHAINSAW_TIMEOUT) that defaults to 120s and is
used to set both --cleanup-timeout and --delete-timeout on the chainsaw test
command; update references to the test_dir variable and replace the hard-coded
120s values with the timeout variable so CI failures show a clear error and
configurable timeout.

In @.github/scripts/gpu-debug-diagnostics.sh:
- Around line 1-8: Add a short explanatory comment near the top (above the set
-o pipefail line) explaining why -e and -u are intentionally omitted (so the
diagnostic script continues running and can collect partial/failure outputs
rather than aborting) and reference the script's behavior via the mode variable
and kubectl_kind() helper; mention that we still use pipefail to detect pipeline
failures while allowing individual commands to fail for diagnostics.
- Around line 86-105: The training case duplicates a subset of GPU checks but
omits the shared diagnostics helper; update the training branch to call
print_common_gpu_diagnostics (same helper used by smoke and inference) instead
of repeating partial checks so you include the full common diagnostics (e.g.,
ClusterPolicy YAML) and then keep only training-specific checks like
kubectl_kind -n kubeflow get deployment kubeflow-trainer-controller-manager and
the Kubeflow-specific pod/CRD/webhook lines; remove the inline duplicated
node/GPU operator checks or ensure they run after print_common_gpu_diagnostics
to avoid duplication.

In @.github/scripts/gpu-smoke-run-nvidia-smi.sh:
- Around line 4-24: The script currently creates a fixed-named Pod
"gpu-smoke-test" with kubectl apply which can reuse an old Succeeded Pod; change
the manifest metadata to use metadata.generateName: gpu-smoke-test- and add a
label like metadata.labels.app: gpu-smoke-test, then create the Pod with kubectl
--context="kind-${KIND_CLUSTER_NAME}" create -f - (not apply) so each run makes
a fresh Pod; finally update the two wait calls that reference pod/gpu-smoke-test
to first resolve the actual pod name (e.g. POD=$(kubectl
--context="kind-${KIND_CLUSTER_NAME}" get pods -l app=gpu-smoke-test -o
jsonpath='{.items[0].metadata.name}') ) and use that variable in the kubectl
wait commands instead of the fixed "gpu-smoke-test".

In @.github/scripts/gpu-validate-conformance.sh:
- Around line 1-2: Add the missing Apache-2.0 license header to this script to
match other PR-added scripts: insert the standard Apache 2.0 comment block
(exact text used in other scripts in this PR) immediately after the existing
shebang line "#!/usr/bin/env bash" and before the "set -euo pipefail" line so
the license appears at the top but the shebang remains the first line.
- Around line 4-15: The --image flag currently uses an incorrect tag
(--image=ko.local:smoke-test); update the image reference to match the validator
build output from build-validator-images.sh by using the registry and path
AICR_VALIDATOR_IMAGE_REGISTRY/ aicr-validators/<phase>:latest (for this script
set the image to ko.local/aicr-validators/conformance:latest), i.e. change the
--image value passed to ./aicr validate to
ko.local/aicr-validators/conformance:latest so it matches the created artifact
and AICR_VALIDATOR_IMAGE_REGISTRY variable.

In `@docs/user/cli-reference.md`:
- Around line 1323-1327: Update the `dynamo-platform` bullet to document the
conditional fallback: state that `deploy.sh` attempts to add
`--server-side=false` for client-side apply but only does so when the Helm
client supports that flag; if the flag is unsupported the script logs a warning
and proceeds without `--server-side=false`, so client-side apply is not
guaranteed. Reference `dynamo-platform`, `deploy.sh`, and the
`--server-side=false` flag in the text and keep the distinction between the
attempted mitigation and the actual behavior when the Helm client lacks support.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Line 27: The retry diagnostics call used by helm_retry currently invokes
dump_kai_scheduler_helm_diagnostics which runs kubectl without a timeout; update
dump_kai_scheduler_helm_diagnostics (and any kubectl calls it makes) to pass the
KUBECTL_REQUEST_TIMEOUT variable so kubectl invocations are bounded, and ensure
helm_retry still calls the helper (no behavior change) but the helper now uses
KUBECTL_REQUEST_TIMEOUT; apply the same change for the other occurrence
referenced at lines ~197 so all kai-scheduler kubectl calls use the
KUBECTL_REQUEST_TIMEOUT variable.
- Around line 423-427: The current check using helm_supports_server_side_flag is
insufficient because it only inspects help text; update that detection to
enforce a minimum Helm version (>= v4.0.5) or explicitly verify the bugfix by
running a targeted behavior check, then only set
COMPONENT_HELM_APPLY_ARGS=(--server-side=false) when the check passes.
Concretely, modify the helm_supports_server_side_flag implementation to parse
helm version output (e.g. helm version --short or --template '{{.Version}}') and
compare semver >= 4.0.5, or perform an executable regression test that runs a
controlled helm upgrade --install --server-side=false against a throwaway chart
to confirm the flag is honored; keep the rest of the branch (echo warning and
skipping the flag) unchanged when the version/test fails.

---

Outside diff comments:
In @.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh:
- Line 105: The file collect-debug-artifacts.sh is missing a trailing newline at
EOF; open .github/actions/gpu-test-cleanup/collect-debug-artifacts.sh and add a
single newline character at the end of the file (ensure the last line ends with
'\n') so the script follows POSIX convention and avoids tools that expect a
trailing newline.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f1a01b81-5620-4db8-8991-dd88b9807e41

📥 Commits

Reviewing files that changed from the base of the PR and between 553051c and ff99d6a.

📒 Files selected for processing (48)

.github/actions/aicr-build/action.yml
.github/actions/aicr-build/build-cli.sh
.github/actions/aicr-build/build-snapshot-agent.sh
.github/actions/aicr-build/build-validator-images.sh
.github/actions/aicr-build/stage-cli.sh
.github/actions/check-control-plane-health/action.yml
.github/actions/check-control-plane-health/check-control-plane-health.sh
.github/actions/gpu-cluster-setup/action.yml
.github/actions/gpu-cluster-setup/check-runner-capacity.sh
.github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh
.github/actions/gpu-cluster-setup/create-gpu-kind-cluster.sh
.github/actions/gpu-cluster-setup/delete-stale-kind-cluster.sh
.github/actions/gpu-cluster-setup/increase-inotify-limits.sh
.github/actions/gpu-cluster-setup/install-nvkind.sh
.github/actions/gpu-cluster-setup/runner-preflight.sh
.github/actions/gpu-cluster-setup/validate-docker-gpu-access.sh
.github/actions/gpu-cluster-setup/validate-env.sh
.github/actions/gpu-cluster-setup/warm-kind-node-image.sh
.github/actions/gpu-operator-install/action.yml
.github/actions/gpu-operator-install/generate-bundle.sh
.github/actions/gpu-operator-install/generate-recipe.sh
.github/actions/gpu-operator-install/install-bundle.sh
.github/actions/gpu-operator-install/install-gpu-operator-helm.sh
.github/actions/gpu-operator-install/wait-gpu-operands-bundle.sh
.github/actions/gpu-operator-install/wait-gpu-operands-helm.sh
.github/actions/gpu-snapshot-validate/action.yml
.github/actions/gpu-snapshot-validate/debug-snapshot-job.sh
.github/actions/gpu-snapshot-validate/run-snapshot.sh
.github/actions/gpu-snapshot-validate/validate-snapshot-gpu.sh
.github/actions/gpu-test-cleanup/action.yml
.github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh
.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh
.github/actions/gpu-test-cleanup/export-kind-logs.sh
.github/actions/install-karpenter-kwok/action.yml
.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh
.github/actions/install-karpenter-kwok/resolve-versions.sh
.github/scripts/gpu-chainsaw-health.sh
.github/scripts/gpu-debug-diagnostics.sh
.github/scripts/gpu-smoke-run-nvidia-smi.sh
.github/scripts/gpu-validate-conformance.sh
.github/workflows/gpu-h100-inference-test.yaml
.github/workflows/gpu-h100-training-test.yaml
.github/workflows/gpu-smoke-test.yaml
docs/user/cli-reference.md
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/README.md.tmpl
pkg/bundler/deployer/helm/templates/component-README.md.tmpl
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/gpu-smoke-test.yaml (1)
110-129: 🛠️ Refactor suggestion | 🟠 Major

Move these script invocations behind local composite actions.

These Layer 3 workflow steps are calling repo scripts directly instead of going through Layer 2 composites. That makes the smoke pod flow and the shared diagnostics runner bypass the repository’s required workflow/action boundary.

As per coding guidelines, .github/workflows/**/*.yaml: GitHub Actions must use three-layer composite architecture: Layer 1 primitives, Layer 2 composed actions, Layer 3 workflows.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/gpu-smoke-test.yaml around lines 110 - 129, The workflow
is calling repository scripts directly (bash
.github/scripts/gpu-smoke-run-nvidia-smi.sh, bash
.github/scripts/gpu-smoke-show-nvidia-smi-log.sh, bash
.github/scripts/gpu-debug-diagnostics.sh) instead of using Layer 2 composite
actions; update the steps named "Run nvidia-smi in a pod", "Show nvidia-smi
output", and the conditional "Debug diagnostics" to invoke the corresponding
Layer 2 composites (or create them if missing) so the workflow uses the repo's
three-layer action boundary, passing through the same inputs/env (e.g.,
GPU_TEST_DIAGNOSTIC_MODE for the debug step) and preserving the "Snapshot and
validate GPU" step that already uses ./.github/actions/gpu-snapshot-validate.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/aicr-build/build-cli.sh:
- Around line 19-22: The script skips building if a pre-existing dist/aicr
binary exists, which allows stale or injected binaries; update build-cli.sh to
not trust workspace state by removing the early-exit check for dist/aicr and
always run the build step for ./cmd/aicr (producing/overwriting dist/aicr), or
alternatively validate the binary's provenance before skipping; locate the if [[
-x dist/aicr ]] block and replace it with logic that unconditionally builds
(e.g., invoke go build for ./cmd/aicr to produce dist/aicr) so CI always
produces a fresh binary.

In @.github/actions/aicr-build/build-validator-images.sh:
- Around line 18-27: The script reads VALIDATOR_PHASES directly which will
trigger an "unbound variable" under set -u; change the checks to use safe
expansions (e.g. use ${VALIDATOR_PHASES:-} when testing emptiness) and/or
explicitly guard the variable existence (use [[ -v VALIDATOR_PHASES ]] or a
separate required-var check) before expanding it, then assign PHASES using the
safe expansion; apply the same hardening to the other occurrence referenced by
the review (the second VALIDATOR_PHASES read around line 52) so all env reads
are safe under strict mode.

In @.github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh:
- Line 24: The health check currently runs "docker info" with no timeout in the
condition "if systemctl is-active --quiet docker && docker info >/dev/null 2>&1;
then", which can block the 30-attempt loop; change the condition to run docker
info under a bounded timeout (e.g., use the timeout command like "timeout 5s
docker info >/dev/null 2>&1") and/or introduce a DOCKER_INFO_TIMEOUT variable
used in that condition so a hung docker info call fails quickly and the loop
continues; keep the rest of the logic the same but ensure the command's exit
status is used exactly as before.
- Line 21: The `timeout 120s sudo systemctl restart docker` can abort the script
under set -e and skip the diagnostics; change it to run with set +e, capture its
exit code (e.g., store $? in a variable), then re-enable set -e and
conditionally run the existing diagnostics block when the captured exit code is
non‑zero, finally exit with that code; locate the restart invocation string
"timeout 120s sudo systemctl restart docker" and update the surrounding logic
accordingly so diagnostics always run on failure.

In @.github/actions/gpu-cluster-setup/create-gpu-kind-cluster.sh:
- Around line 424-433: The code currently treats a static pod manifest match
(static_pod_manifest_contains_arg) as a successful runtime verification even
when neither the live mirror pod nor the running CRI container show the arg;
change the logic so that static_pod_manifest_contains_arg does NOT cause an
early success/return. Instead, only accept the manifest match when the earlier
running_static_pod_container_contains_arg branch indicated the live mirror pod
omitted args (i.e., the special “live mirror pod omitted it” case), otherwise
treat a manifest-only match as a non-success: either emit the warning but fail
the check (return non-zero) or trigger a retry/wait loop that re-checks
control_plane_command_args and running_static_pod_container_contains_arg until
the running container shows the arg; update the branches around
control_plane_command_args, running_static_pod_container_contains_arg, and
static_pod_manifest_contains_arg accordingly.
- Around line 291-314: The post-create probe commands (the kubectl/docker calls
executed after nvkind cluster create, e.g. the kubectl
wait/cluster-info/get/describe block and the docker inspect loop shown around
the kubectl/docker calls) must be bounded with timeouts and kube client request
timeouts; wrap each long-running shell call with timeout (and add kubectl's
--request-timeout) to avoid indefinite hangs, and apply the same pattern inside
the helper functions referenced in the diff (the helper blocks around lines
326–417). Also change assert_control_plane_arg() so it fails the script on
manifest mismatch instead of emitting a warning and returning — require a live
verification of the static pod or running container (e.g., check the
control-plane pod status or container existence and exit non-zero if not
matching) so the workflow cannot proceed before the static pod/container picks
up the patch.

In @.github/actions/gpu-cluster-setup/delete-stale-kind-cluster.sh:
- Line 31: The mapfile usage around the docker_timeout calls (used to fill
remaining_containers) hides failures because mapfile reads an empty stream on
docker_timeout timeout/exit; extract that logic into a helper (e.g.,
run_docker_cmd or get_docker_output) that runs the command, captures
stdout/stderr, checks the exit status, and returns non-zero or aborts with a
clear error if docker_timeout failed; then use the helper's stdout to populate
remaining_containers (via mapfile/readarray from the captured output) in both
places where mapfile < <(docker_timeout ps -aq --filter
"label=${kind_cluster_label}") is used so container-query failures are detected
and handled explicitly.

In @.github/actions/gpu-operator-install/wait-gpu-operands-bundle.sh:
- Around line 22-31: The kubectl call inside the retry loop can cause the whole
script to exit under set -euo pipefail on transient errors; change the loop so
you run kubectl in a way that captures/guards failures (e.g., run kubectl and
capture its stdout/stderr into a variable and handle a non-zero exit by treating
count as zero rather than allowing the script to exit) when evaluating count for
the label app=nvidia-device-plugin-daemonset in namespace gpu-operator (use
KIND_CLUSTER_NAME for context); after the loop add an explicit failure path that
logs a clear error/diagnostics and exits non-zero if no DaemonSet was found
after all attempts.

In @.github/actions/gpu-operator-install/wait-gpu-operands-helm.sh:
- Around line 34-35: The rollout wait for the NVIDIA device-plugin DaemonSet is
masking failures by appending "|| true" to the kubectl rollout status command;
remove the trailing "|| true" from the rollout status invocation that targets
the daemonset with label "app=nvidia-device-plugin-daemonset" (the kubectl
--request-timeout=300s ... rollout status daemonset -l
app=nvidia-device-plugin-daemonset --timeout=300s command) so that the step
fails if the DaemonSet never becomes ready, allowing CI to surface the error
instead of passing silently.

In @.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh:
- Around line 19-20: The artifact directory is only created but not cleared, so
stale files on persistent runners can be reused; update the script at the
location that currently runs mkdir -p /tmp/debug-artifacts to first remove or
empty the directory (e.g., delete all contents or recreate the directory) before
creating it, ensuring /tmp/debug-artifacts is clean prior to collecting
artifacts; keep the rest of the script (including CONTROL_PLANE_COMPONENTS)
unchanged.

In @.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh:
- Around line 27-39: The validate_seconds_input function currently uses Bash
arithmetic (( input_value <= 0 )) which treats numbers with leading zeros as
octal and errors on values like "08"; update the validation to force decimal
parsing by using 10# when performing numeric comparisons (e.g., use
10#${input_value} in the arithmetic test) and ensure the regex check still runs
first to guarantee digits-only input before converting; apply this change inside
validate_seconds_input so error messages remain the same but comparisons never
interpret inputs as octal.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 72-89: The current version parsing strips prerelease info by doing
version="${version%%-*}", which turns "v4.0.5-rc.1" into "4.0.5" and lets RCs
pass; remove that truncation and instead parse the full SemVer (keep the
prerelease part) with a regex that captures major/minor/patch and an optional
prerelease, then explicitly reject when a prerelease token is present (e.g. when
BASH_REMATCH captures a prerelease), or replace the ad-hoc logic with a
semver-aware comparison routine; update the code around the version variable,
the regex check that populates BASH_REMATCH, and the major/minor/patch
comparisons to incorporate this prerelease rejection so only true releases >=
4.0.5 pass.

---

Outside diff comments:
In @.github/workflows/gpu-smoke-test.yaml:
- Around line 110-129: The workflow is calling repository scripts directly (bash
.github/scripts/gpu-smoke-run-nvidia-smi.sh, bash
.github/scripts/gpu-smoke-show-nvidia-smi-log.sh, bash
.github/scripts/gpu-debug-diagnostics.sh) instead of using Layer 2 composite
actions; update the steps named "Run nvidia-smi in a pod", "Show nvidia-smi
output", and the conditional "Debug diagnostics" to invoke the corresponding
Layer 2 composites (or create them if missing) so the workflow uses the repo's
three-layer action boundary, passing through the same inputs/env (e.g.,
GPU_TEST_DIAGNOSTIC_MODE for the debug step) and preserving the "Snapshot and
validate GPU" step that already uses ./.github/actions/gpu-snapshot-validate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5421c998-74b3-4d63-829a-d4e04eeec949

📥 Commits

Reviewing files that changed from the base of the PR and between ff99d6a and 1d55329.

📒 Files selected for processing (29)

.github/actions/README.md
.github/actions/aicr-build/build-cli.sh
.github/actions/aicr-build/build-snapshot-agent.sh
.github/actions/aicr-build/build-validator-images.sh
.github/actions/aicr-build/stage-cli.sh
.github/actions/gpu-cluster-setup/check-runner-capacity.sh
.github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh
.github/actions/gpu-cluster-setup/create-gpu-kind-cluster.sh
.github/actions/gpu-cluster-setup/delete-stale-kind-cluster.sh
.github/actions/gpu-cluster-setup/install-nvkind.sh
.github/actions/gpu-cluster-setup/warm-kind-node-image.sh
.github/actions/gpu-operator-install/generate-bundle.sh
.github/actions/gpu-operator-install/install-gpu-operator-helm.sh
.github/actions/gpu-operator-install/wait-gpu-operands-bundle.sh
.github/actions/gpu-operator-install/wait-gpu-operands-helm.sh
.github/actions/gpu-snapshot-validate/validate-snapshot-gpu.sh
.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh
.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh
.github/scripts/gpu-chainsaw-health.sh
.github/scripts/gpu-debug-diagnostics.sh
.github/scripts/gpu-smoke-run-nvidia-smi.sh
.github/scripts/gpu-smoke-show-nvidia-smi-log.sh
.github/scripts/gpu-validate-conformance.sh
.github/workflows/gpu-smoke-test.yaml
docs/user/cli-reference.md
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/README.md.tmpl
pkg/bundler/deployer/helm/templates/component-README.md.tmpl
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

recipes/overlays/kind.yaml (1)

167-175: 🧹 Nitpick | 🔵 Trivial

Dead configuration: Grafana resource limits are defined but Grafana is disabled.

Lines 169-175 define resources for Grafana, but enabled: false on line 168 means these values are never applied.

♻️ Suggested cleanup

         grafana:
           enabled: false
-          resources:
-            requests:
-              cpu: 100m
-              memory: 128Mi
-            limits:
-              cpu: 500m
-              memory: 512Mi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@recipes/overlays/kind.yaml` around lines 167 - 175, The grafana block
currently sets enabled: false while still declaring resources, which is dead
configuration; either remove the resources subtree (resources ->
requests/limits) under the grafana entry or make grafana enabled if you intended
to apply those values—specifically edit the grafana section to either delete the
resources key(s) (cpu/memory requests and limits) or change grafana.enabled to
true (or refactor into a conditional/values-driven template) so the resources
are only present when grafana is enabled.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh:
- Line 28: The journalctl invocation in configure-nvidia-container-toolkit.sh
may fail under non-root CI users; update the command "journalctl -u docker
--since \"10 minutes ago\" --no-pager || true" to run through sudo (i.e., prefix
with sudo) so diagnostics aren't lost, keeping the existing options and the "||
true" fallback intact so the script still continues if journalctl fails.

In @.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh:
- Around line 98-145: The while-read loop that iterates over kind node
containers (starts at the pipeline: docker_timeout 30s ps ... | while read -r
node_container; do) has no overall timeout or elapsed-time guard, so a sequence
of slow/hanging docker_timeout exec calls can make the entire job run for
unbounded time; wrap the whole loop with an overall timeout (e.g., run the
pipeline under timeout or record start time and break when elapsed exceeds a
configured MAX_LOOP_SECONDS) and ensure you still capture partial artifacts and
exit cleanly when the timeout is reached; modify the invocation that begins with
docker_timeout 30s ps ... | while read -r node_container to use the chosen
global timeout mechanism and check/stop using node_container references so
per-node reporting files (e.g., "${node_file}-control-plane-cri.txt") are
produced for processed nodes before aborting.

In @.github/scripts/gpu-runtime-component-health.sh:
- Around line 63-73: The wait_for_required_object function currently does a
single kubectl_kind get and fails immediately; change it to poll until
COMPONENT_HEALTH_TIMEOUT elapses: inside wait_for_required_object (and using
kubectl_kind) implement a retry loop that attempts kubectl_kind get
"${resource}" repeatedly (e.g., sleep 1 between attempts), tracking elapsed time
(using SECONDS or timestamp) and returning 0 on success; if the timeout is
reached, emit the existing "::error::Required object is missing: ${resource}"
plus additional diagnostics (kubectl_kind describe/get -o yaml for the resource
or related CRD/webhook Deployment status) and return 1. Ensure the function uses
the COMPONENT_HEALTH_TIMEOUT variable for the timeout value and keeps existing
return codes.

In @.github/scripts/gpu-smoke-show-nvidia-smi-log.sh:
- Around line 18-43: The script currently removes POD_NAME_FILE after running
kubectl_kind logs, but if kubectl_kind logs fails under set -e the script exits
early and the cache file remains; add a cleanup that always runs by registering
a trap to remove POD_NAME_FILE on EXIT (referencing POD_NAME_FILE) or
alternatively run kubectl_kind logs without aborting (capture its exit code into
a variable) and then unconditionally rm -f "${POD_NAME_FILE}" and exit with that
captured code; ensure you reference kubectl_kind, pod_name and POD_NAME_FILE
when making the change so the cache is always cleared even if log collection
fails.

In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 591-594: The test currently checks rejectSnippets only against
componentBlock, which omits the helper function and thus fails to verify that
the prerelease guard appears in helm_supports_server_side_false_install; update
the test to scan the full generated script (variable/script content) for the
rejectSnippets or add a separate assertion that specifically inspects the
helm_supports_server_side_false_install() helper body to ensure the prerelease
guard lines (`local prerelease`, `if [[ -n "${prerelease}" ]]`, etc.) are
present; reference the existing test variables rejectSnippets, componentBlock
and the helper name helm_supports_server_side_false_install to locate and change
the assertion accordingly.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 176-180: The function dump_dynamo_platform_helm_diagnostics
currently only runs when namespace == "dynamo-system", which skips diagnostics
if the dynamo-platform component is installed to a different namespace; change
the gating to detect the component identity instead of the hardcoded namespace
by either (A) checking the component name/ID (e.g., use the component identifier
passed into helm_retry or a global/component variable) inside
dump_dynamo_platform_helm_diagnostics, or (B) modify helm_retry to accept both
component name and namespace and call dump_dynamo_platform_helm_diagnostics
based on the component name while still passing the rendered namespace through
for kubectl invocations; update the calls around helm_retry and the conditional
in dump_dynamo_platform_helm_diagnostics (also at the similar check around lines
220-221) so diagnostics run whenever the component is dynamo-platform regardless
of the installation namespace.

---

Outside diff comments:
In `@recipes/overlays/kind.yaml`:
- Around line 167-175: The grafana block currently sets enabled: false while
still declaring resources, which is dead configuration; either remove the
resources subtree (resources -> requests/limits) under the grafana entry or make
grafana enabled if you intended to apply those values—specifically edit the
grafana section to either delete the resources key(s) (cpu/memory requests and
limits) or change grafana.enabled to true (or refactor into a
conditional/values-driven template) so the resources are only present when
grafana is enabled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 128343c2-55b0-420f-9bb1-5713a0a0f894

📥 Commits

Reviewing files that changed from the base of the PR and between 1d55329 and e1e931a.

📒 Files selected for processing (26)

.github/actions/aicr-build/build-cli.sh
.github/actions/aicr-build/build-validator-images.sh
.github/actions/check-control-plane-health/action.yml
.github/actions/check-control-plane-health/check-control-plane-health.sh
.github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh
.github/actions/gpu-cluster-setup/create-gpu-kind-cluster.sh
.github/actions/gpu-cluster-setup/delete-stale-kind-cluster.sh
.github/actions/gpu-debug-diagnostics/action.yml
.github/actions/gpu-operator-install/wait-gpu-operands-bundle.sh
.github/actions/gpu-operator-install/wait-gpu-operands-helm.sh
.github/actions/gpu-smoke-nvidia-smi/action.yml
.github/actions/gpu-test-cleanup/action.yml
.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh
.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh
.github/scripts/gpu-chainsaw-health.sh
.github/scripts/gpu-debug-diagnostics.sh
.github/scripts/gpu-runtime-component-health.sh
.github/scripts/gpu-smoke-run-nvidia-smi.sh
.github/scripts/gpu-smoke-show-nvidia-smi-log.sh
.github/workflows/gpu-h100-inference-test.yaml
.github/workflows/gpu-h100-kind-runtime-test.yaml
.github/workflows/gpu-h100-training-test.yaml
.github/workflows/gpu-smoke-test.yaml
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
recipes/overlays/kind.yaml

yuanchen8911 added the bug label Apr 25, 2026

github-actions Bot added area/ci area/bundler and removed bug labels Apr 25, 2026

yuanchen8911 mentioned this pull request Apr 25, 2026

ci: improve GPU test reliability #675

Closed

25 tasks

github-actions Bot added the size/M label Apr 25, 2026

yuanchen8911 added the bug label Apr 25, 2026

yuanchen8911 force-pushed the ci-gpu-kind-fail-fast branch 3 times, most recently from 4dcf95f to 35fc3ee Compare April 25, 2026 18:27

github-actions Bot added area/tests size/L and removed size/M labels Apr 25, 2026

yuanchen8911 force-pushed the ci-gpu-kind-fail-fast branch from 35fc3ee to 2cd8d2c Compare April 25, 2026 19:36

yuanchen8911 marked this pull request as ready for review April 25, 2026 20:25

yuanchen8911 requested review from a team as code owners April 25, 2026 20:25

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread .github/workflows/gpu-h100-inference-test.yaml Outdated

Comment thread .github/workflows/gpu-h100-training-test.yaml Outdated

Comment thread pkg/bundler/deployer/helm/helm_test.go Outdated

yuanchen8911 force-pushed the ci-gpu-kind-fail-fast branch from 2cd8d2c to 7639a85 Compare April 25, 2026 20:33

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread .github/actions/check-control-plane-health/action.yml

Comment thread pkg/bundler/deployer/helm/helm_test.go Outdated

ci: harden H100 GPU qualification

40abdad

yuanchen8911 force-pushed the ci-gpu-kind-fail-fast branch from 7639a85 to 40abdad Compare April 25, 2026 20:44

yuanchen8911 marked this pull request as draft April 25, 2026 20:55

yuanchen8911 added 3 commits April 25, 2026 18:06

ci: split aicr build artifacts

ae0b717

ci: harden gpu cluster setup

33417c3

ci: tune H100 workflow reliability

b7ffe6b

github-actions Bot added size/XL and removed size/L labels Apr 26, 2026

yuanchen8911 marked this pull request as ready for review April 26, 2026 01:17

ci: harden H100 control plane and Dynamo retries

23b3d5f

yuanchen8911 force-pushed the ci-gpu-kind-fail-fast branch from 6e5500b to 23b3d5f Compare April 26, 2026 18:07

yuanchen8911 added 2 commits April 26, 2026 14:36

Stabilize H100 GPU CI checks

0f308d6

Address H100 CI review feedback

22304c9

yuanchen8911 marked this pull request as ready for review April 26, 2026 21:46

yuanchen8911 added the run-gpu-tests Trigger GPU CI tests on PR label Apr 26, 2026

Avoid pull request events for GPU runners

553051c

yuanchen8911 marked this pull request as draft April 26, 2026 21:54

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

yuanchen8911 marked this pull request as ready for review April 26, 2026 22:12

Address GPU CI review feedback

ff99d6a

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

ci: address GPU workflow review hardening

1d55329

yuanchen8911 commented Apr 26, 2026

View reviewed changes

Comment thread .github/actions/install-karpenter-kwok/install-karpenter-kwok.sh

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

yuanchen8911 marked this pull request as draft April 26, 2026 22:53

yuanchen8911 marked this pull request as ready for review April 27, 2026 02:46

ci: harden H100 kind runtime workflows

e1e931a

github-actions Bot added the area/recipes label Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

ci: address follow-up GPU review feedback

727f1b0

yuanchen8911 marked this pull request as draft April 27, 2026 03:03

yuanchen8911 changed the title ~~ci: harden H100 GPU tests on slow runners~~ (WIP) ci: harden H100 GPU tests on slow runners Apr 27, 2026

Merge branch 'main' into ci-gpu-kind-fail-fast

4219435

mchmarny assigned yuanchen8911 Apr 27, 2026

This was referenced Apr 27, 2026

Refactor and harden H100 GPU CI workflow #694

Merged

H100 Kind CI unstable on GPU runners #696

Closed

yuanchen8911 removed the run-gpu-tests Trigger GPU CI tests on PR label Apr 27, 2026

yuanchen8911 closed this Apr 28, 2026

Conversation

yuanchen8911 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Apr 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

yuanchen8911 commented Apr 25, 2026 •

edited

Loading

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading