Refactor and harden H100 GPU CI workflow#694
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request substantially refactors GPU cluster setup, runtime installation, and test workflows. It extracts inline GitHub Actions logic into standalone Bash scripts (including cluster creation, control-plane health checking, GPU validation, and debug diagnostics), introduces new composite Actions wrapping these scripts, removes the Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/gpu-smoke-test.yaml (1)
47-67:⚠️ Potential issue | 🟠 MajorRestore push-mirror coverage for runtime Helm/version changes.
This filter no longer matches
.settings.yamlorpkg/bundler/deployer/helm/**, even though this workflow still resolves pinned versions and installs the runtime with Helm on Lines 110-114. A PR that only changes version pins or the Helm deployer will now skip the push-mirror smoke run.🔧 Suggested filter additions
matched: - '.github/workflows/gpu-smoke-test.yaml' + - '.settings.yaml' - '.github/actions/gpu-cluster-setup/**' - '.github/actions/runtime-install/**' - '.github/actions/aicr-build/**' - '.github/actions/gpu-debug-diagnostics/**' - '.github/actions/gpu-test-cleanup/**' - '.github/actions/gpu-smoke-nvidia-smi/**' - '.github/actions/load-versions/**' - '.github/scripts/gpu-debug-diagnostics.sh' - '.github/scripts/gpu-smoke-run-nvidia-smi.sh' - '.github/scripts/gpu-smoke-show-nvidia-smi-log.sh' + - 'pkg/bundler/deployer/helm/**' - 'pkg/collector/**'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-smoke-test.yaml around lines 47 - 67, The workflow's filters in gpu-smoke-test.yaml no longer cover changes to .settings.yaml and the Helm deployer, so add the missing patterns (e.g., '.settings.yaml' and 'pkg/bundler/deployer/helm/**') to the matched list in the filters block so pushes that only change version pins or Helm deployer files still trigger the push-mirror smoke run; update the matched array alongside the existing entries to ensure the runtime Helm/version resolution steps (around the workflow's Helm install section) are covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/actions/aicr-build/action.yml:
- Around line 19-25: The action currently always runs build-cli.sh when
build_snapshot_agent == 'true', so the input build_cli is ineffective; update
the composite action so the build-cli.sh invocation is gated by the build_cli
input (e.g., run build-cli.sh only if inputs.build_cli == 'true') rather than by
build_snapshot_agent, or if the intended behavior is staging-only, rename and
document the input (e.g., inputs.stage_cli) and update all references; change
the conditional that triggers build-cli.sh to reference inputs.build_cli (or the
new name) and adjust any related checks in the composite action steps that
currently use build_snapshot_agent to ensure callers can truly skip CLI
building.
In @.github/actions/check-control-plane-health/check-control-plane-health.sh:
- Around line 47-52: The current check unconditionally treats MAX_RESTARTS as
deprecated; adjust the conditional to preserve the legacy absolute ceiling when
the stability window is zero. Specifically, in the block that trims MAX_RESTARTS
and emits the deprecation warning, change the logic to only ignore/warn about
MAX_RESTARTS when STABILITY_WINDOW is not "0s" (i.e., when stability_window is
non-zero or unset); when STABILITY_WINDOW == "0s" keep MAX_RESTARTS enforced (do
not downgrade or ignore it). Make the same corresponding change where the same
deprecation check appears (the other occurrence referenced around the 471-473
region) so both places respect STABILITY_WINDOW == "0s".
In @.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh:
- Around line 23-24: Validate that MAX_KIND_NODE_ARTIFACT_SECONDS contains an
integer right after it's defined in collect-debug-artifacts.sh; if it doesn't
match a digits-only pattern, reset it to the default (600) and emit a warning to
stderr so subsequent arithmetic uses a safe integer value. Specifically, add a
check using a digits regex (e.g. '^[0-9]+$') against
MAX_KIND_NODE_ARTIFACT_SECONDS in the script and coerce/assign the safe integer
before any arithmetic or comparisons that use MAX_KIND_NODE_ARTIFACT_SECONDS or
the related COLLECT_NODE_RUNTIME_ARTIFACTS logic.
In @.github/actions/gpu-test-cleanup/export-kind-logs.sh:
- Line 19: The script currently silences failures from the timeout 300s kind
export logs /tmp/kind-logs --name "${KIND_CLUSTER_NAME}" || true command; change
it to run that command normally, capture its exit code, and if non‑zero emit a
warning to stderr (including the exit code and the cluster name) so diagnostics
are visible, then continue (non‑blocking). Locate the exact command string
timeout 300s kind export logs /tmp/kind-logs --name "${KIND_CLUSTER_NAME}" and
replace the trailing "|| true" with logic that checks $? and prints a clear
warning if it failed.
In @.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh:
- Around line 45-47: Add a precondition check for the environment variable
KIND_CLUSTER_NAME in install-karpenter-kwok.sh before it is dereferenced in the
kubectl command; if KIND_CLUSTER_NAME is empty or unset (under set -u) emit a
clear error and exit non‑zero so the script fails with a helpful message instead
of an unbound-variable error, then proceed to run the existing timeout ...
kubectl --context="kind-${KIND_CLUSTER_NAME}" apply ... command once the
variable is validated.
In @.github/actions/install-karpenter-kwok/resolve-versions.sh:
- Line 21: Replace the current host-toolchain lookup that sets the `go` variable
(the line that echoes/uses `go=$(go env GOVERSION)` in resolve-versions.sh) with
logic that parses the pinned Go version from the repository’s `.settings.yaml`
and assigns that value to `go` before echoing; update the script to extract the
correct key from `.settings.yaml` (the repo’s centralized Go/tool version entry)
so both local `make tools-setup` and GitHub Actions use the same source of
truth.
In @.github/actions/runtime-install/action.yml:
- Around line 18-21: The action silently no-ops for any inputs.method not equal
to "helm" or "bundle"; add input validation so unsupported values fail fast: in
the action.yml inputs.method block add an allowed-values: ["helm","bundle"]
entry (and/or add an explicit step early that uses an if expression to check
contains(['helm','bundle'], inputs.method) and calls exit 1 or uses fail action
when false) so invalid inputs immediately cause the job to error instead of
skipping the steps referenced later (see inputs.method and the conditional logic
controlling steps ~47-86).
In @.github/actions/runtime-install/generate-recipe.sh:
- Around line 24-25: Update the optional env var check to be defensive against
unset variables: in generate-recipe.sh change the conditional and the expansion
that reference AICR_PLATFORM to use the default-empty expansion
${AICR_PLATFORM:-} (i.e., use [[ -n "${AICR_PLATFORM:-}" ]] and when appending
RECIPE_ARGS use --platform "${AICR_PLATFORM:-}") so the script is safe to run
with set -u even if AICR_PLATFORM is not defined; this affects the AICR_PLATFORM
check and the RECIPE_ARGS+=(--platform "...") addition.
In @.github/scripts/gpu-debug-diagnostics.sh:
- Around line 43-49: The remaining host diagnostic commands (docker info, docker
version, nvidia-smi -L, nvidia-smi, kind get clusters) are unbounded and can
hang; wrap each invocation with a hard timeout (use the timeout utility, e.g.,
timeout 10s <command>) and preserve the existing non-failing behavior by
appending || true so diagnostics never block or fail the script; update all
occurrences of these commands in the script to use timeout and keep their
stdout/stderr redirection as before.
In @.github/scripts/gpu-runtime-component-health.sh:
- Around line 23-24: Validate the mode variable immediately after it's read (the
mode="$1" assignment) and before any monitoring or kai-scheduler wait logic
runs: add an explicit check comparing mode against the allowed values and if it
doesn't match, log/echo the "unknown runtime component health mode" error and
exit non‑zero so the script doesn't proceed to run the monitoring/kai-scheduler
waits and consume the full timeout; apply the same early validation to the other
mode-handling block around the monitoring/kai-scheduler waits (the section
covering lines ~115-133) so both entry points reject invalid modes before
performing waits.
In @.github/scripts/gpu-smoke-run-nvidia-smi.sh:
- Around line 39-45: The pod spec uses container name "nvidia-smi" and runs
command ["nvidia-smi"] but the image is set to "ubuntu:22.04" which lacks the
nvidia-smi binary; update the container image to a CUDA image that includes
nvidia-smi (for example "nvidia/cuda:12.9.0-base-ubuntu22.04" or
"nvidia/cuda:12.9.0-base-ubuntu24.04") so the "nvidia-smi" command executes
successfully, matching the pattern used in dra-gpu-test.yaml, hpa-gpu-test.yaml,
and gang-scheduling-test.yaml.
In @.github/workflows/gpu-h100-kind-runtime-test.yaml:
- Around line 114-116: Replace the direct shell invocation in the workflow step
titled "Check runtime component health" that runs bash
.github/scripts/gpu-runtime-component-health.sh "${{ inputs.intent }}" with a
Layer-2 composite action: create a new action in .github/actions (e.g.,
gpu-runtime-component-health) that wraps the script (exposes the same input
"intent" and runs the script), commit that composite action, and update the
workflow step to call it via uses:
./.github/actions/gpu-runtime-component-health with appropriate with: intent:
${{ inputs.intent }}; apply the same pattern for the other direct runs mentioned
(lines around 150-152 and 161-165) so all scripts are invoked via Layer-2
composed actions instead of run:.
---
Outside diff comments:
In @.github/workflows/gpu-smoke-test.yaml:
- Around line 47-67: The workflow's filters in gpu-smoke-test.yaml no longer
cover changes to .settings.yaml and the Helm deployer, so add the missing
patterns (e.g., '.settings.yaml' and 'pkg/bundler/deployer/helm/**') to the
matched list in the filters block so pushes that only change version pins or
Helm deployer files still trigger the push-mirror smoke run; update the matched
array alongside the existing entries to ensure the runtime Helm/version
resolution steps (around the workflow's Helm install section) are covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 338f5fbe-d6d1-4b6e-b673-73ea8714dcee
📒 Files selected for processing (71)
.coderabbit.yaml.github/actions/README.md.github/actions/aicr-build/action.yml.github/actions/aicr-build/build-cli.sh.github/actions/aicr-build/build-snapshot-agent.sh.github/actions/aicr-build/build-validator-images.sh.github/actions/aicr-build/stage-cli.sh.github/actions/check-control-plane-health/action.yml.github/actions/check-control-plane-health/check-control-plane-health.sh.github/actions/gpu-cluster-setup/action.yml.github/actions/gpu-cluster-setup/check-runner-capacity.sh.github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh.github/actions/gpu-cluster-setup/create-gpu-kind-cluster.sh.github/actions/gpu-cluster-setup/delete-stale-kind-cluster.sh.github/actions/gpu-cluster-setup/increase-inotify-limits.sh.github/actions/gpu-cluster-setup/install-nvkind.sh.github/actions/gpu-cluster-setup/runner-preflight.sh.github/actions/gpu-cluster-setup/validate-docker-gpu-access.sh.github/actions/gpu-cluster-setup/validate-env.sh.github/actions/gpu-cluster-setup/warm-kind-node-image.sh.github/actions/gpu-debug-diagnostics/action.yml.github/actions/gpu-operator-install/action.yml.github/actions/gpu-smoke-nvidia-smi/action.yml.github/actions/gpu-snapshot-validate/action.yml.github/actions/gpu-snapshot-validate/debug-snapshot-job.sh.github/actions/gpu-snapshot-validate/run-snapshot.sh.github/actions/gpu-snapshot-validate/validate-snapshot-gpu.sh.github/actions/gpu-test-cleanup/action.yml.github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh.github/actions/gpu-test-cleanup/export-kind-logs.sh.github/actions/install-karpenter-kwok/action.yml.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh.github/actions/install-karpenter-kwok/resolve-versions.sh.github/actions/load-versions/action.yml.github/actions/runtime-install/action.yml.github/actions/runtime-install/generate-bundle.sh.github/actions/runtime-install/generate-recipe.sh.github/actions/runtime-install/install-bundle.sh.github/actions/runtime-install/install-gpu-operator-helm.sh.github/actions/runtime-install/wait-gpu-operands-bundle.sh.github/actions/runtime-install/wait-gpu-operands-helm.sh.github/scripts/gpu-chainsaw-health.sh.github/scripts/gpu-debug-diagnostics.sh.github/scripts/gpu-runtime-component-health.sh.github/scripts/gpu-smoke-run-nvidia-smi.sh.github/scripts/gpu-smoke-show-nvidia-smi-log.sh.github/scripts/gpu-validate-conformance.sh.github/workflows/build-attested.yaml.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-kind-runtime-test.yaml.github/workflows/gpu-h100-training-test.yaml.github/workflows/gpu-smoke-test.yaml.github/workflows/packaging.yaml.settings.yamldocs/user/cli-reference.mdkwok/scripts/install-karpenter-kwok.shkwok/scripts/run-all-recipes.shpkg/cli/doc.gopkg/cli/root.gopkg/cli/skill.gopkg/cli/skill_claude_code.gopkg/cli/skill_codex.gopkg/cli/skill_generator.gopkg/cli/skill_test.gorecipes/overlays/base.yamlrecipes/overlays/kind.yamltests/chainsaw/ai-conformance/README.mdtests/chainsaw/ai-conformance/kind-common/assert-monitoring.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yamltests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml
💤 Files with no reviewable changes (1)
- .github/actions/gpu-operator-install/action.yml
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/actions/check-control-plane-health/check-control-plane-health.sh:
- Around line 188-190: The warning message printed when restart_count > 0 is
misleading when STABILITY_WINDOW is "0s" because historical restarts may later
trigger a hard fail under MAX_RESTARTS_LIMIT; update the echo in the branch that
checks restart_count (the code referencing variables component and
restart_count) to explicitly state that historical restarts are being noted and
that when stability_window (STABILITY_WINDOW or variable stability_window)
equals 0s those historical restarts may be enforced as hard failures per
MAX_RESTARTS_LIMIT, e.g. include the stability_window value and mention
MAX_RESTARTS_LIMIT in the message so triage knows the restarts are not purely
diagnostic.
In @.github/actions/gpu-cluster-setup/action.yml:
- Around line 183-187: The "Print GPUs (nvkind)" step is unintentionally
overriding KIND_CLUSTER_NAME with an undeclared input (inputs.cluster_name);
remove that env override or change it to reference the existing environment
variable so the step uses the job-level value. Edit the "Print GPUs (nvkind)"
step to either delete the env: KIND_CLUSTER_NAME line entirely or set it to
KIND_CLUSTER_NAME: ${{ env.KIND_CLUSTER_NAME }}, ensuring the nvkind cluster
print-gpus --name="${KIND_CLUSTER_NAME}" invocation uses the correct cluster
name.
In @.github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh:
- Around line 25-30: The deletion block that gathers remaining_containers and
calls docker_timeout ... rm -f is swallowing failures with "|| true" and never
verifies removal; change it so after calling docker_timeout 30s rm -f
"${remaining_containers[@]}" you capture its exit status (do not unconditionally
swallow it), then re-run docker_timeout 30s ps -aq --filter
"label=${kind_cluster_label}" to rebuild remaining_containers and, if any
containers still exist, log an error including ${KIND_CLUSTER_NAME} and the
container IDs and exit non‑zero (or retry removal a limited number of times) so
the script does not silently succeed when docker fails to remove labeled
containers. Ensure you reference the same variables used here:
remaining_containers, kind_cluster_label, docker_timeout, and KIND_CLUSTER_NAME.
In @.github/actions/runtime-install/action.yml:
- Around line 71-78: The "Generate recipe" step forwards AICR_ACCELERATOR to
generate-recipe.sh unconditionally when inputs.method == 'bundle', so add a
bundle-only guard that requires a non-empty accelerator; update the step's if to
require both inputs.method == 'bundle' and inputs.accelerator != '' (or add an
earlier check step that fails with a clear message if inputs.method == 'bundle'
and inputs.accelerator is empty) so that generate-recipe.sh is only invoked when
the accelerator input is present and valid; refer to the step name "Generate
recipe", the input variable inputs.accelerator, the condition inputs.method ==
'bundle', and the env var AICR_ACCELERATOR/generate-recipe.sh when implementing
the change.
In @.github/scripts/gpu-chainsaw-health.sh:
- Around line 28-29: Validate CHAINSAW_TEST_TIMEOUT before it is passed into the
timeout command: add the same upfront duration validation used for
MONITORING_READY_TIMEOUT/MONITORING_READY_TIMEOUT parsing (e.g., check
CHAINSAW_TEST_TIMEOUT matches the expected duration pattern like ^[0-9]+[smh]$
or numeric seconds) and if invalid either normalize it to a safe default (30m)
or exit with a clear error. Locate the assignment and the place where timeout is
invoked (references: CHAINSAW_TEST_TIMEOUT and the timeout invocation) and
ensure you perform the validation/normalization immediately after the
CHAINSAW_TEST_TIMEOUT defaulting so malformed overrides are caught before
calling timeout.
In @.github/scripts/gpu-debug-diagnostics.sh:
- Around line 21-25: Add an upfront guard to fail fast when KIND_CLUSTER_NAME is
not set: before the kubectl_kind function, check if KIND_CLUSTER_NAME is empty
(e.g., test -z "${KIND_CLUSTER_NAME:-}") and, if so, print a clear error to
stderr and exit with non‑zero status; this will make failures explicit rather
than allowing an unbound-variable error later inside kubectl_kind which uses
"kind-${KIND_CLUSTER_NAME}".
In @.github/scripts/gpu-smoke-run-nvidia-smi.sh:
- Around line 55-56: The Ready wait for the one-shot nvidia-smi pod is too long;
update the kubectl_kind_wait call that references pod/${pod_name} (the line with
--for=condition=Ready --timeout=120s) to either remove the Ready wait entirely
or shorten the timeout to a small value (e.g., 5–10s) so the smoke test doesn't
incur unnecessary 120s latency and real failures surface earlier.
In `@kwok/scripts/install-karpenter-kwok.sh`:
- Around line 78-85: The kwok-stage-fast Helm release invocation (helm_kind
upgrade --install kwok-stage-fast kwok/stage-fast) is missing the readiness wait
and timeout flags; update that command to include the same --wait --timeout
"${KWOK_HELM_TIMEOUT}" flags used by the kwok-controller release so the install
waits for resources to be ready (use the existing KWOK_HELM_TIMEOUT variable for
consistency).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1ae104c3-310b-48df-8a34-de046ff7a110
📒 Files selected for processing (24)
.github/actions/aicr-build/action.yml.github/actions/check-control-plane-health/check-control-plane-health.sh.github/actions/gpu-chainsaw-health/action.yml.github/actions/gpu-cluster-setup/action.yml.github/actions/gpu-runtime-component-health/action.yml.github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh.github/actions/gpu-test-cleanup/export-kind-logs.sh.github/actions/gpu-validate-conformance/action.yml.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh.github/actions/install-karpenter-kwok/resolve-versions.sh.github/actions/runtime-install/action.yml.github/actions/runtime-install/generate-recipe.sh.github/actions/runtime-install/install-bundle.sh.github/actions/runtime-install/wait-gpu-operands-bundle.sh.github/actions/runtime-install/wait-gpu-operands-helm.sh.github/scripts/gpu-chainsaw-health.sh.github/scripts/gpu-debug-diagnostics.sh.github/scripts/gpu-runtime-component-health.sh.github/scripts/gpu-smoke-run-nvidia-smi.sh.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-kind-runtime-test.yaml.github/workflows/gpu-h100-training-test.yamlkwok/scripts/install-karpenter-kwok.sh
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gpu-h100-inference-test.yaml:
- Around line 23-29: The workflow hard-codes run_conformance: true which ignores
the new workflow_dispatch input; update the places that set run_conformance:
true to use the dispatch input expression instead (e.g. replace run_conformance:
true with run_conformance: ${{ github.event.inputs.run_conformance == 'true'
}}), and make the same change for the other occurrence referenced (lines
104-112) so the manual "run_conformance" toggle is honored.
In @.github/workflows/gpu-h100-kind-runtime-test.yaml:
- Around line 54-57: The reusable workflow is still configured for x2 GPUs:
change the job identifier gpu-h100-kind-runtime-test to target a single-H100
topology by updating the runs-on value from linux-amd64-gpu-h100-latest-2 to the
single-GPU label (e.g., linux-amd64-gpu-h100-latest) and make the corresponding
changes to both preflight and snapshot validation blocks by lowering
min_gpu_count from '2' to '1' in each place (the two occurrences referenced in
the review). Ensure all three locations (runs-on and both min_gpu_count
settings) are changed together so the workflow will select single-GPU hosts.
In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 23-29: The workflow_dispatch input run_conformance is ignored
because the reusable workflow call hardcodes run_conformance: true; update the
reusable workflow invocation(s) that currently set run_conformance: true (the
uses: ... reusable workflow calls) to pass the workflow_dispatch input instead,
e.g. set run_conformance: ${{ github.event.inputs.run_conformance || false }}
(or the appropriate expression for your runner) so scheduled/push/manual runs
honor the new toggle.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6ec4bb1e-78b2-4db7-ba91-fc43598d1db4
📒 Files selected for processing (3)
.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-kind-runtime-test.yaml.github/workflows/gpu-h100-training-test.yaml
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
.github/workflows/gpu-h100-kind-runtime-test.yaml (1)
56-57:⚠️ Potential issue | 🟠 MajorSingle-H100 test objective is not applied in this workflow yet.
This job is still x2-configured (
runs-onand bothmin_gpu_countvalues), and timeout is still150instead of120, so it won’t exercise the intended single-H100 runtime path.Suggested patch
- runs-on: linux-amd64-gpu-h100-latest-2 - timeout-minutes: 150 + runs-on: linux-amd64-gpu-h100-latest + timeout-minutes: 120 @@ - min_gpu_count: '2' + min_gpu_count: '1' @@ - min_gpu_count: '2' + min_gpu_count: '1'Also applies to: 75-80, 129-134
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-h100-kind-runtime-test.yaml around lines 56 - 57, The job still uses the x2/multi-H100 configuration; update the workflow to use the single-H100 runtime by changing the runs-on value from "linux-amd64-gpu-h100-latest-2" to the single-H100 runner (the single-H100 runner name used in this repo), set both min_gpu_count entries to 1, and reduce timeout-minutes from 150 to 120; apply the same changes for the other job blocks that contain runs-on, min_gpu_count, and timeout-minutes (the blocks around the other occurrences reported).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/actions/gpu-cluster-setup/action.yml:
- Around line 23-26: The action's input "min_gpu_count" currently supplies a
permissive default '1'; remove that default and require callers to explicitly
pass a value instead. In .github/actions/gpu-cluster-setup/action.yml update the
"min_gpu_count" input (the min_gpu_count symbol) by deleting the default field
and making the input required (or leaving required: true) so no implicit '1' is
used; callers/tests that need a lower threshold should set it in their workflow
invocation rather than relying on the action default.
- Around line 43-94: The Kubernetes quantity inputs (api_server_cpu_request,
api_server_memory_request, controller_manager_cpu_request,
controller_manager_memory_request, scheduler_cpu_request,
scheduler_memory_request, etcd_cpu_request, etcd_memory_request) lack
pre-validation; add explicit validation for each of these inputs using a
Kubernetes quantity regex (e.g. to accept millicpu like ^[0-9]+m$ and binary SI
memory like ^[0-9]+[Gg]i$ or other allowed forms) before embedding them into
templates, and when validation fails emit a clear ::error:: message and stop the
action; update the action input handling code where these inputs are read to
perform the regex checks and output the errors.
In @.github/scripts/gpu-smoke-run-nvidia-smi.sh:
- Around line 18-20: The POD_NAME_FILE derivation interpolates KIND_CLUSTER_NAME
which can cause an unbound-variable error if KIND_CLUSTER_NAME is unset and
writing POD_NAME_FILE can fail if its parent directory doesn't exist; add an
explicit check that KIND_CLUSTER_NAME is set (e.g., test KIND_CLUSTER_NAME and
exit with a clear error) before computing POD_NAME_FILE, derive POD_NAME_FILE
afterwards using the validated KIND_CLUSTER_NAME (referencing the POD_NAME_FILE
and KIND_CLUSTER_NAME symbols), and before any write to POD_NAME_FILE ensure the
parent directory exists by creating it (use dirname on POD_NAME_FILE and mkdir
-p) so writes are resilient.
---
Duplicate comments:
In @.github/workflows/gpu-h100-kind-runtime-test.yaml:
- Around line 56-57: The job still uses the x2/multi-H100 configuration; update
the workflow to use the single-H100 runtime by changing the runs-on value from
"linux-amd64-gpu-h100-latest-2" to the single-H100 runner (the single-H100
runner name used in this repo), set both min_gpu_count entries to 1, and reduce
timeout-minutes from 150 to 120; apply the same changes for the other job blocks
that contain runs-on, min_gpu_count, and timeout-minutes (the blocks around the
other occurrences reported).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d83a0cc3-3aa3-4534-bcf3-9f153797258f
📒 Files selected for processing (9)
.github/actions/check-control-plane-health/check-control-plane-health.sh.github/actions/gpu-cluster-setup/action.yml.github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh.github/actions/runtime-install/action.yml.github/scripts/gpu-chainsaw-health.sh.github/scripts/gpu-debug-diagnostics.sh.github/scripts/gpu-smoke-run-nvidia-smi.sh.github/workflows/gpu-h100-kind-runtime-test.yamlkwok/scripts/install-karpenter-kwok.sh
8f359f1 to
7ed10b9
Compare
yuanchen8911
left a comment
There was a problem hiding this comment.
Submitting pending inline review comments.
yuanchen8911
left a comment
There was a problem hiding this comment.
Submitting pending inline review comments.
729aee6 to
98ab2d6
Compare
af3b3a0 to
e83d061
Compare
3223469 to
6ca27d4
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Solid refactor — the deduplication via the shared reusable workflow and extraction of inline shell into discrete scripts is a big maintainability win. CI checks are all green.
7 inline comments, all medium or low severity. The main themes: (1) two hardcoded version pins (GPU_OPERATOR_CHART_VERSION, CUDA base image tag) that should live in .settings.yaml for consistency with the project's single-source-of-truth principle; (2) upload-artifact v6/v7 version mismatch in the same composite action; (3) --skip-delete on chainsaw that could leave stale resources for the conformance phase; (4) one github.workspace reference where every other action uses github.action_path.
Nothing here blocks merge — these are consistency and robustness improvements that could be addressed in this PR or a fast follow-up.
6ca27d4 to
b274e63
Compare
b274e63 to
2041bdb
Compare
Summary
The major change in this PR is splitting H100 GPU Kind CI into two explicit validation modes:
pull-request/*pushes: set up the H100 Kind cluster, build the AICR CLI, install the runtime bundle, and run Chainsaw runtime health validation.main: run the minimum validation path plus GPU snapshot validation and CNCF AI Conformance validation.Manual dispatch defaults to the minimum path and can opt into full validation with
run_full_validation=true.Everything else supports that split: the PR deduplicates the training and inference workflows through a reusable H100 workflow, extracts large inline shell blocks into composite actions/scripts, and applies targeted CI/runtime tweaks to reduce fragility on slow GPU runners.
Motivation / Context
AICR H100 Kind CI has had large runtime variance and repeated failures on slow or unhealthy GPU runners. On affected runners, jobs often ran for more than two hours and failed with Kubernetes symptoms such as API server timeouts, control-plane instability, readiness probe timeouts, runtime components not becoming Available, and cleanup/debug steps timing out.
The failures looked different at the Kubernetes layer, but the common signal was runner/host overload. Recent investigation showed a strong correlation between slow/canceled/failed jobs and the NVIDIA-managed runner
NodeUIDprinted by runner bootstrap logs.Problematic recent NodeUID:
507dc070-3563-4487-8a0c-d411242a11a1Recent slow/canceled/failed jobs on that NodeUID used different ephemeral runner names and even different H100 runner labels:
24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rwRepresentative slow/failure examples:
24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm507dc070-3563-4487-8a0c-d411242a11a1kai-scheduler/admission; the 2-hour timeout only ended the already-stuck run.24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z507dc070-3563-4487-8a0c-d411242a11a1129.20,127.54,112.7541CPUs and241GiBmemory.The same workflows completed quickly on other NodeUIDs. Recent successful H100 x1 runs finished in roughly 8-10 minutes, and prior full-validation runs also passed on healthy runners. Runner reliability is tracked separately in #696 and the IPP follow-up; this PR focuses on keeping the AICR CI signal useful and maintainable while that infrastructure issue is investigated.
Known-good NodeUIDs from this branch's successful H100 x1 runs:
3510eba6-bc52-4e81-9bb3-4f7ab58ebc5179c6057e-fa67-45e9-8708-3f0b4bee7a81a3705e58-e420-4f74-a0e4-75de9fb9489b4d3ecc15-d83e-4c0f-ac97-d5d8cb0812c013beb141-0db5-420e-a05a-34d4802ee108Fixes: N/A
Related: #687, #696
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Validation Modes
This is the central behavior change. PRs get the minimum runtime signal by default, while nightly scheduled runs keep the full validation coverage.
Minimum validation is the default for PR-triggered
pushevents onpull-request/*branches and for manual dispatch whenrun_full_validation=false:linux-amd64-gpu-h100-latest-1.Full validation is the default for scheduled runs on
mainand is available through manual dispatch withrun_full_validation=true:aicr snapshot, and validate GPU model/count from the snapshot.The actual wiring is:
run_full_validation=falserun_full_validation=truefalse, opt-intrue00:15 UTC06:15 UTCRefactor and Deduplication
gpu-h100-kind-runtime-test.yamlreusable workflow.gpu-operator-installaction toruntime-installbecause it installs the full AICR runtime bundle, not only GPU Operator.run:blocks into composite actions and scripts with explicit inputs, timeout handling, and clearer failure messages.CI Hardening and Runtime Tweaks
linux-amd64-gpu-h100-latest-1andmin_gpu_count: 1for H100 training/inference CI while the x2 runner pool issue is investigated.Kind Overlay Scope
The
recipes/overlays/kind.yamlmonitoring changes apply to all kind-based local and CI flows, not only H100 GPU CI. They intentionally reduce the kind monitoring footprint by disabling Grafana and default upstream alerting rules, tuning Prometheus Operator probes/resources, and keeping operator-owned monitoring custom resources in themonitoringnamespace.These changes do not affect EKS, GKE, AKS, or OKE overlays.
Deferred Follow-Ups
testing_tools.nvkindis pinned to a commit SHA for this branch because the H100 Kind workflow needs a specific nvkind revision. Move it to an upstream release tag once a tagged release contains the same support.Testing
Scoped CI checks passed locally. Full
make qualifywas not run because this is a CI-focused test PR and the touched Go package was covered by targeted unit, lint, and coverage checks.Coverage:
validators/conformance:15.6%->15.8%(+0.2%)make test-coverage:75.3%total coverage, above the70%thresholdRecent GitHub Actions validation:
Risk Assessment
CI-only workflow refactor: The main behavioral impact is limited to GPU CI workflows and composite actions. The runtime deployment logic for real clusters is not changed by this PR.
Kind overlay scope: The monitoring changes in
recipes/overlays/kind.yamlare intentionally kind-wide. They reduce local/CI kind monitoring overhead and are not applied to real-cluster overlays.Runner label: The H100 workflows currently run on
linux-amd64-gpu-h100-latest-1withmin_gpu_count: 1while the x2 runner pool issue is investigated. Workflow and job display names areH100 x1.nvkind pin:
testing_tools.nvkindis a commit SHA on this branch. Replace it with a release tag before merge if a suitable upstream tag is available.Conformance signal: PR-push H100 runs default to the minimum runtime validation path to keep PR feedback bounded. Scheduled main-branch H100 runs exercise full validation, including snapshot validation and CNCF AI Conformance. Manual dispatch can enable
run_full_validationfor full-path triage.Rollout notes: The minimum and full H100 x1 flows have passed on healthy runners. Keep the restored debug artifact collection because it is the main triage path for slow-runner failures tracked by #696. Roll back by reverting this CI workflow refactor if it destabilizes PR checks.
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info