Skip to content

Refactor and harden H100 GPU CI workflow#694

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:ci-gpu-kind-h100-x1-test
Apr 28, 2026
Merged

Refactor and harden H100 GPU CI workflow#694
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:ci-gpu-kind-h100-x1-test

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 27, 2026

Summary

The major change in this PR is splitting H100 GPU Kind CI into two explicit validation modes:

  • Minimum validation for PR-triggered pull-request/* pushes: set up the H100 Kind cluster, build the AICR CLI, install the runtime bundle, and run Chainsaw runtime health validation.
  • Full validation for nightly scheduled runs on main: run the minimum validation path plus GPU snapshot validation and CNCF AI Conformance validation.

Manual dispatch defaults to the minimum path and can opt into full validation with run_full_validation=true.

Everything else supports that split: the PR deduplicates the training and inference workflows through a reusable H100 workflow, extracts large inline shell blocks into composite actions/scripts, and applies targeted CI/runtime tweaks to reduce fragility on slow GPU runners.

Motivation / Context

AICR H100 Kind CI has had large runtime variance and repeated failures on slow or unhealthy GPU runners. On affected runners, jobs often ran for more than two hours and failed with Kubernetes symptoms such as API server timeouts, control-plane instability, readiness probe timeouts, runtime components not becoming Available, and cleanup/debug steps timing out.

The failures looked different at the Kubernetes layer, but the common signal was runner/host overload. Recent investigation showed a strong correlation between slow/canceled/failed jobs and the NVIDIA-managed runner NodeUID printed by runner bootstrap logs.

Problematic recent NodeUID:

  • 507dc070-3563-4487-8a0c-d411242a11a1

Recent slow/canceled/failed jobs on that NodeUID used different ephemeral runner names and even different H100 runner labels:

  • 24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm
  • 24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z
  • 609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rw

Representative slow/failure examples:

The same workflows completed quickly on other NodeUIDs. Recent successful H100 x1 runs finished in roughly 8-10 minutes, and prior full-validation runs also passed on healthy runners. Runner reliability is tracked separately in #696 and the IPP follow-up; this PR focuses on keeping the AICR CI signal useful and maintainable while that infrastructure issue is investigated.

Known-good NodeUIDs from this branch's successful H100 x1 runs:

  • 3510eba6-bc52-4e81-9bb3-4f7ab58ebc51
  • 79c6057e-fa67-45e9-8708-3f0b4bee7a81
  • a3705e58-e420-4f74-a0e4-75de9fb9489b
  • 4d3ecc15-d83e-4c0f-ac97-d5d8cb0812c0
  • 13beb141-0db5-420e-a05a-34d4802ee108

Fixes: N/A
Related: #687, #696

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: GPU CI workflows and composite actions; CPU-only gang-scheduling conformance check; kind monitoring overlay

Implementation Notes

Validation Modes

This is the central behavior change. PRs get the minimum runtime signal by default, while nightly scheduled runs keep the full validation coverage.

Minimum validation is the default for PR-triggered push events on pull-request/* branches and for manual dispatch when run_full_validation=false:

  1. Check out the repository.
  2. Prepare the GPU workflow by capturing an early runner preflight snapshot and loading pinned tool/image versions.
  3. Set up a one-H100 Kind cluster on linux-amd64-gpu-h100-latest-1.
  4. Build the AICR CLI only.
  5. Install the AICR runtime bundle for the selected intent/platform.
  6. Install Chainsaw and run the intent-specific runtime health checks.
  7. Run unified cleanup. On failure/cancel, collect debug diagnostics and upload debug artifacts; always delete the Kind cluster.

Full validation is the default for scheduled runs on main and is available through manual dispatch with run_full_validation=true:

  1. Run the same checkout, prepare, cluster setup, CLI build, and runtime bundle install steps as the minimum path.
  2. Snapshot and validate GPU state: build the snapshot-agent image, run aicr snapshot, and validate GPU model/count from the snapshot.
  3. Install Chainsaw and run the same intent-specific runtime health checks as the minimum path.
  4. Run CNCF AI Conformance: install Karpenter + KWOK, build the conformance validator image, and run conformance validation.
  5. Run unified cleanup. Upload conformance artifacts when conformance ran; on failure/cancel, collect debug diagnostics and upload debug artifacts; always delete the Kind cluster.

The actual wiring is:

  • PR push: run_full_validation=false
  • Scheduled main run: run_full_validation=true
  • Manual dispatch: default false, opt-in true
  • Training schedule: daily at 00:15 UTC
  • Inference schedule: daily at 06:15 UTC

Refactor and Deduplication

  • Replaces duplicated H100 training/inference workflow bodies with a shared gpu-h100-kind-runtime-test.yaml reusable workflow.
  • Uses thin training and inference caller workflows that pass only intent-specific values: cluster name, runtime intent/platform, Chainsaw path, and artifact prefix.
  • Renames the old gpu-operator-install action to runtime-install because it installs the full AICR runtime bundle, not only GPU Operator.
  • Extracts large inline run: blocks into composite actions and scripts with explicit inputs, timeout handling, and clearer failure messages.
  • Consolidates post-run behavior into one GPU test cleanup action: debug diagnostics, validation artifact upload, debug artifact collection/upload, and Kind cluster deletion.
  • Uses per-intent H100 concurrency keys so newer training pushes supersede older training work and newer inference pushes supersede older inference work, without training cancelling inference.

CI Hardening and Runtime Tweaks

  • Uses linux-amd64-gpu-h100-latest-1 and min_gpu_count: 1 for H100 training/inference CI while the x2 runner pool issue is investigated.
  • Uses a 150-minute timeout for minimum validation and a 180-minute timeout for full validation.
  • Gives runtime bundle installation an 80-minute step timeout.
  • Removes the post-setup control-plane health check and standalone runtime component health check; Chainsaw remains the runtime health contract after bundle install.
  • Moves Chainsaw installation into the GPU Chainsaw health action.
  • Moves snapshot-agent image build into the GPU snapshot validation action.
  • Moves Karpenter + KWOK install and conformance validator image build into the GPU conformance action.
  • Uses CPU-only KAI PodGroup workers for the gang scheduling conformance check so full validation can run on one H100. DRA/GPU allocation behavior remains covered by the DRA support and secure accelerator access conformance checks.
  • Restores failure/cancellation debug artifact collection in the H100 reusable workflow while keeping expensive node-runtime artifacts opt-in.
  • Adds lightweight runner and Kind-node pressure diagnostics, including load average, memory/disk, and top CPU/memory processes.
  • Keeps the historical GPU toleration for tainted GPU-worker compatibility.

Kind Overlay Scope

The recipes/overlays/kind.yaml monitoring changes apply to all kind-based local and CI flows, not only H100 GPU CI. They intentionally reduce the kind monitoring footprint by disabling Grafana and default upstream alerting rules, tuning Prometheus Operator probes/resources, and keeping operator-owned monitoring custom resources in the monitoring namespace.

These changes do not affect EKS, GKE, AKS, or OKE overlays.

Deferred Follow-Ups

  • Once the 2-H100 runner pool is stable, switch the H100 workflow back to 2-H100 runners and restore GPU-backed gang-scheduling workloads.
  • Add an explicit Kind CPU worker to better mimic the production cluster shape: control-plane node for Kubernetes control-plane components, CPU worker for runtime controllers/operators, and GPU worker(s) for GPU-specific workloads. A prototype exposed a GPU Operator operand placement issue: NFD can label the extra worker with GPU features even though its containerd lacks the NVIDIA runtime. The follow-up should first wire per-component GPU Operator node selectors for operands such as devicePlugin, dcgmExporter, GFD, validator, and related daemonsets.
  • testing_tools.nvkind is pinned to a commit SHA for this branch because the H100 Kind workflow needs a specific nvkind revision. Move it to an upstream release tag once a tagged release contains the same support.

Testing

actionlint .github/workflows/gpu-h100-kind-runtime-test.yaml .github/workflows/gpu-smoke-test.yaml .github/workflows/gpu-h100-training-test.yaml .github/workflows/gpu-h100-inference-test.yaml
yamllint .github/actions/gpu-test-cleanup/action.yml .github/actions/gpu-snapshot-validate/action.yml .github/actions/gpu-validate-conformance/action.yml .github/workflows/gpu-h100-kind-runtime-test.yaml .github/workflows/gpu-h100-training-test.yaml .github/workflows/gpu-h100-inference-test.yaml .github/workflows/gpu-smoke-test.yaml recipes/validators/catalog.yaml
git diff --check
go test ./validators/conformance
golangci-lint run -c .golangci.yaml ./validators/conformance
go test -coverprofile=/tmp/aicr/validators-conformance-cover.out ./validators/conformance
make test-coverage

Scoped CI checks passed locally. Full make qualify was not run because this is a CI-focused test PR and the touched Go package was covered by targeted unit, lint, and coverage checks.

Coverage:

  • validators/conformance: 15.6% -> 15.8% (+0.2%)
  • make test-coverage: 75.3% total coverage, above the 70% threshold

Recent GitHub Actions validation:

Risk Assessment

  • Low — Isolated test PR, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

CI-only workflow refactor: The main behavioral impact is limited to GPU CI workflows and composite actions. The runtime deployment logic for real clusters is not changed by this PR.

Kind overlay scope: The monitoring changes in recipes/overlays/kind.yaml are intentionally kind-wide. They reduce local/CI kind monitoring overhead and are not applied to real-cluster overlays.

Runner label: The H100 workflows currently run on linux-amd64-gpu-h100-latest-1 with min_gpu_count: 1 while the x2 runner pool issue is investigated. Workflow and job display names are H100 x1.

nvkind pin: testing_tools.nvkind is a commit SHA on this branch. Replace it with a release tag before merge if a suitable upstream tag is available.

Conformance signal: PR-push H100 runs default to the minimum runtime validation path to keep PR feedback bounded. Scheduled main-branch H100 runs exercise full validation, including snapshot validation and CNCF AI Conformance. Manual dispatch can enable run_full_validation for full-path triage.

Rollout notes: The minimum and full H100 x1 flows have passed on healthy runners. Keep the restored debug artifact collection because it is the main triage path for slow-runner failures tracked by #696. Roll back by reverting this CI workflow refactor if it destabilizes PR checks.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request substantially refactors GPU cluster setup, runtime installation, and test workflows. It extracts inline GitHub Actions logic into standalone Bash scripts (including cluster creation, control-plane health checking, GPU validation, and debug diagnostics), introduces new composite Actions wrapping these scripts, removes the gpu-operator-install action in favor of a new runtime-install action supporting both Helm and bundle-based installation methods, adds a reusable workflow template for H100 GPU runtime testing, extends configuration inputs across multiple actions for timeouts and resource tuning, updates workflow triggers and paths filters, and adds new monitoring assertions for Prometheus stacks without Grafana.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'Refactor and harden H100 GPU CI workflow' accurately reflects the main change: refactoring H100 GPU CI workflows into a reusable pattern with improved reliability.
Description check ✅ Passed The PR description provides a comprehensive explanation of the changes, including motivation, validation modes, refactoring approach, CI hardening, and risk assessment.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/gpu-smoke-test.yaml (1)

47-67: ⚠️ Potential issue | 🟠 Major

Restore push-mirror coverage for runtime Helm/version changes.

This filter no longer matches .settings.yaml or pkg/bundler/deployer/helm/**, even though this workflow still resolves pinned versions and installs the runtime with Helm on Lines 110-114. A PR that only changes version pins or the Helm deployer will now skip the push-mirror smoke run.

🔧 Suggested filter additions
           matched:
             - '.github/workflows/gpu-smoke-test.yaml'
+            - '.settings.yaml'
             - '.github/actions/gpu-cluster-setup/**'
             - '.github/actions/runtime-install/**'
             - '.github/actions/aicr-build/**'
             - '.github/actions/gpu-debug-diagnostics/**'
             - '.github/actions/gpu-test-cleanup/**'
             - '.github/actions/gpu-smoke-nvidia-smi/**'
             - '.github/actions/load-versions/**'
             - '.github/scripts/gpu-debug-diagnostics.sh'
             - '.github/scripts/gpu-smoke-run-nvidia-smi.sh'
             - '.github/scripts/gpu-smoke-show-nvidia-smi-log.sh'
+            - 'pkg/bundler/deployer/helm/**'
             - 'pkg/collector/**'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/gpu-smoke-test.yaml around lines 47 - 67, The workflow's
filters in gpu-smoke-test.yaml no longer cover changes to .settings.yaml and the
Helm deployer, so add the missing patterns (e.g., '.settings.yaml' and
'pkg/bundler/deployer/helm/**') to the matched list in the filters block so
pushes that only change version pins or Helm deployer files still trigger the
push-mirror smoke run; update the matched array alongside the existing entries
to ensure the runtime Helm/version resolution steps (around the workflow's Helm
install section) are covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/aicr-build/action.yml:
- Around line 19-25: The action currently always runs build-cli.sh when
build_snapshot_agent == 'true', so the input build_cli is ineffective; update
the composite action so the build-cli.sh invocation is gated by the build_cli
input (e.g., run build-cli.sh only if inputs.build_cli == 'true') rather than by
build_snapshot_agent, or if the intended behavior is staging-only, rename and
document the input (e.g., inputs.stage_cli) and update all references; change
the conditional that triggers build-cli.sh to reference inputs.build_cli (or the
new name) and adjust any related checks in the composite action steps that
currently use build_snapshot_agent to ensure callers can truly skip CLI
building.

In @.github/actions/check-control-plane-health/check-control-plane-health.sh:
- Around line 47-52: The current check unconditionally treats MAX_RESTARTS as
deprecated; adjust the conditional to preserve the legacy absolute ceiling when
the stability window is zero. Specifically, in the block that trims MAX_RESTARTS
and emits the deprecation warning, change the logic to only ignore/warn about
MAX_RESTARTS when STABILITY_WINDOW is not "0s" (i.e., when stability_window is
non-zero or unset); when STABILITY_WINDOW == "0s" keep MAX_RESTARTS enforced (do
not downgrade or ignore it). Make the same corresponding change where the same
deprecation check appears (the other occurrence referenced around the 471-473
region) so both places respect STABILITY_WINDOW == "0s".

In @.github/actions/gpu-test-cleanup/collect-debug-artifacts.sh:
- Around line 23-24: Validate that MAX_KIND_NODE_ARTIFACT_SECONDS contains an
integer right after it's defined in collect-debug-artifacts.sh; if it doesn't
match a digits-only pattern, reset it to the default (600) and emit a warning to
stderr so subsequent arithmetic uses a safe integer value. Specifically, add a
check using a digits regex (e.g. '^[0-9]+$') against
MAX_KIND_NODE_ARTIFACT_SECONDS in the script and coerce/assign the safe integer
before any arithmetic or comparisons that use MAX_KIND_NODE_ARTIFACT_SECONDS or
the related COLLECT_NODE_RUNTIME_ARTIFACTS logic.

In @.github/actions/gpu-test-cleanup/export-kind-logs.sh:
- Line 19: The script currently silences failures from the timeout 300s kind
export logs /tmp/kind-logs --name "${KIND_CLUSTER_NAME}" || true command; change
it to run that command normally, capture its exit code, and if non‑zero emit a
warning to stderr (including the exit code and the cluster name) so diagnostics
are visible, then continue (non‑blocking). Locate the exact command string
timeout 300s kind export logs /tmp/kind-logs --name "${KIND_CLUSTER_NAME}" and
replace the trailing "|| true" with logic that checks $? and prints a clear
warning if it failed.

In @.github/actions/install-karpenter-kwok/install-karpenter-kwok.sh:
- Around line 45-47: Add a precondition check for the environment variable
KIND_CLUSTER_NAME in install-karpenter-kwok.sh before it is dereferenced in the
kubectl command; if KIND_CLUSTER_NAME is empty or unset (under set -u) emit a
clear error and exit non‑zero so the script fails with a helpful message instead
of an unbound-variable error, then proceed to run the existing timeout ...
kubectl --context="kind-${KIND_CLUSTER_NAME}" apply ... command once the
variable is validated.

In @.github/actions/install-karpenter-kwok/resolve-versions.sh:
- Line 21: Replace the current host-toolchain lookup that sets the `go` variable
(the line that echoes/uses `go=$(go env GOVERSION)` in resolve-versions.sh) with
logic that parses the pinned Go version from the repository’s `.settings.yaml`
and assigns that value to `go` before echoing; update the script to extract the
correct key from `.settings.yaml` (the repo’s centralized Go/tool version entry)
so both local `make tools-setup` and GitHub Actions use the same source of
truth.

In @.github/actions/runtime-install/action.yml:
- Around line 18-21: The action silently no-ops for any inputs.method not equal
to "helm" or "bundle"; add input validation so unsupported values fail fast: in
the action.yml inputs.method block add an allowed-values: ["helm","bundle"]
entry (and/or add an explicit step early that uses an if expression to check
contains(['helm','bundle'], inputs.method) and calls exit 1 or uses fail action
when false) so invalid inputs immediately cause the job to error instead of
skipping the steps referenced later (see inputs.method and the conditional logic
controlling steps ~47-86).

In @.github/actions/runtime-install/generate-recipe.sh:
- Around line 24-25: Update the optional env var check to be defensive against
unset variables: in generate-recipe.sh change the conditional and the expansion
that reference AICR_PLATFORM to use the default-empty expansion
${AICR_PLATFORM:-} (i.e., use [[ -n "${AICR_PLATFORM:-}" ]] and when appending
RECIPE_ARGS use --platform "${AICR_PLATFORM:-}") so the script is safe to run
with set -u even if AICR_PLATFORM is not defined; this affects the AICR_PLATFORM
check and the RECIPE_ARGS+=(--platform "...") addition.

In @.github/scripts/gpu-debug-diagnostics.sh:
- Around line 43-49: The remaining host diagnostic commands (docker info, docker
version, nvidia-smi -L, nvidia-smi, kind get clusters) are unbounded and can
hang; wrap each invocation with a hard timeout (use the timeout utility, e.g.,
timeout 10s <command>) and preserve the existing non-failing behavior by
appending || true so diagnostics never block or fail the script; update all
occurrences of these commands in the script to use timeout and keep their
stdout/stderr redirection as before.

In @.github/scripts/gpu-runtime-component-health.sh:
- Around line 23-24: Validate the mode variable immediately after it's read (the
mode="$1" assignment) and before any monitoring or kai-scheduler wait logic
runs: add an explicit check comparing mode against the allowed values and if it
doesn't match, log/echo the "unknown runtime component health mode" error and
exit non‑zero so the script doesn't proceed to run the monitoring/kai-scheduler
waits and consume the full timeout; apply the same early validation to the other
mode-handling block around the monitoring/kai-scheduler waits (the section
covering lines ~115-133) so both entry points reject invalid modes before
performing waits.

In @.github/scripts/gpu-smoke-run-nvidia-smi.sh:
- Around line 39-45: The pod spec uses container name "nvidia-smi" and runs
command ["nvidia-smi"] but the image is set to "ubuntu:22.04" which lacks the
nvidia-smi binary; update the container image to a CUDA image that includes
nvidia-smi (for example "nvidia/cuda:12.9.0-base-ubuntu22.04" or
"nvidia/cuda:12.9.0-base-ubuntu24.04") so the "nvidia-smi" command executes
successfully, matching the pattern used in dra-gpu-test.yaml, hpa-gpu-test.yaml,
and gang-scheduling-test.yaml.

In @.github/workflows/gpu-h100-kind-runtime-test.yaml:
- Around line 114-116: Replace the direct shell invocation in the workflow step
titled "Check runtime component health" that runs bash
.github/scripts/gpu-runtime-component-health.sh "${{ inputs.intent }}" with a
Layer-2 composite action: create a new action in .github/actions (e.g.,
gpu-runtime-component-health) that wraps the script (exposes the same input
"intent" and runs the script), commit that composite action, and update the
workflow step to call it via uses:
./.github/actions/gpu-runtime-component-health with appropriate with: intent:
${{ inputs.intent }}; apply the same pattern for the other direct runs mentioned
(lines around 150-152 and 161-165) so all scripts are invoked via Layer-2
composed actions instead of run:.

---

Outside diff comments:
In @.github/workflows/gpu-smoke-test.yaml:
- Around line 47-67: The workflow's filters in gpu-smoke-test.yaml no longer
cover changes to .settings.yaml and the Helm deployer, so add the missing
patterns (e.g., '.settings.yaml' and 'pkg/bundler/deployer/helm/**') to the
matched list in the filters block so pushes that only change version pins or
Helm deployer files still trigger the push-mirror smoke run; update the matched
array alongside the existing entries to ensure the runtime Helm/version
resolution steps (around the workflow's Helm install section) are covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 338f5fbe-d6d1-4b6e-b673-73ea8714dcee

📥 Commits

Reviewing files that changed from the base of the PR and between 8d0168e and 031e12a.

📒 Files selected for processing (71)
  • .coderabbit.yaml
  • .github/actions/README.md
  • .github/actions/aicr-build/action.yml
  • .github/actions/aicr-build/build-cli.sh
  • .github/actions/aicr-build/build-snapshot-agent.sh
  • .github/actions/aicr-build/build-validator-images.sh
  • .github/actions/aicr-build/stage-cli.sh
  • .github/actions/check-control-plane-health/action.yml
  • .github/actions/check-control-plane-health/check-control-plane-health.sh
  • .github/actions/gpu-cluster-setup/action.yml
  • .github/actions/gpu-cluster-setup/check-runner-capacity.sh
  • .github/actions/gpu-cluster-setup/configure-nvidia-container-toolkit.sh
  • .github/actions/gpu-cluster-setup/create-gpu-kind-cluster.sh
  • .github/actions/gpu-cluster-setup/delete-stale-kind-cluster.sh
  • .github/actions/gpu-cluster-setup/increase-inotify-limits.sh
  • .github/actions/gpu-cluster-setup/install-nvkind.sh
  • .github/actions/gpu-cluster-setup/runner-preflight.sh
  • .github/actions/gpu-cluster-setup/validate-docker-gpu-access.sh
  • .github/actions/gpu-cluster-setup/validate-env.sh
  • .github/actions/gpu-cluster-setup/warm-kind-node-image.sh
  • .github/actions/gpu-debug-diagnostics/action.yml
  • .github/actions/gpu-operator-install/action.yml
  • .github/actions/gpu-smoke-nvidia-smi/action.yml
  • .github/actions/gpu-snapshot-validate/action.yml
  • .github/actions/gpu-snapshot-validate/debug-snapshot-job.sh
  • .github/actions/gpu-snapshot-validate/run-snapshot.sh
  • .github/actions/gpu-snapshot-validate/validate-snapshot-gpu.sh
  • .github/actions/gpu-test-cleanup/action.yml
  • .github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh
  • .github/actions/gpu-test-cleanup/collect-debug-artifacts.sh
  • .github/actions/gpu-test-cleanup/export-kind-logs.sh
  • .github/actions/install-karpenter-kwok/action.yml
  • .github/actions/install-karpenter-kwok/install-karpenter-kwok.sh
  • .github/actions/install-karpenter-kwok/resolve-versions.sh
  • .github/actions/load-versions/action.yml
  • .github/actions/runtime-install/action.yml
  • .github/actions/runtime-install/generate-bundle.sh
  • .github/actions/runtime-install/generate-recipe.sh
  • .github/actions/runtime-install/install-bundle.sh
  • .github/actions/runtime-install/install-gpu-operator-helm.sh
  • .github/actions/runtime-install/wait-gpu-operands-bundle.sh
  • .github/actions/runtime-install/wait-gpu-operands-helm.sh
  • .github/scripts/gpu-chainsaw-health.sh
  • .github/scripts/gpu-debug-diagnostics.sh
  • .github/scripts/gpu-runtime-component-health.sh
  • .github/scripts/gpu-smoke-run-nvidia-smi.sh
  • .github/scripts/gpu-smoke-show-nvidia-smi-log.sh
  • .github/scripts/gpu-validate-conformance.sh
  • .github/workflows/build-attested.yaml
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-kind-runtime-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml
  • .github/workflows/gpu-smoke-test.yaml
  • .github/workflows/packaging.yaml
  • .settings.yaml
  • docs/user/cli-reference.md
  • kwok/scripts/install-karpenter-kwok.sh
  • kwok/scripts/run-all-recipes.sh
  • pkg/cli/doc.go
  • pkg/cli/root.go
  • pkg/cli/skill.go
  • pkg/cli/skill_claude_code.go
  • pkg/cli/skill_codex.go
  • pkg/cli/skill_generator.go
  • pkg/cli/skill_test.go
  • recipes/overlays/base.yaml
  • recipes/overlays/kind.yaml
  • tests/chainsaw/ai-conformance/README.md
  • tests/chainsaw/ai-conformance/kind-common/assert-monitoring.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml
💤 Files with no reviewable changes (1)
  • .github/actions/gpu-operator-install/action.yml

Comment thread .github/actions/aicr-build/action.yml
Comment thread .github/actions/gpu-test-cleanup/collect-debug-artifacts.sh
Comment thread .github/actions/gpu-test-cleanup/export-kind-logs.sh Outdated
Comment thread .github/actions/install-karpenter-kwok/install-karpenter-kwok.sh
Comment thread .github/actions/runtime-install/generate-recipe.sh Outdated
Comment thread .github/scripts/gpu-debug-diagnostics.sh Outdated
Comment thread .github/scripts/gpu-runtime-component-health.sh Outdated
Comment thread .github/scripts/gpu-smoke-run-nvidia-smi.sh
Comment thread .github/workflows/gpu-h100-kind-runtime-test.yaml Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/check-control-plane-health/check-control-plane-health.sh:
- Around line 188-190: The warning message printed when restart_count > 0 is
misleading when STABILITY_WINDOW is "0s" because historical restarts may later
trigger a hard fail under MAX_RESTARTS_LIMIT; update the echo in the branch that
checks restart_count (the code referencing variables component and
restart_count) to explicitly state that historical restarts are being noted and
that when stability_window (STABILITY_WINDOW or variable stability_window)
equals 0s those historical restarts may be enforced as hard failures per
MAX_RESTARTS_LIMIT, e.g. include the stability_window value and mention
MAX_RESTARTS_LIMIT in the message so triage knows the restarts are not purely
diagnostic.

In @.github/actions/gpu-cluster-setup/action.yml:
- Around line 183-187: The "Print GPUs (nvkind)" step is unintentionally
overriding KIND_CLUSTER_NAME with an undeclared input (inputs.cluster_name);
remove that env override or change it to reference the existing environment
variable so the step uses the job-level value. Edit the "Print GPUs (nvkind)"
step to either delete the env: KIND_CLUSTER_NAME line entirely or set it to
KIND_CLUSTER_NAME: ${{ env.KIND_CLUSTER_NAME }}, ensuring the nvkind cluster
print-gpus --name="${KIND_CLUSTER_NAME}" invocation uses the correct cluster
name.

In @.github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh:
- Around line 25-30: The deletion block that gathers remaining_containers and
calls docker_timeout ... rm -f is swallowing failures with "|| true" and never
verifies removal; change it so after calling docker_timeout 30s rm -f
"${remaining_containers[@]}" you capture its exit status (do not unconditionally
swallow it), then re-run docker_timeout 30s ps -aq --filter
"label=${kind_cluster_label}" to rebuild remaining_containers and, if any
containers still exist, log an error including ${KIND_CLUSTER_NAME} and the
container IDs and exit non‑zero (or retry removal a limited number of times) so
the script does not silently succeed when docker fails to remove labeled
containers. Ensure you reference the same variables used here:
remaining_containers, kind_cluster_label, docker_timeout, and KIND_CLUSTER_NAME.

In @.github/actions/runtime-install/action.yml:
- Around line 71-78: The "Generate recipe" step forwards AICR_ACCELERATOR to
generate-recipe.sh unconditionally when inputs.method == 'bundle', so add a
bundle-only guard that requires a non-empty accelerator; update the step's if to
require both inputs.method == 'bundle' and inputs.accelerator != '' (or add an
earlier check step that fails with a clear message if inputs.method == 'bundle'
and inputs.accelerator is empty) so that generate-recipe.sh is only invoked when
the accelerator input is present and valid; refer to the step name "Generate
recipe", the input variable inputs.accelerator, the condition inputs.method ==
'bundle', and the env var AICR_ACCELERATOR/generate-recipe.sh when implementing
the change.

In @.github/scripts/gpu-chainsaw-health.sh:
- Around line 28-29: Validate CHAINSAW_TEST_TIMEOUT before it is passed into the
timeout command: add the same upfront duration validation used for
MONITORING_READY_TIMEOUT/MONITORING_READY_TIMEOUT parsing (e.g., check
CHAINSAW_TEST_TIMEOUT matches the expected duration pattern like ^[0-9]+[smh]$
or numeric seconds) and if invalid either normalize it to a safe default (30m)
or exit with a clear error. Locate the assignment and the place where timeout is
invoked (references: CHAINSAW_TEST_TIMEOUT and the timeout invocation) and
ensure you perform the validation/normalization immediately after the
CHAINSAW_TEST_TIMEOUT defaulting so malformed overrides are caught before
calling timeout.

In @.github/scripts/gpu-debug-diagnostics.sh:
- Around line 21-25: Add an upfront guard to fail fast when KIND_CLUSTER_NAME is
not set: before the kubectl_kind function, check if KIND_CLUSTER_NAME is empty
(e.g., test -z "${KIND_CLUSTER_NAME:-}") and, if so, print a clear error to
stderr and exit with non‑zero status; this will make failures explicit rather
than allowing an unbound-variable error later inside kubectl_kind which uses
"kind-${KIND_CLUSTER_NAME}".

In @.github/scripts/gpu-smoke-run-nvidia-smi.sh:
- Around line 55-56: The Ready wait for the one-shot nvidia-smi pod is too long;
update the kubectl_kind_wait call that references pod/${pod_name} (the line with
--for=condition=Ready --timeout=120s) to either remove the Ready wait entirely
or shorten the timeout to a small value (e.g., 5–10s) so the smoke test doesn't
incur unnecessary 120s latency and real failures surface earlier.

In `@kwok/scripts/install-karpenter-kwok.sh`:
- Around line 78-85: The kwok-stage-fast Helm release invocation (helm_kind
upgrade --install kwok-stage-fast kwok/stage-fast) is missing the readiness wait
and timeout flags; update that command to include the same --wait --timeout
"${KWOK_HELM_TIMEOUT}" flags used by the kwok-controller release so the install
waits for resources to be ready (use the existing KWOK_HELM_TIMEOUT variable for
consistency).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1ae104c3-310b-48df-8a34-de046ff7a110

📥 Commits

Reviewing files that changed from the base of the PR and between 031e12a and 46af9d5.

📒 Files selected for processing (24)
  • .github/actions/aicr-build/action.yml
  • .github/actions/check-control-plane-health/check-control-plane-health.sh
  • .github/actions/gpu-chainsaw-health/action.yml
  • .github/actions/gpu-cluster-setup/action.yml
  • .github/actions/gpu-runtime-component-health/action.yml
  • .github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh
  • .github/actions/gpu-test-cleanup/collect-debug-artifacts.sh
  • .github/actions/gpu-test-cleanup/export-kind-logs.sh
  • .github/actions/gpu-validate-conformance/action.yml
  • .github/actions/install-karpenter-kwok/install-karpenter-kwok.sh
  • .github/actions/install-karpenter-kwok/resolve-versions.sh
  • .github/actions/runtime-install/action.yml
  • .github/actions/runtime-install/generate-recipe.sh
  • .github/actions/runtime-install/install-bundle.sh
  • .github/actions/runtime-install/wait-gpu-operands-bundle.sh
  • .github/actions/runtime-install/wait-gpu-operands-helm.sh
  • .github/scripts/gpu-chainsaw-health.sh
  • .github/scripts/gpu-debug-diagnostics.sh
  • .github/scripts/gpu-runtime-component-health.sh
  • .github/scripts/gpu-smoke-run-nvidia-smi.sh
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-kind-runtime-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml
  • kwok/scripts/install-karpenter-kwok.sh

Comment thread .github/actions/gpu-cluster-setup/action.yml
Comment thread .github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh
Comment thread .github/actions/runtime-install/action.yml
Comment thread .github/scripts/gpu-chainsaw-health.sh
Comment thread .github/scripts/gpu-debug-diagnostics.sh
Comment thread .github/scripts/gpu-smoke-run-nvidia-smi.sh Outdated
Comment thread kwok/scripts/install-karpenter-kwok.sh Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/gpu-h100-inference-test.yaml:
- Around line 23-29: The workflow hard-codes run_conformance: true which ignores
the new workflow_dispatch input; update the places that set run_conformance:
true to use the dispatch input expression instead (e.g. replace run_conformance:
true with run_conformance: ${{ github.event.inputs.run_conformance == 'true'
}}), and make the same change for the other occurrence referenced (lines
104-112) so the manual "run_conformance" toggle is honored.

In @.github/workflows/gpu-h100-kind-runtime-test.yaml:
- Around line 54-57: The reusable workflow is still configured for x2 GPUs:
change the job identifier gpu-h100-kind-runtime-test to target a single-H100
topology by updating the runs-on value from linux-amd64-gpu-h100-latest-2 to the
single-GPU label (e.g., linux-amd64-gpu-h100-latest) and make the corresponding
changes to both preflight and snapshot validation blocks by lowering
min_gpu_count from '2' to '1' in each place (the two occurrences referenced in
the review). Ensure all three locations (runs-on and both min_gpu_count
settings) are changed together so the workflow will select single-GPU hosts.

In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 23-29: The workflow_dispatch input run_conformance is ignored
because the reusable workflow call hardcodes run_conformance: true; update the
reusable workflow invocation(s) that currently set run_conformance: true (the
uses: ... reusable workflow calls) to pass the workflow_dispatch input instead,
e.g. set run_conformance: ${{ github.event.inputs.run_conformance || false }}
(or the appropriate expression for your runner) so scheduled/push/manual runs
honor the new toggle.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6ec4bb1e-78b2-4db7-ba91-fc43598d1db4

📥 Commits

Reviewing files that changed from the base of the PR and between 46af9d5 and 75d6fae.

📒 Files selected for processing (3)
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-kind-runtime-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml

Comment thread .github/workflows/gpu-h100-inference-test.yaml
Comment thread .github/workflows/gpu-h100-kind-runtime-test.yaml Outdated
Comment thread .github/workflows/gpu-h100-training-test.yaml
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
.github/workflows/gpu-h100-kind-runtime-test.yaml (1)

56-57: ⚠️ Potential issue | 🟠 Major

Single-H100 test objective is not applied in this workflow yet.

This job is still x2-configured (runs-on and both min_gpu_count values), and timeout is still 150 instead of 120, so it won’t exercise the intended single-H100 runtime path.

Suggested patch
-    runs-on: linux-amd64-gpu-h100-latest-2
-    timeout-minutes: 150
+    runs-on: linux-amd64-gpu-h100-latest
+    timeout-minutes: 120
@@
-          min_gpu_count: '2'
+          min_gpu_count: '1'
@@
-          min_gpu_count: '2'
+          min_gpu_count: '1'

Also applies to: 75-80, 129-134

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/gpu-h100-kind-runtime-test.yaml around lines 56 - 57, The
job still uses the x2/multi-H100 configuration; update the workflow to use the
single-H100 runtime by changing the runs-on value from
"linux-amd64-gpu-h100-latest-2" to the single-H100 runner (the single-H100
runner name used in this repo), set both min_gpu_count entries to 1, and reduce
timeout-minutes from 150 to 120; apply the same changes for the other job blocks
that contain runs-on, min_gpu_count, and timeout-minutes (the blocks around the
other occurrences reported).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/gpu-cluster-setup/action.yml:
- Around line 23-26: The action's input "min_gpu_count" currently supplies a
permissive default '1'; remove that default and require callers to explicitly
pass a value instead. In .github/actions/gpu-cluster-setup/action.yml update the
"min_gpu_count" input (the min_gpu_count symbol) by deleting the default field
and making the input required (or leaving required: true) so no implicit '1' is
used; callers/tests that need a lower threshold should set it in their workflow
invocation rather than relying on the action default.
- Around line 43-94: The Kubernetes quantity inputs (api_server_cpu_request,
api_server_memory_request, controller_manager_cpu_request,
controller_manager_memory_request, scheduler_cpu_request,
scheduler_memory_request, etcd_cpu_request, etcd_memory_request) lack
pre-validation; add explicit validation for each of these inputs using a
Kubernetes quantity regex (e.g. to accept millicpu like ^[0-9]+m$ and binary SI
memory like ^[0-9]+[Gg]i$ or other allowed forms) before embedding them into
templates, and when validation fails emit a clear ::error:: message and stop the
action; update the action input handling code where these inputs are read to
perform the regex checks and output the errors.

In @.github/scripts/gpu-smoke-run-nvidia-smi.sh:
- Around line 18-20: The POD_NAME_FILE derivation interpolates KIND_CLUSTER_NAME
which can cause an unbound-variable error if KIND_CLUSTER_NAME is unset and
writing POD_NAME_FILE can fail if its parent directory doesn't exist; add an
explicit check that KIND_CLUSTER_NAME is set (e.g., test KIND_CLUSTER_NAME and
exit with a clear error) before computing POD_NAME_FILE, derive POD_NAME_FILE
afterwards using the validated KIND_CLUSTER_NAME (referencing the POD_NAME_FILE
and KIND_CLUSTER_NAME symbols), and before any write to POD_NAME_FILE ensure the
parent directory exists by creating it (use dirname on POD_NAME_FILE and mkdir
-p) so writes are resilient.

---

Duplicate comments:
In @.github/workflows/gpu-h100-kind-runtime-test.yaml:
- Around line 56-57: The job still uses the x2/multi-H100 configuration; update
the workflow to use the single-H100 runtime by changing the runs-on value from
"linux-amd64-gpu-h100-latest-2" to the single-H100 runner (the single-H100
runner name used in this repo), set both min_gpu_count entries to 1, and reduce
timeout-minutes from 150 to 120; apply the same changes for the other job blocks
that contain runs-on, min_gpu_count, and timeout-minutes (the blocks around the
other occurrences reported).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d83a0cc3-3aa3-4534-bcf3-9f153797258f

📥 Commits

Reviewing files that changed from the base of the PR and between 75d6fae and 8f359f1.

📒 Files selected for processing (9)
  • .github/actions/check-control-plane-health/check-control-plane-health.sh
  • .github/actions/gpu-cluster-setup/action.yml
  • .github/actions/gpu-test-cleanup/cleanup-kind-cluster.sh
  • .github/actions/runtime-install/action.yml
  • .github/scripts/gpu-chainsaw-health.sh
  • .github/scripts/gpu-debug-diagnostics.sh
  • .github/scripts/gpu-smoke-run-nvidia-smi.sh
  • .github/workflows/gpu-h100-kind-runtime-test.yaml
  • kwok/scripts/install-karpenter-kwok.sh

Comment thread .github/actions/gpu-cluster-setup/action.yml Outdated
Comment thread .github/actions/gpu-cluster-setup/action.yml
Comment thread .github/scripts/gpu-smoke-run-nvidia-smi.sh
Copy link
Copy Markdown
Contributor Author

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting pending inline review comments.

Comment thread .github/scripts/gpu-chainsaw-health.sh
Comment thread .github/workflows/gpu-h100-training-test.yaml
Comment thread .github/workflows/gpu-h100-kind-runtime-test.yaml Outdated
Comment thread .github/workflows/gpu-h100-inference-test.yaml
Copy link
Copy Markdown
Contributor Author

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting pending inline review comments.

Comment thread kwok/scripts/install-karpenter-kwok.sh Outdated
@yuanchen8911 yuanchen8911 changed the title ci: test single H100 kind runtime flow ci: test H100 x2 conformance runtime flow Apr 27, 2026
@yuanchen8911 yuanchen8911 changed the title ci: test H100 x2 conformance runtime flow ci: test new GPU CI workflow Apr 27, 2026
@yuanchen8911 yuanchen8911 changed the title ci: test new GPU CI workflow ci: harden and test new GPU CI workflow Apr 27, 2026
coderabbitai[bot]

This comment was marked as resolved.

@yuanchen8911 yuanchen8911 changed the title ci: harden and test new GPU CI workflow ci: harden and test new GPU CI workflow (WIP, DO NOT MERGE) Apr 27, 2026
@yuanchen8911 yuanchen8911 force-pushed the ci-gpu-kind-h100-x1-test branch from 729aee6 to 98ab2d6 Compare April 28, 2026 00:49
@yuanchen8911 yuanchen8911 changed the title ci: harden and test new GPU CI workflow (WIP, DO NOT MERGE) Harden and test new GPU CI workflow (WIP, DO NOT MERGE) Apr 28, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review April 28, 2026 02:31
@yuanchen8911 yuanchen8911 requested review from a team as code owners April 28, 2026 02:31
@yuanchen8911 yuanchen8911 changed the title Harden and test new GPU CI workflow (WIP, DO NOT MERGE) Refactor and harden H100 GPU CI workflow Apr 28, 2026
@yuanchen8911 yuanchen8911 force-pushed the ci-gpu-kind-h100-x1-test branch from af3b3a0 to e83d061 Compare April 28, 2026 03:55
@yuanchen8911 yuanchen8911 force-pushed the ci-gpu-kind-h100-x1-test branch 2 times, most recently from 3223469 to 6ca27d4 Compare April 28, 2026 04:01
@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 April 28, 2026 04:20
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid refactor — the deduplication via the shared reusable workflow and extraction of inline shell into discrete scripts is a big maintainability win. CI checks are all green.

7 inline comments, all medium or low severity. The main themes: (1) two hardcoded version pins (GPU_OPERATOR_CHART_VERSION, CUDA base image tag) that should live in .settings.yaml for consistency with the project's single-source-of-truth principle; (2) upload-artifact v6/v7 version mismatch in the same composite action; (3) --skip-delete on chainsaw that could leave stale resources for the conformance phase; (4) one github.workspace reference where every other action uses github.action_path.

Nothing here blocks merge — these are consistency and robustness improvements that could be addressed in this PR or a fast follow-up.

Comment thread .github/actions/gpu-test-cleanup/action.yml Outdated
Comment thread .github/actions/runtime-install/install-gpu-operator-helm.sh Outdated
Comment thread .github/scripts/gpu-chainsaw-health.sh
Comment thread .github/actions/gpu-debug-diagnostics/action.yml Outdated
Comment thread .github/actions/aicr-build/build-snapshot-agent.sh Outdated
Comment thread .github/workflows/gpu-h100-kind-runtime-test.yaml Outdated
Comment thread recipes/overlays/kind.yaml
@yuanchen8911 yuanchen8911 force-pushed the ci-gpu-kind-h100-x1-test branch from 6ca27d4 to b274e63 Compare April 28, 2026 16:36
@yuanchen8911 yuanchen8911 requested review from a team and lockwobr April 28, 2026 16:45
@yuanchen8911 yuanchen8911 force-pushed the ci-gpu-kind-h100-x1-test branch from b274e63 to 2041bdb Compare April 28, 2026 16:48
@yuanchen8911 yuanchen8911 requested a review from mchmarny April 28, 2026 16:48
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@mchmarny mchmarny merged commit 8ffea23 into NVIDIA:main Apr 28, 2026
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants