Skip to content

ci: improve GPU test reliability#675

Closed
yuanchen8911 wants to merge 2 commits into
NVIDIA:mainfrom
yuanchen8911:ci-bundle-install-diagnostics
Closed

ci: improve GPU test reliability#675
yuanchen8911 wants to merge 2 commits into
NVIDIA:mainfrom
yuanchen8911:ci-bundle-install-diagnostics

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 24, 2026

Summary

Improve GPU CI reliability and diagnostics for slow/cold H100 runners and hard-to-triage failures.

Motivation / Context

Recent GPU CI runs exposed several reliability gaps:

  • runtime bundle installation can consume most of the previous 120-minute job budget on cold runners
  • slow smoke image loading, cert-manager, kube-prometheus-stack, KAI Scheduler, and Dynamo installs were hard to separate from image pulls, Helm hooks, and readiness waits
  • inference snapshot failures can create the AICR Job but never get a Job pod before the default 5-minute timeout
  • snapshot failures cleaned up the agent Job before the debug step could inspect pods/events/logs
  • gang-scheduling conformance failures timed out without preserving enough KAI/DRA state to explain whether scheduling, DRA claims, or controller lag was responsible
  • post-failure resource collection could continue after snapshot failure and consume the remaining job budget

Fixes: N/A
Related: #670

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator, validators/*)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: GPU CI workflows and composite actions

Implementation Notes

  • Renames the GPU runtime install composite action from gpu-operator-install to runtime-install to match its bundle/runtime role.
  • Adds explicit timing around the smoke-test go build, Docker image build, image size inspection, and kind load docker-image.
  • Adds elapsed-time logging for every Helm component install in generated deploy.sh.
  • Dumps full pods/jobs/events diagnostics on Helm failure and on slow successful Helm installs; fast successful installs keep the lighter recent-events dump.
  • Keeps KAI Scheduler in the normal bundle install order to avoid double-install side effects while preserving deploy.sh retry behavior.
  • Restores the Dynamo chart ssh-keygen hook for Kind inference parity with standalone Dynamo and gives dynamo-platform a 20-minute Helm hook timeout on cold runners.
  • Extends H100 inference and training workflow timeouts from 120 to 180 minutes because cold self-hosted H100 runners have exceeded the previous ceiling before diagnostics could complete.
  • Extends the GPU snapshot wait from 5 minutes to 10 minutes to absorb transient Job-controller lag on slow runners.
  • Preserves snapshot agent resources with --no-cleanup until failure diagnostics run, then cleans up the Job/RBAC explicitly.
  • Adds snapshot failure diagnostics for cluster-wide events, pods, nodes, Jobs, quotas/limits, admission webhooks, APIService health, API server livez/readyz, control-plane leases, kube-system pods, kube-apiserver/controller-manager/scheduler/etcd logs, and Kind control-plane container logs.
  • Skips post-run resource collection when snapshot validation fails, so the snapshot-specific debug step is not starved by the workflow timeout.
  • Adds gang-scheduling failure artifacts for gang pods, PodGroups, ResourceClaims, ResourceSlices, namespace events, all KAI component pods/logs, and NVIDIA DRA driver pods/logs before cleanup.
  • Adds an executable generated-script test that verifies helm_retry preserves the failing command's exit code in logs.

Testing

git diff --check
GOCACHE=/tmp/aicr-gocache GOFLAGS=-mod=vendor go test ./pkg/bundler/deployer/helm -count=1
GOCACHE=/tmp/aicr-gocache GOLANGCI_LINT_CACHE=/tmp/aicr-golangci-cache golangci-lint run -c .golangci.yaml ./pkg/bundler/deployer/helm/...
go test ./validators/conformance -count=1
golangci-lint run -c .golangci.yaml ./validators/conformance/...
GOFLAGS=-mod=vendor go test -coverprofile=/tmp/aicr/conformance-cover.out ./validators/conformance/...
make test-coverage
bash -n <(yq eval '.runs.steps[] | select(.name == "Build snapshot agent image and load into kind") | .run' .github/actions/aicr-build/action.yml)
bash -n <(yq eval '.runs.steps[] | select(.name == "Run aicr snapshot") | .run' .github/actions/gpu-snapshot-validate/action.yml)
bash -n <(yq eval '.runs.steps[] | select(.name == "Debug snapshot Job") | .run' .github/actions/gpu-snapshot-validate/action.yml)
bash -n <(yq eval '.runs.steps[] | select(.name == "Cleanup snapshot Job") | .run' .github/actions/gpu-snapshot-validate/action.yml)

Coverage notes:

  • validators/conformance: 15.3% -> 16.6% (+1.3%) compared with origin/main
  • make test-coverage: 74.9% total coverage, threshold 70%, passed

Full make qualify has not been rerun after the latest diagnostic changes; this PR remains draft while we validate GPU CI behavior.

Risk Assessment

  • Low — Isolated diagnostic/CI change, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Draft PR is intended to validate GPU CI behavior before marking ready.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 added the enhancement New feature or request label Apr 24, 2026
@yuanchen8911 yuanchen8911 force-pushed the ci-bundle-install-diagnostics branch 2 times, most recently from 9375f81 to 7f5cf57 Compare April 24, 2026 20:37
@yuanchen8911 yuanchen8911 changed the title WIP: ci: add runtime bundle install diagnostics WIP: ci: add GPU bundle install diagnostics Apr 24, 2026
@yuanchen8911 yuanchen8911 force-pushed the ci-bundle-install-diagnostics branch from 7f5cf57 to eb9bc30 Compare April 25, 2026 01:14
@github-actions github-actions Bot added size/L and removed size/M labels Apr 25, 2026
@yuanchen8911 yuanchen8911 changed the title WIP: ci: add GPU bundle install diagnostics ci: improve GPU test reliability Apr 25, 2026
@yuanchen8911 yuanchen8911 force-pushed the ci-bundle-install-diagnostics branch from eb9bc30 to 8722d67 Compare April 25, 2026 03:29
@github-actions github-actions Bot added size/XL and removed size/L labels Apr 25, 2026
@yuanchen8911 yuanchen8911 force-pushed the ci-bundle-install-diagnostics branch 3 times, most recently from 5c1b523 to 014d9f6 Compare April 25, 2026 04:29
@yuanchen8911 yuanchen8911 force-pushed the ci-bundle-install-diagnostics branch from 014d9f6 to 906b19a Compare April 25, 2026 04:44
Add runtime bundle timing and diagnostics, extend slow GPU CI timeouts, preserve snapshot artifacts for debugging, and collect KAI gang-scheduling failure evidence.

Restore Kind Dynamo ssh-keygen parity while extending KAI and Dynamo Helm hook timeouts for cold runners.
@yuanchen8911 yuanchen8911 force-pushed the ci-bundle-install-diagnostics branch from 906b19a to b6edf9d Compare April 25, 2026 05:21
@mchmarny mchmarny added the P2 label Apr 25, 2026
@github-actions github-actions Bot removed the P2 label Apr 25, 2026
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Closing in favor of #687, which trims the diagnostic branch down to the minimal H100 CI hardening changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants