ci: improve GPU test reliability#675
Closed
yuanchen8911 wants to merge 2 commits into
Closed
Conversation
9375f81 to
7f5cf57
Compare
7f5cf57 to
eb9bc30
Compare
eb9bc30 to
8722d67
Compare
5c1b523 to
014d9f6
Compare
014d9f6 to
906b19a
Compare
Add runtime bundle timing and diagnostics, extend slow GPU CI timeouts, preserve snapshot artifacts for debugging, and collect KAI gang-scheduling failure evidence. Restore Kind Dynamo ssh-keygen parity while extending KAI and Dynamo Helm hook timeouts for cold runners.
906b19a to
b6edf9d
Compare
25 tasks
Contributor
Author
|
Closing in favor of #687, which trims the diagnostic branch down to the minimal H100 CI hardening changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improve GPU CI reliability and diagnostics for slow/cold H100 runners and hard-to-triage failures.
Motivation / Context
Recent GPU CI runs exposed several reliability gaps:
Fixes: N/A
Related: #670
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator,validators/*)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
gpu-operator-installtoruntime-installto match its bundle/runtime role.go build, Docker image build, image size inspection, andkind load docker-image.deploy.sh.--no-cleanupuntil failure diagnostics run, then cleans up the Job/RBAC explicitly.helm_retrypreserves the failing command's exit code in logs.Testing
Coverage notes:
validators/conformance: 15.3% -> 16.6% (+1.3%) compared withorigin/mainmake test-coverage: 74.9% total coverage, threshold 70%, passedFull
make qualifyhas not been rerun after the latest diagnostic changes; this PR remains draft while we validate GPU CI behavior.Risk Assessment
Rollout notes: Draft PR is intended to validate GPU CI behavior before marking ready.
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info