fix(ci): centralize GPU CI runtime pins by yuanchen8911 · Pull Request #710 · NVIDIA/aicr

yuanchen8911 · 2026-04-28T19:31:59Z

Summary

Centralizes the GPU CI runtime pins added in #694 behind the shared load-versions action. The GPU Operator chart version and snapshot-agent CUDA image remain in .settings.yaml, and the consuming scripts now receive those values from composite-action inputs/env instead of reading .settings.yaml directly.

Motivation / Context

This is a small follow-up to merged PR #694. In review, we moved the GPU Operator chart version and snapshot-agent CUDA image into .settings.yaml; this PR finishes that cleanup by making those pins follow the same load-versions path as other CI tool/image pins.

This keeps .settings.yaml as the source and .github/actions/load-versions as the shared read path.

Fixes: N/A
Related: #694

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: GPU CI composite actions and .settings.yaml

Implementation Notes

Adds gpu_operator_chart_version and snapshot_agent_cuda_image outputs to .github/actions/load-versions.
Updates runtime-install to load the GPU Operator chart version through load-versions for Helm-mode installs and pass it to install-gpu-operator-helm.sh via GPU_OPERATOR_CHART_VERSION.
Updates aicr-build to load the snapshot-agent CUDA image through load-versions when build_snapshot_agent=true and pass it to build-snapshot-agent.sh via SNAPSHOT_AGENT_CUDA_IMAGE.
Keeps optional action inputs for both values so future callers can override them explicitly if needed.
Leaves the actual pinned values unchanged.

Testing

bash -n .github/actions/runtime-install/install-gpu-operator-helm.sh .github/actions/aicr-build/build-snapshot-agent.sh
yamllint .settings.yaml .github/actions/load-versions/action.yml .github/actions/runtime-install/action.yml .github/actions/aicr-build/action.yml
actionlint .github/workflows/gpu-smoke-test.yaml .github/workflows/gpu-h100-kind-runtime-test.yaml .github/workflows/gpu-h100-training-test.yaml .github/workflows/gpu-h100-inference-test.yaml
git diff --check

Scoped CI checks passed locally. Full make qualify was not run because this is a CI-only composite-action wiring change with no Go, recipe, or user-facing behavior changes.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Existing GPU CI callers keep using the same pinned values. The scripts now fail clearly if their parent action does not provide the required env value.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

coderabbitai · 2026-04-28T19:34:40Z

📝 Walkthrough

Walkthrough

This change refactors how configuration values are passed through GitHub Actions workflows. The load-versions action now extracts gpu_operator_chart_version and snapshot_agent_cuda_image from .settings.yaml and exposes them as outputs. The aicr-build and runtime-install actions accept these values as optional inputs and use conditional steps to load defaults when not provided. The corresponding shell scripts (build-snapshot-agent.sh and install-gpu-operator-helm.sh) are updated to validate that these values are set in environment variables rather than reading directly from .settings.yaml with yq. A new configuration entry is added to .settings.yaml for gpu_operator_chart_version.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description clearly explains the purpose of centralizing GPU CI runtime pins behind the load-versions action, provides motivation from a preceding PR, and details the implementation changes across multiple files.
Title check	✅ Passed	The title 'fix(ci): centralize GPU CI runtime pins' accurately reflects the main change: moving GPU-related configuration pins to a centralized location via the load-versions action.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mchmarny

Clean follow-up to #694. Centralizes the two GPU CI pins through load-versions while keeping .settings.yaml as the source. The optional action inputs preserve override flexibility, the env-based pass-through is consistent with how other pinned values flow today, and the script-side guards correctly handle both empty and literal "null" from yq. End-to-end signal is strong: GPU smoke, H100 inference, and H100 training all green on this commit. LGTM.

ci: centralize GPU CI runtime pins

376af14

yuanchen8911 requested a review from a team as a code owner April 28, 2026 19:32

github-actions Bot added area/ci size/M labels Apr 28, 2026

yuanchen8911 requested a review from mchmarny April 28, 2026 19:37

yuanchen8911 changed the title ~~ci: centralize GPU CI runtime pins~~ fix(ci): centralize GPU CI runtime pins Apr 28, 2026

mchmarny approved these changes Apr 28, 2026

View reviewed changes

mchmarny merged commit f94c66c into NVIDIA:main Apr 28, 2026
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): centralize GPU CI runtime pins#710

fix(ci): centralize GPU CI runtime pins#710
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:ci-centralize-gpu-ci-pins

yuanchen8911 commented Apr 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading