Skip to content

fix(ci): centralize GPU CI runtime pins#710

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:ci-centralize-gpu-ci-pins
Apr 28, 2026
Merged

fix(ci): centralize GPU CI runtime pins#710
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:ci-centralize-gpu-ci-pins

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 28, 2026

Summary

Centralizes the GPU CI runtime pins added in #694 behind the shared load-versions action. The GPU Operator chart version and snapshot-agent CUDA image remain in .settings.yaml, and the consuming scripts now receive those values from composite-action inputs/env instead of reading .settings.yaml directly.

Motivation / Context

This is a small follow-up to merged PR #694. In review, we moved the GPU Operator chart version and snapshot-agent CUDA image into .settings.yaml; this PR finishes that cleanup by making those pins follow the same load-versions path as other CI tool/image pins.

This keeps .settings.yaml as the source and .github/actions/load-versions as the shared read path.

Fixes: N/A
Related: #694

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: GPU CI composite actions and .settings.yaml

Implementation Notes

  • Adds gpu_operator_chart_version and snapshot_agent_cuda_image outputs to .github/actions/load-versions.
  • Updates runtime-install to load the GPU Operator chart version through load-versions for Helm-mode installs and pass it to install-gpu-operator-helm.sh via GPU_OPERATOR_CHART_VERSION.
  • Updates aicr-build to load the snapshot-agent CUDA image through load-versions when build_snapshot_agent=true and pass it to build-snapshot-agent.sh via SNAPSHOT_AGENT_CUDA_IMAGE.
  • Keeps optional action inputs for both values so future callers can override them explicitly if needed.
  • Leaves the actual pinned values unchanged.

Testing

bash -n .github/actions/runtime-install/install-gpu-operator-helm.sh .github/actions/aicr-build/build-snapshot-agent.sh
yamllint .settings.yaml .github/actions/load-versions/action.yml .github/actions/runtime-install/action.yml .github/actions/aicr-build/action.yml
actionlint .github/workflows/gpu-smoke-test.yaml .github/workflows/gpu-h100-kind-runtime-test.yaml .github/workflows/gpu-h100-training-test.yaml .github/workflows/gpu-h100-inference-test.yaml
git diff --check

Scoped CI checks passed locally. Full make qualify was not run because this is a CI-only composite-action wiring change with no Go, recipe, or user-facing behavior changes.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Existing GPU CI callers keep using the same pinned values. The scripts now fail clearly if their parent action does not provide the required env value.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 28, 2026 19:32
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

This change refactors how configuration values are passed through GitHub Actions workflows. The load-versions action now extracts gpu_operator_chart_version and snapshot_agent_cuda_image from .settings.yaml and exposes them as outputs. The aicr-build and runtime-install actions accept these values as optional inputs and use conditional steps to load defaults when not provided. The corresponding shell scripts (build-snapshot-agent.sh and install-gpu-operator-helm.sh) are updated to validate that these values are set in environment variables rather than reading directly from .settings.yaml with yq. A new configuration entry is added to .settings.yaml for gpu_operator_chart_version.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly explains the purpose of centralizing GPU CI runtime pins behind the load-versions action, provides motivation from a preceding PR, and details the implementation changes across multiple files.
Title check ✅ Passed The title 'fix(ci): centralize GPU CI runtime pins' accurately reflects the main change: moving GPU-related configuration pins to a centralized location via the load-versions action.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 requested a review from mchmarny April 28, 2026 19:37
@yuanchen8911 yuanchen8911 changed the title ci: centralize GPU CI runtime pins fix(ci): centralize GPU CI runtime pins Apr 28, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean follow-up to #694. Centralizes the two GPU CI pins through load-versions while keeping .settings.yaml as the source. The optional action inputs preserve override flexibility, the env-based pass-through is consistent with how other pinned values flow today, and the script-side guards correctly handle both empty and literal "null" from yq. End-to-end signal is strong: GPU smoke, H100 inference, and H100 training all green on this commit. LGTM.

@mchmarny mchmarny merged commit f94c66c into NVIDIA:main Apr 28, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants