Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .github/actions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ This directory contains a modular, reusable GitHub Actions architecture optimize

## Composite Actions

### Script Conventions

Composite action helper scripts in this directory are intentionally portable
across checkout modes: keep them mode `0644` and invoke them as
`bash path/to/script.sh` from workflows or `action.yml` files. Do not rely on
executable bits or `./script.sh` invocation.

### Core CI/CD Actions

#### `security-scan/`
Expand Down Expand Up @@ -50,7 +57,8 @@ This action runs `tools/setup-tools --skip-go --skip-docker` in auto mode, which
**When to use**: When you need version values in workflow steps
**Outputs**:
- `go`, `goreleaser`, `ko`, `crane`, `golangci_lint`, `yamllint`, `addlicense`
- `grype`, `kubectl`, `kind`, `ctlptl`, `tilt`, `helm`
- `grype`, `kubectl`, `kind`, `nvkind`, `ctlptl`, `tilt`, `helm`
- `kind_node_image`, `h100_kind_node_image`

**Example**:
```yaml
Expand Down
94 changes: 25 additions & 69 deletions .github/actions/aicr-build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,17 @@
# limitations under the License.

name: 'AICR Build'
description: 'Builds the aicr validator image (via Dockerfile) and CLI binary, and loads the image into kind.'
description: 'Builds the aicr CLI and optional snapshot/validator images, and loads requested images into kind.'

inputs:
build_cli:
description: 'Build and stage the standalone aicr CLI binary at the repository root'
required: false
default: 'true'
build_snapshot_agent:
description: 'Build the CUDA-based snapshot agent image and load it into kind'
required: false
Comment thread
coderabbitai[bot] marked this conversation as resolved.
default: 'true'
build_validators:
description: 'Deprecated: use validator_phases instead. Ignored when validator_phases is set.'
required: false
Expand All @@ -28,86 +36,34 @@ inputs:
runs:
using: 'composite'
steps:

- name: Install ko
- name: Build standalone aicr CLI binary
if: inputs.build_cli == 'true'
shell: bash
run: |
KO_VERSION=$(yq eval '.build_tools.ko' .settings.yaml)
GOFLAGS= go install "github.com/google/ko@${KO_VERSION}"
env:
GOFLAGS: -mod=vendor
run: bash "${{ github.action_path }}/build-cli.sh"

- name: Build snapshot agent image and load into kind
- name: Build snapshot agent CLI binary
if: inputs.build_cli != 'true' && inputs.build_snapshot_agent == 'true'
shell: bash
env:
GOFLAGS: -mod=vendor
run: |
# Build snapshot agent image with CUDA base (provides nvidia-smi for GPU detection).
# Uses cuda:base (~250MB) instead of cuda:runtime (~1.8GB) — only nvidia-smi is needed.
# GPU test workflows use --image=ko.local:smoke-test for aicr snapshot.
CGO_ENABLED=0 go build -trimpath -o dist/aicr ./cmd/aicr
docker build -t ko.local:smoke-test -f - . <<'DOCKERFILE'
FROM nvcr.io/nvidia/cuda:13.1.0-base-ubuntu24.04
COPY dist/aicr /usr/local/bin/aicr
ENTRYPOINT ["/usr/local/bin/aicr"]
DOCKERFILE
run: bash "${{ github.action_path }}/build-cli.sh"

# Load onto all nodes. The snapshot agent requests nvidia.com/gpu but
# does not set a node selector, so it can land on any GPU-capable node
# including the control-plane (e.g., T4 smoke test).
#
# Timeout is intentionally generous (900s per attempt). H100 self-hosted
# runners transfer images over a shared Docker-in-Docker bridge; large
# CUDA base images (~250MB compressed) combined with I/O contention from
# parallel GPU operator pods regularly exceed the previous 600s limit.
timeout 900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}" || {
echo "::warning::kind load attempt 1 failed for ko.local:smoke-test, retrying..."
timeout 900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"
}
- name: Build snapshot agent image and load into kind
if: inputs.build_snapshot_agent == 'true'
shell: bash
run: bash "${{ github.action_path }}/build-snapshot-agent.sh"

- name: Build validator images and load into kind
if: "!(inputs.validator_phases == 'none' || (inputs.validator_phases == '' && inputs.build_validators == 'false'))"
shell: bash
env:
GOFLAGS: -mod=vendor
run: |
# Determine which validator phases to build.
# validator_phases takes precedence; build_validators is a deprecated fallback.
if [[ -n "${{ inputs.validator_phases }}" ]]; then
if [[ "${{ inputs.validator_phases }}" == "none" ]]; then
echo "Skipping validator builds (validator_phases=none)"
exit 0
fi
PHASES="${{ inputs.validator_phases }}"
else
# Default: build all phases (backwards compatible)
PHASES="deployment,performance,conformance"
fi

# Compile only the requested validator binaries.
mkdir -p dist/validator
for phase in ${PHASES//,/ }; do
echo "Building validator binary: ${phase}"
CGO_ENABLED=0 go build -trimpath -o "dist/validator/${phase}" "./validators/${phase}"
done

for phase in ${PHASES//,/ }; do
mkdir -p "validators/${phase}/testdata"
docker build -t "ko.local/aicr-validators/${phase}:latest" -f - . <<DOCKERFILE
FROM gcr.io/distroless/static-debian12:nonroot
COPY dist/validator/${phase} /${phase}
COPY validators/${phase}/testdata /app/testdata
WORKDIR /app
USER nonroot
ENTRYPOINT ["/${phase}"]
DOCKERFILE
# Validator images are small (~30MB distroless), but share the same
# Docker-in-Docker bridge as the smoke-test load above. 600s per
# attempt accommodates I/O queuing behind concurrent image pulls.
timeout 600 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}" || {
echo "::warning::kind load attempt 1 failed for ko.local/aicr-validators/${phase}:latest, retrying..."
timeout 600 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}"
}
done
VALIDATOR_PHASES: ${{ inputs.validator_phases }}
run: bash "${{ github.action_path }}/build-validator-images.sh"

- name: Stage aicr binary at repo root
if: inputs.build_cli == 'true'
shell: bash
run: cp dist/aicr ./aicr
run: bash "${{ github.action_path }}/stage-cli.sh"
19 changes: 19 additions & 0 deletions .github/actions/aicr-build/build-cli.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -euo pipefail

mkdir -p dist
CGO_ENABLED=0 go build -trimpath -o dist/aicr ./cmd/aicr
51 changes: 51 additions & 0 deletions .github/actions/aicr-build/build-snapshot-agent.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -euo pipefail

if ! command -v yq >/dev/null 2>&1; then
echo "::error::yq is required to read testing.snapshot_agent_cuda_image from .settings.yaml"
exit 1
fi

SNAPSHOT_AGENT_CUDA_IMAGE="$(yq eval '.testing.snapshot_agent_cuda_image // ""' .settings.yaml)"
if [[ -z "${SNAPSHOT_AGENT_CUDA_IMAGE}" || "${SNAPSHOT_AGENT_CUDA_IMAGE}" == "null" ]]; then
echo "::error::testing.snapshot_agent_cuda_image must be set in .settings.yaml"
exit 1
fi

if [[ ! -f dist/aicr ]]; then
echo "::error::dist/aicr not found; build the AICR CLI before building the snapshot agent image"
exit 1
fi

# Build snapshot agent image with CUDA base (provides nvidia-smi for GPU detection).
# Uses cuda:base (~250MB) instead of cuda:runtime (~1.8GB) because only nvidia-smi is needed.
timeout 900s docker build \
--build-arg SNAPSHOT_AGENT_CUDA_IMAGE="${SNAPSHOT_AGENT_CUDA_IMAGE}" \
-t ko.local:smoke-test -f - . <<'DOCKERFILE'
ARG SNAPSHOT_AGENT_CUDA_IMAGE
FROM ${SNAPSHOT_AGENT_CUDA_IMAGE}
COPY dist/aicr /usr/local/bin/aicr
ENTRYPOINT ["/usr/local/bin/aicr"]
DOCKERFILE

# Load onto all nodes. The snapshot agent requests nvidia.com/gpu but does not
# set a node selector, so it can land on any GPU-capable node including the
# control-plane in the L40G smoke test.
timeout 900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}" || {
echo "::warning::kind load attempt 1 failed for ko.local:smoke-test, retrying..."
timeout 900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"
}
59 changes: 59 additions & 0 deletions .github/actions/aicr-build/build-validator-images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -euo pipefail

VALIDATOR_PHASES="${VALIDATOR_PHASES:-}"
if [[ -n "${VALIDATOR_PHASES}" ]]; then
if [[ "${VALIDATOR_PHASES}" == "none" ]]; then
echo "Skipping validator builds (validator_phases=none)"
exit 0
fi
PHASES="${VALIDATOR_PHASES}"
else
# Default: build all phases (backwards compatible).
PHASES="deployment,performance,conformance"
fi

: "${KIND_CLUSTER_NAME:?KIND_CLUSTER_NAME must be set}"

mkdir -p dist/validator
for phase in ${PHASES//,/ }; do
if ! [[ "${phase}" =~ ^[a-z][a-z0-9_-]*$ ]]; then
echo "::error::invalid validator phase '${phase}'; expected ^[a-z][a-z0-9_-]*$"
exit 1
fi
echo "Building validator binary: ${phase}"
CGO_ENABLED=0 go build -trimpath -o "dist/validator/${phase}" "./validators/${phase}"
done

for phase in ${PHASES//,/ }; do
if [[ ! -d "validators/${phase}/testdata" ]]; then
echo "::error::validators/${phase}/testdata is missing"
exit 1
fi
Comment thread
coderabbitai[bot] marked this conversation as resolved.
docker build -t "ko.local/aicr-validators/${phase}:latest" -f - . <<DOCKERFILE
FROM gcr.io/distroless/static-debian12:nonroot
COPY dist/validator/${phase} /${phase}
COPY validators/${phase}/testdata /app/testdata
WORKDIR /app
USER nonroot
ENTRYPOINT ["/${phase}"]
DOCKERFILE
timeout 600 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}" || {
echo "::warning::kind load attempt 1 failed for ko.local/aicr-validators/${phase}:latest, retrying..."
timeout 600 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}"
}
done
18 changes: 18 additions & 0 deletions .github/actions/aicr-build/stage-cli.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -euo pipefail

cp dist/aicr ./aicr
80 changes: 80 additions & 0 deletions .github/actions/check-control-plane-health/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: 'Check Control Plane Health'
description: 'Fails if Kind control-plane static pods are missing, unready, or unstable.'

inputs:
cluster_name:
description: 'Kind cluster name'
required: true
namespace:
description: 'Namespace that contains the control-plane pods'
required: false
default: kube-system
components:
description: 'Space-separated component label values to check'
required: false
default: kube-apiserver kube-controller-manager kube-scheduler etcd
wait_timeout:
description: 'Timeout for each component readiness wait'
required: false
default: 60s
max_restarts:
description: 'Compatibility input; with stability_window=0s, fail if historical restartCount exceeds this ceiling'
required: false
default: '1'
Comment thread
coderabbitai[bot] marked this conversation as resolved.
stability_window:
description: 'Optional duration to watch for new control-plane restarts after pods are Ready'
required: false
default: '0s'
stability_probe_interval:
description: 'Interval for active API server probes during the stability window'
required: false
default: '10s'
stability_probe_failure_threshold:
description: 'Consecutive active stability probe failures allowed before failing'
required: false
default: '2'
lease_components:
description: 'Space-separated leader election lease names to check for freshness'
required: false
default: kube-controller-manager kube-scheduler
lease_stale_timeout:
description: 'Maximum allowed leader election lease age at the end of a stability window'
required: false
default: '120s'
runtime_diagnostics:
description: 'Collect expensive kind node runtime diagnostics such as docker stats, crictl, and journalctl on failure'
required: false
default: 'false'

runs:
using: 'composite'
steps:
- name: Check control-plane pods
shell: bash
env:
KIND_CLUSTER_NAME: ${{ inputs.cluster_name }}
NAMESPACE: ${{ inputs.namespace }}
COMPONENTS: ${{ inputs.components }}
WAIT_TIMEOUT: ${{ inputs.wait_timeout }}
MAX_RESTARTS: ${{ inputs.max_restarts }}
STABILITY_WINDOW: ${{ inputs.stability_window }}
STABILITY_PROBE_INTERVAL: ${{ inputs.stability_probe_interval }}
STABILITY_PROBE_FAILURE_THRESHOLD: ${{ inputs.stability_probe_failure_threshold }}
LEASE_COMPONENTS: ${{ inputs.lease_components }}
LEASE_STALE_TIMEOUT: ${{ inputs.lease_stale_timeout }}
RUNTIME_DIAGNOSTICS: ${{ inputs.runtime_diagnostics }}
run: bash "${{ github.action_path }}/check-control-plane-health.sh"
Loading
Loading