Skip to content

chore(recipes): bump aws-efa to v0.5.26 keeping AICR's hardened securityContext#872

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:chore/aws-efa-v0.5.26
May 14, 2026
Merged

chore(recipes): bump aws-efa to v0.5.26 keeping AICR's hardened securityContext#872
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:chore/aws-efa-v0.5.26

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Bumps the EFA device-plugin chart from v0.5.3 (2023) → v0.5.26 / app v0.5.18 (current upstream latest). AICR keeps its existing hardened securityContext override rather than accepting the chart's new upstream-default of privileged: true.

Motivation / Context

The audit in #698 flagged aws-efa as blocked on a security review because chart v0.5.4+ switched the device-plugin DaemonSet's default to privileged: true + runAsUser: 0, conflicting with AICR's narrow override (allowPrivilegeEscalation: false, capabilities.drop: [ALL], runAsNonRoot: false).

Live-cluster testing on EKS aicr2 (see Testing below) shows the narrow posture continues to work on AICR's primary target — the device-plugin binary doesn't actually require elevated privileges on Ubuntu 24.04 / kernel 6.14 / p5. This unlocks a third path beyond "accept privileged: true" or "stay on v0.5.3": bump the chart, keep the narrow override, get 23 patches of upstream fixes without expanding the security blast radius.

Fixes: follow-up #4 from #698 (aws-efa v0.5.3 → v0.5.26)
Related: #698, #715

Type of Change

  • Bug fix / enhancement (component version refresh)
  • Build/CI/tooling (regenerated BOM doc)

Component(s) Affected

  • Recipe engine / data (recipes/registry.yaml, recipes/components/aws-efa/values.yaml)
  • Docs (docs/user/container-images.md regenerated via make bom-docs)

Implementation Notes

Three changes only:

  1. recipes/registry.yamldefaultVersion: v0.5.3v0.5.26
  2. recipes/components/aws-efa/values.yamlimage.tag: v0.5.3v0.5.18 (chart's bundled appVersion at v0.5.26), plus a long-form comment documenting the intentional divergence from upstream's privileged: true default
  3. docs/user/container-images.md — regenerated via make bom-docs

No change to the securityContext block itself — AICR's existing override is preserved verbatim:

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  runAsNonRoot: false

Why upstream went privileged. Starting in chart v0.5.4 AWS switched the daemonset to privileged: true + runAsUser: 0 to cover the full matrix of EFA driver / kernel / OS combinations they support. Their default optimizes for "works everywhere" at the cost of blast radius. AICR optimizes for the narrower OS surface we actually ship, so we can keep narrow caps and still get the fixes.

Revisit triggers documented inline in the values file:

  • Plugin logs return operation not permitted on a future EFA driver version
  • AICR adds an AL2023 / Bottlerocket / GB200 GPU recipe (the test below only covers Ubuntu)

Testing

make qualify    # all Go unit tests + 20/20 chainsaw + lint + scan: green
make bom-docs   # docs/user/container-images.md regenerated cleanly

Live cluster regression test on EKS aicr2 (account 615299774277, us-east-1, EKS 1.35):

Baseline (chart v0.5.3):

  • p5.48xlarge GPU node (ip-10-0-185-161.ec2.internal), Ubuntu 24.04.4, kernel 6.14.0-1018-aws
  • DaemonSet aws-efa-k8s-device-plugin 1/1 Running for 13d under AICR's narrow securityContext
  • Node advertising vpc.amazonaws.com/efa: 32 in both Capacity and Allocatable

Test procedure:

helm upgrade aws-efa eks/aws-efa-k8s-device-plugin \
  --version v0.5.26 --reuse-values --set image.tag=v0.5.18 -n kube-system

Result:

  • New pod came up 1/1 Running in ~30s
  • kubectl get pod -o jsonpath='{.spec.containers[0].securityContext}' confirmed AICR's narrow override preserved: {"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"runAsNonRoot":false} (no privileged: true)
  • Pod logs showed clean EFA enumeration without operation not permitted:
    EfaDeviceTopology initialized successfully with 8 GPUs and 32 EFAs
    GPU topology initialized successfully.
    Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
    Registered device plugin with Kubelet
    
  • Node continued advertising vpc.amazonaws.com/efa: 32 in both Capacity and Allocatable

Rolled back via helm rollback aws-efa 1 to restore v0.5.3 — cluster returned to baseline state.

Test scope caveats (also documented in the values file comment):

  • Validated only on Ubuntu 24.04 / kernel 6.14 / p5. AICR ships no *-eks-amazonlinux-* or Bottlerocket+EFA GPU recipe today, so a broader OS matrix wasn't tested.
  • Test exercised plugin enumeration + Kubelet registration. Actual workload EFA usage (a pod requesting vpc.amazonaws.com/efa: N and doing RDMA traffic) was not validated in this PR — that path uses the workload's securityContext, not the plugin's, so it's out of scope for the chart bump.
  • If/when an AL2023 EKS GPU recipe is added, that PR's review should re-validate aws-efa's narrow-cap posture on AL2023 and either confirm the override holds or accept upstream's privileged: true default at that time.

Risk Assessment

  • Low — Single component bump, narrow blast radius. Live regression test on the actual deployment target with a documented rollback path. Existing AICR EKS bundles regenerated against this version will run the same securityContext as today, just with a newer binary that picks up 15 app-version increments of upstream fixes.

Rollout notes: Existing AICR-deployed EKS clusters running aws-efa v0.5.3 can helm-upgrade in place by regenerating the bundle and running deploy.sh. The DaemonSet rotates one pod per EFA-capable node (~30s downtime per node for that node's EFA resource advertisement; running pods that already hold EFA allocations are unaffected).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (n/a — version bump only)
  • I updated docs if user-facing behavior changed (docs/user/container-images.md regenerated; values-file comment added)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ddb4cdd5-b59e-4b21-a064-94419a27bad0

📥 Commits

Reviewing files that changed from the base of the PR and between a4b2147 and 8f39d75.

📒 Files selected for processing (7)
  • docs/user/container-images.md
  • pkg/bom/extract_test.go
  • recipes/components/aws-efa/values.yaml
  • recipes/overlays/eks.yaml
  • recipes/registry.yaml
  • tests/chainsaw/ai-conformance/cluster/assert-kube-system.yaml
  • tests/chainsaw/cli/cuj1-training/mock-snapshot.yaml

📝 Walkthrough

Walkthrough

This PR regenerates the container image inventory and updates AWS EFA from chart v0.5.3→v0.5.26 with its device plugin image tag v0.5.3→v0.5.18 (values.yaml), including an added securityContext comment and explicit fields. The EKS overlay and registry default were bumped to v0.5.26. Documentation was updated: BOM counts (22→24 components, 69→71 images), added slinky-slurm-operator and slinky-slurm-operator-crds, and refreshed kubeflow-trainer and network-operator image references.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: bumping aws-efa to v0.5.26 while maintaining AICR's hardened securityContext, which aligns with the primary objective.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering motivation, implementation details, testing, and risk assessment for the aws-efa version bump.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/registry.yaml`:
- Line 145: defaultVersion is set to a non-existent chart version v0.5.26;
change it to v0.5.25. Open the registry entry containing the defaultVersion key
(look for the line "defaultVersion: v0.5.26") and replace the value v0.5.26 with
v0.5.25 so the registry points to the verified upstream chart release.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6d3b6502-d8be-4718-9938-ab571b0ac337

📥 Commits

Reviewing files that changed from the base of the PR and between 6dfbb5d and c641209.

📒 Files selected for processing (3)
  • docs/user/container-images.md
  • recipes/components/aws-efa/values.yaml
  • recipes/registry.yaml

Comment thread recipes/registry.yaml
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right call on the divergence — keeping the narrow securityContext is the correct posture for AICR's EKS surface, and the values-file comment is one of the better divergence rationales I've seen. CI is fully green (all Tier 2 e2e + KWOK + grype + ClamAV).

One real concern: the live cluster test used helm upgrade --reuse-values, which bypasses chart-default merging — so the new explicit privileged: false and runAsUser: null keys this PR adds weren't actually exercised on cluster. Worth a fresh-install validation before merge since that's the path AICR users actually take. Other notes are minor (BOM doc carries unrelated drift; runAsUser: null literal handling worth asserting in health check).

Comment thread recipes/components/aws-efa/values.yaml Outdated
Comment thread recipes/components/aws-efa/values.yaml
Comment thread recipes/components/aws-efa/values.yaml
Comment thread docs/user/container-images.md
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/chainsaw/ai-conformance/cluster/assert-kube-system.yaml (1)

22-31: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add conformance assertions for the hardened container securityContext.

This check currently proves only DaemonSet readiness; it won’t catch regressions back to privileged defaults. Please assert privileged: false, allowPrivilegeEscalation: false, capabilities.drop: [ALL], and runAsUser unset semantics in the DaemonSet template.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/chainsaw/ai-conformance/cluster/assert-kube-system.yaml` around lines
22 - 31, Add conformance assertions to the aws-efa-k8s-device-plugin DaemonSet
template to ensure the hardened container securityContext is enforced: assert
that securityContext.privileged is false,
securityContext.allowPrivilegeEscalation is false,
securityContext.capabilities.drop includes "ALL", and that runAsUser is not set
(or explicitly tests the unset semantics) in the DaemonSet pod spec/template for
the DaemonSet named aws-efa-k8s-device-plugin in namespace kube-system.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/chainsaw/ai-conformance/cluster/assert-kube-system.yaml`:
- Around line 22-31: Add conformance assertions to the aws-efa-k8s-device-plugin
DaemonSet template to ensure the hardened container securityContext is enforced:
assert that securityContext.privileged is false,
securityContext.allowPrivilegeEscalation is false,
securityContext.capabilities.drop includes "ALL", and that runAsUser is not set
(or explicitly tests the unset semantics) in the DaemonSet pod spec/template for
the DaemonSet named aws-efa-k8s-device-plugin in namespace kube-system.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f0116c7a-2451-4328-a682-f667ef332dfc

📥 Commits

Reviewing files that changed from the base of the PR and between 637ae02 and a4b2147.

📒 Files selected for processing (7)
  • docs/user/container-images.md
  • pkg/bom/extract_test.go
  • recipes/components/aws-efa/values.yaml
  • recipes/overlays/eks.yaml
  • recipes/registry.yaml
  • tests/chainsaw/ai-conformance/cluster/assert-kube-system.yaml
  • tests/chainsaw/cli/cuj1-training/mock-snapshot.yaml

…ityContext

Bumps the EFA device-plugin chart from v0.5.3 (2023) through 23 patch
releases to v0.5.26 / app v0.5.18 (current upstream latest). Picks up
upstream bug fixes, dependency bumps, and broader EFA-capable instance
type coverage in `supportedInstanceLabels`.

AICR intentionally diverges from the upstream chart's `privileged: true`
default. Starting in v0.5.4 AWS switched the device-plugin DaemonSet to
fully privileged + runAsUser:0 to cover the full matrix of EFA driver /
kernel / OS combinations they ship. AICR keeps the narrow posture
(allowPrivilegeEscalation: false, capabilities.drop: [ALL],
runAsNonRoot: false, plus explicit privileged: false and runAsUser:
null) because the device-plugin binary does not actually require
elevated privileges on AICR's primary EKS target.

Critical correctness fix vs initial patch: Helm's value-merge layer
deep-merges the user-supplied `securityContext` map onto the chart's
defaults, so any field omitted in AICR's values inherits the chart's
`privileged: true` and `runAsUser: 0`. The narrow override must
therefore set those fields explicitly (privileged: false, runAsUser:
null), not just rely on omission. Verified with `helm template` and a
live deploy on aicr2: the rendered DaemonSet container securityContext
is now `{"allowPrivilegeEscalation":false,"capabilities":{"drop":
["ALL"]},"privileged":false,"runAsNonRoot":false}` — privileged: true
is explicitly overridden, runAsUser: 0 is absent.

Validated on EKS cluster aicr2 (account 615299774277, us-east-1, EKS
1.35) with the following profile:
- p5.48xlarge GPU node, Ubuntu 24.04, kernel 6.14.0-1018-aws
- EFA driver baked into the EKS AMI

Test procedure: `helm upgrade aws-efa eks/aws-efa-k8s-device-plugin
--version v0.5.26 -f recipes/components/aws-efa/values.yaml ...`. The
new DaemonSet pod came up 1/1 Running in ~23s under the explicit narrow
securityContext (no privileged, no runAsUser:0). Plugin logs showed
clean enumeration ("EfaDeviceTopology initialized successfully with 8
GPUs and 32 EFAs", "Registered device plugin with Kubelet"), and the
node continued to advertise `vpc.amazonaws.com/efa: 32` in both
Capacity and Allocatable. Rolled back to v0.5.3 to leave the cluster
in its baseline state.

Also bumps the EKS overlay's aws-efa pin in `recipes/overlays/eks.yaml`
from v0.5.3 to v0.5.26. `ApplyRegistryDefaults` only fills `version`
when empty, so the registry-level bump alone would not affect
generated EKS recipes — the overlay pin takes precedence. Surfaced by
Codex review.

A long-form comment in `recipes/components/aws-efa/values.yaml`
documents the divergence, explains why explicit privileged: false /
runAsUser: null are required (Helm value-merge semantics), names the
test profile, and lists the revisit triggers (operation-not-permitted
in plugin logs, new AL2023/Bottlerocket/GB200 GPU recipes added
without re-validation).

Scope caveat: validated only for the Ubuntu / kernel 6.14 / p5 target.
AICR ships no `*-eks-amazonlinux-*` or Bottlerocket+EFA GPU recipe
today, so a broader matrix wasn't tested. If/when an AL2023 EKS GPU
recipe is added, the new recipe's PR should re-validate the EFA
posture on that combination and either confirm the narrow override
holds or accept upstream's `privileged: true` default at that time.

Closes follow-up NVIDIA#4 from issue NVIDIA#698.

Validation:
- `make qualify`: all Go unit tests pass with race detector, 20/20
  chainsaw tests pass, golangci-lint + yamllint clean, vulnerability
  scan + license headers OK
- `make bom-docs`: regenerated docs/user/container-images.md;
  aws-efa entry updated to chart v0.5.26 / image v0.5.18
- `helm template` against the rendered chart confirms securityContext
  is narrow with privileged: false explicit
- Live cluster regression test on aicr2 (see above)
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Heads up on the BOM diff Mark flagged: I've split the catch-up out into #878 (pure make bom-docs regen — slinky-slurm BOM rows that #866 missed, plus upstream pytorch + busybox image drift). Will rebase this PR after #878 merges; the BOM diff here will then be just aws-efa rows. Both PRs CI green.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Two coupled changes:

1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
   to reflect current registry + chart state. No code changes, no
   registry pins moved; this picks up drift accumulated since main's
   last BOM commit:
   - slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
     added (registry entries landed in NVIDIA#866 without a corresponding
     BOM regen)
   - kubeflow-trainer's bundled pytorch image bumped upstream
   - busybox bumped upstream in some chart's rendered templates

2. Add the BOM-regen rule to contributor and agent docs, surfaced by
   the same NVIDIA#872 review that flagged the drift:
   - .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
     "Common Tasks" section: after any change to recipes/registry.yaml,
     a component values file, or a chart pin, run `make bom-docs` and
     commit the regenerated container-images.md in the same PR
   - docs/contributor/data.md — new "Regenerating the BOM" subsection
     under Component Configuration covering when to run it, how it can
     surface upstream chart drift, and how to handle unrelated drift
     (split into a catch-up PR vs land it together)

Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.

Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.

Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
  staleness check, agents-sync check, chart-pin verification, 20/20
  chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Two coupled changes:

1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
   to reflect current registry + chart state. No code changes, no
   registry pins moved; this picks up drift accumulated since main's
   last BOM commit:
   - slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
     added (registry entries landed in NVIDIA#866 without a corresponding
     BOM regen)
   - kubeflow-trainer's bundled pytorch image bumped upstream
   - busybox bumped upstream in some chart's rendered templates

2. Add the BOM-regen rule to contributor and agent docs, surfaced by
   the same NVIDIA#872 review that flagged the drift:
   - .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
     "Common Tasks" section: after any change to recipes/registry.yaml,
     a component values file, or a chart pin, run `make bom-docs` and
     commit the regenerated container-images.md in the same PR
   - docs/contributor/data.md — new "Regenerating the BOM" subsection
     under Component Configuration covering when to run it, how it can
     surface upstream chart drift, and how to handle unrelated drift
     (split into a catch-up PR vs land it together)

Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.

Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.

Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
  staleness check, agents-sync check, chart-pin verification, 20/20
  chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Two coupled changes:

1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
   to reflect current registry + chart state. No code changes, no
   registry pins moved; this picks up drift accumulated since main's
   last BOM commit:
   - slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
     added (registry entries landed in NVIDIA#866 without a corresponding
     BOM regen)
   - kubeflow-trainer's bundled pytorch image bumped upstream
   - busybox bumped upstream in some chart's rendered templates

2. Add the BOM-regen rule to contributor and agent docs, surfaced by
   the same NVIDIA#872 review that flagged the drift:
   - .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
     "Common Tasks" section: after any change to recipes/registry.yaml,
     a component values file, or a chart pin, run `make bom-docs` and
     commit the regenerated container-images.md in the same PR
   - docs/contributor/data.md — new "Regenerating the BOM" subsection
     under Component Configuration covering when to run it, how it can
     surface upstream chart drift, and how to handle unrelated drift
     (split into a catch-up PR vs land it together)

Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.

Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.

Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
  staleness check, agents-sync check, chart-pin verification, 20/20
  chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Two coupled changes:

1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
   to reflect current registry + chart state. No code changes, no
   registry pins moved; this picks up drift accumulated since main's
   last BOM commit:
   - slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
     added (registry entries landed in NVIDIA#866 without a corresponding
     BOM regen)
   - kubeflow-trainer's bundled pytorch image bumped upstream
   - busybox bumped upstream in some chart's rendered templates

2. Add the BOM-regen rule to contributor and agent docs, surfaced by
   the same NVIDIA#872 review that flagged the drift:
   - .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
     "Common Tasks" section: after any change to recipes/registry.yaml,
     a component values file, or a chart pin, run `make bom-docs` and
     commit the regenerated container-images.md in the same PR
   - docs/contributor/data.md — new "Regenerating the BOM" subsection
     under Component Configuration covering when to run it, how it can
     surface upstream chart drift, and how to handle unrelated drift
     (split into a catch-up PR vs land it together)
   - Makefile — corrected the bom-check target's help text from
     "CI gate, opt-in locally" to "opt-in; not wired into qualify/
     lint/merge gate" to match the actual enforcement story

Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.

Honest enforcement story: an earlier revision of these docs (and an
earlier revision of this commit message) claimed `make qualify` and
CI catch stale BOMs. That was incorrect — verified directly against
the Makefile:
- `qualify` depends on test-coverage / lint / e2e / scan / license-
  check, none of which run `bom-check`
- `lint` runs `bom-pinning-check` (chart-pin verification per ADR-006),
  not `bom-check` (BOM doc freshness)
- The merge gate has no PR-time BOM-staleness check
The corrected docs say so explicitly. Wiring `bom-check` into the gate
is a desirable follow-up but intentionally out of scope here — it
would change CI behavior for every PR and likely block unrelated PRs
that happen to expose accumulated upstream drift, which deserves its
own discussion.

This also explains how NVIDIA#866 merged with a stale BOM: nothing in the
existing gate would have caught it.

Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, agents-
  sync check, chart-pin verification, 20/20 chainsaw, vulnerability
  scan, license headers)
- No code changes, no chart pins moved, no behavior change
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 14, 2026 03:11
@mchmarny mchmarny enabled auto-merge (squash) May 14, 2026 12:12
@mchmarny mchmarny merged commit adddfe0 into NVIDIA:main May 14, 2026
90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants