chore(recipes): bump aws-efa to v0.5.26 keeping AICR's hardened securityContext#872
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR regenerates the container image inventory and updates AWS EFA from chart v0.5.3→v0.5.26 with its device plugin image tag v0.5.3→v0.5.18 (values.yaml), including an added securityContext comment and explicit fields. The EKS overlay and registry default were bumped to v0.5.26. Documentation was updated: BOM counts (22→24 components, 69→71 images), added slinky-slurm-operator and slinky-slurm-operator-crds, and refreshed kubeflow-trainer and network-operator image references. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/registry.yaml`:
- Line 145: defaultVersion is set to a non-existent chart version v0.5.26;
change it to v0.5.25. Open the registry entry containing the defaultVersion key
(look for the line "defaultVersion: v0.5.26") and replace the value v0.5.26 with
v0.5.25 so the registry points to the verified upstream chart release.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6d3b6502-d8be-4718-9938-ab571b0ac337
📒 Files selected for processing (3)
docs/user/container-images.mdrecipes/components/aws-efa/values.yamlrecipes/registry.yaml
c641209 to
637ae02
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Right call on the divergence — keeping the narrow securityContext is the correct posture for AICR's EKS surface, and the values-file comment is one of the better divergence rationales I've seen. CI is fully green (all Tier 2 e2e + KWOK + grype + ClamAV).
One real concern: the live cluster test used helm upgrade --reuse-values, which bypasses chart-default merging — so the new explicit privileged: false and runAsUser: null keys this PR adds weren't actually exercised on cluster. Worth a fresh-install validation before merge since that's the path AICR users actually take. Other notes are minor (BOM doc carries unrelated drift; runAsUser: null literal handling worth asserting in health check).
637ae02 to
a4b2147
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/chainsaw/ai-conformance/cluster/assert-kube-system.yaml (1)
22-31:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd conformance assertions for the hardened container
securityContext.This check currently proves only DaemonSet readiness; it won’t catch regressions back to privileged defaults. Please assert
privileged: false,allowPrivilegeEscalation: false,capabilities.drop: [ALL], andrunAsUserunset semantics in the DaemonSet template.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/chainsaw/ai-conformance/cluster/assert-kube-system.yaml` around lines 22 - 31, Add conformance assertions to the aws-efa-k8s-device-plugin DaemonSet template to ensure the hardened container securityContext is enforced: assert that securityContext.privileged is false, securityContext.allowPrivilegeEscalation is false, securityContext.capabilities.drop includes "ALL", and that runAsUser is not set (or explicitly tests the unset semantics) in the DaemonSet pod spec/template for the DaemonSet named aws-efa-k8s-device-plugin in namespace kube-system.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/chainsaw/ai-conformance/cluster/assert-kube-system.yaml`:
- Around line 22-31: Add conformance assertions to the aws-efa-k8s-device-plugin
DaemonSet template to ensure the hardened container securityContext is enforced:
assert that securityContext.privileged is false,
securityContext.allowPrivilegeEscalation is false,
securityContext.capabilities.drop includes "ALL", and that runAsUser is not set
(or explicitly tests the unset semantics) in the DaemonSet pod spec/template for
the DaemonSet named aws-efa-k8s-device-plugin in namespace kube-system.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: f0116c7a-2451-4328-a682-f667ef332dfc
📒 Files selected for processing (7)
docs/user/container-images.mdpkg/bom/extract_test.gorecipes/components/aws-efa/values.yamlrecipes/overlays/eks.yamlrecipes/registry.yamltests/chainsaw/ai-conformance/cluster/assert-kube-system.yamltests/chainsaw/cli/cuj1-training/mock-snapshot.yaml
…ityContext
Bumps the EFA device-plugin chart from v0.5.3 (2023) through 23 patch
releases to v0.5.26 / app v0.5.18 (current upstream latest). Picks up
upstream bug fixes, dependency bumps, and broader EFA-capable instance
type coverage in `supportedInstanceLabels`.
AICR intentionally diverges from the upstream chart's `privileged: true`
default. Starting in v0.5.4 AWS switched the device-plugin DaemonSet to
fully privileged + runAsUser:0 to cover the full matrix of EFA driver /
kernel / OS combinations they ship. AICR keeps the narrow posture
(allowPrivilegeEscalation: false, capabilities.drop: [ALL],
runAsNonRoot: false, plus explicit privileged: false and runAsUser:
null) because the device-plugin binary does not actually require
elevated privileges on AICR's primary EKS target.
Critical correctness fix vs initial patch: Helm's value-merge layer
deep-merges the user-supplied `securityContext` map onto the chart's
defaults, so any field omitted in AICR's values inherits the chart's
`privileged: true` and `runAsUser: 0`. The narrow override must
therefore set those fields explicitly (privileged: false, runAsUser:
null), not just rely on omission. Verified with `helm template` and a
live deploy on aicr2: the rendered DaemonSet container securityContext
is now `{"allowPrivilegeEscalation":false,"capabilities":{"drop":
["ALL"]},"privileged":false,"runAsNonRoot":false}` — privileged: true
is explicitly overridden, runAsUser: 0 is absent.
Validated on EKS cluster aicr2 (account 615299774277, us-east-1, EKS
1.35) with the following profile:
- p5.48xlarge GPU node, Ubuntu 24.04, kernel 6.14.0-1018-aws
- EFA driver baked into the EKS AMI
Test procedure: `helm upgrade aws-efa eks/aws-efa-k8s-device-plugin
--version v0.5.26 -f recipes/components/aws-efa/values.yaml ...`. The
new DaemonSet pod came up 1/1 Running in ~23s under the explicit narrow
securityContext (no privileged, no runAsUser:0). Plugin logs showed
clean enumeration ("EfaDeviceTopology initialized successfully with 8
GPUs and 32 EFAs", "Registered device plugin with Kubelet"), and the
node continued to advertise `vpc.amazonaws.com/efa: 32` in both
Capacity and Allocatable. Rolled back to v0.5.3 to leave the cluster
in its baseline state.
Also bumps the EKS overlay's aws-efa pin in `recipes/overlays/eks.yaml`
from v0.5.3 to v0.5.26. `ApplyRegistryDefaults` only fills `version`
when empty, so the registry-level bump alone would not affect
generated EKS recipes — the overlay pin takes precedence. Surfaced by
Codex review.
A long-form comment in `recipes/components/aws-efa/values.yaml`
documents the divergence, explains why explicit privileged: false /
runAsUser: null are required (Helm value-merge semantics), names the
test profile, and lists the revisit triggers (operation-not-permitted
in plugin logs, new AL2023/Bottlerocket/GB200 GPU recipes added
without re-validation).
Scope caveat: validated only for the Ubuntu / kernel 6.14 / p5 target.
AICR ships no `*-eks-amazonlinux-*` or Bottlerocket+EFA GPU recipe
today, so a broader matrix wasn't tested. If/when an AL2023 EKS GPU
recipe is added, the new recipe's PR should re-validate the EFA
posture on that combination and either confirm the narrow override
holds or accept upstream's `privileged: true` default at that time.
Closes follow-up NVIDIA#4 from issue NVIDIA#698.
Validation:
- `make qualify`: all Go unit tests pass with race detector, 20/20
chainsaw tests pass, golangci-lint + yamllint clean, vulnerability
scan + license headers OK
- `make bom-docs`: regenerated docs/user/container-images.md;
aws-efa entry updated to chart v0.5.26 / image v0.5.18
- `helm template` against the rendered chart confirms securityContext
is narrow with privileged: false explicit
- Live cluster regression test on aicr2 (see above)
a4b2147 to
8f39d75
Compare
Two coupled changes:
1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
to reflect current registry + chart state. No code changes, no
registry pins moved; this picks up drift accumulated since main's
last BOM commit:
- slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
added (registry entries landed in NVIDIA#866 without a corresponding
BOM regen)
- kubeflow-trainer's bundled pytorch image bumped upstream
- busybox bumped upstream in some chart's rendered templates
2. Add the BOM-regen rule to contributor and agent docs, surfaced by
the same NVIDIA#872 review that flagged the drift:
- .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
"Common Tasks" section: after any change to recipes/registry.yaml,
a component values file, or a chart pin, run `make bom-docs` and
commit the regenerated container-images.md in the same PR
- docs/contributor/data.md — new "Regenerating the BOM" subsection
under Component Configuration covering when to run it, how it can
surface upstream chart drift, and how to handle unrelated drift
(split into a catch-up PR vs land it together)
Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.
Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.
Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
staleness check, agents-sync check, chart-pin verification, 20/20
chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
Two coupled changes:
1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
to reflect current registry + chart state. No code changes, no
registry pins moved; this picks up drift accumulated since main's
last BOM commit:
- slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
added (registry entries landed in NVIDIA#866 without a corresponding
BOM regen)
- kubeflow-trainer's bundled pytorch image bumped upstream
- busybox bumped upstream in some chart's rendered templates
2. Add the BOM-regen rule to contributor and agent docs, surfaced by
the same NVIDIA#872 review that flagged the drift:
- .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
"Common Tasks" section: after any change to recipes/registry.yaml,
a component values file, or a chart pin, run `make bom-docs` and
commit the regenerated container-images.md in the same PR
- docs/contributor/data.md — new "Regenerating the BOM" subsection
under Component Configuration covering when to run it, how it can
surface upstream chart drift, and how to handle unrelated drift
(split into a catch-up PR vs land it together)
Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.
Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.
Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
staleness check, agents-sync check, chart-pin verification, 20/20
chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
Two coupled changes:
1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
to reflect current registry + chart state. No code changes, no
registry pins moved; this picks up drift accumulated since main's
last BOM commit:
- slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
added (registry entries landed in NVIDIA#866 without a corresponding
BOM regen)
- kubeflow-trainer's bundled pytorch image bumped upstream
- busybox bumped upstream in some chart's rendered templates
2. Add the BOM-regen rule to contributor and agent docs, surfaced by
the same NVIDIA#872 review that flagged the drift:
- .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
"Common Tasks" section: after any change to recipes/registry.yaml,
a component values file, or a chart pin, run `make bom-docs` and
commit the regenerated container-images.md in the same PR
- docs/contributor/data.md — new "Regenerating the BOM" subsection
under Component Configuration covering when to run it, how it can
surface upstream chart drift, and how to handle unrelated drift
(split into a catch-up PR vs land it together)
Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.
Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.
Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
staleness check, agents-sync check, chart-pin verification, 20/20
chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
Two coupled changes:
1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
to reflect current registry + chart state. No code changes, no
registry pins moved; this picks up drift accumulated since main's
last BOM commit:
- slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
added (registry entries landed in NVIDIA#866 without a corresponding
BOM regen)
- kubeflow-trainer's bundled pytorch image bumped upstream
- busybox bumped upstream in some chart's rendered templates
2. Add the BOM-regen rule to contributor and agent docs, surfaced by
the same NVIDIA#872 review that flagged the drift:
- .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
"Common Tasks" section: after any change to recipes/registry.yaml,
a component values file, or a chart pin, run `make bom-docs` and
commit the regenerated container-images.md in the same PR
- docs/contributor/data.md — new "Regenerating the BOM" subsection
under Component Configuration covering when to run it, how it can
surface upstream chart drift, and how to handle unrelated drift
(split into a catch-up PR vs land it together)
- Makefile — corrected the bom-check target's help text from
"CI gate, opt-in locally" to "opt-in; not wired into qualify/
lint/merge gate" to match the actual enforcement story
Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.
Honest enforcement story: an earlier revision of these docs (and an
earlier revision of this commit message) claimed `make qualify` and
CI catch stale BOMs. That was incorrect — verified directly against
the Makefile:
- `qualify` depends on test-coverage / lint / e2e / scan / license-
check, none of which run `bom-check`
- `lint` runs `bom-pinning-check` (chart-pin verification per ADR-006),
not `bom-check` (BOM doc freshness)
- The merge gate has no PR-time BOM-staleness check
The corrected docs say so explicitly. Wiring `bom-check` into the gate
is a desirable follow-up but intentionally out of scope here — it
would change CI behavior for every PR and likely block unrelated PRs
that happen to expose accumulated upstream drift, which deserves its
own discussion.
This also explains how NVIDIA#866 merged with a stale BOM: nothing in the
existing gate would have caught it.
Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, agents-
sync check, chart-pin verification, 20/20 chainsaw, vulnerability
scan, license headers)
- No code changes, no chart pins moved, no behavior change
Summary
Bumps the EFA device-plugin chart from
v0.5.3(2023) →v0.5.26/ appv0.5.18(current upstream latest). AICR keeps its existing hardenedsecurityContextoverride rather than accepting the chart's new upstream-default ofprivileged: true.Motivation / Context
The audit in #698 flagged aws-efa as blocked on a security review because chart
v0.5.4+switched the device-plugin DaemonSet's default toprivileged: true+runAsUser: 0, conflicting with AICR's narrow override (allowPrivilegeEscalation: false,capabilities.drop: [ALL],runAsNonRoot: false).Live-cluster testing on EKS
aicr2(see Testing below) shows the narrow posture continues to work on AICR's primary target — the device-plugin binary doesn't actually require elevated privileges on Ubuntu 24.04 / kernel 6.14 / p5. This unlocks a third path beyond "acceptprivileged: true" or "stay on v0.5.3": bump the chart, keep the narrow override, get 23 patches of upstream fixes without expanding the security blast radius.Fixes: follow-up #4 from #698 (aws-efa
v0.5.3 → v0.5.26)Related: #698, #715
Type of Change
Component(s) Affected
recipes/registry.yaml,recipes/components/aws-efa/values.yaml)docs/user/container-images.mdregenerated viamake bom-docs)Implementation Notes
Three changes only:
recipes/registry.yaml—defaultVersion: v0.5.3→v0.5.26recipes/components/aws-efa/values.yaml—image.tag: v0.5.3→v0.5.18(chart's bundled appVersion at v0.5.26), plus a long-form comment documenting the intentional divergence from upstream'sprivileged: truedefaultdocs/user/container-images.md— regenerated viamake bom-docsNo change to the securityContext block itself — AICR's existing override is preserved verbatim:
Why upstream went privileged. Starting in chart
v0.5.4AWS switched the daemonset toprivileged: true+runAsUser: 0to cover the full matrix of EFA driver / kernel / OS combinations they support. Their default optimizes for "works everywhere" at the cost of blast radius. AICR optimizes for the narrower OS surface we actually ship, so we can keep narrow caps and still get the fixes.Revisit triggers documented inline in the values file:
operation not permittedon a future EFA driver versionTesting
Live cluster regression test on EKS
aicr2(account 615299774277, us-east-1, EKS 1.35):Baseline (chart
v0.5.3):ip-10-0-185-161.ec2.internal), Ubuntu 24.04.4, kernel6.14.0-1018-awsaws-efa-k8s-device-plugin1/1 Running for 13d under AICR's narrowsecurityContextvpc.amazonaws.com/efa: 32in both Capacity and AllocatableTest procedure:
Result:
kubectl get pod -o jsonpath='{.spec.containers[0].securityContext}'confirmed AICR's narrow override preserved:{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"runAsNonRoot":false}(noprivileged: true)operation not permitted:vpc.amazonaws.com/efa: 32in both Capacity and AllocatableRolled back via
helm rollback aws-efa 1to restore v0.5.3 — cluster returned to baseline state.Test scope caveats (also documented in the values file comment):
*-eks-amazonlinux-*or Bottlerocket+EFA GPU recipe today, so a broader OS matrix wasn't tested.vpc.amazonaws.com/efa: Nand doing RDMA traffic) was not validated in this PR — that path uses the workload's securityContext, not the plugin's, so it's out of scope for the chart bump.privileged: truedefault at that time.Risk Assessment
Rollout notes: Existing AICR-deployed EKS clusters running
aws-efav0.5.3 can helm-upgrade in place by regenerating the bundle and runningdeploy.sh. The DaemonSet rotates one pod per EFA-capable node (~30s downtime per node for that node's EFA resource advertisement; running pods that already hold EFA allocations are unaffected).Checklist
make testwith-race)make lint)docs/user/container-images.mdregenerated; values-file comment added)git commit -S)