Skip to content

recipe: add gatekeeper as optional component#821

Open
JaydipGabani wants to merge 3 commits into
NVIDIA:mainfrom
JaydipGabani:feat/gatekeeper-component
Open

recipe: add gatekeeper as optional component#821
JaydipGabani wants to merge 3 commits into
NVIDIA:mainfrom
JaydipGabani:feat/gatekeeper-component

Conversation

@JaydipGabani
Copy link
Copy Markdown

@JaydipGabani JaydipGabani commented May 8, 2026

Summary

Adds Gatekeeper as an optional AICR recipe component for clusters that need admission-controller support for AI workload guardrails.

Motivation / Context

The direct Slack discussion in #aicr started from the Gatekeeper maintainers asking whether defining AI workload Kubernetes policies belongs in AICR. The thread identified three possible integration shapes: policy-as-component, validation integration through gator policy and --no-cluster, and conformance evidence.

Reviewers in the thread asked for concrete AI-specific policy examples and for the distinction from AICR/Chainsaw checks. The discussion converged on workload-shape guardrails such as nvidia.com/gpu request/limit parity, memory request/limit parity for GPU pods, CPU request requirements, training shared-memory requirements, and optional max-GPU caps for shared clusters. The resulting split was that AICR and Chainsaw cover platform correctness and component health, while Gatekeeper and gator cover reusable workload guardrails through offline policy evaluation or admission-time enforcement.

This PR lands only the deployable Gatekeeper component slice. It does not ship policy bundles, run gator, connect Gatekeeper to AICR validation, or capture policy results as conformance evidence. The broader Slack-derived rationale and follow-up integration paths are tracked in #860.

Fixes: N/A
Related: #860
Related discussion: https://kubernetes.slack.com/archives/C0AQMPP1BK7/p1775874775990329

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: New optional Gatekeeper component

Implementation Notes

  • Registers Gatekeeper as an optional Helm component in recipes/registry.yaml.
  • Pins the upstream Gatekeeper chart to 3.22.2 and installs it into gatekeeper-system when recipes opt in.
  • Wires only system-tier node selector and toleration paths because Gatekeeper is an admission controller, not a GPU workload.
  • Adds base Helm values for controller-manager and audit resources, disables mutation by default, and leaves the component out of base overlays.
  • Adds a Chainsaw health check that asserts both Gatekeeper deployments exist with available replicas and that no Gatekeeper pods are in unhealthy phases.
  • Defers policy bundle components, gator validation, AICR validator integration, and conformance evidence integration to follow-up work tracked in Epic: Track Gatekeeper integration for AI workload guardrails #860.

Testing

make qualify

Passed locally. This covers race-enabled unit tests, coverage enforcement, linting, Chainsaw CLI e2e tests, vulnerability scan, license checks, and repository-specific documentation checks.

Risk Assessment

  • Low - Isolated optional component, not referenced by base overlays, and easy to revert
  • Medium - Touches multiple components or has broader impact
  • High - Breaking change, affects critical paths, or complex rollout

Rollout notes: Gatekeeper remains opt-in. Existing recipes are unchanged unless they explicitly reference the component, and this PR does not install or enforce any workload policies.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) - GPG signing info

@JaydipGabani JaydipGabani requested a review from a team as a code owner May 8, 2026 17:29
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Welcome to AICR, @JaydipGabani! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds Gatekeeper as an optional system-tier component: a registry entry (Helm chart defaults, defaultVersion 3.22.2, defaultNamespace gatekeeper-system, valueOverride key, and node-scheduling paths), a Helm values file with conservative defaults (replicas, resources, disableMutation: true, auditInterval: 60, constraintViolationsLimit: 20), and a Kyverno Chainsaw health-check that asserts both controller-manager and audit Deployments have available replicas and no Pods are in Pending/Failed/Unknown.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding Gatekeeper as an optional component to the project.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively explains the addition of Gatekeeper as an optional component, detailing motivation, implementation, scope, and deferral of follow-up work.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/gatekeeper/values.yaml`:
- Around line 52-55: Update the misleading comment: instead of referencing the
unrelated flag "gatekeeper:disableMutation=false", instruct users to re-enable
the PDB via the actual Helm value for this key (podDisruptionBudget.enabled) —
e.g., tell them to use "--set gatekeeper.podDisruptionBudget.enabled=true" (or
the equivalent overlay) so the comment correctly points to the
podDisruptionBudget.enabled setting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 15dfa5e4-9d35-4c3a-856d-2fe6258734e2

📥 Commits

Reviewing files that changed from the base of the PR and between 2f22e2b and 9e2361a.

📒 Files selected for processing (3)
  • recipes/checks/gatekeeper/health-check.yaml
  • recipes/components/gatekeeper/values.yaml
  • recipes/registry.yaml

Comment thread recipes/components/gatekeeper/values.yaml Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
recipes/components/gatekeeper/values.yaml (1)

53-53: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the Helm override example syntax for PDB re-enable.

Line 53 uses --set gatekeeper:podDisruptionBudget.enabled=true; Helm --set expects dot-separated nested paths. This example can fail to apply the intended value.

Suggested fix
-# Production overlays can re-enable via --set gatekeeper:podDisruptionBudget.enabled=true.
+# Production overlays can re-enable via --set gatekeeper.podDisruptionBudget.enabled=true.
What is the correct Helm --set syntax for nested values, and is "gatekeeper:podDisruptionBudget.enabled=true" valid compared to "gatekeeper.podDisruptionBudget.enabled=true"?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/components/gatekeeper/values.yaml` at line 53, The Helm override
example uses invalid colon notation; change the example to use Helm's
dot-separated nested path syntax so the dashboard PDB value is set correctly
(replace "gatekeeper:podDisruptionBudget.enabled=true" with
"gatekeeper.podDisruptionBudget.enabled=true"), and update the comment in
values.yaml accordingly so users run the correct --set flag.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@recipes/components/gatekeeper/values.yaml`:
- Line 53: The Helm override example uses invalid colon notation; change the
example to use Helm's dot-separated nested path syntax so the dashboard PDB
value is set correctly (replace "gatekeeper:podDisruptionBudget.enabled=true"
with "gatekeeper.podDisruptionBudget.enabled=true"), and update the comment in
values.yaml accordingly so users run the correct --set flag.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 22d83453-7fc3-4806-a9fd-adf8d04b420b

📥 Commits

Reviewing files that changed from the base of the PR and between 9e2361a and ea24241.

📒 Files selected for processing (1)
  • recipes/components/gatekeeper/values.yaml

@mchmarny
Copy link
Copy Markdown
Member

mchmarny commented May 9, 2026

Hi @JaydipGabani — thanks for kicking this off and for the Slack discussion behind it.

Before we dig into the YAML, can we get the PR description filled in to match the PR template? Right now the Summary, Motivation / Context, and Implementation Notes sections are empty, the Type of Change boxes aren't checked, and there's no Fixes: issue reference — the only context is a Slack link, which isn't durable for future readers (search, audit, anyone outside the workspace).

Concretely, what would help reviewers most is an issue (filed in this repo) that captures the parts of the Slack thread that justify landing Gatekeeper in AICR — specifically:

  • What problem does this solve, and for whom? @yuanchen8911 and @lockwobr both asked for concrete AI-workload policy examples; your nvidia.com/gpu request==limit / memory==limit / training-shm examples from the thread are exactly the right framing and should live in the issue.
  • Why AICR? The thread converged on a useful split — AICR + Chainsaw cover platform correctness; Gatekeeper covers workload guardrails — and on three integration shapes (policy-as-component, validation integration, conformance evidence). Pinning down which shape this PR is delivering, and which are deferred, would give reviewers a frame for what's in scope here.
  • Why now / why this slice? @yuanchen8911 also flagged that gator (CLI tooling) isn't a deployable, only the policy bundle and Gatekeeper itself are — so this PR appears to be the "Gatekeeper as optional component" piece. Stating that explicitly avoids reviewers re-litigating the larger design in this PR.

Once an issue exists, please link it via Fixes: #N (or Related: #N if it's part of a larger initiative), fill in the rest of the template (Implementation Notes, Testing — make qualify, Component(s) Affected = Recipes), and we'll take a proper review pass. Happy to help shape the issue if useful.

Also worth noting that the branch is mergeable_state: behind — a rebase onto current main would be good to do alongside the description fill-in.

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs context

@JaydipGabani
Copy link
Copy Markdown
Author

JaydipGabani commented May 13, 2026

Addressed the durable-context and template gaps:

  • Filed Epic: Track Gatekeeper integration for AI workload guardrails #860 with the Slack-derived problem statement, AI workload policy examples, the AICR/Chainsaw vs. Gatekeeper scope split, and the deferred integration shapes.
  • Updated this PR description to the project template with Summary, Motivation / Context, Type of Change, Component(s) Affected, Implementation Notes, Testing, Risk Assessment, and Checklist.
  • Linked the issue as Related: #860 because it tracks the broader Gatekeeper integration work while this PR stays scoped to the optional deployable component.

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
@JaydipGabani JaydipGabani force-pushed the feat/gatekeeper-component branch from 2e1b03a to c45a9e1 Compare May 13, 2026 00:48
@JaydipGabani JaydipGabani requested a review from mchmarny May 13, 2026 03:25
@JaydipGabani
Copy link
Copy Markdown
Author

Added contex

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold on on this PR for now to align on the scope. See: #860 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants