Openshift Container Platform (OCP) initial support - simplified by kaponco · Pull Request #857 · NVIDIA/aicr

kaponco · 2026-05-12T18:22:04Z

Summary

Add comprehensive OpenShift Container Platform (OCP) support with Operator Lifecycle Manager (OLM)-based deployments for GPU-accelerated workloads. (#566)

Motivation / Context

OpenShift requires a different deployment model. Rather than directly deploying resources via Helm, OCP uses OLM to manage operator lifecycles through Subscriptions and Custom Resources. This PR adds first-class OCP support to AICR.

Key differences addressed:

OLM-based operator installation (Subscriptions, OperatorGroups, CSVs)
Custom Resource-driven configuration vs. direct manifest deployment
Fixes: N/A
Related: N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: ____________

Implementation Notes

OCP Support

New Recipe Criteria:

Added ocp as a valid service type
Add kubectl apply file (direct) option. No ref to ext repo
deploy and undeploy relate to openshift csv phase

Not included:

overlays on top the ocp basic overlay
additional validations
argoCD deployer (for ocp - will be added in later PR)

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Commands: make qualify

Results:

Unit tests: All passed (73.1% coverage, threshold: 70%)
E2E tests: All 18 passed
golangci-lint: 0 issues
yamllint: passed

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

The rest of the services are untouched. Minimal code changes

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — [GPG signing info]

(https://docs.github.com/en/authentication/managing-commit-signature-verification)

Signed-off-by: Shai Kapon <skapon@redhat.com>

Add OpenShift (ocp) as a new service type with Operator Lifecycle Manager support. Changes: - Add CriteriaServiceOCP to service enumeration - Implement ComponentTypeDirect for static Kubernetes manifests - Add recipes/overlays/ocp.yaml with dual-component pattern: * nfd-olm + nfd (OLM Subscription + NodeFeatureDiscovery CR) * gpu-operator-olm + gpu-operator (OLM Subscription + ClusterPolicy CR) - Create OLM installation manifests and CR configuration files - Register gpu-operator-olm and nfd-olm components in registry.yaml - Update API spec and documentation to include "ocp" service type Signed-off-by: Shai Kapon <skapon@redhat.com>

Add Direct component support for kubectl-based deployments (OLM Subscriptions, CRs).

…uninstall (NVIDIA#566) - Add Direct component type for static YAML manifests (OLM Subscriptions, CRs) - Generate install.sh with kubectl apply and namespace auto-creation - Generate uninstall.sh with kubectl delete for Direct components - Update undeploy.sh to detect and handle Direct vs Helm components - Configure OCP overlay with Direct components (nfd-olm, gpu-operator-olm, nfd, gpu-operator) - Skip value extraction optimization for Direct components in bundler

…ion (NVIDIA#566) Implement Direct deployment type for static YAML manifests with embedded OLM operator lifecycle management. This enables native OpenShift operator installation through Operator Lifecycle Manager with automatic CSV readiness verification. - Add Direct deployment type alongside Helm and Kustomize with embedded CSV wait logic for OLM components - Implement component registry validation enforcing exactly one deployment type (helm, kustomize, or direct) per component - Add gpu-operator-olm and nfd-olm components with OpenShift OLM Subscription/OperatorGroup manifests - Generate install/uninstall scripts with conditional CSV deletion logic for OLM-managed operators - Update ComponentRef merge logic to clear type-incompatible fields when deployment type changes - Add comprehensive OpenShift deployment documentation with OLM principles and workflow examples Signed-off-by: Shai Kapon <skapon@redhat.com>

…VIDIA#566) - Add unit tests for Direct deployment type (coverage improved from 68.2% to 73.3%) - Add OLM timeout constants to pkg/defaults with env var override capability (AICR_OLM_CSV_TIMEOUT, AICR_OLM_CSV_INTERVAL) - Update Direct deployment install scripts to use centralized timeout constants - Fix deploy.sh status message to reflect actual deployment method (OLM vs Helm vs kubectl) - Remove redundant empty source field from Direct component recipe YAML output - Add test coverage for OCP service type parsing Signed-off-by: Shai Kapon <skapon@redhat.com>

…NVIDIA#566) Signed-off-by: Shai Kapon <skapon@redhat.com>

copy-pr-bot · 2026-05-12T18:22:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-12T18:26:49Z

📝 Walkthrough

Walkthrough

This pull request introduces OpenShift Operator Lifecycle Manager (OLM) integration for AICR by adding platform recognition for OpenShift (ocp service type), implementing a new Direct deployment type for static YAML manifests, and integrating CSV readiness polling into deployment scripts. The changes span criteria enumerations, component metadata schemas, bundler logic, deployment templates, and concrete recipe data for GPU Operator and Node Feature Discovery on OpenShift, along with comprehensive design and operational documentation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main feature being added: OpenShift Container Platform (OCP) initial support with the keyword 'simplified' indicating scope reduction.
Description check	✅ Passed	The description comprehensively explains OCP/OLM support additions, motivation, affected components, testing results, and risk assessment directly aligned with the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/api-server.md`:
- Around line 287-292: The GET `/v1/recipe` parameter docs and the earlier enum
list for `service` disagree about the `any` value; pick one canonical enum set
and make both places match. Update the `service` enum in the GET `/v1/recipe`
parameter list to include `any` if it is allowed, or remove `any` from the
earlier `service` enum bullet if it is no longer supported, and ensure both
references (the `service` enum block and the GET `/v1/recipe` parameters) use
the same values so the `service` parameter definition is consistent across docs.

In `@docs/design/009-openshift-integration.md`:
- Line 1: The document heading "ADR-007: OpenShift Integration" in the file
named 009-openshift-integration.md is inconsistent with the filename; change the
ADR identifier in the top-level heading from "ADR-007" to "ADR-009" (update the
line that currently reads "ADR-007: OpenShift Integration") and scan the file
for any other occurrences of "ADR-007" to replace them with "ADR-009" so
cross-references and identifiers remain consistent.
- Around line 325-329: The paragraph currently implies ArgoCD is the primary
deployment target; update the wording to clearly distinguish current behavior
from future work by stating that the wait scripts apply only to the install.sh
reference implementation and that ArgoCD/OCP ArgoCD deployer support is out of
scope for this PR (i.e., the PR does not add ArgoCD integration or OCP ArgoCD
deployer support), while noting that ArgoCD is a planned/target deployment
mechanism for later work; reference the existing terms "install.sh", "wait
scripts", and "ArgoCD/OCP ArgoCD deployer support" so reviewers can locate and
change the sentence accordingly.
- Around line 38-304: The markdown has multiple fenced code blocks missing fence
languages and surrounding blank lines (e.g., the initial triple-backtick block,
the registry.yaml example, file-structure block, bundle output block,
shell/deployment flow blocks, and the YAML examples for
recipes/components/gpu-operator-olm/direct/olm.yaml and
recipes/components/gpu-operator/direct/clusterpolicy.yaml); fix by adding an
explicit fence language (e.g., ```bash, ```yaml, or ```text as appropriate) to
each fenced block and ensure there is a blank line before and after each fence
so markdownlint rules MD031/MD040 are satisfied, updating the blocks shown in
the doc (the flow snippet, "Registry entry", "File structure", "Bundle output",
"Deployment flow", and the two YAML examples) accordingly.

In `@docs/integrator/data-flow.md`:
- Line 901: Replace the vague line "Enum validation (eks, gke, aks, ocp, etc.)"
with an explicit, comma-separated list of every accepted value for the service
config (the exact members of the ServiceType enum or the values returned/checked
by validateService/validateIntegratorConfig), ensuring "ocp" is included; also
add one short clause stating the failure behavior when an unknown value is
provided (e.g., validation error and which error is thrown/HTTP response).
Ensure you pull the canonical values from the ServiceType enum / validateService
implementation and mirror them exactly in the docs for the service config key.

In `@docs/integrator/openshift.md`:
- Around line 47-59: The fenced code block that starts with "User creates
Subscription" is missing a language identifier which fails markdown lint; update
the opening fence from ``` to include a language such as ```text (or ```mermaid
if you prefer rendered diagrams) so the block becomes a labeled code block and
the linter passes.

In `@pkg/bundler/deployer/argocd/argocd.go`:
- Around line 541-545: The switch branch handling localformat.KindDirect
currently only comments that direct components are unsupported but continues
generation; change the case for localformat.KindDirect to fail fast by returning
a clear error (e.g., fmt.Errorf or wrapped error) from the enclosing function
instead of falling through so ArgoCD artifacts are not emitted; update the case
in the same switch (the case localformat.KindDirect block) to produce and return
that error so callers of the function immediately receive a failure when a
Direct component is encountered.

In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl`:
- Around line 35-40: The loop currently reads CSV phase from .items[0] which can
pick the wrong CSV; instead resolve the exact CSV name from the target
Subscription's status.installedCSV and poll that CSV. Concretely: use kubectl to
read the Subscription resource (e.g., kubectl get subscription <name> -n {{
.Namespace }} -o jsonpath='{.status.installedCSV}') into a variable (before or
inside the loop), then replace the kubectl get csv -o
jsonpath='{.items[0].status.phase}' call with kubectl get csv <resolvedCSVName>
-o jsonpath='{.status.phase}' (honoring ${KUBECONFIG_FLAG:-}) and continue using
CSV_PHASE, TIMEOUT, and ELAPSED as before so the loop only observes the intended
CSV.

In `@pkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmpl`:
- Around line 23-27: The script currently picks the first CSV via
csv_name=$(kubectl get csv ... | head -1) which is unsafe; instead, enumerate
the relevant Subscription(s) and delete only their associated CSVs by reading
each Subscription's status.installedCSV. Replace the single csv_name/head -1
logic with a loop that runs kubectl get subscription -n {{ .Namespace }}
(filtering the same Subscription selector you create elsewhere), for each
subscription extract the .status.installedCSV via jsonpath, skip empty values,
and run kubectl delete "<installedCSV>" -n {{ .Namespace }} --ignore-not-found
--timeout="${HELM_TIMEOUT:-120}s" (preserving the || true behavior); reference
the csv_name variable only as the per-subscription installedCSV and use kubectl
get subscription / .status.installedCSV to locate the correct CSVs.

In `@pkg/recipe/components.go`:
- Around line 332-334: The Helm detection condition incorrectly omits
HelmConfig.DefaultVersion causing components with only DefaultVersion set to be
missed; update the hasHelm boolean to include comp.Helm.DefaultVersion != ""
alongside the existing checks (i.e., set hasHelm to true if
comp.Helm.DefaultRepository != "" || comp.Helm.DefaultChart != "" ||
comp.Helm.DefaultNamespace != "" || comp.Helm.DefaultVersion != "") so the
Helm-only-by-version case is correctly classified (refer to the hasHelm variable
and comp.Helm.DefaultVersion/DefaultRepository/DefaultChart/DefaultNamespace
identifiers).

In `@recipes/components/gpu-operator-olm/direct/olm.yaml`:
- Around line 21-36: The OperatorGroup (metadata.name:
nvidia-gpu-operator-group) and Subscription (metadata.name:
gpu-operator-certified) lack explicit namespaces; add metadata.namespace:
nvidia-gpu-operator to each resource's metadata block so both the OperatorGroup
and the Subscription are deterministically bound to the nvidia-gpu-operator
namespace and no longer rely on external kubectl -n behavior.

In `@recipes/components/nfd/direct/nodefeaturediscovery.yaml`:
- Around line 21-24: The NodeFeatureDiscovery custom resource (metadata.name:
nfd-instance) lacks an explicit namespace and relies on caller context; add
metadata.namespace: openshift-nfd under the metadata block of the
NodeFeatureDiscovery manifest (alongside metadata.name: nfd-instance) so the CR
is always created in the openshift-nfd namespace and cannot be accidentally
applied to the wrong namespace.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bdbf65c5-30c9-401c-966a-d700bbe8678b

📥 Commits

Reviewing files that changed from the base of the PR and between ee2e7ae and 04ed2a9.

📒 Files selected for processing (45)

.github/ISSUE_TEMPLATE/bug_report.yml
api/aicr/v1/server.yaml
docs/README.md
docs/contributor/api-server.md
docs/contributor/cli.md
docs/contributor/data.md
docs/contributor/validations.md
docs/design/009-openshift-integration.md
docs/integrator/data-flow.md
docs/integrator/openshift.md
docs/user/api-reference.md
docs/user/cli-reference.md
pkg/bundler/bundler.go
pkg/bundler/deployer/argocd/argocd.go
pkg/bundler/deployer/helm/helm.go
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh
pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh
pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh
pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh
pkg/bundler/deployer/localformat/direct.go
pkg/bundler/deployer/localformat/folder.go
pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl
pkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmpl
pkg/bundler/deployer/localformat/writer.go
pkg/bundler/deployer/localformat/writer_test.go
pkg/defaults/timeouts.go
pkg/recipe/components.go
pkg/recipe/components_test.go
pkg/recipe/criteria.go
pkg/recipe/criteria_test.go
pkg/recipe/metadata.go
pkg/recipe/metadata_test.go
pkg/recipe/yaml_test.go
recipes/components/gpu-operator-olm/direct/olm.yaml
recipes/components/gpu-operator/direct/clusterpolicy.yaml
recipes/components/nfd-olm/direct/olm.yaml
recipes/components/nfd/direct/nodefeaturediscovery.yaml
recipes/data.go
recipes/overlays/ocp.yaml
recipes/registry.yaml
site/.vitepress/config.ts
validators/performance/nccl_all_reduce_bw_constraint.go

…ion (NVIDIA#566) - Fix OLM CSV wait logic to resolve CSV name from Subscription.status.installedCSV instead of non-deterministic .items[0] query, ensuring correct CSV polling in multi-operator namespaces - Fix OLM CSV deletion to enumerate all Subscriptions and delete only their associated CSVs rather than using unsafe head -1 selection - Add fail-fast validation in ArgoCD deployer to reject Direct components with clear error message instead of silently generating unusable artifacts - Fix Helm deployment type detection to include DefaultVersion field, preventing components with only version configured from being misclassified - Fix ComponentRef merge logic to clear Olm field when transitioning from Direct to Helm/Kustomize types and apply Olm overlay to Direct components - Add explicit metadata.namespace to all OLM and CR manifests (gpu-operator-olm, nfd-olm, nfd CR) for deterministic namespace targeting Signed-off-by: Shai Kapon <skapon@redhat.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/openshift.md`:
- Line 492: The ADR reference is wrong: update the link text and target that
currently read "ADR-007: OpenShift Integration" to use the correct ADR number
and filename (ADR-009) so the link points to
../design/009-openshift-integration.md and the display text reflects "ADR-009:
OpenShift Integration"; edit the markdown line that contains the link text
"ADR-007: OpenShift Integration" to the corrected ADR-009 variant.

In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl`:
- Around line 35-57: The current loop resolves the installed CSV using an
non-deterministic query (kubectl get subscription -n {{ .Namespace }} -o
jsonpath='{.items[0].status.installedCSV}') which can pick the wrong
Subscription; change the template to accept a concrete selector and use it when
computing CSV_NAME: add a parameter like .SubscriptionName or
.SubscriptionLabelSelector and replace the generic kubectl get subscription call
that populates CSV_NAME with a targeted command that either (1) queries the
specific Subscription by name (kubectl get subscription <name> -n {{ .Namespace
}} -o jsonpath='{.status.installedCSV}') when .SubscriptionName is provided, or
(2) queries by label selector (kubectl get subscription -n {{ .Namespace }} -l
"<label>=<value>" -o jsonpath='{.items[0].status.installedCSV}') when
.SubscriptionLabelSelector is provided; update the template documentation to
mention the new parameter and preserve the existing fallback only if neither
parameter is set.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 96975350-a756-44e8-8f9a-71053c323385

📥 Commits

Reviewing files that changed from the base of the PR and between 04ed2a9 and 8e06e88.

📒 Files selected for processing (13)

docs/design/009-openshift-integration.md
docs/integrator/data-flow.md
docs/integrator/openshift.md
pkg/bundler/deployer/argocd/argocd.go
pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl
pkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmpl
pkg/bundler/deployer/localformat/writer_test.go
pkg/recipe/components.go
pkg/recipe/components_test.go
pkg/recipe/metadata.go
pkg/recipe/metadata_test.go
recipes/components/gpu-operator-olm/direct/olm.yaml
recipes/components/nfd/direct/nodefeaturediscovery.yaml

coderabbitai · 2026-05-13T06:01:42Z

+
+- [NVIDIA GPU Operator on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html)
+- [Understanding OLM](https://docs.redhat.com/en/documentation/openshift_container_platform/4.2/html/operators/understanding-the-operator-lifecycle-manager-olm)
+- [ADR-007: OpenShift Integration](../design/007-openshift-integration.md)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect ADR reference.

Line 492 references ADR-007: OpenShift Integration but the design document is numbered ADR-009 (filename: 009-openshift-integration.md).

📝 Suggested fix

-- [ADR-007: OpenShift Integration](../design/007-openshift-integration.md) +- [ADR-009: OpenShift Integration](../design/009-openshift-integration.md)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- [ADR-007: OpenShift Integration](../design/007-openshift-integration.md)

- [ADR-009: OpenShift Integration](../design/009-openshift-integration.md)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/integrator/openshift.md` at line 492, The ADR reference is wrong: update the link text and target that currently read "ADR-007: OpenShift Integration" to use the correct ADR number and filename (ADR-009) so the link points to ../design/009-openshift-integration.md and the display text reflects "ADR-009: OpenShift Integration"; edit the markdown line that contains the link text "ADR-007: OpenShift Integration" to the corrected ADR-009 variant.

coderabbitai · 2026-05-13T06:01:42Z

+while [ $ELAPSED -lt $TIMEOUT ]; do
+  # Resolve the installed CSV name from the Subscription
+  CSV_NAME=$(kubectl get subscription -n {{ .Namespace }} -o jsonpath='{.items[0].status.installedCSV}' ${KUBECONFIG_FLAG:-} 2>/dev/null || echo "")
+
+  if [ -z "$CSV_NAME" ]; then
+    echo "Waiting for Subscription to report installedCSV... (${ELAPSED}s/${TIMEOUT}s)"
+    sleep $INTERVAL
+    ELAPSED=$((ELAPSED + INTERVAL))
+    continue
+  fi
+
+  # Check the specific CSV's phase
+  CSV_PHASE=$(kubectl get csv "$CSV_NAME" -n {{ .Namespace }} -o jsonpath='{.status.phase}' ${KUBECONFIG_FLAG:-} 2>/dev/null || echo "")
+
+  if [ "$CSV_PHASE" = "Succeeded" ]; then
+    echo "CSV ${CSV_NAME} reached Succeeded phase"
+    exit 0
+  fi
+
+  echo "CSV ${CSV_NAME} phase: ${CSV_PHASE:-<not found>}, waiting... (${ELAPSED}s/${TIMEOUT}s)"
+  sleep $INTERVAL
+  ELAPSED=$((ELAPSED + INTERVAL))
+done


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

CSV wait logic still uses .items[0] which is non-deterministic.

Line 37 resolves the installed CSV name from .items[0].status.installedCSV, which picks the first Subscription in the namespace. In namespaces with multiple operators, this can poll the wrong CSV and report incorrect installation status.

The past review recommended resolving the CSV from "the target Subscription," but the current implementation still selects the first Subscription found in the namespace.

Recommendation:

If the OLM manifest contains a known Subscription name, resolve that specific Subscription instead of .items[0]. If the Subscription name is dynamic or unknown at template time, consider:

Add a template parameter for the expected Subscription name/selector

Use a label selector to identify the target Subscription

Document that this wait logic assumes single-operator namespaces

For now, this may be acceptable if all OLM Direct components use dedicated single-operator namespaces (e.g., nvidia-gpu-operator, openshift-nfd), but it's fragile if namespace conventions change.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl` around lines 35 - 57, The current loop resolves the installed CSV using an non-deterministic query (kubectl get subscription -n {{ .Namespace }} -o jsonpath='{.items[0].status.installedCSV}') which can pick the wrong Subscription; change the template to accept a concrete selector and use it when computing CSV_NAME: add a parameter like .SubscriptionName or .SubscriptionLabelSelector and replace the generic kubectl get subscription call that populates CSV_NAME with a targeted command that either (1) queries the specific Subscription by name (kubectl get subscription <name> -n {{ .Namespace }} -o jsonpath='{.status.installedCSV}') when .SubscriptionName is provided, or (2) queries by label selector (kubectl get subscription -n {{ .Namespace }} -l "<label>=<value>" -o jsonpath='{.items[0].status.installedCSV}') when .SubscriptionLabelSelector is provided; update the template documentation to mention the new parameter and preserve the existing fallback only if neither parameter is set.

github-actions · 2026-05-14T16:14:30Z

@kaponco this PR now has merge conflicts with main. Please rebase to resolve them.

lockwobr · 2026-05-15T05:37:35Z

Thanks for the work here — the OCP modeling (criteria, overlay, dual-component operator/CR pattern) looks good and I'd like OpenShift support to land. Before you go further on this PR, a couple of architectural questions worth working through together.

On the new direct deployment type

docs/contributor/index.md ("What AICR Is Not" / "Architectural Boundaries") draws a line on this: AICR generates artifacts, deployment belongs to the tool that consumes them — and it specifically calls out a "direct" kubectl apply deployer with bespoke wait/uninstall scripts as out of scope.

To be fair, the existing helm deployer already carries some of this shape — install/uninstall shell scripts — so the line isn't clean today. But we've been actively trying to shrink that footprint, not grow it: adding the helmfile deployer (#899) was a step in that direction, with the eventual goal of leaning on a community tool instead of maintaining our own scripts. I'd rather find a route that fits our current paradigms — the new localformat/direct.go + install-direct.sh.tmpl / uninstall-direct.sh.tmpl (with embedded CSV polling) is pretty close to the shape the doc discourages.

On reusing existing output adapters

I'd like to understand the rationale for a new adapter vs. emitting the OLM resources through an existing path — e.g., a small in-tree chart per component containing the Subscription / OperatorGroup (and similarly for the CR component). That would let an existing bundler render them, an existing install path drive them, and CSV readiness could ride on a post-install hook. We just opened #904 (replaces #610) to make chainsaw-driven readiness an optional first-class convention; OCP/OLM looks like a great early use case for it.

Possible split

What if we split this? Land the non-deployer parts now — ocp criteria, overlay, dual-component recipe modeling, docs — and then we work together on getting the deployment piece into a shape we both like.

If we can get #904 landed, would that help unblock the deployment question here? At that point you could just add the readiness checks you need as chainsaw tests, and OLM would fit into the same paradigm we use everywhere else — no new deployer, no bespoke wait scripts.

One more question: does OCP need to work in the helm deployer at all? If we scoped it to Argo CD / Flux only, a lot of this complexity goes away.

kaponco · 2026-05-15T12:12:19Z

@lockwobr, thanks for your input, your points are clear.

PR #904 will be a big help in removing the OLM explicit waits. Once it's in, the "direct" deploy.sh will be simplified to a minimal "kubectl apply".

I think keeping install-direct script in this PR is necessary. It provides a helpful example for early validation of the OCP-to-AICR integration and aligns with other service types.

Regarding the in-tree Helm solution: what will the output in the bundle directory look like? If it's a Helm chart, it might confuse OpenShift users, who traditionally don't use Helm scripts to manage artifacts.

Also, happy to change the "direct" keyword if we want something clearer.

mchmarny · 2026-05-20T14:49:59Z

@kaponco appears to be very new account (43 days old) with minimal established reputation across all dimensions. The account shows zero merged PRs and two closed PRs, no followers, and no engagement signals or community involvement, resulting in a low overall trust score: https://devtrace.thingz.io/score/kaponco

The lack of successful contributions, community presence, and engagement history indicates this is a bot. Given the scope of that PR and existing misalignment with established AICR patterns/principles documented by @lockwobr I recommend we reject it.

kaponco · 2026-05-21T06:58:32Z

@mchmarny please sync with @lockwobr we had a discussion about the design. The design of this project leans heavily on helm while the openshift approach is different. I didnt see any attempt from you to bridge the gap. All I am doing is proposing solutions. If you think this is a personal issue - I dont mind asking my company for a replacement.

haarchri · 2026-05-21T08:03:01Z

On EKS/AKS/GKE, GPU Operator and NFD are installed through Helm, same as the other components.

For OCP, this PR introduces a separate deployment model based on:

OLM Subscriptions + OperatorGroups
kubectl apply flows
custom CSV polling/wait logic

That means the bundler, deploy/undeploy scripts, and validation paths now need to support two different installation mechanisms.

My main question is: is OLM actually required here, or just preferred?

The GPU Operator Helm chart already works on OpenShift, and NVIDIA documentation describes both approaches. If Helm is a supported path, then a lot of the additional complexity introduced here may not be necessary:

the Direct deployer becomes unnecessary
custom install/uninstall templates are no longer needed
CSV polling/wait logic goes away
existing deployers (Helm, ArgoCD, Flux) continue to work without special handling

If there is a hard requirement for OLM (for example OperatorHub certification, compliance requirements, or customer constraints), I think that rationale should be explicitly documented in both the design doc and PR description. Right now the reason for introducing a parallel install path is unclear.

I also noticed that when generating bundles for all four services using:

--accelerator h100 --intent training

the OCP recipe currently installs only:

GPU Operator
NFD

Compared to EKS/AKS/GKE, the following components are disabled:

monitoring stack
prometheus adapter
gang scheduling
node tuning
DRA
health/ops components
cert-manager

A few questions on scope and intent:

Is this meant to be a phase-1 baseline, or the final intended OCP scope?
Which of these components actually cannot run on OCP?
Which components are simply not integrated yet?

For example, cert-manager, Prometheus, kai-scheduler, and nodewright already support OpenShift, so it would help to clarify whether these omissions are platform limitations or just incomplete wiring.

One thing I still do not fully understand is the deployer direction.

The PR notes mention that ArgoCD support for OCP will come in a later PR, but if components were installed through Helm (like the other platforms), then ArgoCD/Flux support would already work automatically?!

That makes the OLM-specific deployment path harder to justify from an architecture and maintenance perspective - please answer the question around the requirements for OLM

kaponco added 7 commits May 12, 2026 18:09

docs(design): add ADR-007 for OpenShift OLM integration

3f110f1

Signed-off-by: Shai Kapon <skapon@redhat.com>

feat(bundler): implement Direct deployment type (NVIDIA#566)

1bf6683

Add Direct component support for kubectl-based deployments (OLM Subscriptions, CRs).

fix(gpu-operator): disable RDMA in OpenShift ClusterPolicy by default (…

04ed2a9

…NVIDIA#566) Signed-off-by: Shai Kapon <skapon@redhat.com>

kaponco requested review from a team as code owners May 12, 2026 18:22

github-actions Bot added area/recipes area/ci area/docs area/bundler area/api size/XL labels May 12, 2026

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

mchmarny assigned kaponco May 12, 2026

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

github-actions Bot added the needs-rebase label May 14, 2026

	- [ADR-007: OpenShift Integration](../design/007-openshift-integration.md)
	- [ADR-009: OpenShift Integration](../design/009-openshift-integration.md)

Conversation

kaponco commented May 12, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

lockwobr commented May 15, 2026

Uh oh!

kaponco commented May 15, 2026

Uh oh!

mchmarny commented May 20, 2026

Uh oh!

kaponco commented May 21, 2026

Uh oh!

haarchri commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented May 12, 2026 •

edited

Loading