Skip to content

Openshift Container Platform (OCP) initial support - simplified #857

Open
kaponco wants to merge 8 commits into
NVIDIA:mainfrom
kaponco:feat/ocp-olm-integration
Open

Openshift Container Platform (OCP) initial support - simplified #857
kaponco wants to merge 8 commits into
NVIDIA:mainfrom
kaponco:feat/ocp-olm-integration

Conversation

@kaponco
Copy link
Copy Markdown

@kaponco kaponco commented May 12, 2026

Summary

Add comprehensive OpenShift Container Platform (OCP) support with Operator Lifecycle Manager (OLM)-based deployments for GPU-accelerated workloads. (#566)

Motivation / Context

OpenShift requires a different deployment model. Rather than directly deploying resources via Helm, OCP uses OLM to manage operator lifecycles through Subscriptions and Custom Resources. This PR adds first-class OCP support to AICR.

Key differences addressed:

OLM-based operator installation (Subscriptions, OperatorGroups, CSVs)
Custom Resource-driven configuration vs. direct manifest deployment
Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

OCP Support

New Recipe Criteria:

  • Added ocp as a valid service type
  • Add kubectl apply file (direct) option. No ref to ext repo
  • deploy and undeploy relate to openshift csv phase

Not included:

  • overlays on top the ocp basic overlay
  • additional validations
  • argoCD deployer (for ocp - will be added in later PR)

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Commands: make qualify

Results:

  • Unit tests: All passed (73.1% coverage, threshold: 70%)
  • E2E tests: All 18 passed
  • golangci-lint: 0 issues
  • yamllint: passed

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

The rest of the services are untouched. Minimal code changes

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — [GPG signing info]

(https://docs.github.com/en/authentication/managing-commit-signature-verification)

kaponco added 7 commits May 12, 2026 18:09
Signed-off-by: Shai Kapon <skapon@redhat.com>
Add OpenShift (ocp) as a new service type with Operator Lifecycle Manager support.

Changes:
- Add CriteriaServiceOCP to service enumeration
- Implement ComponentTypeDirect for static Kubernetes manifests
- Add recipes/overlays/ocp.yaml with dual-component pattern:
  * nfd-olm + nfd (OLM Subscription + NodeFeatureDiscovery CR)
  * gpu-operator-olm + gpu-operator (OLM Subscription + ClusterPolicy CR)
- Create OLM installation manifests and CR configuration files
- Register gpu-operator-olm and nfd-olm components in registry.yaml
- Update API spec and documentation to include "ocp" service type

Signed-off-by: Shai Kapon <skapon@redhat.com>
Add Direct component support for kubectl-based deployments (OLM Subscriptions, CRs).
…uninstall (NVIDIA#566)

- Add Direct component type for static YAML manifests (OLM Subscriptions, CRs)
- Generate install.sh with kubectl apply and namespace auto-creation
- Generate uninstall.sh with kubectl delete for Direct components
- Update undeploy.sh to detect and handle Direct vs Helm components
- Configure OCP overlay with Direct components (nfd-olm, gpu-operator-olm, nfd, gpu-operator)
- Skip value extraction optimization for Direct components in bundler
…ion (NVIDIA#566)

Implement Direct deployment type for static YAML manifests with embedded
OLM operator lifecycle management. This enables native OpenShift operator
installation through Operator Lifecycle Manager with automatic CSV readiness
verification.

- Add Direct deployment type alongside Helm and Kustomize with embedded CSV wait logic for OLM components
- Implement component registry validation enforcing exactly one deployment type (helm, kustomize, or direct) per component
- Add gpu-operator-olm and nfd-olm components with OpenShift OLM Subscription/OperatorGroup manifests
- Generate install/uninstall scripts with conditional CSV deletion logic for OLM-managed operators
- Update ComponentRef merge logic to clear type-incompatible fields when deployment type changes
- Add comprehensive OpenShift deployment documentation with OLM principles and workflow examples

Signed-off-by: Shai Kapon <skapon@redhat.com>
…VIDIA#566)

- Add unit tests for Direct deployment type (coverage improved from 68.2% to 73.3%)
- Add OLM timeout constants to pkg/defaults with env var override capability (AICR_OLM_CSV_TIMEOUT, AICR_OLM_CSV_INTERVAL)
- Update Direct deployment install scripts to use centralized timeout constants
- Fix deploy.sh status message to reflect actual deployment method (OLM vs Helm vs kubectl)
- Remove redundant empty source field from Direct component recipe YAML output
- Add test coverage for OCP service type parsing

Signed-off-by: Shai Kapon <skapon@redhat.com>
@kaponco kaponco requested review from a team as code owners May 12, 2026 18:22
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request introduces OpenShift Operator Lifecycle Manager (OLM) integration for AICR by adding platform recognition for OpenShift (ocp service type), implementing a new Direct deployment type for static YAML manifests, and integrating CSV readiness polling into deployment scripts. The changes span criteria enumerations, component metadata schemas, bundler logic, deployment templates, and concrete recipe data for GPU Operator and Node Feature Discovery on OpenShift, along with comprehensive design and operational documentation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main feature being added: OpenShift Container Platform (OCP) initial support with the keyword 'simplified' indicating scope reduction.
Description check ✅ Passed The description comprehensively explains OCP/OLM support additions, motivation, affected components, testing results, and risk assessment directly aligned with the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/api-server.md`:
- Around line 287-292: The GET `/v1/recipe` parameter docs and the earlier enum
list for `service` disagree about the `any` value; pick one canonical enum set
and make both places match. Update the `service` enum in the GET `/v1/recipe`
parameter list to include `any` if it is allowed, or remove `any` from the
earlier `service` enum bullet if it is no longer supported, and ensure both
references (the `service` enum block and the GET `/v1/recipe` parameters) use
the same values so the `service` parameter definition is consistent across docs.

In `@docs/design/009-openshift-integration.md`:
- Line 1: The document heading "ADR-007: OpenShift Integration" in the file
named 009-openshift-integration.md is inconsistent with the filename; change the
ADR identifier in the top-level heading from "ADR-007" to "ADR-009" (update the
line that currently reads "ADR-007: OpenShift Integration") and scan the file
for any other occurrences of "ADR-007" to replace them with "ADR-009" so
cross-references and identifiers remain consistent.
- Around line 325-329: The paragraph currently implies ArgoCD is the primary
deployment target; update the wording to clearly distinguish current behavior
from future work by stating that the wait scripts apply only to the install.sh
reference implementation and that ArgoCD/OCP ArgoCD deployer support is out of
scope for this PR (i.e., the PR does not add ArgoCD integration or OCP ArgoCD
deployer support), while noting that ArgoCD is a planned/target deployment
mechanism for later work; reference the existing terms "install.sh", "wait
scripts", and "ArgoCD/OCP ArgoCD deployer support" so reviewers can locate and
change the sentence accordingly.
- Around line 38-304: The markdown has multiple fenced code blocks missing fence
languages and surrounding blank lines (e.g., the initial triple-backtick block,
the registry.yaml example, file-structure block, bundle output block,
shell/deployment flow blocks, and the YAML examples for
recipes/components/gpu-operator-olm/direct/olm.yaml and
recipes/components/gpu-operator/direct/clusterpolicy.yaml); fix by adding an
explicit fence language (e.g., ```bash, ```yaml, or ```text as appropriate) to
each fenced block and ensure there is a blank line before and after each fence
so markdownlint rules MD031/MD040 are satisfied, updating the blocks shown in
the doc (the flow snippet, "Registry entry", "File structure", "Bundle output",
"Deployment flow", and the two YAML examples) accordingly.

In `@docs/integrator/data-flow.md`:
- Line 901: Replace the vague line "Enum validation (eks, gke, aks, ocp, etc.)"
with an explicit, comma-separated list of every accepted value for the service
config (the exact members of the ServiceType enum or the values returned/checked
by validateService/validateIntegratorConfig), ensuring "ocp" is included; also
add one short clause stating the failure behavior when an unknown value is
provided (e.g., validation error and which error is thrown/HTTP response).
Ensure you pull the canonical values from the ServiceType enum / validateService
implementation and mirror them exactly in the docs for the service config key.

In `@docs/integrator/openshift.md`:
- Around line 47-59: The fenced code block that starts with "User creates
Subscription" is missing a language identifier which fails markdown lint; update
the opening fence from ``` to include a language such as ```text (or ```mermaid
if you prefer rendered diagrams) so the block becomes a labeled code block and
the linter passes.

In `@pkg/bundler/deployer/argocd/argocd.go`:
- Around line 541-545: The switch branch handling localformat.KindDirect
currently only comments that direct components are unsupported but continues
generation; change the case for localformat.KindDirect to fail fast by returning
a clear error (e.g., fmt.Errorf or wrapped error) from the enclosing function
instead of falling through so ArgoCD artifacts are not emitted; update the case
in the same switch (the case localformat.KindDirect block) to produce and return
that error so callers of the function immediately receive a failure when a
Direct component is encountered.

In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl`:
- Around line 35-40: The loop currently reads CSV phase from .items[0] which can
pick the wrong CSV; instead resolve the exact CSV name from the target
Subscription's status.installedCSV and poll that CSV. Concretely: use kubectl to
read the Subscription resource (e.g., kubectl get subscription <name> -n {{
.Namespace }} -o jsonpath='{.status.installedCSV}') into a variable (before or
inside the loop), then replace the kubectl get csv -o
jsonpath='{.items[0].status.phase}' call with kubectl get csv <resolvedCSVName>
-o jsonpath='{.status.phase}' (honoring ${KUBECONFIG_FLAG:-}) and continue using
CSV_PHASE, TIMEOUT, and ELAPSED as before so the loop only observes the intended
CSV.

In `@pkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmpl`:
- Around line 23-27: The script currently picks the first CSV via
csv_name=$(kubectl get csv ... | head -1) which is unsafe; instead, enumerate
the relevant Subscription(s) and delete only their associated CSVs by reading
each Subscription's status.installedCSV. Replace the single csv_name/head -1
logic with a loop that runs kubectl get subscription -n {{ .Namespace }}
(filtering the same Subscription selector you create elsewhere), for each
subscription extract the .status.installedCSV via jsonpath, skip empty values,
and run kubectl delete "<installedCSV>" -n {{ .Namespace }} --ignore-not-found
--timeout="${HELM_TIMEOUT:-120}s" (preserving the || true behavior); reference
the csv_name variable only as the per-subscription installedCSV and use kubectl
get subscription / .status.installedCSV to locate the correct CSVs.

In `@pkg/recipe/components.go`:
- Around line 332-334: The Helm detection condition incorrectly omits
HelmConfig.DefaultVersion causing components with only DefaultVersion set to be
missed; update the hasHelm boolean to include comp.Helm.DefaultVersion != ""
alongside the existing checks (i.e., set hasHelm to true if
comp.Helm.DefaultRepository != "" || comp.Helm.DefaultChart != "" ||
comp.Helm.DefaultNamespace != "" || comp.Helm.DefaultVersion != "") so the
Helm-only-by-version case is correctly classified (refer to the hasHelm variable
and comp.Helm.DefaultVersion/DefaultRepository/DefaultChart/DefaultNamespace
identifiers).

In `@recipes/components/gpu-operator-olm/direct/olm.yaml`:
- Around line 21-36: The OperatorGroup (metadata.name:
nvidia-gpu-operator-group) and Subscription (metadata.name:
gpu-operator-certified) lack explicit namespaces; add metadata.namespace:
nvidia-gpu-operator to each resource's metadata block so both the OperatorGroup
and the Subscription are deterministically bound to the nvidia-gpu-operator
namespace and no longer rely on external kubectl -n behavior.

In `@recipes/components/nfd/direct/nodefeaturediscovery.yaml`:
- Around line 21-24: The NodeFeatureDiscovery custom resource (metadata.name:
nfd-instance) lacks an explicit namespace and relies on caller context; add
metadata.namespace: openshift-nfd under the metadata block of the
NodeFeatureDiscovery manifest (alongside metadata.name: nfd-instance) so the CR
is always created in the openshift-nfd namespace and cannot be accidentally
applied to the wrong namespace.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bdbf65c5-30c9-401c-966a-d700bbe8678b

📥 Commits

Reviewing files that changed from the base of the PR and between ee2e7ae and 04ed2a9.

📒 Files selected for processing (45)
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/README.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/data.md
  • docs/contributor/validations.md
  • docs/design/009-openshift-integration.md
  • docs/integrator/data-flow.md
  • docs/integrator/openshift.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler.go
  • pkg/bundler/deployer/argocd/argocd.go
  • pkg/bundler/deployer/helm/helm.go
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh
  • pkg/bundler/deployer/localformat/direct.go
  • pkg/bundler/deployer/localformat/folder.go
  • pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl
  • pkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmpl
  • pkg/bundler/deployer/localformat/writer.go
  • pkg/bundler/deployer/localformat/writer_test.go
  • pkg/defaults/timeouts.go
  • pkg/recipe/components.go
  • pkg/recipe/components_test.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_test.go
  • pkg/recipe/yaml_test.go
  • recipes/components/gpu-operator-olm/direct/olm.yaml
  • recipes/components/gpu-operator/direct/clusterpolicy.yaml
  • recipes/components/nfd-olm/direct/olm.yaml
  • recipes/components/nfd/direct/nodefeaturediscovery.yaml
  • recipes/data.go
  • recipes/overlays/ocp.yaml
  • recipes/registry.yaml
  • site/.vitepress/config.ts
  • validators/performance/nccl_all_reduce_bw_constraint.go

Comment thread docs/contributor/api-server.md
Comment thread docs/design/009-openshift-integration.md Outdated
Comment thread docs/design/009-openshift-integration.md Outdated
Comment thread docs/design/009-openshift-integration.md Outdated
Comment thread docs/integrator/data-flow.md Outdated
Comment thread pkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmpl Outdated
Comment thread pkg/recipe/components.go Outdated
Comment thread pkg/recipe/metadata.go
Comment thread recipes/components/gpu-operator-olm/direct/olm.yaml
Comment thread recipes/components/nfd/direct/nodefeaturediscovery.yaml
…ion (NVIDIA#566)

- Fix OLM CSV wait logic to resolve CSV name from Subscription.status.installedCSV instead of non-deterministic .items[0] query, ensuring correct CSV polling in multi-operator namespaces
- Fix OLM CSV deletion to enumerate all Subscriptions and delete only their associated CSVs rather than using unsafe head -1 selection
- Add fail-fast validation in ArgoCD deployer to reject Direct components with clear error message instead of silently generating unusable artifacts
- Fix Helm deployment type detection to include DefaultVersion field, preventing components with only version configured from being misclassified
- Fix ComponentRef merge logic to clear Olm field when transitioning from Direct to Helm/Kustomize types and apply Olm overlay to Direct components
- Add explicit metadata.namespace to all OLM and CR manifests (gpu-operator-olm, nfd-olm, nfd CR) for deterministic namespace targeting

Signed-off-by: Shai Kapon <skapon@redhat.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/openshift.md`:
- Line 492: The ADR reference is wrong: update the link text and target that
currently read "ADR-007: OpenShift Integration" to use the correct ADR number
and filename (ADR-009) so the link points to
../design/009-openshift-integration.md and the display text reflects "ADR-009:
OpenShift Integration"; edit the markdown line that contains the link text
"ADR-007: OpenShift Integration" to the corrected ADR-009 variant.

In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl`:
- Around line 35-57: The current loop resolves the installed CSV using an
non-deterministic query (kubectl get subscription -n {{ .Namespace }} -o
jsonpath='{.items[0].status.installedCSV}') which can pick the wrong
Subscription; change the template to accept a concrete selector and use it when
computing CSV_NAME: add a parameter like .SubscriptionName or
.SubscriptionLabelSelector and replace the generic kubectl get subscription call
that populates CSV_NAME with a targeted command that either (1) queries the
specific Subscription by name (kubectl get subscription <name> -n {{ .Namespace
}} -o jsonpath='{.status.installedCSV}') when .SubscriptionName is provided, or
(2) queries by label selector (kubectl get subscription -n {{ .Namespace }} -l
"<label>=<value>" -o jsonpath='{.items[0].status.installedCSV}') when
.SubscriptionLabelSelector is provided; update the template documentation to
mention the new parameter and preserve the existing fallback only if neither
parameter is set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 96975350-a756-44e8-8f9a-71053c323385

📥 Commits

Reviewing files that changed from the base of the PR and between 04ed2a9 and 8e06e88.

📒 Files selected for processing (13)
  • docs/design/009-openshift-integration.md
  • docs/integrator/data-flow.md
  • docs/integrator/openshift.md
  • pkg/bundler/deployer/argocd/argocd.go
  • pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl
  • pkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmpl
  • pkg/bundler/deployer/localformat/writer_test.go
  • pkg/recipe/components.go
  • pkg/recipe/components_test.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_test.go
  • recipes/components/gpu-operator-olm/direct/olm.yaml
  • recipes/components/nfd/direct/nodefeaturediscovery.yaml


- [NVIDIA GPU Operator on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html)
- [Understanding OLM](https://docs.redhat.com/en/documentation/openshift_container_platform/4.2/html/operators/understanding-the-operator-lifecycle-manager-olm)
- [ADR-007: OpenShift Integration](../design/007-openshift-integration.md)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect ADR reference.

Line 492 references ADR-007: OpenShift Integration but the design document is numbered ADR-009 (filename: 009-openshift-integration.md).

📝 Suggested fix
-- [ADR-007: OpenShift Integration](../design/007-openshift-integration.md)
+- [ADR-009: OpenShift Integration](../design/009-openshift-integration.md)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- [ADR-007: OpenShift Integration](../design/007-openshift-integration.md)
- [ADR-009: OpenShift Integration](../design/009-openshift-integration.md)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/integrator/openshift.md` at line 492, The ADR reference is wrong: update
the link text and target that currently read "ADR-007: OpenShift Integration" to
use the correct ADR number and filename (ADR-009) so the link points to
../design/009-openshift-integration.md and the display text reflects "ADR-009:
OpenShift Integration"; edit the markdown line that contains the link text
"ADR-007: OpenShift Integration" to the corrected ADR-009 variant.

Comment on lines +35 to +57
while [ $ELAPSED -lt $TIMEOUT ]; do
# Resolve the installed CSV name from the Subscription
CSV_NAME=$(kubectl get subscription -n {{ .Namespace }} -o jsonpath='{.items[0].status.installedCSV}' ${KUBECONFIG_FLAG:-} 2>/dev/null || echo "")

if [ -z "$CSV_NAME" ]; then
echo "Waiting for Subscription to report installedCSV... (${ELAPSED}s/${TIMEOUT}s)"
sleep $INTERVAL
ELAPSED=$((ELAPSED + INTERVAL))
continue
fi

# Check the specific CSV's phase
CSV_PHASE=$(kubectl get csv "$CSV_NAME" -n {{ .Namespace }} -o jsonpath='{.status.phase}' ${KUBECONFIG_FLAG:-} 2>/dev/null || echo "")

if [ "$CSV_PHASE" = "Succeeded" ]; then
echo "CSV ${CSV_NAME} reached Succeeded phase"
exit 0
fi

echo "CSV ${CSV_NAME} phase: ${CSV_PHASE:-<not found>}, waiting... (${ELAPSED}s/${TIMEOUT}s)"
sleep $INTERVAL
ELAPSED=$((ELAPSED + INTERVAL))
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

CSV wait logic still uses .items[0] which is non-deterministic.

Line 37 resolves the installed CSV name from .items[0].status.installedCSV, which picks the first Subscription in the namespace. In namespaces with multiple operators, this can poll the wrong CSV and report incorrect installation status.

The past review recommended resolving the CSV from "the target Subscription," but the current implementation still selects the first Subscription found in the namespace.

Recommendation:

If the OLM manifest contains a known Subscription name, resolve that specific Subscription instead of .items[0]. If the Subscription name is dynamic or unknown at template time, consider:

  1. Add a template parameter for the expected Subscription name/selector
  2. Use a label selector to identify the target Subscription
  3. Document that this wait logic assumes single-operator namespaces

For now, this may be acceptable if all OLM Direct components use dedicated single-operator namespaces (e.g., nvidia-gpu-operator, openshift-nfd), but it's fragile if namespace conventions change.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl` around
lines 35 - 57, The current loop resolves the installed CSV using an
non-deterministic query (kubectl get subscription -n {{ .Namespace }} -o
jsonpath='{.items[0].status.installedCSV}') which can pick the wrong
Subscription; change the template to accept a concrete selector and use it when
computing CSV_NAME: add a parameter like .SubscriptionName or
.SubscriptionLabelSelector and replace the generic kubectl get subscription call
that populates CSV_NAME with a targeted command that either (1) queries the
specific Subscription by name (kubectl get subscription <name> -n {{ .Namespace
}} -o jsonpath='{.status.installedCSV}') when .SubscriptionName is provided, or
(2) queries by label selector (kubectl get subscription -n {{ .Namespace }} -l
"<label>=<value>" -o jsonpath='{.items[0].status.installedCSV}') when
.SubscriptionLabelSelector is provided; update the template documentation to
mention the new parameter and preserve the existing fallback only if neither
parameter is set.

@github-actions
Copy link
Copy Markdown
Contributor

@kaponco this PR now has merge conflicts with main. Please rebase to resolve them.

@lockwobr
Copy link
Copy Markdown
Contributor

Thanks for the work here — the OCP modeling (criteria, overlay, dual-component operator/CR pattern) looks good and I'd like OpenShift support to land. Before you go further on this PR, a couple of architectural questions worth working through together.

On the new direct deployment type

docs/contributor/index.md ("What AICR Is Not" / "Architectural Boundaries") draws a line on this: AICR generates artifacts, deployment belongs to the tool that consumes them — and it specifically calls out a "direct" kubectl apply deployer with bespoke wait/uninstall scripts as out of scope.

To be fair, the existing helm deployer already carries some of this shape — install/uninstall shell scripts — so the line isn't clean today. But we've been actively trying to shrink that footprint, not grow it: adding the helmfile deployer (#899) was a step in that direction, with the eventual goal of leaning on a community tool instead of maintaining our own scripts. I'd rather find a route that fits our current paradigms — the new localformat/direct.go + install-direct.sh.tmpl / uninstall-direct.sh.tmpl (with embedded CSV polling) is pretty close to the shape the doc discourages.

On reusing existing output adapters

I'd like to understand the rationale for a new adapter vs. emitting the OLM resources through an existing path — e.g., a small in-tree chart per component containing the Subscription / OperatorGroup (and similarly for the CR component). That would let an existing bundler render them, an existing install path drive them, and CSV readiness could ride on a post-install hook. We just opened #904 (replaces #610) to make chainsaw-driven readiness an optional first-class convention; OCP/OLM looks like a great early use case for it.

Possible split

What if we split this? Land the non-deployer parts now — ocp criteria, overlay, dual-component recipe modeling, docs — and then we work together on getting the deployment piece into a shape we both like.

If we can get #904 landed, would that help unblock the deployment question here? At that point you could just add the readiness checks you need as chainsaw tests, and OLM would fit into the same paradigm we use everywhere else — no new deployer, no bespoke wait scripts.

One more question: does OCP need to work in the helm deployer at all? If we scoped it to Argo CD / Flux only, a lot of this complexity goes away.

@kaponco
Copy link
Copy Markdown
Author

kaponco commented May 15, 2026

@lockwobr, thanks for your input, your points are clear.

PR #904 will be a big help in removing the OLM explicit waits. Once it's in, the "direct" deploy.sh will be simplified to a minimal "kubectl apply".

I think keeping install-direct script in this PR is necessary. It provides a helpful example for early validation of the OCP-to-AICR integration and aligns with other service types.

Regarding the in-tree Helm solution: what will the output in the bundle directory look like? If it's a Helm chart, it might confuse OpenShift users, who traditionally don't use Helm scripts to manage artifacts.

Also, happy to change the "direct" keyword if we want something clearer.

@mchmarny
Copy link
Copy Markdown
Member

@kaponco appears to be very new account (43 days old) with minimal established reputation across all dimensions. The account shows zero merged PRs and two closed PRs, no followers, and no engagement signals or community involvement, resulting in a low overall trust score: https://devtrace.thingz.io/score/kaponco

The lack of successful contributions, community presence, and engagement history indicates this is a bot. Given the scope of that PR and existing misalignment with established AICR patterns/principles documented by @lockwobr I recommend we reject it.

@kaponco
Copy link
Copy Markdown
Author

kaponco commented May 21, 2026

@mchmarny please sync with @lockwobr we had a discussion about the design. The design of this project leans heavily on helm while the openshift approach is different. I didnt see any attempt from you to bridge the gap. All I am doing is proposing solutions. If you think this is a personal issue - I dont mind asking my company for a replacement.

@haarchri
Copy link
Copy Markdown
Contributor

On EKS/AKS/GKE, GPU Operator and NFD are installed through Helm, same as the other components.

For OCP, this PR introduces a separate deployment model based on:

  • OLM Subscriptions + OperatorGroups
  • kubectl apply flows
  • custom CSV polling/wait logic

That means the bundler, deploy/undeploy scripts, and validation paths now need to support two different installation mechanisms.

My main question is: is OLM actually required here, or just preferred?

The GPU Operator Helm chart already works on OpenShift, and NVIDIA documentation describes both approaches. If Helm is a supported path, then a lot of the additional complexity introduced here may not be necessary:

  • the Direct deployer becomes unnecessary
  • custom install/uninstall templates are no longer needed
  • CSV polling/wait logic goes away
  • existing deployers (Helm, ArgoCD, Flux) continue to work without special handling

If there is a hard requirement for OLM (for example OperatorHub certification, compliance requirements, or customer constraints), I think that rationale should be explicitly documented in both the design doc and PR description. Right now the reason for introducing a parallel install path is unclear.

I also noticed that when generating bundles for all four services using:

--accelerator h100 --intent training

the OCP recipe currently installs only:

  • GPU Operator
  • NFD

Compared to EKS/AKS/GKE, the following components are disabled:

  • monitoring stack
  • prometheus adapter
  • gang scheduling
  • node tuning
  • DRA
  • health/ops components
  • cert-manager

A few questions on scope and intent:

  • Is this meant to be a phase-1 baseline, or the final intended OCP scope?
  • Which of these components actually cannot run on OCP?
  • Which components are simply not integrated yet?

For example, cert-manager, Prometheus, kai-scheduler, and nodewright already support OpenShift, so it would help to clarify whether these omissions are platform limitations or just incomplete wiring.

One thing I still do not fully understand is the deployer direction.

The PR notes mention that ArgoCD support for OCP will come in a later PR, but if components were installed through Helm (like the other platforms), then ArgoCD/Flux support would already work automatically?!

That makes the OLM-specific deployment path harder to justify from an architecture and maintenance perspective - please answer the question around the requirements for OLM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants