Openshift Container Platform (OCP) initial support - simplified #857
Openshift Container Platform (OCP) initial support - simplified #857kaponco wants to merge 8 commits into
Conversation
Signed-off-by: Shai Kapon <skapon@redhat.com>
Add OpenShift (ocp) as a new service type with Operator Lifecycle Manager support. Changes: - Add CriteriaServiceOCP to service enumeration - Implement ComponentTypeDirect for static Kubernetes manifests - Add recipes/overlays/ocp.yaml with dual-component pattern: * nfd-olm + nfd (OLM Subscription + NodeFeatureDiscovery CR) * gpu-operator-olm + gpu-operator (OLM Subscription + ClusterPolicy CR) - Create OLM installation manifests and CR configuration files - Register gpu-operator-olm and nfd-olm components in registry.yaml - Update API spec and documentation to include "ocp" service type Signed-off-by: Shai Kapon <skapon@redhat.com>
Add Direct component support for kubectl-based deployments (OLM Subscriptions, CRs).
…uninstall (NVIDIA#566) - Add Direct component type for static YAML manifests (OLM Subscriptions, CRs) - Generate install.sh with kubectl apply and namespace auto-creation - Generate uninstall.sh with kubectl delete for Direct components - Update undeploy.sh to detect and handle Direct vs Helm components - Configure OCP overlay with Direct components (nfd-olm, gpu-operator-olm, nfd, gpu-operator) - Skip value extraction optimization for Direct components in bundler
…ion (NVIDIA#566) Implement Direct deployment type for static YAML manifests with embedded OLM operator lifecycle management. This enables native OpenShift operator installation through Operator Lifecycle Manager with automatic CSV readiness verification. - Add Direct deployment type alongside Helm and Kustomize with embedded CSV wait logic for OLM components - Implement component registry validation enforcing exactly one deployment type (helm, kustomize, or direct) per component - Add gpu-operator-olm and nfd-olm components with OpenShift OLM Subscription/OperatorGroup manifests - Generate install/uninstall scripts with conditional CSV deletion logic for OLM-managed operators - Update ComponentRef merge logic to clear type-incompatible fields when deployment type changes - Add comprehensive OpenShift deployment documentation with OLM principles and workflow examples Signed-off-by: Shai Kapon <skapon@redhat.com>
…VIDIA#566) - Add unit tests for Direct deployment type (coverage improved from 68.2% to 73.3%) - Add OLM timeout constants to pkg/defaults with env var override capability (AICR_OLM_CSV_TIMEOUT, AICR_OLM_CSV_INTERVAL) - Update Direct deployment install scripts to use centralized timeout constants - Fix deploy.sh status message to reflect actual deployment method (OLM vs Helm vs kubectl) - Remove redundant empty source field from Direct component recipe YAML output - Add test coverage for OCP service type parsing Signed-off-by: Shai Kapon <skapon@redhat.com>
…NVIDIA#566) Signed-off-by: Shai Kapon <skapon@redhat.com>
📝 WalkthroughWalkthroughThis pull request introduces OpenShift Operator Lifecycle Manager (OLM) integration for AICR by adding platform recognition for OpenShift ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/api-server.md`:
- Around line 287-292: The GET `/v1/recipe` parameter docs and the earlier enum
list for `service` disagree about the `any` value; pick one canonical enum set
and make both places match. Update the `service` enum in the GET `/v1/recipe`
parameter list to include `any` if it is allowed, or remove `any` from the
earlier `service` enum bullet if it is no longer supported, and ensure both
references (the `service` enum block and the GET `/v1/recipe` parameters) use
the same values so the `service` parameter definition is consistent across docs.
In `@docs/design/009-openshift-integration.md`:
- Line 1: The document heading "ADR-007: OpenShift Integration" in the file
named 009-openshift-integration.md is inconsistent with the filename; change the
ADR identifier in the top-level heading from "ADR-007" to "ADR-009" (update the
line that currently reads "ADR-007: OpenShift Integration") and scan the file
for any other occurrences of "ADR-007" to replace them with "ADR-009" so
cross-references and identifiers remain consistent.
- Around line 325-329: The paragraph currently implies ArgoCD is the primary
deployment target; update the wording to clearly distinguish current behavior
from future work by stating that the wait scripts apply only to the install.sh
reference implementation and that ArgoCD/OCP ArgoCD deployer support is out of
scope for this PR (i.e., the PR does not add ArgoCD integration or OCP ArgoCD
deployer support), while noting that ArgoCD is a planned/target deployment
mechanism for later work; reference the existing terms "install.sh", "wait
scripts", and "ArgoCD/OCP ArgoCD deployer support" so reviewers can locate and
change the sentence accordingly.
- Around line 38-304: The markdown has multiple fenced code blocks missing fence
languages and surrounding blank lines (e.g., the initial triple-backtick block,
the registry.yaml example, file-structure block, bundle output block,
shell/deployment flow blocks, and the YAML examples for
recipes/components/gpu-operator-olm/direct/olm.yaml and
recipes/components/gpu-operator/direct/clusterpolicy.yaml); fix by adding an
explicit fence language (e.g., ```bash, ```yaml, or ```text as appropriate) to
each fenced block and ensure there is a blank line before and after each fence
so markdownlint rules MD031/MD040 are satisfied, updating the blocks shown in
the doc (the flow snippet, "Registry entry", "File structure", "Bundle output",
"Deployment flow", and the two YAML examples) accordingly.
In `@docs/integrator/data-flow.md`:
- Line 901: Replace the vague line "Enum validation (eks, gke, aks, ocp, etc.)"
with an explicit, comma-separated list of every accepted value for the service
config (the exact members of the ServiceType enum or the values returned/checked
by validateService/validateIntegratorConfig), ensuring "ocp" is included; also
add one short clause stating the failure behavior when an unknown value is
provided (e.g., validation error and which error is thrown/HTTP response).
Ensure you pull the canonical values from the ServiceType enum / validateService
implementation and mirror them exactly in the docs for the service config key.
In `@docs/integrator/openshift.md`:
- Around line 47-59: The fenced code block that starts with "User creates
Subscription" is missing a language identifier which fails markdown lint; update
the opening fence from ``` to include a language such as ```text (or ```mermaid
if you prefer rendered diagrams) so the block becomes a labeled code block and
the linter passes.
In `@pkg/bundler/deployer/argocd/argocd.go`:
- Around line 541-545: The switch branch handling localformat.KindDirect
currently only comments that direct components are unsupported but continues
generation; change the case for localformat.KindDirect to fail fast by returning
a clear error (e.g., fmt.Errorf or wrapped error) from the enclosing function
instead of falling through so ArgoCD artifacts are not emitted; update the case
in the same switch (the case localformat.KindDirect block) to produce and return
that error so callers of the function immediately receive a failure when a
Direct component is encountered.
In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl`:
- Around line 35-40: The loop currently reads CSV phase from .items[0] which can
pick the wrong CSV; instead resolve the exact CSV name from the target
Subscription's status.installedCSV and poll that CSV. Concretely: use kubectl to
read the Subscription resource (e.g., kubectl get subscription <name> -n {{
.Namespace }} -o jsonpath='{.status.installedCSV}') into a variable (before or
inside the loop), then replace the kubectl get csv -o
jsonpath='{.items[0].status.phase}' call with kubectl get csv <resolvedCSVName>
-o jsonpath='{.status.phase}' (honoring ${KUBECONFIG_FLAG:-}) and continue using
CSV_PHASE, TIMEOUT, and ELAPSED as before so the loop only observes the intended
CSV.
In `@pkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmpl`:
- Around line 23-27: The script currently picks the first CSV via
csv_name=$(kubectl get csv ... | head -1) which is unsafe; instead, enumerate
the relevant Subscription(s) and delete only their associated CSVs by reading
each Subscription's status.installedCSV. Replace the single csv_name/head -1
logic with a loop that runs kubectl get subscription -n {{ .Namespace }}
(filtering the same Subscription selector you create elsewhere), for each
subscription extract the .status.installedCSV via jsonpath, skip empty values,
and run kubectl delete "<installedCSV>" -n {{ .Namespace }} --ignore-not-found
--timeout="${HELM_TIMEOUT:-120}s" (preserving the || true behavior); reference
the csv_name variable only as the per-subscription installedCSV and use kubectl
get subscription / .status.installedCSV to locate the correct CSVs.
In `@pkg/recipe/components.go`:
- Around line 332-334: The Helm detection condition incorrectly omits
HelmConfig.DefaultVersion causing components with only DefaultVersion set to be
missed; update the hasHelm boolean to include comp.Helm.DefaultVersion != ""
alongside the existing checks (i.e., set hasHelm to true if
comp.Helm.DefaultRepository != "" || comp.Helm.DefaultChart != "" ||
comp.Helm.DefaultNamespace != "" || comp.Helm.DefaultVersion != "") so the
Helm-only-by-version case is correctly classified (refer to the hasHelm variable
and comp.Helm.DefaultVersion/DefaultRepository/DefaultChart/DefaultNamespace
identifiers).
In `@recipes/components/gpu-operator-olm/direct/olm.yaml`:
- Around line 21-36: The OperatorGroup (metadata.name:
nvidia-gpu-operator-group) and Subscription (metadata.name:
gpu-operator-certified) lack explicit namespaces; add metadata.namespace:
nvidia-gpu-operator to each resource's metadata block so both the OperatorGroup
and the Subscription are deterministically bound to the nvidia-gpu-operator
namespace and no longer rely on external kubectl -n behavior.
In `@recipes/components/nfd/direct/nodefeaturediscovery.yaml`:
- Around line 21-24: The NodeFeatureDiscovery custom resource (metadata.name:
nfd-instance) lacks an explicit namespace and relies on caller context; add
metadata.namespace: openshift-nfd under the metadata block of the
NodeFeatureDiscovery manifest (alongside metadata.name: nfd-instance) so the CR
is always created in the openshift-nfd namespace and cannot be accidentally
applied to the wrong namespace.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: bdbf65c5-30c9-401c-966a-d700bbe8678b
📒 Files selected for processing (45)
.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/README.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/data.mddocs/contributor/validations.mddocs/design/009-openshift-integration.mddocs/integrator/data-flow.mddocs/integrator/openshift.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/bundler/bundler.gopkg/bundler/deployer/argocd/argocd.gopkg/bundler/deployer/helm/helm.gopkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/templates/undeploy.sh.tmplpkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.shpkg/bundler/deployer/helm/testdata/manifest_only/undeploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.shpkg/bundler/deployer/localformat/direct.gopkg/bundler/deployer/localformat/folder.gopkg/bundler/deployer/localformat/templates/install-direct.sh.tmplpkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmplpkg/bundler/deployer/localformat/writer.gopkg/bundler/deployer/localformat/writer_test.gopkg/defaults/timeouts.gopkg/recipe/components.gopkg/recipe/components_test.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/metadata.gopkg/recipe/metadata_test.gopkg/recipe/yaml_test.gorecipes/components/gpu-operator-olm/direct/olm.yamlrecipes/components/gpu-operator/direct/clusterpolicy.yamlrecipes/components/nfd-olm/direct/olm.yamlrecipes/components/nfd/direct/nodefeaturediscovery.yamlrecipes/data.gorecipes/overlays/ocp.yamlrecipes/registry.yamlsite/.vitepress/config.tsvalidators/performance/nccl_all_reduce_bw_constraint.go
…ion (NVIDIA#566) - Fix OLM CSV wait logic to resolve CSV name from Subscription.status.installedCSV instead of non-deterministic .items[0] query, ensuring correct CSV polling in multi-operator namespaces - Fix OLM CSV deletion to enumerate all Subscriptions and delete only their associated CSVs rather than using unsafe head -1 selection - Add fail-fast validation in ArgoCD deployer to reject Direct components with clear error message instead of silently generating unusable artifacts - Fix Helm deployment type detection to include DefaultVersion field, preventing components with only version configured from being misclassified - Fix ComponentRef merge logic to clear Olm field when transitioning from Direct to Helm/Kustomize types and apply Olm overlay to Direct components - Add explicit metadata.namespace to all OLM and CR manifests (gpu-operator-olm, nfd-olm, nfd CR) for deterministic namespace targeting Signed-off-by: Shai Kapon <skapon@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/integrator/openshift.md`:
- Line 492: The ADR reference is wrong: update the link text and target that
currently read "ADR-007: OpenShift Integration" to use the correct ADR number
and filename (ADR-009) so the link points to
../design/009-openshift-integration.md and the display text reflects "ADR-009:
OpenShift Integration"; edit the markdown line that contains the link text
"ADR-007: OpenShift Integration" to the corrected ADR-009 variant.
In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl`:
- Around line 35-57: The current loop resolves the installed CSV using an
non-deterministic query (kubectl get subscription -n {{ .Namespace }} -o
jsonpath='{.items[0].status.installedCSV}') which can pick the wrong
Subscription; change the template to accept a concrete selector and use it when
computing CSV_NAME: add a parameter like .SubscriptionName or
.SubscriptionLabelSelector and replace the generic kubectl get subscription call
that populates CSV_NAME with a targeted command that either (1) queries the
specific Subscription by name (kubectl get subscription <name> -n {{ .Namespace
}} -o jsonpath='{.status.installedCSV}') when .SubscriptionName is provided, or
(2) queries by label selector (kubectl get subscription -n {{ .Namespace }} -l
"<label>=<value>" -o jsonpath='{.items[0].status.installedCSV}') when
.SubscriptionLabelSelector is provided; update the template documentation to
mention the new parameter and preserve the existing fallback only if neither
parameter is set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 96975350-a756-44e8-8f9a-71053c323385
📒 Files selected for processing (13)
docs/design/009-openshift-integration.mddocs/integrator/data-flow.mddocs/integrator/openshift.mdpkg/bundler/deployer/argocd/argocd.gopkg/bundler/deployer/localformat/templates/install-direct.sh.tmplpkg/bundler/deployer/localformat/templates/uninstall-direct.sh.tmplpkg/bundler/deployer/localformat/writer_test.gopkg/recipe/components.gopkg/recipe/components_test.gopkg/recipe/metadata.gopkg/recipe/metadata_test.gorecipes/components/gpu-operator-olm/direct/olm.yamlrecipes/components/nfd/direct/nodefeaturediscovery.yaml
|
|
||
| - [NVIDIA GPU Operator on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html) | ||
| - [Understanding OLM](https://docs.redhat.com/en/documentation/openshift_container_platform/4.2/html/operators/understanding-the-operator-lifecycle-manager-olm) | ||
| - [ADR-007: OpenShift Integration](../design/007-openshift-integration.md) |
There was a problem hiding this comment.
Incorrect ADR reference.
Line 492 references ADR-007: OpenShift Integration but the design document is numbered ADR-009 (filename: 009-openshift-integration.md).
📝 Suggested fix
-- [ADR-007: OpenShift Integration](../design/007-openshift-integration.md)
+- [ADR-009: OpenShift Integration](../design/009-openshift-integration.md)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - [ADR-007: OpenShift Integration](../design/007-openshift-integration.md) | |
| - [ADR-009: OpenShift Integration](../design/009-openshift-integration.md) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/integrator/openshift.md` at line 492, The ADR reference is wrong: update
the link text and target that currently read "ADR-007: OpenShift Integration" to
use the correct ADR number and filename (ADR-009) so the link points to
../design/009-openshift-integration.md and the display text reflects "ADR-009:
OpenShift Integration"; edit the markdown line that contains the link text
"ADR-007: OpenShift Integration" to the corrected ADR-009 variant.
| while [ $ELAPSED -lt $TIMEOUT ]; do | ||
| # Resolve the installed CSV name from the Subscription | ||
| CSV_NAME=$(kubectl get subscription -n {{ .Namespace }} -o jsonpath='{.items[0].status.installedCSV}' ${KUBECONFIG_FLAG:-} 2>/dev/null || echo "") | ||
|
|
||
| if [ -z "$CSV_NAME" ]; then | ||
| echo "Waiting for Subscription to report installedCSV... (${ELAPSED}s/${TIMEOUT}s)" | ||
| sleep $INTERVAL | ||
| ELAPSED=$((ELAPSED + INTERVAL)) | ||
| continue | ||
| fi | ||
|
|
||
| # Check the specific CSV's phase | ||
| CSV_PHASE=$(kubectl get csv "$CSV_NAME" -n {{ .Namespace }} -o jsonpath='{.status.phase}' ${KUBECONFIG_FLAG:-} 2>/dev/null || echo "") | ||
|
|
||
| if [ "$CSV_PHASE" = "Succeeded" ]; then | ||
| echo "CSV ${CSV_NAME} reached Succeeded phase" | ||
| exit 0 | ||
| fi | ||
|
|
||
| echo "CSV ${CSV_NAME} phase: ${CSV_PHASE:-<not found>}, waiting... (${ELAPSED}s/${TIMEOUT}s)" | ||
| sleep $INTERVAL | ||
| ELAPSED=$((ELAPSED + INTERVAL)) | ||
| done |
There was a problem hiding this comment.
CSV wait logic still uses .items[0] which is non-deterministic.
Line 37 resolves the installed CSV name from .items[0].status.installedCSV, which picks the first Subscription in the namespace. In namespaces with multiple operators, this can poll the wrong CSV and report incorrect installation status.
The past review recommended resolving the CSV from "the target Subscription," but the current implementation still selects the first Subscription found in the namespace.
Recommendation:
If the OLM manifest contains a known Subscription name, resolve that specific Subscription instead of .items[0]. If the Subscription name is dynamic or unknown at template time, consider:
- Add a template parameter for the expected Subscription name/selector
- Use a label selector to identify the target Subscription
- Document that this wait logic assumes single-operator namespaces
For now, this may be acceptable if all OLM Direct components use dedicated single-operator namespaces (e.g., nvidia-gpu-operator, openshift-nfd), but it's fragile if namespace conventions change.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/bundler/deployer/localformat/templates/install-direct.sh.tmpl` around
lines 35 - 57, The current loop resolves the installed CSV using an
non-deterministic query (kubectl get subscription -n {{ .Namespace }} -o
jsonpath='{.items[0].status.installedCSV}') which can pick the wrong
Subscription; change the template to accept a concrete selector and use it when
computing CSV_NAME: add a parameter like .SubscriptionName or
.SubscriptionLabelSelector and replace the generic kubectl get subscription call
that populates CSV_NAME with a targeted command that either (1) queries the
specific Subscription by name (kubectl get subscription <name> -n {{ .Namespace
}} -o jsonpath='{.status.installedCSV}') when .SubscriptionName is provided, or
(2) queries by label selector (kubectl get subscription -n {{ .Namespace }} -l
"<label>=<value>" -o jsonpath='{.items[0].status.installedCSV}') when
.SubscriptionLabelSelector is provided; update the template documentation to
mention the new parameter and preserve the existing fallback only if neither
parameter is set.
|
@kaponco this PR now has merge conflicts with |
|
Thanks for the work here — the OCP modeling (criteria, overlay, dual-component operator/CR pattern) looks good and I'd like OpenShift support to land. Before you go further on this PR, a couple of architectural questions worth working through together. On the new
To be fair, the existing helm deployer already carries some of this shape — install/uninstall shell scripts — so the line isn't clean today. But we've been actively trying to shrink that footprint, not grow it: adding the helmfile deployer (#899) was a step in that direction, with the eventual goal of leaning on a community tool instead of maintaining our own scripts. I'd rather find a route that fits our current paradigms — the new On reusing existing output adapters I'd like to understand the rationale for a new adapter vs. emitting the OLM resources through an existing path — e.g., a small in-tree chart per component containing the Subscription / OperatorGroup (and similarly for the CR component). That would let an existing bundler render them, an existing install path drive them, and CSV readiness could ride on a post-install hook. We just opened #904 (replaces #610) to make chainsaw-driven readiness an optional first-class convention; OCP/OLM looks like a great early use case for it. Possible split What if we split this? Land the non-deployer parts now — If we can get #904 landed, would that help unblock the deployment question here? At that point you could just add the readiness checks you need as chainsaw tests, and OLM would fit into the same paradigm we use everywhere else — no new deployer, no bespoke wait scripts. One more question: does OCP need to work in the helm deployer at all? If we scoped it to Argo CD / Flux only, a lot of this complexity goes away. |
|
@lockwobr, thanks for your input, your points are clear. PR #904 will be a big help in removing the OLM explicit waits. Once it's in, the "direct" deploy.sh will be simplified to a minimal "kubectl apply". I think keeping install-direct script in this PR is necessary. It provides a helpful example for early validation of the OCP-to-AICR integration and aligns with other service types. Regarding the in-tree Helm solution: what will the output in the bundle directory look like? If it's a Helm chart, it might confuse OpenShift users, who traditionally don't use Helm scripts to manage artifacts. Also, happy to change the "direct" keyword if we want something clearer. |
|
@kaponco appears to be very new account (43 days old) with minimal established reputation across all dimensions. The account shows zero merged PRs and two closed PRs, no followers, and no engagement signals or community involvement, resulting in a low overall trust score: https://devtrace.thingz.io/score/kaponco The lack of successful contributions, community presence, and engagement history indicates this is a bot. Given the scope of that PR and existing misalignment with established AICR patterns/principles documented by @lockwobr I recommend we reject it. |
|
@mchmarny please sync with @lockwobr we had a discussion about the design. The design of this project leans heavily on helm while the openshift approach is different. I didnt see any attempt from you to bridge the gap. All I am doing is proposing solutions. If you think this is a personal issue - I dont mind asking my company for a replacement. |
|
On EKS/AKS/GKE, GPU Operator and NFD are installed through Helm, same as the other components. For OCP, this PR introduces a separate deployment model based on:
That means the bundler, deploy/undeploy scripts, and validation paths now need to support two different installation mechanisms. My main question is: is OLM actually required here, or just preferred? The GPU Operator Helm chart already works on OpenShift, and NVIDIA documentation describes both approaches. If Helm is a supported path, then a lot of the additional complexity introduced here may not be necessary:
If there is a hard requirement for OLM (for example OperatorHub certification, compliance requirements, or customer constraints), I think that rationale should be explicitly documented in both the design doc and PR description. Right now the reason for introducing a parallel install path is unclear. I also noticed that when generating bundles for all four services using:
the OCP recipe currently installs only:
Compared to EKS/AKS/GKE, the following components are disabled:
A few questions on scope and intent:
For example, cert-manager, Prometheus, kai-scheduler, and nodewright already support OpenShift, so it would help to clarify whether these omissions are platform limitations or just incomplete wiring. One thing I still do not fully understand is the deployer direction. The PR notes mention that ArgoCD support for OCP will come in a later PR, but if components were installed through Helm (like the other platforms), then ArgoCD/Flux support would already work automatically?! That makes the OLM-specific deployment path harder to justify from an architecture and maintenance perspective - please answer the question around the requirements for OLM |
Summary
Add comprehensive OpenShift Container Platform (OCP) support with Operator Lifecycle Manager (OLM)-based deployments for GPU-accelerated workloads. (#566)
Motivation / Context
OpenShift requires a different deployment model. Rather than directly deploying resources via Helm, OCP uses OLM to manage operator lifecycles through Subscriptions and Custom Resources. This PR adds first-class OCP support to AICR.
Key differences addressed:
OLM-based operator installation (Subscriptions, OperatorGroups, CSVs)
Custom Resource-driven configuration vs. direct manifest deployment
Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
OCP Support
New Recipe Criteria:
Not included:
Testing
# Commands run (prefer `make qualify` for non-trivial changes) make qualifyCommands:
make qualifyResults:
Risk Assessment
The rest of the services are untouched. Minimal code changes
Checklist
make testwith-race)make lint)git commit -S) — [GPG signing info](https://docs.github.com/en/authentication/managing-commit-signature-verification)