Skip to content

feat(e2e): add user-override E2E tests and fix OCP-specific test issues#228

Open
rlobillo wants to merge 5 commits into
openshift-virtualization:mainfrom
rlobillo:e2e-user-override
Open

feat(e2e): add user-override E2E tests and fix OCP-specific test issues#228
rlobillo wants to merge 5 commits into
openshift-virtualization:mainfrom
rlobillo:e2e-user-override

Conversation

@rlobillo

@rlobillo rlobillo commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Short Description

Add comprehensive E2E tests for the user-override feature (patch, ignore-fields, unmanaged modes) and fix OCP-specific test issues.

More details

This PR introduces a full E2E test suite for the user-override feature and addresses issues uncovered during testing on OpenShift clusters. It also strengthens metric-cleanup assertions for the selective-activation feature.

What this PR does / why we need it

1. User-override E2E tests (CNV-89267, CNV-89805)

New test file user_override_test.go covering three override modes:

  • Patch mode: concurrent application with drift detection/correction, security blocks on sensitive kinds, forbidden patch paths, invalid patch syntax
  • Ignore-fields mode: active field modification without reconciliation
  • Unmanaged mode: full drift preservation and re-management

Supporting changes:

  • Extend assets_test.go with UserOverrideFieldSpec (per-asset real operator-controlled fields), sensitiveKinds auto-derivation via initAssets(), and 3 observability assets (Service, ServiceMonitor, PrometheusRule)
  • Extract shared helpers (setAnnotation, removeAnnotation, setLabel, tamperField, readOverrideFieldValue, findCustomizationMetric, pollResourceField) and generic metric lookup functions into helpers_test.go

2. Skip PrometheusRule drift tests on OCP (fix)

The anti-thrashing and alert suites set PrometheusRule to unmanaged mode so they can patch alert "for" durations without the operator reverting them. This causes all drift-detection tests for PrometheusRule to time out on OpenShift. Added skipIfUnmanagedOnOCP() to skip these specs on OCP clusters.

3. Static KubeDescheduler asset for namespace guard tests (fix)

The namespace guard test was iterating over all assetsUnderTest and deleting each asset's namespace. After adding observability assets in openshift-cnv, this destroyed the operator's own namespace. Replaced with a single static KubeDescheduler asset which is safe to delete.

4. Wait for MachineConfigPools to stabilize (fix)

On OCP clusters, patching the autopilot annotation can trigger MachineConfig rollouts that leave nodes cordoned/updating. Tests that run before the rollout completes hit spurious failures. Added waitForMCPStable() in BeforeSuite and after every autopilot patch to block until all MCPs report Updated=True, Updating=False, Degraded=False. On non-OCP clusters (e.g. Kind) the function returns immediately.

5. Explicit metric-cleanup assertions for selective activation (CNV-89268)

Strengthened the selective-activation E2E tests to explicitly verify that per-asset metrics (compliance_status, paused_resources, reconcile_duration, customization_info) are deleted (series value = -1) when an asset leaves the allowlist. The previous test only compared metrics at two points after exclusion, which would pass even without the fix.

Other

  • Increased Ginkgo timeout from 10m to 30m in run-e2e.sh to accommodate the expanded test suite

rlobillo and others added 3 commits July 1, 2026 15:38
The namespace guard test was iterating over all assetsUnderTest and
deleting each asset's namespace. After adding Service, ServiceMonitor,
and PrometheusRule assets in openshift-cnv, this destroyed the operator's
own namespace, killing the autopilot pod with no way to recover.

Replace the dynamic loop with a single static KubeDescheduler asset
(namespace openshift-kube-descheduler-operator) which is safe to delete
without affecting the operator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nmanaged

The anti-thrashing and alert suites set PrometheusRule to unmanaged mode
(platform.kubevirt.io/mode=unmanaged) so they can patch alert "for"
durations without the operator reverting them. This makes the operator
skip PrometheusRule entirely during reconciliation, causing all drift-
detection tests for that asset to time out on OpenShift.

Add skipIfUnmanagedOnOCP() to skip PrometheusRule specs on OCP clusters
where unmanaged mode is active. On kind this is a non-issue because the
unmanaged block is never set. Also removes the informer cache sleep
workaround from touchHCO() which is no longer needed now that the real
root cause is addressed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…unmanaged mode (CNV-89267, CNV-89805)

Add a comprehensive E2E test suite for the user-override feature covering
three override modes: patch (concurrent application with drift
detection/correction), ignore-fields (active field modification without
reconciliation), and unmanaged (full drift preservation and
re-management).

Key changes:

- New test file user_override_test.go with tests for concurrent patch
  application, security blocks on sensitive kinds, forbidden patch paths,
  invalid patch syntax, ignore-fields with active field tampering, and
  unmanaged mode with drift preservation and re-management.

- Extend assets_test.go with UserOverrideFieldSpec (per-asset real
  operator-controlled fields), sensitiveKinds auto-derivation via
  initAssets(), and 3 observability assets (Service, ServiceMonitor,
  PrometheusRule).

- Extract shared helpers (setAnnotation, removeAnnotation, setLabel,
  tamperField, readOverrideFieldValue, findCustomizationMetric,
  pollResourceField) and generic metric lookup functions into
  helpers_test.go for reuse across all test suites.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fabiand for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Generated Files Verification Failed

One or more generated files in this PR are out of sync:

  • CRDs: Run make update-crds if CRD verification failed
  • RBAC: Run make generate-rbac if RBAC verification failed

Please regenerate the files locally and commit the changes.

rlobillo and others added 2 commits July 2, 2026 12:00
On OCP clusters, patching the autopilot annotation can trigger
MachineConfig rollouts that leave nodes cordoned/updating. Tests that
run before the rollout completes hit spurious failures. Add
waitForMCPStable() in BeforeSuite and after every autopilot patch to
block until all MCPs report Updated=True, Updating=False,
Degraded=False.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t (CNV-89268)

The existing test only checked that metrics didn't change between two
points after the allowlist was narrowed, which would pass even without
the fix. Now explicitly assert that compliance_status, paused_resources,
reconcile_duration, and customization_info are -1 (series deleted) after
the asset is excluded from the allowlist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant