Skip to content

[upstream CP] Optimize VSC handle readiness polling for VSS backups#517

Open
sseago wants to merge 2 commits into
openshift:oadp-devfrom
sseago:vss-oadp-dev
Open

[upstream CP] Optimize VSC handle readiness polling for VSS backups#517
sseago wants to merge 2 commits into
openshift:oadp-devfrom
sseago:vss-oadp-dev

Conversation

@sseago

@sseago sseago commented Jun 23, 2026

Copy link
Copy Markdown

Thank you for contributing to Velero!

Please add a summary of your change

When waiting for the CSI Snapshot to complete, the CSI plugin checks for the SnapHandle every 5 seconds up until csiSnapshotTimeout (default 10min) is reached. This is a problem for workloads that use Microsoft VSS because VSS will unfreeze the filesystem after 10 seconds (which is not configurable). If a workload has 2 volumes, the 5 second polling interval will almost always result in a forced unfreeze before the post hook runs and likely before the last PVC's snapshot is done.

See the VSS doc here: https://learn.microsoft.com/en-us/windows/win32/vss/overview-of-processing-a-backup-under-vss
Note that that the 10-second unfreeze is not configurable.

This PR refactors this to poll every second for the first 10 seconds, followed by the previous "every 5 seconds" until the snapshot timeout is reached if the csi-snapshot-early-frequent-polling flag is true

related: https://redhat.atlassian.net/browse/OADP-8232

Summary by CodeRabbit

  • New Features
    • Added installation flag --csi-snapshot-early-frequent-polling to enable accelerated polling for CSI snapshot handle readiness, improving performance of volume snapshot operations during backups.

sseago and others added 2 commits June 23, 2026 13:53
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Walkthrough

Adds a --csi-snapshot-early-frequent-polling CLI install flag that propagates through VeleroOptions and podTemplateConfig to inject a CSI_SNAPSHOT_EARLY_FREQUENT_POLLING=true environment variable into the Velero deployment. WaitUntilVSCHandleIsReady reads this variable to run an optional early 1s-interval/10s-duration polling phase before falling back to the standard 5s-interval loop.

Changes

CSI Snapshot Early Frequent Polling

Layer / File(s) Summary
Two-phase polling logic
pkg/util/csi/volume_snapshot.go
Rewrites WaitUntilVSCHandleIsReady into a shared pollFunc plus an optional early phase (1s interval, 10s cap) controlled by CSI_SNAPSHOT_EARLY_FREQUENT_POLLING, falling back to the original 5s-interval loop up to csiSnapshotTimeout. Adds os and strconv imports.
Deployment env var injection and install resource wiring
pkg/install/deployment.go, pkg/install/resources.go
Adds csiSnapshotEarlyFrequentPolling to podTemplateConfig and VeleroOptions, introduces WithCSISnapshotEarlyFrequentPolling option function, injects the CSI_SNAPSHOT_EARLY_FREQUENT_POLLING=true container env var when enabled, and conditionally appends the option in AllResources.
CLI install flag, propagation, and changelog
pkg/cmd/cli/install/install.go, changelogs/unreleased/9629-sseago
Adds CSISnapshotEarlyFrequentPolling bool to install Options, wires --csi-snapshot-early-frequent-polling in BindFlags, defaults to false, propagates into VeleroOptions via AsVeleroOptions, and adds the changelog entry.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning New feature adds WithCSISnapshotEarlyFrequentPolling() function and CSI_SNAPSHOT_EARLY_FREQUENT_POLLING environment variable logic, but PR includes zero tests for this functionality across all thre... Add Ginkgo/test cases to verify: (1) WithCSISnapshotEarlyFrequentPolling correctly sets environment variable in deployment, (2) CSISnapshotEarlyFrequentPolling field in VeleroOptions enables early polling, (3) CSI_SNAPSHOT_EARLY_FREQUENT...
Description check ⚠️ Warning The pull request description provides a clear summary of the change addressing a VSS timing issue, but it omits required checklist items for DCO, changelog, and documentation updates. Complete all checklist items: confirm DCO acceptance, verify changelog was created, and ensure documentation updates are included or mark changelog-not-required if not needed.
✅ Passed checks (12 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: optimizing VSC handle readiness polling specifically for VSS backups, which is the core objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not add or modify Ginkgo test files. Check for stable test names in Ginkgo tests is not applicable to this PR.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The changes are purely functional (CLI flags, deployment options, polling logic) with no test code additions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All modifications are backend implementation changes (install options, deployment config, and polling logic) with no test additions.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no topology-unfriendly scheduling constraints. Changes are limited to adding an environment variable for CSI snapshot polling optimization with no deployment manifests, affinity rules...
Ote Binary Stdout Contract ✅ Passed PR modifies only library packages (install, csi). No process-level code (main/init), test files, or stdout writes (fmt.Print, klog) introduced. All logging uses proper logger parameters.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. The changes are library/utility code for CSI polling (pkg/install/ and pkg/util/csi/), CLI flags, and a changelog entry. The check does not apply.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in any changed files.
Container-Privileges ✅ Passed The PR adds a CSI snapshot polling optimization feature via environment variable (CSI_SNAPSHOT_EARLY_FREQUENT_POLLING) without introducing any container privilege settings like privileged mode, hos...
No-Sensitive-Data-In-Logs ✅ Passed PR contains no sensitive data exposure in logs. Logged data includes only Kubernetes resource identifiers (namespace/name), polling intervals, and CSI driver error messages—none matching sensitive...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2026
@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown

@sseago: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/util/csi/volume_snapshot.go`:
- Around line 675-679: The second polling phase starting around line 691-697 is
using the full csiSnapshotTimeout duration instead of accounting for time
already spent in the first polling phase that begins at line 675. To fix this,
instead of passing a fresh context.Background() and the full timeout duration to
the second wait.PollUntilContextTimeout call, calculate the remaining time
budget by subtracting the elapsed time from the first polling phase from the
original csiSnapshotTimeout, then use that remaining duration for the fallback
polling loop. Alternatively, create a deadline-based context at the start of the
function and reuse it for both polling phases to ensure the total wait time
respects the overall timeout budget.
- Around line 655-657: The code at line 657 dereferences
vsc.Status.Error.Message without first checking if the Message pointer is nil,
which will cause a panic. This same issue also exists in the similar error
logging block around lines 708-711. Add a nil check for the Message field before
dereferencing it, following the same defensive pattern used in the diagnostic
functions around lines 737 and 764 where both the Error and Message fields are
properly checked before access.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3b145ba4-7937-4d20-9efa-9f01b0a9c198

📥 Commits

Reviewing files that changed from the base of the PR and between 82b0e3e and 99738e0.

📒 Files selected for processing (5)
  • changelogs/unreleased/9629-sseago
  • pkg/cmd/cli/install/install.go
  • pkg/install/deployment.go
  • pkg/install/resources.go
  • pkg/util/csi/volume_snapshot.go

Comment on lines +655 to +657
vsc.Status.Error != nil {
log.Warnf("VolumeSnapshotContent %s has error: %v",
vsc.Name, *vsc.Status.Error.Message)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and read the relevant file around the flagged lines
cat -n pkg/util/csi/volume_snapshot.go | sed -n '650,665p'

Repository: openshift/velero

Length of output: 664


🏁 Script executed:

# Also check the second location mentioned
cat -n pkg/util/csi/volume_snapshot.go | sed -n '700,715p'

Repository: openshift/velero

Length of output: 782


🏁 Script executed:

# Look at the type definition to understand if Message is optional
rg "VolumeSnapshotError" pkg/util/csi/volume_snapshot.go -A 5 -B 2

Repository: openshift/velero

Length of output: 154


🏁 Script executed:

# Search for the import of VolumeSnapshotError to find the API package
rg "import" pkg/util/csi/volume_snapshot.go | head -20

Repository: openshift/velero

Length of output: 163


🏁 Script executed:

# Search for type definitions and uses to understand the Message field
ast-grep outline pkg/util/csi/volume_snapshot.go

Repository: openshift/velero

Length of output: 1735


🏁 Script executed:

# Look for VolumeSnapshotError struct definition or where it's used with Message
rg "Message" pkg/util/csi/volume_snapshot.go

Repository: openshift/velero

Length of output: 1024


🏁 Script executed:

# Get the full import section
head -50 pkg/util/csi/volume_snapshot.go

Repository: openshift/velero

Length of output: 1807


🏁 Script executed:

# Get more context around the problematic lines to see the full conditions
sed -n '597,720p' pkg/util/csi/volume_snapshot.go | cat -n

Repository: openshift/velero

Length of output: 4518


🏁 Script executed:

# Look at how Message is safely used elsewhere in the codebase
sed -n '724,790p' pkg/util/csi/volume_snapshot.go | cat -n

Repository: openshift/velero

Length of output: 2485


Guard VolumeSnapshotError.Message before dereferencing.

Lines 657 and 708-711 dereference *vsc.Status.Error.Message without checking if Message is nil. The Message field is a pointer and can be nil even when Error is not, as shown in the diagnostic functions at lines 737 and 764 which correctly check both conditions. This will panic on error/timeout paths.

Suggested fix
+	getVSCErrMsg := func(vscErr *snapshotv1api.VolumeSnapshotError) string {
+		if vscErr == nil || vscErr.Message == nil || *vscErr.Message == "" {
+			return "<empty error message>"
+		}
+		return *vscErr.Message
+	}
+
 	if vsc.Status != nil &&
 		vsc.Status.Error != nil {
 		log.Warnf("VolumeSnapshotContent %s has error: %v",
-			vsc.Name, *vsc.Status.Error.Message)
+			vsc.Name, getVSCErrMsg(vsc.Status.Error))
 	}
@@
 			log.Errorf(
 				"Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v",
-				vsc.Name, *vsc.Status.Error.Message)
+				vsc.Name, getVSCErrMsg(vsc.Status.Error))
 			return nil,
 				errors.Errorf("CSI got timed out with error: %v",
-					*vsc.Status.Error.Message)
+					getVSCErrMsg(vsc.Status.Error))

Also applies to: 708-711

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/util/csi/volume_snapshot.go` around lines 655 - 657, The code at line 657
dereferences vsc.Status.Error.Message without first checking if the Message
pointer is nil, which will cause a panic. This same issue also exists in the
similar error logging block around lines 708-711. Add a nil check for the
Message field before dereferencing it, following the same defensive pattern used
in the diagnostic functions around lines 737 and 764 where both the Error and
Message fields are properly checked before access.

Comment on lines +675 to +679
err = wait.PollUntilContextTimeout(
context.Background(),
interval,
timeout,
true,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Second polling phase resets the timeout budget instead of using the remainder.

Line 693 starts the fallback loop with full csiSnapshotTimeout even after spending up to 10s in the early phase, so total wait can exceed the configured timeout.

💡 Suggested fix
+	remainingTimeout := csiSnapshotTimeout
+
 	if err == nil && frequentPolling {
@@
-		timeout := 10 * time.Second
+		timeout := 10 * time.Second
+		if timeout > remainingTimeout {
+			timeout = remainingTimeout
+		}
 		err = wait.PollUntilContextTimeout(
@@
 		if !wait.Interrupted(err) {
 			return nil, err
 		}
+		remainingTimeout -= timeout
+		if remainingTimeout <= 0 {
+			return nil, err
+		}
 	}
@@
 	err = wait.PollUntilContextTimeout(
 		context.Background(),
 		interval,
-		csiSnapshotTimeout,
+		remainingTimeout,
 		true,
 		pollFunc,
 	)

Also applies to: 691-697

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/util/csi/volume_snapshot.go` around lines 675 - 679, The second polling
phase starting around line 691-697 is using the full csiSnapshotTimeout duration
instead of accounting for time already spent in the first polling phase that
begins at line 675. To fix this, instead of passing a fresh context.Background()
and the full timeout duration to the second wait.PollUntilContextTimeout call,
calculate the remaining time budget by subtracting the elapsed time from the
first polling phase from the original csiSnapshotTimeout, then use that
remaining duration for the fallback polling loop. Alternatively, create a
deadline-based context at the start of the function and reuse it for both
polling phases to ensure the total wait time respects the overall timeout
budget.

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shubham-pampattiwar, sseago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants