[upstream CP] Optimize VSC handle readiness polling for VSS backups by sseago · Pull Request #517 · openshift/velero

sseago · 2026-06-23T17:54:54Z

Thank you for contributing to Velero!

Please add a summary of your change

When waiting for the CSI Snapshot to complete, the CSI plugin checks for the SnapHandle every 5 seconds up until csiSnapshotTimeout (default 10min) is reached. This is a problem for workloads that use Microsoft VSS because VSS will unfreeze the filesystem after 10 seconds (which is not configurable). If a workload has 2 volumes, the 5 second polling interval will almost always result in a forced unfreeze before the post hook runs and likely before the last PVC's snapshot is done.

See the VSS doc here: https://learn.microsoft.com/en-us/windows/win32/vss/overview-of-processing-a-backup-under-vss
Note that that the 10-second unfreeze is not configurable.

This PR refactors this to poll every second for the first 10 seconds, followed by the previous "every 5 seconds" until the snapshot timeout is reached if the csi-snapshot-early-frequent-polling flag is true

related: https://redhat.atlassian.net/browse/OADP-8232

Summary by CodeRabbit

New Features
- Added installation flag --csi-snapshot-early-frequent-polling to enable accelerated polling for CSI snapshot handle readiness, improving performance of volume snapshot operations during backups.

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

coderabbitai · 2026-06-23T17:55:14Z

Walkthrough

Adds a --csi-snapshot-early-frequent-polling CLI install flag that propagates through VeleroOptions and podTemplateConfig to inject a CSI_SNAPSHOT_EARLY_FREQUENT_POLLING=true environment variable into the Velero deployment. WaitUntilVSCHandleIsReady reads this variable to run an optional early 1s-interval/10s-duration polling phase before falling back to the standard 5s-interval loop.

Changes

CSI Snapshot Early Frequent Polling

Layer / File(s)	Summary
Two-phase polling logic `pkg/util/csi/volume_snapshot.go`	Rewrites `WaitUntilVSCHandleIsReady` into a shared `pollFunc` plus an optional early phase (1s interval, 10s cap) controlled by `CSI_SNAPSHOT_EARLY_FREQUENT_POLLING`, falling back to the original 5s-interval loop up to `csiSnapshotTimeout`. Adds `os` and `strconv` imports.
Deployment env var injection and install resource wiring `pkg/install/deployment.go`, `pkg/install/resources.go`	Adds `csiSnapshotEarlyFrequentPolling` to `podTemplateConfig` and `VeleroOptions`, introduces `WithCSISnapshotEarlyFrequentPolling` option function, injects the `CSI_SNAPSHOT_EARLY_FREQUENT_POLLING=true` container env var when enabled, and conditionally appends the option in `AllResources`.
CLI install flag, propagation, and changelog `pkg/cmd/cli/install/install.go`, `changelogs/unreleased/9629-sseago`	Adds `CSISnapshotEarlyFrequentPolling bool` to install `Options`, wires `--csi-snapshot-early-frequent-polling` in `BindFlags`, defaults to `false`, propagates into `VeleroOptions` via `AsVeleroOptions`, and adds the changelog entry.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	New feature adds WithCSISnapshotEarlyFrequentPolling() function and CSI_SNAPSHOT_EARLY_FREQUENT_POLLING environment variable logic, but PR includes zero tests for this functionality across all thre...	Add Ginkgo/test cases to verify: (1) WithCSISnapshotEarlyFrequentPolling correctly sets environment variable in deployment, (2) CSISnapshotEarlyFrequentPolling field in VeleroOptions enables early polling, (3) CSI_SNAPSHOT_EARLY_FREQUENT...
Description check	⚠️ Warning	The pull request description provides a clear summary of the change addressing a VSS timing issue, but it omits required checklist items for DCO, changelog, and documentation updates.	Complete all checklist items: confirm DCO acceptance, verify changelog was created, and ensure documentation updates are included or mark changelog-not-required if not needed.

✅ Passed checks (12 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: optimizing VSC handle readiness polling specifically for VSS backups, which is the core objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR does not add or modify Ginkgo test files. Check for stable test names in Ginkgo tests is not applicable to this PR.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added in this PR. The changes are purely functional (CLI flags, deployment options, polling logic) with no test code additions.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR. All modifications are backend implementation changes (install options, deployment config, and polling logic) with no test additions.
Topology-Aware Scheduling Compatibility	✅ Passed	PR introduces no topology-unfriendly scheduling constraints. Changes are limited to adding an environment variable for CSI snapshot polling optimization with no deployment manifests, affinity rules...
Ote Binary Stdout Contract	✅ Passed	PR modifies only library packages (install, csi). No process-level code (main/init), test files, or stdout writes (fmt.Print, klog) introduced. All logging uses proper logger parameters.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added in this PR. The changes are library/utility code for CSI polling (pkg/install/ and pkg/util/csi/), CLI flags, and a changelog entry. The check does not apply.
No-Weak-Crypto	✅ Passed	No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in any changed files.
Container-Privileges	✅ Passed	The PR adds a CSI snapshot polling optimization feature via environment variable (CSI_SNAPSHOT_EARLY_FREQUENT_POLLING) without introducing any container privilege settings like privileged mode, hos...
No-Sensitive-Data-In-Logs	✅ Passed	PR contains no sensitive data exposure in logs. Logged data includes only Kubernetes resource identifiers (namespace/name), polling intervals, and CSI driver error messages—none matching sensitive...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

openshift-ci · 2026-06-23T18:05:25Z

@sseago: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/util/csi/volume_snapshot.go`:
- Around line 675-679: The second polling phase starting around line 691-697 is
using the full csiSnapshotTimeout duration instead of accounting for time
already spent in the first polling phase that begins at line 675. To fix this,
instead of passing a fresh context.Background() and the full timeout duration to
the second wait.PollUntilContextTimeout call, calculate the remaining time
budget by subtracting the elapsed time from the first polling phase from the
original csiSnapshotTimeout, then use that remaining duration for the fallback
polling loop. Alternatively, create a deadline-based context at the start of the
function and reuse it for both polling phases to ensure the total wait time
respects the overall timeout budget.
- Around line 655-657: The code at line 657 dereferences
vsc.Status.Error.Message without first checking if the Message pointer is nil,
which will cause a panic. This same issue also exists in the similar error
logging block around lines 708-711. Add a nil check for the Message field before
dereferencing it, following the same defensive pattern used in the diagnostic
functions around lines 737 and 764 where both the Error and Message fields are
properly checked before access.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3b145ba4-7937-4d20-9efa-9f01b0a9c198

📥 Commits

Reviewing files that changed from the base of the PR and between 82b0e3e and 99738e0.

📒 Files selected for processing (5)

changelogs/unreleased/9629-sseago
pkg/cmd/cli/install/install.go
pkg/install/deployment.go
pkg/install/resources.go
pkg/util/csi/volume_snapshot.go

coderabbitai · 2026-06-23T18:06:11Z

+				vsc.Status.Error != nil {
+				log.Warnf("VolumeSnapshotContent %s has error: %v",
+					vsc.Name, *vsc.Status.Error.Message)


🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and read the relevant file around the flagged lines cat -n pkg/util/csi/volume_snapshot.go | sed -n '650,665p'

Repository: openshift/velero

Length of output: 664

🏁 Script executed:

# Also check the second location mentioned cat -n pkg/util/csi/volume_snapshot.go | sed -n '700,715p'

Repository: openshift/velero

Length of output: 782

🏁 Script executed:

# Look at the type definition to understand if Message is optional rg "VolumeSnapshotError" pkg/util/csi/volume_snapshot.go -A 5 -B 2

Repository: openshift/velero

Length of output: 154

🏁 Script executed:

# Search for the import of VolumeSnapshotError to find the API package rg "import" pkg/util/csi/volume_snapshot.go | head -20

Repository: openshift/velero

Length of output: 163

🏁 Script executed:

# Search for type definitions and uses to understand the Message field ast-grep outline pkg/util/csi/volume_snapshot.go

Repository: openshift/velero

Length of output: 1735

🏁 Script executed:

# Look for VolumeSnapshotError struct definition or where it's used with Message rg "Message" pkg/util/csi/volume_snapshot.go

Repository: openshift/velero

Length of output: 1024

🏁 Script executed:

# Get the full import section head -50 pkg/util/csi/volume_snapshot.go

Repository: openshift/velero

Length of output: 1807

🏁 Script executed:

# Get more context around the problematic lines to see the full conditions sed -n '597,720p' pkg/util/csi/volume_snapshot.go | cat -n

Repository: openshift/velero

Length of output: 4518

🏁 Script executed:

# Look at how Message is safely used elsewhere in the codebase sed -n '724,790p' pkg/util/csi/volume_snapshot.go | cat -n

Repository: openshift/velero

Length of output: 2485

Guard VolumeSnapshotError.Message before dereferencing.

Lines 657 and 708-711 dereference *vsc.Status.Error.Message without checking if Message is nil. The Message field is a pointer and can be nil even when Error is not, as shown in the diagnostic functions at lines 737 and 764 which correctly check both conditions. This will panic on error/timeout paths.

Suggested fix

+ getVSCErrMsg := func(vscErr *snapshotv1api.VolumeSnapshotError) string { + if vscErr == nil || vscErr.Message == nil || *vscErr.Message == "" { + return "<empty error message>" + } + return *vscErr.Message + } + if vsc.Status != nil && vsc.Status.Error != nil { log.Warnf("VolumeSnapshotContent %s has error: %v", - vsc.Name, *vsc.Status.Error.Message) + vsc.Name, getVSCErrMsg(vsc.Status.Error)) } @@ log.Errorf( "Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v", - vsc.Name, *vsc.Status.Error.Message) + vsc.Name, getVSCErrMsg(vsc.Status.Error)) return nil, errors.Errorf("CSI got timed out with error: %v", - *vsc.Status.Error.Message) + getVSCErrMsg(vsc.Status.Error))

Also applies to: 708-711

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/util/csi/volume_snapshot.go` around lines 655 - 657, The code at line 657 dereferences vsc.Status.Error.Message without first checking if the Message pointer is nil, which will cause a panic. This same issue also exists in the similar error logging block around lines 708-711. Add a nil check for the Message field before dereferencing it, following the same defensive pattern used in the diagnostic functions around lines 737 and 764 where both the Error and Message fields are properly checked before access.

coderabbitai · 2026-06-23T18:06:11Z

+		err = wait.PollUntilContextTimeout(
+			context.Background(),
+			interval,
+			timeout,
+			true,


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Second polling phase resets the timeout budget instead of using the remainder.

Line 693 starts the fallback loop with full csiSnapshotTimeout even after spending up to 10s in the early phase, so total wait can exceed the configured timeout.

💡 Suggested fix

+ remainingTimeout := csiSnapshotTimeout + if err == nil && frequentPolling { @@ - timeout := 10 * time.Second + timeout := 10 * time.Second + if timeout > remainingTimeout { + timeout = remainingTimeout + } err = wait.PollUntilContextTimeout( @@ if !wait.Interrupted(err) { return nil, err } + remainingTimeout -= timeout + if remainingTimeout <= 0 { + return nil, err + } } @@ err = wait.PollUntilContextTimeout( context.Background(), interval, - csiSnapshotTimeout, + remainingTimeout, true, pollFunc, )

Also applies to: 691-697

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/util/csi/volume_snapshot.go` around lines 675 - 679, The second polling phase starting around line 691-697 is using the full csiSnapshotTimeout duration instead of accounting for time already spent in the first polling phase that begins at line 675. To fix this, instead of passing a fresh context.Background() and the full timeout duration to the second wait.PollUntilContextTimeout call, calculate the remaining time budget by subtracting the elapsed time from the first polling phase from the original csiSnapshotTimeout, then use that remaining duration for the fallback polling loop. Alternatively, create a deadline-based context at the start of the function and reuse it for both polling phases to ensure the total wait time respects the overall timeout budget.

openshift-ci · 2026-06-23T18:12:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shubham-pampattiwar, sseago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [shubham-pampattiwar,sseago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sseago and others added 2 commits June 23, 2026 13:53

refactor: Optimize VSC handle readiness polling for VSS backups

7d2aa0d

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

feat: Implement early frequent polling for CSI snapshots

99738e0

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

openshift-ci Bot requested review from savitharaghunathan and shubham-pampattiwar June 23, 2026 17:55

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

shubham-pampattiwar approved these changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[upstream CP] Optimize VSC handle readiness polling for VSS backups#517

[upstream CP] Optimize VSC handle readiness polling for VSS backups#517
sseago wants to merge 2 commits into
openshift:oadp-devfrom
sseago:vss-oadp-dev

sseago commented Jun 23, 2026 •

edited by weshayutin

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

❌ Failed checks (3 warnings)

Uh oh!

openshift-ci Bot commented Jun 23, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

openshift-ci Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sseago commented Jun 23, 2026 • edited by weshayutin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Please add a summary of your change

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (3 warnings)

Uh oh!

openshift-ci Bot commented Jun 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sseago commented Jun 23, 2026 •

edited by weshayutin

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading