[upstream CP] Optimize VSC handle readiness polling for VSS backups#517
[upstream CP] Optimize VSC handle readiness polling for VSS backups#517sseago wants to merge 2 commits into
Conversation
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
WalkthroughAdds a ChangesCSI Snapshot Early Frequent Polling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 12 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@sseago: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/util/csi/volume_snapshot.go`:
- Around line 675-679: The second polling phase starting around line 691-697 is
using the full csiSnapshotTimeout duration instead of accounting for time
already spent in the first polling phase that begins at line 675. To fix this,
instead of passing a fresh context.Background() and the full timeout duration to
the second wait.PollUntilContextTimeout call, calculate the remaining time
budget by subtracting the elapsed time from the first polling phase from the
original csiSnapshotTimeout, then use that remaining duration for the fallback
polling loop. Alternatively, create a deadline-based context at the start of the
function and reuse it for both polling phases to ensure the total wait time
respects the overall timeout budget.
- Around line 655-657: The code at line 657 dereferences
vsc.Status.Error.Message without first checking if the Message pointer is nil,
which will cause a panic. This same issue also exists in the similar error
logging block around lines 708-711. Add a nil check for the Message field before
dereferencing it, following the same defensive pattern used in the diagnostic
functions around lines 737 and 764 where both the Error and Message fields are
properly checked before access.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3b145ba4-7937-4d20-9efa-9f01b0a9c198
📒 Files selected for processing (5)
changelogs/unreleased/9629-sseagopkg/cmd/cli/install/install.gopkg/install/deployment.gopkg/install/resources.gopkg/util/csi/volume_snapshot.go
| vsc.Status.Error != nil { | ||
| log.Warnf("VolumeSnapshotContent %s has error: %v", | ||
| vsc.Name, *vsc.Status.Error.Message) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
# First, find and read the relevant file around the flagged lines
cat -n pkg/util/csi/volume_snapshot.go | sed -n '650,665p'Repository: openshift/velero
Length of output: 664
🏁 Script executed:
# Also check the second location mentioned
cat -n pkg/util/csi/volume_snapshot.go | sed -n '700,715p'Repository: openshift/velero
Length of output: 782
🏁 Script executed:
# Look at the type definition to understand if Message is optional
rg "VolumeSnapshotError" pkg/util/csi/volume_snapshot.go -A 5 -B 2Repository: openshift/velero
Length of output: 154
🏁 Script executed:
# Search for the import of VolumeSnapshotError to find the API package
rg "import" pkg/util/csi/volume_snapshot.go | head -20Repository: openshift/velero
Length of output: 163
🏁 Script executed:
# Search for type definitions and uses to understand the Message field
ast-grep outline pkg/util/csi/volume_snapshot.goRepository: openshift/velero
Length of output: 1735
🏁 Script executed:
# Look for VolumeSnapshotError struct definition or where it's used with Message
rg "Message" pkg/util/csi/volume_snapshot.goRepository: openshift/velero
Length of output: 1024
🏁 Script executed:
# Get the full import section
head -50 pkg/util/csi/volume_snapshot.goRepository: openshift/velero
Length of output: 1807
🏁 Script executed:
# Get more context around the problematic lines to see the full conditions
sed -n '597,720p' pkg/util/csi/volume_snapshot.go | cat -nRepository: openshift/velero
Length of output: 4518
🏁 Script executed:
# Look at how Message is safely used elsewhere in the codebase
sed -n '724,790p' pkg/util/csi/volume_snapshot.go | cat -nRepository: openshift/velero
Length of output: 2485
Guard VolumeSnapshotError.Message before dereferencing.
Lines 657 and 708-711 dereference *vsc.Status.Error.Message without checking if Message is nil. The Message field is a pointer and can be nil even when Error is not, as shown in the diagnostic functions at lines 737 and 764 which correctly check both conditions. This will panic on error/timeout paths.
Suggested fix
+ getVSCErrMsg := func(vscErr *snapshotv1api.VolumeSnapshotError) string {
+ if vscErr == nil || vscErr.Message == nil || *vscErr.Message == "" {
+ return "<empty error message>"
+ }
+ return *vscErr.Message
+ }
+
if vsc.Status != nil &&
vsc.Status.Error != nil {
log.Warnf("VolumeSnapshotContent %s has error: %v",
- vsc.Name, *vsc.Status.Error.Message)
+ vsc.Name, getVSCErrMsg(vsc.Status.Error))
}
@@
log.Errorf(
"Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v",
- vsc.Name, *vsc.Status.Error.Message)
+ vsc.Name, getVSCErrMsg(vsc.Status.Error))
return nil,
errors.Errorf("CSI got timed out with error: %v",
- *vsc.Status.Error.Message)
+ getVSCErrMsg(vsc.Status.Error))Also applies to: 708-711
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/util/csi/volume_snapshot.go` around lines 655 - 657, The code at line 657
dereferences vsc.Status.Error.Message without first checking if the Message
pointer is nil, which will cause a panic. This same issue also exists in the
similar error logging block around lines 708-711. Add a nil check for the
Message field before dereferencing it, following the same defensive pattern used
in the diagnostic functions around lines 737 and 764 where both the Error and
Message fields are properly checked before access.
| err = wait.PollUntilContextTimeout( | ||
| context.Background(), | ||
| interval, | ||
| timeout, | ||
| true, |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Second polling phase resets the timeout budget instead of using the remainder.
Line 693 starts the fallback loop with full csiSnapshotTimeout even after spending up to 10s in the early phase, so total wait can exceed the configured timeout.
💡 Suggested fix
+ remainingTimeout := csiSnapshotTimeout
+
if err == nil && frequentPolling {
@@
- timeout := 10 * time.Second
+ timeout := 10 * time.Second
+ if timeout > remainingTimeout {
+ timeout = remainingTimeout
+ }
err = wait.PollUntilContextTimeout(
@@
if !wait.Interrupted(err) {
return nil, err
}
+ remainingTimeout -= timeout
+ if remainingTimeout <= 0 {
+ return nil, err
+ }
}
@@
err = wait.PollUntilContextTimeout(
context.Background(),
interval,
- csiSnapshotTimeout,
+ remainingTimeout,
true,
pollFunc,
)Also applies to: 691-697
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/util/csi/volume_snapshot.go` around lines 675 - 679, The second polling
phase starting around line 691-697 is using the full csiSnapshotTimeout duration
instead of accounting for time already spent in the first polling phase that
begins at line 675. To fix this, instead of passing a fresh context.Background()
and the full timeout duration to the second wait.PollUntilContextTimeout call,
calculate the remaining time budget by subtracting the elapsed time from the
first polling phase from the original csiSnapshotTimeout, then use that
remaining duration for the fallback polling loop. Alternatively, create a
deadline-based context at the start of the function and reuse it for both
polling phases to ensure the total wait time respects the overall timeout
budget.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shubham-pampattiwar, sseago The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you for contributing to Velero!
Please add a summary of your change
When waiting for the CSI Snapshot to complete, the CSI plugin checks for the SnapHandle every 5 seconds up until csiSnapshotTimeout (default 10min) is reached. This is a problem for workloads that use Microsoft VSS because VSS will unfreeze the filesystem after 10 seconds (which is not configurable). If a workload has 2 volumes, the 5 second polling interval will almost always result in a forced unfreeze before the post hook runs and likely before the last PVC's snapshot is done.
See the VSS doc here: https://learn.microsoft.com/en-us/windows/win32/vss/overview-of-processing-a-backup-under-vss
Note that that the 10-second unfreeze is not configurable.
This PR refactors this to poll every second for the first 10 seconds, followed by the previous "every 5 seconds" until the snapshot timeout is reached if the csi-snapshot-early-frequent-polling flag is true
related: https://redhat.atlassian.net/browse/OADP-8232
Summary by CodeRabbit
--csi-snapshot-early-frequent-pollingto enable accelerated polling for CSI snapshot handle readiness, improving performance of volume snapshot operations during backups.