virt: extend post-migration SSH wait and log connectivity duration#5446
virt: extend post-migration SSH wait and log connectivity duration#5446geetikakay wants to merge 3 commits into
Conversation
Add ssh_timeout to migrate_vm_and_verify and verify_vm_migrated defaulting to TIMEOUT_3MIN to match upgrade tests. Signed-off-by: Geetika Kapoor <gkapoor@redhat.com>
Log elapsed time and attempt count when wait_for_ssh_connectivity succeeds to collect CI data for post-migration SSH flake triage and timeout tuning. Signed-off-by: Geetika Kapoor <gkapoor@redhat.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (3)**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
utilities/virt.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
**⚙️ CodeRabbit configuration file
Files:
⚙️ CodeRabbit configuration file
Files:
🧠 Learnings (30)📚 Learning: 2026-01-12T11:24:13.825ZApplied to files:
📚 Learning: 2026-01-12T14:25:05.723ZApplied to files:
📚 Learning: 2026-01-20T01:03:13.139ZApplied to files:
📚 Learning: 2026-01-21T21:26:41.805ZApplied to files:
📚 Learning: 2026-01-25T13:18:21.675ZApplied to files:
📚 Learning: 2026-02-18T06:35:39.536ZApplied to files:
📚 Learning: 2026-02-23T16:33:22.070ZApplied to files:
📚 Learning: 2026-03-17T01:32:02.617ZApplied to files:
📚 Learning: 2026-03-17T01:32:02.617ZApplied to files:
📚 Learning: 2026-05-04T13:45:29.122ZApplied to files:
📚 Learning: 2026-05-04T13:45:33.892ZApplied to files:
📚 Learning: 2026-05-05T17:01:15.294ZApplied to files:
📚 Learning: 2026-05-08T12:49:20.694ZApplied to files:
📚 Learning: 2026-05-12T05:10:24.601ZApplied to files:
📚 Learning: 2026-05-13T19:23:09.603ZApplied to files:
📚 Learning: 2026-05-18T06:30:56.781ZApplied to files:
📚 Learning: 2026-05-18T06:31:12.015ZApplied to files:
📚 Learning: 2026-05-18T06:31:15.083ZApplied to files:
📚 Learning: 2026-05-18T06:31:20.848ZApplied to files:
📚 Learning: 2026-05-18T09:09:09.479ZApplied to files:
📚 Learning: 2026-05-19T07:48:17.119ZApplied to files:
📚 Learning: 2026-05-19T07:48:17.119ZApplied to files:
📚 Learning: 2026-05-26T15:52:31.613ZApplied to files:
📚 Learning: 2026-06-21T20:28:07.727ZApplied to files:
📚 Learning: 2026-02-16T10:34:49.439ZApplied to files:
📚 Learning: 2026-03-09T18:30:09.993ZApplied to files:
📚 Learning: 2026-02-18T06:34:38.042ZApplied to files:
📚 Learning: 2026-05-03T15:38:09.624ZApplied to files:
📚 Learning: 2026-05-15T18:42:02.504ZApplied to files:
📚 Learning: 2026-06-15T10:56:21.758ZApplied to files:
🪛 Ruff (0.15.20)utilities/virt.py[warning] 1683-1683: Logging statement uses f-string (G004) 🔇 Additional comments (4)
📝 WalkthroughWalkthroughAdds a ChangesSSH connectivity timeout tracking and propagation
Estimated code review effort: 1 (Trivial) | ~5 minutes Related Issues: None specified. Related PRs: None specified. Suggested labels: enhancement, utilities Suggested reviewers: None specified. Listen up, reviewer — CRITICAL callout time: this diff doesn't break anything, so relax! 🐇 It's a small, well-contained timeout plumbing job — HIGH priority for readability since default parameters like 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsLinked repositories: Your configuration references 1 linked repositories, but your current plan allows 0. Analyzed ``, skipped Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Report bugs in Issues Welcome! 🎉This pull request will be automatically processed with the following features: 🔄 Automatic Actions
📋 Available CommandsPR Status Management
Review & Approval
Testing & Validation
Container Operations
Cherry-pick Operations
Branch Management
Custom Commands
Label Management
✅ Merge RequirementsThis PR will be automatically approved when the following conditions are met:
📊 Review ProcessApprovers and ReviewersApprovers:
Reviewers:
Available Labels
AI Features
Security Checks
💡 Tips
For more information, please refer to the project documentation or contact the maintainers. |
for more information, see https://pre-commit.ci
|
/verified |
Test execution plan request detailsCRITICAL: You MUST post an inline review comment on the first changed line of the first file. As an expert software testing engineer, analyze all modified files in this PR and create a targeted test execution plan. Analysis Requirements:
Output rules: Your deliverable: Test Execution Plan
Use these formats:
Real test commands (MANDATORY when changes affect session/runtime code): When the affected code runs at session/collection time (conftest fixtures, pytest plugins,
Example output for a session-startup fix: Guidelines:
Hardware-Related Checks (SR-IOV, GPU, DPDK): When PR modifies fixtures for hardware-specific resources:
CRITICAL WORKFLOW COMPLETION RULES: When responding to this test execution plan request, you MUST follow these rules EXACTLY:
ABSOLUTE PROHIBITIONS (violating these creates empty/meaningless reviews):
Remember: The pull request review is visible to users. Additional comments are redundant noise. CRITICAL — Verification rules for smoke and gating decisions: |
|
Oops, something went wrong! Please try again later. 🐰 💔 |
|
/test-plan |
Test execution plan request detailsCRITICAL: You MUST post an inline review comment on the first changed line of the first file. As an expert software testing engineer, analyze all modified files in this PR and create a targeted test execution plan. Analysis Requirements:
Output rules: Your deliverable: Test Execution Plan
Use these formats:
Real test commands (MANDATORY when changes affect session/runtime code): When the affected code runs at session/collection time (conftest fixtures, pytest plugins,
Example output for a session-startup fix: Guidelines:
Hardware-Related Checks (SR-IOV, GPU, DPDK): When PR modifies fixtures for hardware-specific resources:
CRITICAL WORKFLOW COMPLETION RULES: When responding to this test execution plan request, you MUST follow these rules EXACTLY:
ABSOLUTE PROHIBITIONS (violating these creates empty/meaningless reviews):
Remember: The pull request review is visible to users. Additional comments are redundant noise. CRITICAL — Verification rules for smoke and gating decisions: |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
# Confirm the diff for utilities/virt.py
git diff HEAD~3 --unified=0 -- utilities/virt.py | head -100
echo "---- direct callers of migrate_vm_and_verify ----"
rg -n 'migrate_vm_and_verify' --type=py -g '!utilities/virt.py' | head -80
echo "---- direct callers of verify_vm_migrated ----"
rg -n 'verify_vm_migrated' --type=py -g '!utilities/virt.py' | head -50
echo "---- direct callers of wait_for_ssh_connectivity ----"
rg -n 'wait_for_ssh_connectivity' --type=py -g '!utilities/virt.py' | head -80
echo "---- smoke marked files ----"
rg -l '`@pytest.mark.smoke`' tests/ | head -50
echo "---- gating marked files ----"
rg -l '`@pytest.mark.gating`' tests/ | head -50Length of output: 2260 🏁 Script executed: #!/bin/bash
set +e
echo "---- direct callers of migrate_vm_and_verify (tests/) ----"
rg -n 'migrate_vm_and_verify' --type=py tests/
echo "---- direct callers of migrate_vm_and_verify (whole repo, excluding utilities/virt.py) ----"
rg -n 'migrate_vm_and_verify' --type=py -g '!utilities/virt.py'
echo "---- direct callers of verify_vm_migrated ----"
rg -n 'verify_vm_migrated' --type=py -g '!utilities/virt.py'
echo "---- direct callers of wait_for_ssh_connectivity ----"
rg -n 'wait_for_ssh_connectivity' --type=py -g '!utilities/virt.py'
echo "---- smoke marked files ----"
rg -l '`@pytest.mark.smoke`' tests/
echo "---- gating marked files ----"
rg -l '`@pytest.mark.gating`' tests/
echo "---- done ----"Length of output: 21762 shell#!/bin/bash echo "---- smoke marked files ----" echo "---- gating marked files ----" echo "---- wait_for_ssh_connectivity direct callers (excluding utilities/virt.py) ----" echo "---- verify_vm_migrated direct callers (excluding utilities/virt.py) ----" echo "done" |
| LOGGER.info( | ||
| f"SSH connectivity to {vm.name} established after {timeout - timeout_watch.remaining_time():.1f}s." | ||
| ) |
There was a problem hiding this comment.
this log is not needed. we have a log when the sampler starts and a log when it is completed (Elapsed time)
There was a problem hiding this comment.
there are so many elapsed logs and they just don't deliver which vm is it and if it's for ssh
timeout_sampler �[32mINFO�[0m Elapsed time: 0.00011849403381347656 [0:00:00.000118]�[0m
Just trying to make it human friendly
There was a problem hiding this comment.
Another motivation to add clear logging along with VM name is agents can look for right message and not all elapsed lines, less tokens needed.
| timeout: int = TIMEOUT_12MIN, | ||
| wait_for_interfaces: bool = True, | ||
| check_ssh_connectivity: bool = False, | ||
| ssh_timeout: int = TIMEOUT_3MIN, |
There was a problem hiding this comment.
isn't this a bug? waiting 3 minutes for ssh connectivity after migration?
@vsibirsk @servolkov would like to hear your thoughts here
There was a problem hiding this comment.
@rnetser saw it 1-2 times on rdu clusters. not everytime and never on tlv ones
There was a problem hiding this comment.
as for me that is a signal worth investigating why SSH takes that long on RDU clusters. 3 minutes is very generous, it may hide real issues.
There was a problem hiding this comment.
Tests pass on TLV but flake on RDU with SSH timeouts after migration pointing to RDU-specific slowness rather than a broken code path. Raising the ceiling to 3 minutes avoids false failures while we collect data(polling period is same so will not impact +ve scenarios). The success log is human-readable and includes the VM name (SSH connectivity to established after Xs) so we can grep per VM and compare RDU vs TLV durations. Once we have enough samples, we either tighten the timeout or file an env follow-up for consistently high durations , the goal is to fine-tune this timeout with real data. I added same intention in my PR description also
What this PR does / why we need it:
Extend post-migration SSH wait from 2 to 3 minutes for intermittent ssh flakes and log SSH connect duration on success for CI triage.
with this fix, we don't change default timeout of wait_for_ssh_connectivity but making it tunable for post migration cases.callers of migrate_vm_and_verify can make a choice now.
Which issue(s) this PR fixes:
SSH comes up quickly on some clusters and slowly on others, so we need real data to set a safe timeout. Logging the success duration gives us that data and we can calibrate ssh_timeout from actual CI runs instead of guessing, understand why results vary across clusters under similar load and reduce flakiness.
Now we have this data : SSH connectivity to rhel10-hotplug-instance-type-vm-1782904035-2483807 established after 20.6s.
Special notes for reviewer:
jira-ticket:
failure in CI lanes seen rarely
Summary by CodeRabbit