-
Notifications
You must be signed in to change notification settings - Fork 361
add time.sleep before creating ssh master files #6238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Seeing some network delays and ssh failing because of it. Adding sleep(10) to induce some wait period. Signed-off-by: Vaishnavi Bhat <vaishnavi@linux.vnet.ibm.com>
WalkthroughThis pull request adds a Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Areas for extra attention:
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (4.0.3)avocado/utils/ssh.pyTip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
avocado/utils/ssh.py(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (42)
- GitHub Check: rpm-build:fedora-41-ppc64le
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-41-aarch64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-41-x86_64
- GitHub Check: rpm-build:fedora-41-s390x
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:fedora-41-s390x
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-41-aarch64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-41-ppc64le
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-41-x86_64
- GitHub Check: Linux with Python 3.12
- GitHub Check: Linux with Python 3.13
- GitHub Check: Linux with Python 3.11
- GitHub Check: Linux with Python 3.9
- GitHub Check: Linux with Python 3.10
- GitHub Check: rpm-build:fedora-41-aarch64
- GitHub Check: rpm-build:centos-stream-9-x86_64
- GitHub Check: rpm-build:fedora-42-x86_64
- GitHub Check: rpm-build:epel-9-x86_64
- GitHub Check: rpm-build:fedora-41-s390x
- GitHub Check: rpm-build:fedora-41-x86_64
- GitHub Check: rpm-build:fedora-43-x86_64
- GitHub Check: rpm-build:fedora-rawhide-x86_64
- GitHub Check: rpm-build:fedora-41-ppc64le
- GitHub Check: Fedora selftests
- GitHub Check: Version task fedora:41
- GitHub Check: Podman spawner with 3rd party runner plugin
- GitHub Check: Egg task fedora:40
- GitHub Check: Version task ubi:8.8
- GitHub Check: Version task debian:12.4
- GitHub Check: Version task ubuntu:22.04
- GitHub Check: Static checks
- GitHub Check: macOS with Python 3.11
- GitHub Check: Code Coverage (3.11)
| import subprocess | ||
| import sys | ||
| import tempfile | ||
| import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Architectural concern: Fixed delays are not the right solution for network reliability.
The PR description mentions addressing "network delays that cause SSH failures," but adding fixed sleeps is a workaround that:
- Doesn't address the root cause
- Impacts all users with 30+ seconds of total delay per session
- Provides no diagnostic information about failures
A robust solution should:
- Diagnose the root cause: Are these connection timeouts? DNS resolution delays? Master connection race conditions?
- Use SSH timeout options: Add
ConnectTimeout,ServerAliveInterval,ServerAliveCountMaxtoDEFAULT_OPTIONS - Implement proper retry logic: Retry failed operations with exponential backoff
- Make timeouts configurable: Allow users to adjust based on their environment
- Add logging: Help diagnose when and why SSH operations fail
Would you like me to open an issue with a detailed implementation plan for proper SSH reliability improvements?
Also applies to: 104-104, 130-130, 193-193
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Create a GitHub issue for SSH reliability improvements
gh issue create \
--repo avocado-framework/avocado \
--title "Improve SSH reliability with proper timeout configuration and retry logic" \
--assignee clebergnu \
--body "## Context
This issue is a follow-up to PR #6238 and the discussion at:
- PR: https://github.com/avocado-framework/avocado/pull/6238
- Comment: https://github.com/avocado-framework/avocado/pull/6238#discussion_r2540644060
- Requested by: @clebergnu
## Problem
PR #6238 proposes adding fixed 10-second sleeps in three locations within \`avocado/utils/ssh.py\` to address \"network delays that cause SSH failures.\" While this may work around immediate symptoms, it has several drawbacks:
1. **Root cause unknown**: The delays don't address why SSH connections are failing
2. **Performance impact**: All users pay a 30+ second penalty per SSH session, even with fast networks
3. **No diagnostics**: No logging or error information to help debug failures
4. **Arbitrary timeout**: 10 seconds may be too short for some networks, too long for others
5. **Not maintainable**: Hardcoded sleeps are difficult to adjust and test
## Proposed Solution
Implement a comprehensive SSH reliability improvement that addresses the root causes:
### 1. Add SSH Timeout Configuration Options
Add timeout-related SSH options to \`DEFAULT_OPTIONS\`:
- \`ConnectTimeout\`: Maximum time to establish connection (e.g., 30 seconds)
- \`ServerAliveInterval\`: Send keepalive messages every N seconds (e.g., 15)
- \`ServerAliveCountMax\`: Number of keepalive messages before disconnect (e.g., 3)
### 2. Implement Retry Logic with Exponential Backoff
For operations prone to transient failures:
- Retry failed master connection establishment
- Use exponential backoff (e.g., 1s, 2s, 4s, 8s)
- Make retry count configurable
- Add maximum total timeout
### 3. Add Comprehensive Logging
Add logging to help diagnose issues:
- Log connection attempts and results
- Log retry attempts with reasons
- Log SSH command execution details
- Use Python's \`logging\` module with appropriate levels
### 4. Make Timeouts Configurable
Allow users to customize timeouts based on their environment:
- Add optional parameters to \`Session.__init__\`
- Support environment variables for defaults
- Document recommended values for different scenarios
### 5. Improve Error Messages
Enhance error reporting to help users diagnose issues:
- Distinguish between connection timeout vs. authentication failure
- Include timing information in error messages
- Suggest configuration changes based on error type
## Implementation Steps
1. Investigate root cause of SSH failures mentioned in PR #6238
2. Add timeout options to SSH configuration
3. Implement retry logic for master connection
4. Add logging framework
5. Make timeouts configurable via constructor parameters
6. Add tests for retry logic and timeout handling
7. Update documentation with troubleshooting guide
## Benefits
- **Better reliability**: Proper timeouts and retries handle transient failures
- **Better performance**: No fixed delays for fast networks
- **Better diagnostics**: Logging helps identify and fix issues
- **Better flexibility**: Users can tune for their environment
- **Better maintainability**: Clear, testable code vs. magic sleeps
## Additional Context
The current implementation adds sleeps in:
- \`_master_connection()\`: Before establishing master connection
- \`_master_command()\`: Before executing master commands
- \`control_master\` property: When accessing control master path
These locations suggest the issue may be related to master connection timing or file system synchronization delays."
Length of output: 56
@clebergnu Perfect! I've created issue #6251 with a comprehensive implementation plan for improving SSH reliability:
🔗 #6251
The issue includes:
- Context and backlinks to this PR and discussion
- Detailed problem description (why fixed sleeps are problematic)
- Proposed solutions covering timeout configuration, retry logic, logging, and error handling
- Step-by-step implementation plan
- Benefits of the proper approach
The issue is assigned to you and ready for the team to work on. This should provide a much more robust solution than the fixed sleep approach in the current PR.
_
/ \
/ _ \
| / \ |
|| ||
|\_v_/|
/ \
/ ( ) \
/_____
| ___ |
\ (. .) /
_/
Happy to help! 🐰
🧬 Code Graph Analysis Results
Filename: avocado/utils/path.py, lines 30-51
class CmdNotFoundError(Exception):
"""Indicates that the command was not found in the system after a search.
:param cmd: String with the command.
:param paths: List of paths where we looked after.
"""
def __init__(self, cmd, paths): # pylint: disable=W0231
super()
self.cmd = cmd
self.paths = paths
def __str__(self):
"""String representation of the exception.
:return: A string describing the missing command and the paths searched.
:rtype: str
"""
return (
f"Command '{self.cmd}' could not be found in any "
f"of the PATH dirs: {self.paths}"
)
Filename: avocado/utils/path.py, lines 90-133
def find_command(cmd, default=None, check_exec=True):
"""Try to find a command in the PATH, paranoid version.
:param cmd: Command to be found.
:type cmd: str
:param default: Command path to use as a fallback if not found
in the standard directories.
:type default: str or None
:param check_exec: if a check for permissions that render the command
executable by the current user should be performed.
:type check_exec: bool
:raise avocado.utils.path.CmdNotFoundError: in case the
command was not found and no default was given.
:return: Returns an absolute path to the command or the default
value if the command is not found
:rtype: str
"""
common_bin_paths = [
"/usr/libexec",
"/usr/local/sbin",
"/usr/local/bin",
"/usr/sbin",
"/usr/bin",
"/sbin",
"/bin",
]
try:
path_paths = os.environ["PATH"].split(":")
except KeyError:
path_paths = []
path_paths = list(set(common_bin_paths + path_paths))
for dir_path in path_paths:
cmd_path = os.path.join(dir_path, cmd)
if os.path.isfile(cmd_path):
if check_exec:
if not os.access(cmd_path, os.R_OK | os.X_OK):
continue
return os.path.abspath(cmd_path)
if default is not None:
return default
path_paths.sort()
raise CmdNotFoundError(cmd, path_paths)
Filename: avocado/utils/process.py, lines 44-58
class CmdError(Exception):
def __init__(
self, command=None, result=None, additional_text=None
): # pylint: disable=W0231
self.command = command
self.result = result
self.additional_text = additional_text
def __str__(self):
return (
f"Command '{self.command}' failed.\nstdout: "
f"{self.result.stdout!r}\nstderr: "
f"{self.result.stderr!r}\nadditional_info: "
f"{self.additional_text}"
)
Filename: avocado/utils/process.py, lines 952-1019
def run(
cmd,
timeout=None,
verbose=True,
ignore_status=False,
shell=False,
env=None,
sudo=False,
ignore_bg_processes=False,
encoding=None,
logger=None,
):
"""
Run a subprocess, returning a CmdResult object.
:param cmd: Command line to run.
:type cmd: str
:param timeout: Time limit in seconds before attempting to kill the
running process. This function will take a few seconds
longer than 'timeout' to complete if it has to kill the
process.
:type timeout: float
:param verbose: Whether to log the command run and stdout/stderr.
:type verbose: bool
:param ignore_status: Whether to raise an exception when command returns
=! 0 (False), or not (True).
:type ignore_status: bool
:param shell: Whether to run the command on a subshell
:type shell: bool
:param env: Use extra environment variables
:type env: dict
:param sudo: Whether the command requires admin privileges to run,
so that sudo will be prepended to the command.
The assumption here is that the user running the command
has a sudo configuration such that a password won't be
prompted. If that's not the case, the command will
straight out fail.
:param encoding: the encoding to use for the text representation
of the command result stdout and stderr, by default
:data:`avocado.utils.astring.ENCODING`
:type encoding: str
:param logger: User's custom logger, which will be logging the subprocess
outputs. When this parameter is not set, the
`avocado.utils.process` logger will be used.
:type logger: logging.Logger
:return: An :class:`CmdResult` object.
:raise: :class:`CmdError`, if ``ignore_status=False``.
"""
if not cmd:
raise CmdInputError("Invalid empty command")
if encoding is None:
encoding = astring.ENCODING
sp = SubProcess(
cmd=cmd,
verbose=verbose,
shell=shell,
env=env,
sudo=sudo,
ignore_bg_processes=ignore_bg_processes,
encoding=encoding,
logger=logger,
)
cmd_result = sp.run(timeout=timeout)
fail_condition = cmd_result.exit_status or cmd_result.interrupted
if fail_condition and not ignore_status:
raise CmdError(cmd, sp.result)
return cmd_result
| return cmd | ||
|
|
||
| def _master_connection(self): | ||
| time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace fixed delay with retry logic and configurable timeouts.
A fixed 10-second sleep before establishing the master connection is problematic:
- Sleeping before the connection attempt doesn't mitigate network delays that occur during the connection
- The arbitrary 10-second delay will impact all users, even those without network issues
- It's not configurable and provides no feedback about what's happening
Consider these alternatives:
- Add SSH connection timeout options (e.g.,
ConnectTimeout=10) - Implement retry logic with exponential backoff for connection failures
- Make any delay configurable via parameters or class attributes
- Add logging to diagnose connection failures
Would you like me to generate an implementation with retry logic and configurable timeouts?
🤖 Prompt for AI Agents
In avocado/utils/ssh.py around line 104, replace the hardcoded time.sleep(10)
with configurable connection and retry behavior: add parameters or class
attributes for ssh_connect_timeout (e.g., ConnectTimeout), max_retries,
base_backoff_seconds, and max_backoff_seconds; attempt the master connection in
a loop that catches connection-related exceptions, logs each failure with
context, and retries with exponential backoff (jitter) until success or
max_retries is reached; ensure the SSH command includes the ConnectTimeout
option and surface a clear error/exception if all retries fail so callers can
handle it.
| return path | ||
|
|
||
| def _master_command(self, command): | ||
| time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed delay in _master_command multiplies performance impact.
This method is called by _check (line 136) and quit (line 259), meaning the 10-second delay will occur multiple times per session. Similar to the delay in _master_connection, sleeping before command execution doesn't address network issues that occur during execution.
If master commands are timing out, address this through:
- SSH timeout options in the command construction
- Retry logic for failed commands with specific exit codes
- Configurable timeout parameters
🤖 Prompt for AI Agents
In avocado/utils/ssh.py around line 130, the hardcoded time.sleep(10) in
_master_command imposes repeated 10s delays and should be removed; instead make
command timeouts configurable and enforce them at execution (e.g., add a timeout
parameter or use SSH/remote command timeout flags), implement retry logic for
transient failures (retry on specific exit codes or exceptions with a capped
retry count and backoff), and expose the timeout and retry settings via
configuration so callers (_check and quit) don’t incur unnecessary fixed delays.
|
|
||
| @property | ||
| def control_master(self): | ||
| time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the sleep from control_master property - it performs no network operations.
This property only expands a file path and checks if it exists on the local filesystem (lines 194-198). There are no network operations here, so a 10-second sleep makes no sense and will severely impact performance since this property is accessed multiple times:
- In
cleanup_master(line 140) - In
connectvia_check(lines 157, 189)
This sleep should be removed entirely.
🤖 Prompt for AI Agents
In avocado/utils/ssh.py around line 193, remove the unnecessary time.sleep(10)
in the control_master property: this property only expands and checks a local
filesystem path and performs no network I/O, so delete the sleep call and ensure
the property simply returns the expanded path or existence result as before;
update any dependent logic if it assumed a delay (none should), and run tests to
confirm no behavioral changes besides improved performance.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6238 +/- ##
==========================================
- Coverage 73.47% 73.46% -0.01%
==========================================
Files 206 206
Lines 22494 22498 +4
==========================================
+ Hits 16528 16529 +1
- Misses 5966 5969 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
richtja
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vaishnavibhat, can you please better describe your current issue with the ssh utility so we could disscus possible solutions?
IMO adding fixed wait to the ssh connection is not right way how to solve any connection issues. It will slow dow all the other connection in enviroments without connectivity issues also there might be evrimoments where 10 sec won't be sufficient and it won't solv the issue. Thereofre I think we need to come up with exaption handling and retry logic indsted of hardcoded timouts.
clebergnu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vaishnavibhat,
I think I understand how this race condition could happen, and thanks for looking into fixing it. But, the 10 seconds sleep is just not acceptable because it can introduce major performance regressions.
Please take a look at the issue that CodeRabbit created: #6251
Among its point, the most important one is to:
Implement Retry Logic with Exponential Backoff
For operations prone to transient failures:
* Retry failed master connection establishment
* Use exponential backoff (e.g., 1s, 2s, 4s, 8s)
* Make retry count configurable
* Add maximum total timeout
Let me know if this makes sense to you.
Seeing some network delays and ssh failing because of it. Adding sleep(10) to induce some wait period.
Summary by CodeRabbit