add time.sleep before creating ssh master files #6238

vaishnavibhat · 2025-11-19T05:53:36Z

Seeing some network delays and ssh failing because of it. Adding sleep(10) to induce some wait period.

Summary by CodeRabbit

Bug Fixes
- Enhanced SSH connection reliability through timing adjustments applied to connection establishment, command execution, and control path management.

Seeing some network delays and ssh failing because of it. Adding sleep(10) to induce some wait period. Signed-off-by: Vaishnavi Bhat <vaishnavi@linux.vnet.ibm.com>

coderabbitai · 2025-11-19T05:53:45Z

Walkthrough

This pull request adds a time module import and introduces three fixed 10-second delays (time.sleep(10)) at different points in the SSH utility module: before establishing a master connection, before executing a master command, and when accessing or creating the control master path. No changes are made to function signatures, error handling logic, or control flow beyond the inserted sleep calls.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Areas for extra attention:

Rationale for delays: Reviewers should verify the justification for the fixed 10-second delays and confirm they address a specific timing or race condition issue
Performance impact: Consider whether these delays will noticeably increase SSH operation latency and test execution time, particularly in scenarios where connections are created or commands are executed frequently
Alternative approaches: Evaluate whether retry logic, exponential backoff, or polling-based solutions might be more appropriate than fixed delays

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'add time.sleep before creating ssh master files' clearly and specifically describes the main change: introducing a sleep delay before SSH master file creation to address network delays.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.3)

avocado/utils/ssh.py

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50bcca0 and ab660f6.

📒 Files selected for processing (1)

avocado/utils/ssh.py (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (42)

GitHub Check: rpm-build:fedora-41-ppc64le
GitHub Check: rpm-build:fedora-43-x86_64
GitHub Check: rpm-build:fedora-rawhide-x86_64
GitHub Check: rpm-build:epel-9-x86_64
GitHub Check: rpm-build:fedora-41-aarch64
GitHub Check: rpm-build:fedora-42-x86_64
GitHub Check: rpm-build:centos-stream-9-x86_64
GitHub Check: rpm-build:fedora-41-x86_64
GitHub Check: rpm-build:fedora-41-s390x
GitHub Check: rpm-build:fedora-42-x86_64
GitHub Check: rpm-build:fedora-41-s390x
GitHub Check: rpm-build:centos-stream-9-x86_64
GitHub Check: rpm-build:fedora-41-aarch64
GitHub Check: rpm-build:fedora-43-x86_64
GitHub Check: rpm-build:epel-9-x86_64
GitHub Check: rpm-build:fedora-41-ppc64le
GitHub Check: rpm-build:fedora-rawhide-x86_64
GitHub Check: rpm-build:fedora-41-x86_64
GitHub Check: Linux with Python 3.12
GitHub Check: Linux with Python 3.13
GitHub Check: Linux with Python 3.11
GitHub Check: Linux with Python 3.9
GitHub Check: Linux with Python 3.10
GitHub Check: rpm-build:fedora-41-aarch64
GitHub Check: rpm-build:centos-stream-9-x86_64
GitHub Check: rpm-build:fedora-42-x86_64
GitHub Check: rpm-build:epel-9-x86_64
GitHub Check: rpm-build:fedora-41-s390x
GitHub Check: rpm-build:fedora-41-x86_64
GitHub Check: rpm-build:fedora-43-x86_64
GitHub Check: rpm-build:fedora-rawhide-x86_64
GitHub Check: rpm-build:fedora-41-ppc64le
GitHub Check: Fedora selftests
GitHub Check: Version task fedora:41
GitHub Check: Podman spawner with 3rd party runner plugin
GitHub Check: Egg task fedora:40
GitHub Check: Version task ubi:8.8
GitHub Check: Version task debian:12.4
GitHub Check: Version task ubuntu:22.04
GitHub Check: Static checks
GitHub Check: macOS with Python 3.11
GitHub Check: Code Coverage (3.11)

coderabbitai · 2025-11-19T05:57:15Z

avocado/utils/ssh.py

 import subprocess
 import sys
 import tempfile
+import time


⚠️ Potential issue | 🟠 Major

Architectural concern: Fixed delays are not the right solution for network reliability.

The PR description mentions addressing "network delays that cause SSH failures," but adding fixed sleeps is a workaround that:

Doesn't address the root cause

Impacts all users with 30+ seconds of total delay per session

Provides no diagnostic information about failures

A robust solution should:

Diagnose the root cause: Are these connection timeouts? DNS resolution delays? Master connection race conditions?

Use SSH timeout options: Add ConnectTimeout, ServerAliveInterval, ServerAliveCountMax to DEFAULT_OPTIONS

Implement proper retry logic: Retry failed operations with exponential backoff

Make timeouts configurable: Allow users to adjust based on their environment

Add logging: Help diagnose when and why SSH operations fail

Would you like me to open an issue with a detailed implementation plan for proper SSH reliability improvements?

Also applies to: 104-104, 130-130, 193-193

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Create a GitHub issue for SSH reliability improvements gh issue create \ --repo avocado-framework/avocado \ --title "Improve SSH reliability with proper timeout configuration and retry logic" \ --assignee clebergnu \ --body "## Context This issue is a follow-up to PR #6238 and the discussion at: - PR: https://github.com/avocado-framework/avocado/pull/6238 - Comment: https://github.com/avocado-framework/avocado/pull/6238#discussion_r2540644060 - Requested by: @clebergnu ## Problem PR #6238 proposes adding fixed 10-second sleeps in three locations within \`avocado/utils/ssh.py\` to address \"network delays that cause SSH failures.\" While this may work around immediate symptoms, it has several drawbacks: 1. **Root cause unknown**: The delays don't address why SSH connections are failing 2. **Performance impact**: All users pay a 30+ second penalty per SSH session, even with fast networks 3. **No diagnostics**: No logging or error information to help debug failures 4. **Arbitrary timeout**: 10 seconds may be too short for some networks, too long for others 5. **Not maintainable**: Hardcoded sleeps are difficult to adjust and test ## Proposed Solution Implement a comprehensive SSH reliability improvement that addresses the root causes: ### 1. Add SSH Timeout Configuration Options Add timeout-related SSH options to \`DEFAULT_OPTIONS\`: - \`ConnectTimeout\`: Maximum time to establish connection (e.g., 30 seconds) - \`ServerAliveInterval\`: Send keepalive messages every N seconds (e.g., 15) - \`ServerAliveCountMax\`: Number of keepalive messages before disconnect (e.g., 3) ### 2. Implement Retry Logic with Exponential Backoff For operations prone to transient failures: - Retry failed master connection establishment - Use exponential backoff (e.g., 1s, 2s, 4s, 8s) - Make retry count configurable - Add maximum total timeout ### 3. Add Comprehensive Logging Add logging to help diagnose issues: - Log connection attempts and results - Log retry attempts with reasons - Log SSH command execution details - Use Python's \`logging\` module with appropriate levels ### 4. Make Timeouts Configurable Allow users to customize timeouts based on their environment: - Add optional parameters to \`Session.__init__\` - Support environment variables for defaults - Document recommended values for different scenarios ### 5. Improve Error Messages Enhance error reporting to help users diagnose issues: - Distinguish between connection timeout vs. authentication failure - Include timing information in error messages - Suggest configuration changes based on error type ## Implementation Steps 1. Investigate root cause of SSH failures mentioned in PR #6238 2. Add timeout options to SSH configuration 3. Implement retry logic for master connection 4. Add logging framework 5. Make timeouts configurable via constructor parameters 6. Add tests for retry logic and timeout handling 7. Update documentation with troubleshooting guide ## Benefits - **Better reliability**: Proper timeouts and retries handle transient failures - **Better performance**: No fixed delays for fast networks - **Better diagnostics**: Logging helps identify and fix issues - **Better flexibility**: Users can tune for their environment - **Better maintainability**: Clear, testable code vs. magic sleeps ## Additional Context The current implementation adds sleeps in: - \`_master_connection()\`: Before establishing master connection - \`_master_command()\`: Before executing master commands - \`control_master\` property: When accessing control master path These locations suggest the issue may be related to master connection timing or file system synchronization delays."

Length of output: 56

@clebergnu Perfect! I've created issue #6251 with a comprehensive implementation plan for improving SSH reliability:

🔗 #6251

The issue includes:

Context and backlinks to this PR and discussion

Detailed problem description (why fixed sleeps are problematic)

Proposed solutions covering timeout configuration, retry logic, logging, and error handling

Step-by-step implementation plan

Benefits of the proper approach

The issue is assigned to you and ready for the team to work on. This should provide a much more robust solution than the fixed sleep approach in the current PR.

_ / \ / _ \ | / \ | || || |\_v_/| / \ / ( ) \

/_____
| ___ |
\ (. .) /
_/
Happy to help! 🐰

🧬 Code Graph Analysis Results

Filename: avocado/utils/path.py, lines 30-51

class CmdNotFoundError(Exception): """Indicates that the command was not found in the system after a search. :param cmd: String with the command. :param paths: List of paths where we looked after. """ def __init__(self, cmd, paths): # pylint: disable=W0231 super() self.cmd = cmd self.paths = paths def __str__(self): """String representation of the exception. :return: A string describing the missing command and the paths searched. :rtype: str """ return ( f"Command '{self.cmd}' could not be found in any " f"of the PATH dirs: {self.paths}" )

Filename: avocado/utils/path.py, lines 90-133

def find_command(cmd, default=None, check_exec=True): """Try to find a command in the PATH, paranoid version. :param cmd: Command to be found. :type cmd: str :param default: Command path to use as a fallback if not found in the standard directories. :type default: str or None :param check_exec: if a check for permissions that render the command executable by the current user should be performed. :type check_exec: bool :raise avocado.utils.path.CmdNotFoundError: in case the command was not found and no default was given. :return: Returns an absolute path to the command or the default value if the command is not found :rtype: str """ common_bin_paths = [ "/usr/libexec", "/usr/local/sbin", "/usr/local/bin", "/usr/sbin", "/usr/bin", "/sbin", "/bin", ] try: path_paths = os.environ["PATH"].split(":") except KeyError: path_paths = [] path_paths = list(set(common_bin_paths + path_paths)) for dir_path in path_paths: cmd_path = os.path.join(dir_path, cmd) if os.path.isfile(cmd_path): if check_exec: if not os.access(cmd_path, os.R_OK | os.X_OK): continue return os.path.abspath(cmd_path) if default is not None: return default path_paths.sort() raise CmdNotFoundError(cmd, path_paths)

Filename: avocado/utils/process.py, lines 44-58

class CmdError(Exception): def __init__( self, command=None, result=None, additional_text=None ): # pylint: disable=W0231 self.command = command self.result = result self.additional_text = additional_text def __str__(self): return ( f"Command '{self.command}' failed.\nstdout: " f"{self.result.stdout!r}\nstderr: " f"{self.result.stderr!r}\nadditional_info: " f"{self.additional_text}" )

Filename: avocado/utils/process.py, lines 952-1019

def run( cmd, timeout=None, verbose=True, ignore_status=False, shell=False, env=None, sudo=False, ignore_bg_processes=False, encoding=None, logger=None, ): """ Run a subprocess, returning a CmdResult object. :param cmd: Command line to run. :type cmd: str :param timeout: Time limit in seconds before attempting to kill the running process. This function will take a few seconds longer than 'timeout' to complete if it has to kill the process. :type timeout: float :param verbose: Whether to log the command run and stdout/stderr. :type verbose: bool :param ignore_status: Whether to raise an exception when command returns =! 0 (False), or not (True). :type ignore_status: bool :param shell: Whether to run the command on a subshell :type shell: bool :param env: Use extra environment variables :type env: dict :param sudo: Whether the command requires admin privileges to run, so that sudo will be prepended to the command. The assumption here is that the user running the command has a sudo configuration such that a password won't be prompted. If that's not the case, the command will straight out fail. :param encoding: the encoding to use for the text representation of the command result stdout and stderr, by default :data:`avocado.utils.astring.ENCODING` :type encoding: str :param logger: User's custom logger, which will be logging the subprocess outputs. When this parameter is not set, the `avocado.utils.process` logger will be used. :type logger: logging.Logger :return: An :class:`CmdResult` object. :raise: :class:`CmdError`, if ``ignore_status=False``. """ if not cmd: raise CmdInputError("Invalid empty command") if encoding is None: encoding = astring.ENCODING sp = SubProcess( cmd=cmd, verbose=verbose, shell=shell, env=env, sudo=sudo, ignore_bg_processes=ignore_bg_processes, encoding=encoding, logger=logger, ) cmd_result = sp.run(timeout=timeout) fail_condition = cmd_result.exit_status or cmd_result.interrupted if fail_condition and not ignore_status: raise CmdError(cmd, sp.result) return cmd_result

coderabbitai · 2025-11-19T05:57:15Z

avocado/utils/ssh.py

        return cmd

    def _master_connection(self):
+        time.sleep(10)


⚠️ Potential issue | 🟠 Major

Replace fixed delay with retry logic and configurable timeouts.

A fixed 10-second sleep before establishing the master connection is problematic:

Sleeping before the connection attempt doesn't mitigate network delays that occur during the connection

The arbitrary 10-second delay will impact all users, even those without network issues

It's not configurable and provides no feedback about what's happening

Consider these alternatives:

Add SSH connection timeout options (e.g., ConnectTimeout=10)

Implement retry logic with exponential backoff for connection failures

Make any delay configurable via parameters or class attributes

Add logging to diagnose connection failures

Would you like me to generate an implementation with retry logic and configurable timeouts?

🤖 Prompt for AI Agents

In avocado/utils/ssh.py around line 104, replace the hardcoded time.sleep(10) with configurable connection and retry behavior: add parameters or class attributes for ssh_connect_timeout (e.g., ConnectTimeout), max_retries, base_backoff_seconds, and max_backoff_seconds; attempt the master connection in a loop that catches connection-related exceptions, logs each failure with context, and retries with exponential backoff (jitter) until success or max_retries is reached; ensure the SSH command includes the ConnectTimeout option and surface a clear error/exception if all retries fail so callers can handle it.

coderabbitai · 2025-11-19T05:57:15Z

avocado/utils/ssh.py

        return path

    def _master_command(self, command):
+        time.sleep(10)


⚠️ Potential issue | 🟠 Major

Fixed delay in _master_command multiplies performance impact.

This method is called by _check (line 136) and quit (line 259), meaning the 10-second delay will occur multiple times per session. Similar to the delay in _master_connection, sleeping before command execution doesn't address network issues that occur during execution.

If master commands are timing out, address this through:

SSH timeout options in the command construction

Retry logic for failed commands with specific exit codes

Configurable timeout parameters

🤖 Prompt for AI Agents

In avocado/utils/ssh.py around line 130, the hardcoded time.sleep(10) in _master_command imposes repeated 10s delays and should be removed; instead make command timeouts configurable and enforce them at execution (e.g., add a timeout parameter or use SSH/remote command timeout flags), implement retry logic for transient failures (retry on specific exit codes or exceptions with a capped retry count and backoff), and expose the timeout and retry settings via configuration so callers (_check and quit) don’t incur unnecessary fixed delays.

coderabbitai · 2025-11-19T05:57:15Z

avocado/utils/ssh.py


    @property
    def control_master(self):
+        time.sleep(10)


⚠️ Potential issue | 🔴 Critical

Remove the sleep from control_master property - it performs no network operations.

This property only expands a file path and checks if it exists on the local filesystem (lines 194-198). There are no network operations here, so a 10-second sleep makes no sense and will severely impact performance since this property is accessed multiple times:

In cleanup_master (line 140)

In connect via _check (lines 157, 189)

This sleep should be removed entirely.

🤖 Prompt for AI Agents

In avocado/utils/ssh.py around line 193, remove the unnecessary time.sleep(10) in the control_master property: this property only expands and checks a local filesystem path and performs no network I/O, so delete the sleep call and ensure the property simply returns the expanded path or existence result as before; update any dependent logic if it assumed a delay (none should), and run tests to confirm no behavioral changes besides improved performance.

codecov · 2025-11-19T06:09:35Z

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.46%. Comparing base (50bcca0) to head (ab660f6).

Files with missing lines	Patch %	Lines
avocado/utils/ssh.py	50.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6238      +/-   ##
==========================================
- Coverage   73.47%   73.46%   -0.01%     
==========================================
  Files         206      206              
  Lines       22494    22498       +4     
==========================================
+ Hits        16528    16529       +1     
- Misses       5966     5969       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

richtja

Hi @vaishnavibhat, can you please better describe your current issue with the ssh utility so we could disscus possible solutions?

IMO adding fixed wait to the ssh connection is not right way how to solve any connection issues. It will slow dow all the other connection in enviroments without connectivity issues also there might be evrimoments where 10 sec won't be sufficient and it won't solv the issue. Thereofre I think we need to come up with exaption handling and retry logic indsted of hardcoded timouts.

clebergnu

Hi @vaishnavibhat,

I think I understand how this race condition could happen, and thanks for looking into fixing it. But, the 10 seconds sleep is just not acceptable because it can introduce major performance regressions.

Please take a look at the issue that CodeRabbit created: #6251

Among its point, the most important one is to:

Implement Retry Logic with Exponential Backoff

For operations prone to transient failures:
* Retry failed master connection establishment
* Use exponential backoff (e.g., 1s, 2s, 4s, 8s)
* Make retry count configurable
* Add maximum total timeout

Let me know if this makes sense to you.

add time.sleep before creating ssh master files

ab660f6

Seeing some network delays and ssh failing because of it. Adding sleep(10) to induce some wait period. Signed-off-by: Vaishnavi Bhat <vaishnavi@linux.vnet.ibm.com>

mr-avocado bot added this to Default project Nov 19, 2025

mr-avocado bot moved this to Review Requested in Default project Nov 19, 2025

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

clebergnu requested a review from richtja November 25, 2025 15:14

richtja requested changes Nov 28, 2025

View reviewed changes

coderabbitai bot mentioned this pull request Dec 4, 2025

Improve SSH reliability with proper timeout configuration and retry logic #6251

Closed

clebergnu requested changes Dec 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add time.sleep before creating ssh master files #6238

add time.sleep before creating ssh master files #6238

vaishnavibhat commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 19, 2025 •

edited

Loading

Uh oh!

clebergnu Dec 4, 2025

Uh oh!

coderabbitai bot Dec 4, 2025

Uh oh!

coderabbitai bot Nov 19, 2025

Uh oh!

coderabbitai bot Nov 19, 2025

Uh oh!

coderabbitai bot Nov 19, 2025

Uh oh!

codecov bot commented Nov 19, 2025

Uh oh!

richtja left a comment

Uh oh!

clebergnu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add time.sleep before creating ssh master files #6238

Are you sure you want to change the base?

add time.sleep before creating ssh master files #6238

Conversation

vaishnavibhat commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clebergnu Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 19, 2025

Codecov Report

Uh oh!

richtja left a comment

Choose a reason for hiding this comment

Uh oh!

clebergnu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vaishnavibhat commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 19, 2025 •

edited

Loading

coderabbitai bot Nov 19, 2025 •

edited

Loading