Skip to content

Conversation

@vaishnavibhat
Copy link
Contributor

@vaishnavibhat vaishnavibhat commented Nov 19, 2025

Seeing some network delays and ssh failing because of it. Adding sleep(10) to induce some wait period.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced SSH connection reliability through timing adjustments applied to connection establishment, command execution, and control path management.

Seeing some network delays and ssh failing because of it.
Adding sleep(10) to induce some wait period.

Signed-off-by: Vaishnavi Bhat <vaishnavi@linux.vnet.ibm.com>
@coderabbitai
Copy link

coderabbitai bot commented Nov 19, 2025

Walkthrough

This pull request adds a time module import and introduces three fixed 10-second delays (time.sleep(10)) at different points in the SSH utility module: before establishing a master connection, before executing a master command, and when accessing or creating the control master path. No changes are made to function signatures, error handling logic, or control flow beyond the inserted sleep calls.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Areas for extra attention:

  • Rationale for delays: Reviewers should verify the justification for the fixed 10-second delays and confirm they address a specific timing or race condition issue
  • Performance impact: Consider whether these delays will noticeably increase SSH operation latency and test execution time, particularly in scenarios where connections are created or commands are executed frequently
  • Alternative approaches: Evaluate whether retry logic, exponential backoff, or polling-based solutions might be more appropriate than fixed delays

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'add time.sleep before creating ssh master files' clearly and specifically describes the main change: introducing a sleep delay before SSH master file creation to address network delays.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.3)
avocado/utils/ssh.py

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mr-avocado mr-avocado bot moved this to Review Requested in Default project Nov 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50bcca0 and ab660f6.

📒 Files selected for processing (1)
  • avocado/utils/ssh.py (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (42)
  • GitHub Check: rpm-build:fedora-41-ppc64le
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-41-aarch64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-41-x86_64
  • GitHub Check: rpm-build:fedora-41-s390x
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:fedora-41-s390x
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-41-aarch64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-41-ppc64le
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-41-x86_64
  • GitHub Check: Linux with Python 3.12
  • GitHub Check: Linux with Python 3.13
  • GitHub Check: Linux with Python 3.11
  • GitHub Check: Linux with Python 3.9
  • GitHub Check: Linux with Python 3.10
  • GitHub Check: rpm-build:fedora-41-aarch64
  • GitHub Check: rpm-build:centos-stream-9-x86_64
  • GitHub Check: rpm-build:fedora-42-x86_64
  • GitHub Check: rpm-build:epel-9-x86_64
  • GitHub Check: rpm-build:fedora-41-s390x
  • GitHub Check: rpm-build:fedora-41-x86_64
  • GitHub Check: rpm-build:fedora-43-x86_64
  • GitHub Check: rpm-build:fedora-rawhide-x86_64
  • GitHub Check: rpm-build:fedora-41-ppc64le
  • GitHub Check: Fedora selftests
  • GitHub Check: Version task fedora:41
  • GitHub Check: Podman spawner with 3rd party runner plugin
  • GitHub Check: Egg task fedora:40
  • GitHub Check: Version task ubi:8.8
  • GitHub Check: Version task debian:12.4
  • GitHub Check: Version task ubuntu:22.04
  • GitHub Check: Static checks
  • GitHub Check: macOS with Python 3.11
  • GitHub Check: Code Coverage (3.11)

import subprocess
import sys
import tempfile
import time
Copy link

@coderabbitai coderabbitai bot Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Architectural concern: Fixed delays are not the right solution for network reliability.

The PR description mentions addressing "network delays that cause SSH failures," but adding fixed sleeps is a workaround that:

  • Doesn't address the root cause
  • Impacts all users with 30+ seconds of total delay per session
  • Provides no diagnostic information about failures

A robust solution should:

  1. Diagnose the root cause: Are these connection timeouts? DNS resolution delays? Master connection race conditions?
  2. Use SSH timeout options: Add ConnectTimeout, ServerAliveInterval, ServerAliveCountMax to DEFAULT_OPTIONS
  3. Implement proper retry logic: Retry failed operations with exponential backoff
  4. Make timeouts configurable: Allow users to adjust based on their environment
  5. Add logging: Help diagnose when and why SSH operations fail

Would you like me to open an issue with a detailed implementation plan for proper SSH reliability improvements?

Also applies to: 104-104, 130-130, 193-193

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue for SSH reliability improvements

gh issue create \
  --repo avocado-framework/avocado \
  --title "Improve SSH reliability with proper timeout configuration and retry logic" \
  --assignee clebergnu \
  --body "## Context

This issue is a follow-up to PR #6238 and the discussion at:
- PR: https://github.com/avocado-framework/avocado/pull/6238
- Comment: https://github.com/avocado-framework/avocado/pull/6238#discussion_r2540644060
- Requested by: @clebergnu

## Problem

PR #6238 proposes adding fixed 10-second sleeps in three locations within \`avocado/utils/ssh.py\` to address \"network delays that cause SSH failures.\" While this may work around immediate symptoms, it has several drawbacks:

1. **Root cause unknown**: The delays don't address why SSH connections are failing
2. **Performance impact**: All users pay a 30+ second penalty per SSH session, even with fast networks
3. **No diagnostics**: No logging or error information to help debug failures
4. **Arbitrary timeout**: 10 seconds may be too short for some networks, too long for others
5. **Not maintainable**: Hardcoded sleeps are difficult to adjust and test

## Proposed Solution

Implement a comprehensive SSH reliability improvement that addresses the root causes:

### 1. Add SSH Timeout Configuration Options

Add timeout-related SSH options to \`DEFAULT_OPTIONS\`:
- \`ConnectTimeout\`: Maximum time to establish connection (e.g., 30 seconds)
- \`ServerAliveInterval\`: Send keepalive messages every N seconds (e.g., 15)
- \`ServerAliveCountMax\`: Number of keepalive messages before disconnect (e.g., 3)

### 2. Implement Retry Logic with Exponential Backoff

For operations prone to transient failures:
- Retry failed master connection establishment
- Use exponential backoff (e.g., 1s, 2s, 4s, 8s)
- Make retry count configurable
- Add maximum total timeout

### 3. Add Comprehensive Logging

Add logging to help diagnose issues:
- Log connection attempts and results
- Log retry attempts with reasons
- Log SSH command execution details
- Use Python's \`logging\` module with appropriate levels

### 4. Make Timeouts Configurable

Allow users to customize timeouts based on their environment:
- Add optional parameters to \`Session.__init__\`
- Support environment variables for defaults
- Document recommended values for different scenarios

### 5. Improve Error Messages

Enhance error reporting to help users diagnose issues:
- Distinguish between connection timeout vs. authentication failure
- Include timing information in error messages
- Suggest configuration changes based on error type

## Implementation Steps

1. Investigate root cause of SSH failures mentioned in PR #6238
2. Add timeout options to SSH configuration
3. Implement retry logic for master connection
4. Add logging framework
5. Make timeouts configurable via constructor parameters
6. Add tests for retry logic and timeout handling
7. Update documentation with troubleshooting guide

## Benefits

- **Better reliability**: Proper timeouts and retries handle transient failures
- **Better performance**: No fixed delays for fast networks
- **Better diagnostics**: Logging helps identify and fix issues
- **Better flexibility**: Users can tune for their environment
- **Better maintainability**: Clear, testable code vs. magic sleeps

## Additional Context

The current implementation adds sleeps in:
- \`_master_connection()\`: Before establishing master connection
- \`_master_command()\`: Before executing master commands
- \`control_master\` property: When accessing control master path

These locations suggest the issue may be related to master connection timing or file system synchronization delays."

Length of output: 56


@clebergnu Perfect! I've created issue #6251 with a comprehensive implementation plan for improving SSH reliability:

🔗 #6251

The issue includes:

  • Context and backlinks to this PR and discussion
  • Detailed problem description (why fixed sleeps are problematic)
  • Proposed solutions covering timeout configuration, retry logic, logging, and error handling
  • Step-by-step implementation plan
  • Benefits of the proper approach

The issue is assigned to you and ready for the team to work on. This should provide a much more robust solution than the fixed sleep approach in the current PR.

    _
   / \
  / _ \
 | / \ |
 ||   || 
 |\_v_/|
 /     \
/ (  )  \

/_____
| ___ |
\ (. .) /
_
/
Happy to help! 🐰


🧬 Code Graph Analysis Results

Filename: avocado/utils/path.py, lines 30-51

class CmdNotFoundError(Exception):
    """Indicates that the command was not found in the system after a search.

    :param cmd: String with the command.
    :param paths: List of paths where we looked after.
    """

    def __init__(self, cmd, paths):  # pylint: disable=W0231
        super()
        self.cmd = cmd
        self.paths = paths

    def __str__(self):
        """String representation of the exception.

        :return: A string describing the missing command and the paths searched.
        :rtype: str
        """
        return (
            f"Command '{self.cmd}' could not be found in any "
            f"of the PATH dirs: {self.paths}"
        )

Filename: avocado/utils/path.py, lines 90-133

def find_command(cmd, default=None, check_exec=True):
    """Try to find a command in the PATH, paranoid version.

    :param cmd: Command to be found.
    :type cmd: str
    :param default: Command path to use as a fallback if not found
                    in the standard directories.
    :type default: str or None
    :param check_exec: if a check for permissions that render the command
                       executable by the current user should be performed.
    :type check_exec: bool
    :raise avocado.utils.path.CmdNotFoundError: in case the
            command was not found and no default was given.
    :return: Returns an absolute path to the command or the default
            value if the command is not found
    :rtype: str
    """
    common_bin_paths = [
        "/usr/libexec",
        "/usr/local/sbin",
        "/usr/local/bin",
        "/usr/sbin",
        "/usr/bin",
        "/sbin",
        "/bin",
    ]
    try:
        path_paths = os.environ["PATH"].split(":")
    except KeyError:
        path_paths = []
    path_paths = list(set(common_bin_paths + path_paths))

    for dir_path in path_paths:
        cmd_path = os.path.join(dir_path, cmd)
        if os.path.isfile(cmd_path):
            if check_exec:
                if not os.access(cmd_path, os.R_OK | os.X_OK):
                    continue
            return os.path.abspath(cmd_path)

    if default is not None:
        return default
    path_paths.sort()
    raise CmdNotFoundError(cmd, path_paths)

Filename: avocado/utils/process.py, lines 44-58

class CmdError(Exception):
    def __init__(
        self, command=None, result=None, additional_text=None
    ):  # pylint: disable=W0231
        self.command = command
        self.result = result
        self.additional_text = additional_text

    def __str__(self):
        return (
            f"Command '{self.command}' failed.\nstdout: "
            f"{self.result.stdout!r}\nstderr: "
            f"{self.result.stderr!r}\nadditional_info: "
            f"{self.additional_text}"
        )

Filename: avocado/utils/process.py, lines 952-1019

def run(
    cmd,
    timeout=None,
    verbose=True,
    ignore_status=False,
    shell=False,
    env=None,
    sudo=False,
    ignore_bg_processes=False,
    encoding=None,
    logger=None,
):
    """
    Run a subprocess, returning a CmdResult object.

    :param cmd: Command line to run.
    :type cmd: str
    :param timeout: Time limit in seconds before attempting to kill the
                    running process. This function will take a few seconds
                    longer than 'timeout' to complete if it has to kill the
                    process.
    :type timeout: float
    :param verbose: Whether to log the command run and stdout/stderr.
    :type verbose: bool
    :param ignore_status: Whether to raise an exception when command returns
                          =! 0 (False), or not (True).
    :type ignore_status: bool
    :param shell: Whether to run the command on a subshell
    :type shell: bool
    :param env: Use extra environment variables
    :type env: dict
    :param sudo: Whether the command requires admin privileges to run,
                 so that sudo will be prepended to the command.
                 The assumption here is that the user running the command
                 has a sudo configuration such that a password won't be
                 prompted. If that's not the case, the command will
                 straight out fail.
    :param encoding: the encoding to use for the text representation
                     of the command result stdout and stderr, by default
                     :data:`avocado.utils.astring.ENCODING`
    :type encoding: str
    :param logger: User's custom logger, which will be logging the subprocess
                   outputs. When this parameter is not set, the
                   `avocado.utils.process` logger will be used.
    :type logger: logging.Logger

    :return: An :class:`CmdResult` object.
    :raise: :class:`CmdError`, if ``ignore_status=False``.
    """
    if not cmd:
        raise CmdInputError("Invalid empty command")
    if encoding is None:
        encoding = astring.ENCODING
    sp = SubProcess(
        cmd=cmd,
        verbose=verbose,
        shell=shell,
        env=env,
        sudo=sudo,
        ignore_bg_processes=ignore_bg_processes,
        encoding=encoding,
        logger=logger,
    )
    cmd_result = sp.run(timeout=timeout)
    fail_condition = cmd_result.exit_status or cmd_result.interrupted
    if fail_condition and not ignore_status:
        raise CmdError(cmd, sp.result)
    return cmd_result

return cmd

def _master_connection(self):
time.sleep(10)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace fixed delay with retry logic and configurable timeouts.

A fixed 10-second sleep before establishing the master connection is problematic:

  • Sleeping before the connection attempt doesn't mitigate network delays that occur during the connection
  • The arbitrary 10-second delay will impact all users, even those without network issues
  • It's not configurable and provides no feedback about what's happening

Consider these alternatives:

  • Add SSH connection timeout options (e.g., ConnectTimeout=10)
  • Implement retry logic with exponential backoff for connection failures
  • Make any delay configurable via parameters or class attributes
  • Add logging to diagnose connection failures

Would you like me to generate an implementation with retry logic and configurable timeouts?

🤖 Prompt for AI Agents
In avocado/utils/ssh.py around line 104, replace the hardcoded time.sleep(10)
with configurable connection and retry behavior: add parameters or class
attributes for ssh_connect_timeout (e.g., ConnectTimeout), max_retries,
base_backoff_seconds, and max_backoff_seconds; attempt the master connection in
a loop that catches connection-related exceptions, logs each failure with
context, and retries with exponential backoff (jitter) until success or
max_retries is reached; ensure the SSH command includes the ConnectTimeout
option and surface a clear error/exception if all retries fail so callers can
handle it.

return path

def _master_command(self, command):
time.sleep(10)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fixed delay in _master_command multiplies performance impact.

This method is called by _check (line 136) and quit (line 259), meaning the 10-second delay will occur multiple times per session. Similar to the delay in _master_connection, sleeping before command execution doesn't address network issues that occur during execution.

If master commands are timing out, address this through:

  • SSH timeout options in the command construction
  • Retry logic for failed commands with specific exit codes
  • Configurable timeout parameters
🤖 Prompt for AI Agents
In avocado/utils/ssh.py around line 130, the hardcoded time.sleep(10) in
_master_command imposes repeated 10s delays and should be removed; instead make
command timeouts configurable and enforce them at execution (e.g., add a timeout
parameter or use SSH/remote command timeout flags), implement retry logic for
transient failures (retry on specific exit codes or exceptions with a capped
retry count and backoff), and expose the timeout and retry settings via
configuration so callers (_check and quit) don’t incur unnecessary fixed delays.


@property
def control_master(self):
time.sleep(10)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove the sleep from control_master property - it performs no network operations.

This property only expands a file path and checks if it exists on the local filesystem (lines 194-198). There are no network operations here, so a 10-second sleep makes no sense and will severely impact performance since this property is accessed multiple times:

  • In cleanup_master (line 140)
  • In connect via _check (lines 157, 189)

This sleep should be removed entirely.

🤖 Prompt for AI Agents
In avocado/utils/ssh.py around line 193, remove the unnecessary time.sleep(10)
in the control_master property: this property only expands and checks a local
filesystem path and performs no network I/O, so delete the sleep call and ensure
the property simply returns the expanded path or existence result as before;
update any dependent logic if it assumed a delay (none should), and run tests to
confirm no behavioral changes besides improved performance.

@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.46%. Comparing base (50bcca0) to head (ab660f6).

Files with missing lines Patch % Lines
avocado/utils/ssh.py 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6238      +/-   ##
==========================================
- Coverage   73.47%   73.46%   -0.01%     
==========================================
  Files         206      206              
  Lines       22494    22498       +4     
==========================================
+ Hits        16528    16529       +1     
- Misses       5966     5969       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@clebergnu clebergnu requested a review from richtja November 25, 2025 15:14
Copy link
Contributor

@richtja richtja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vaishnavibhat, can you please better describe your current issue with the ssh utility so we could disscus possible solutions?

IMO adding fixed wait to the ssh connection is not right way how to solve any connection issues. It will slow dow all the other connection in enviroments without connectivity issues also there might be evrimoments where 10 sec won't be sufficient and it won't solv the issue. Thereofre I think we need to come up with exaption handling and retry logic indsted of hardcoded timouts.

Copy link
Contributor

@clebergnu clebergnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vaishnavibhat,

I think I understand how this race condition could happen, and thanks for looking into fixing it. But, the 10 seconds sleep is just not acceptable because it can introduce major performance regressions.

Please take a look at the issue that CodeRabbit created: #6251

Among its point, the most important one is to:

Implement Retry Logic with Exponential Backoff

For operations prone to transient failures:
* Retry failed master connection establishment
* Use exponential backoff (e.g., 1s, 2s, 4s, 8s)
* Make retry count configurable
* Add maximum total timeout

Let me know if this makes sense to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Review Requested

Development

Successfully merging this pull request may close these issues.

3 participants