feat: add exponential backoff retry for transient SDK errors by haripatel07 · Pull Request #127 · RichardAtCT/claude-code-telegram

haripatel07 · 2026-03-05T04:33:11Z

Closes #60 - adds configurable retry logic to ClaudeSDKManager.execute_command() for transient CLIConnectionError failures (non-MCP).

Changes:

src/utils/constants.py: 4 new retry default constants
src/config/settings.py: 4 new settings fields (claude_retry_max_attempts, claude_retry_base_delay, claude_retry_backoff_factor, claude_retry_max_delay)
src/claude/sdk_integration.py: _is_retryable_error() helper + retry loop wrapping asyncio.wait_for() in execute_command()

Retry decision:

CLIConnectionError (non-MCP): retried with exponential backoff
asyncio.TimeoutError: not retried (user-configured timeout, intentional)
CLINotFoundError, ProcessError, CLIJSONDecodeError: not retried

Default backoff: 1s → 3s → 9s, capped at 30s (CLAUDE_RETRY_MAX_ATTEMPTS=0 disables)

Tests: 491 passed, 0 failed

Closes RichardAtCT#60 - adds configurable retry logic to ClaudeSDKManager.execute_command() for transient CLIConnectionError failures (non-MCP). Changes: - src/utils/constants.py: 4 new retry default constants - src/config/settings.py: 4 new settings fields (claude_retry_max_attempts, claude_retry_base_delay, claude_retry_backoff_factor, claude_retry_max_delay) - src/claude/sdk_integration.py: _is_retryable_error() helper + retry loop wrapping asyncio.wait_for() in execute_command() Retry decision: - CLIConnectionError (non-MCP): retried with exponential backoff - asyncio.TimeoutError: not retried (user-configured timeout, intentional) - CLINotFoundError, ProcessError, CLIJSONDecodeError: not retried Default backoff: 1s → 3s → 9s, capped at 30s (CLAUDE_RETRY_MAX_ATTEMPTS=0 disables) Tests: 491 passed, 0 failed

FridayOpenClawBot · 2026-03-05T04:36:07Z

Good addition — retry logic for transient CLIConnectionError is a real pain point and the overall structure is clean. A few things worth looking at:

🐛 Dead code in the retry loop

The for...else + last_exc pattern doesn't work as intended. On the final attempt, when attempt < max_attempts - 1 is False, the bare raise fires — so the loop exits via raise, not by completing normally. Python's for...else only runs the else block when no break occurs; a raise also skips it. In practice last_exc is stored but never used — the else block is unreachable. The exception is still re-raised correctly (bare raise re-raises exc), so behaviour is fine, but the dead code is misleading. Simplest fix: drop the else block and last_exc tracking entirely, or restructure to re-raise after the loop explicitly.

⚠️ Overly broad MCP filter in _is_retryable_error

return "mcp" not in msg and "server" not in msg

Filtering on "server" is too aggressive — legitimate transient CLIConnectionErrors (e.g. "failed to connect to server") could contain that word and get silently non-retried. The MCP distinction was presumably to avoid retrying when an MCP server is down (which is user-controlled, not transient). A tighter check — e.g. "mcp server" as a phrase, or checking the exception type/subclass — would be safer.

ℹ️ max(1, ...) silently overrides 0 = disabled

max(1, self.config.claude_retry_max_attempts) converts 0 to 1, meaning you always get at least one attempt. This is correct semantically (setting to 0 gives 1 attempt = no retry), but the constant comment says CLAUDE_RETRY_MAX_ATTEMPTS=0 disables which is accurate but the max_attempts variable name implies otherwise. Minor: worth a comment clarifying this.

Otherwise the settings split (constants → settings → sdk) is clean, and the test coverage sounds solid.

— Friday, AI assistant to @RichardAtCT

Copilot

Pull request overview

Adds configurable exponential-backoff retry behavior around Claude SDK command execution to better handle transient (non-MCP) connection failures without immediately surfacing errors to users.

Changes:

Introduces default retry/backoff constants (attempts/base delay/factor/max delay).
Adds corresponding Settings fields to configure retry behavior via environment/config.
Wraps ClaudeSDKManager.execute_command() execution in a retry loop gated by a new _is_retryable_error() helper.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
src/utils/constants.py	Adds default values for retry/backoff configuration.
src/config/settings.py	Exposes retry/backoff configuration via Pydantic settings.
src/claude/sdk_integration.py	Implements retry decision helper and retry loop around `asyncio.wait_for()` for transient `CLIConnectionError`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T04:36:30Z

src/claude/sdk_integration.py

+            max_attempts = max(1, self.config.claude_retry_max_attempts)
+            last_exc: Optional[BaseException] = None
+
+            for attempt in range(max_attempts):
+                if attempt > 0:
+                    delay = min(
+                        self.config.claude_retry_base_delay
+                        * (self.config.claude_retry_backoff_factor ** (attempt - 1)),
+                        self.config.claude_retry_max_delay,
+                    )
+                    logger.warning(
+                        "Retrying Claude SDK command",
+                        attempt=attempt + 1,
+                        max_attempts=max_attempts,
+                        delay_seconds=delay,
+                    )
+                    await asyncio.sleep(delay)
+                try:
+                    await asyncio.wait_for(
+                        _run_client(),
+                        timeout=self.config.claude_timeout_seconds,
+                    )
+                    break  # success — exit retry loop
+                except CLIConnectionError as exc:
+                    if self._is_retryable_error(exc) and attempt < max_attempts - 1:
+                        last_exc = exc
+                        logger.warning(
+                            "Transient connection error, will retry",
+                            attempt=attempt + 1,
+                            error=str(exc),
+                        )
+                        continue
+                    raise  # non-retryable or attempts exhausted
+            else:
+                if last_exc is not None:
+                    raise last_exc


The retry loop’s for ... else block is effectively unreachable: max_attempts is forced to be >= 1, and on the final failed attempt the code raises inside the except, so the loop will never fall through to else. This also makes last_exc unnecessary and potentially misleading. Consider simplifying by removing last_exc/else, or restructuring to only raise after the loop completes.

Copilot · 2026-03-05T04:36:31Z

src/config/settings.py

+        description="Max retry attempts for transient SDK errors (0 = disabled)",
+    )
+    claude_retry_base_delay: float = Field(
+        DEFAULT_RETRY_BASE_DELAY, description="Base delay in seconds between retries"
+    )
+    claude_retry_backoff_factor: float = Field(
+        DEFAULT_RETRY_BACKOFF_FACTOR, description="Exponential backoff multiplier"
+    )
+    claude_retry_max_delay: float = Field(
+        DEFAULT_RETRY_MAX_DELAY, description="Maximum delay cap in seconds"
+    )
+


These retry-related settings are user-configurable but currently have no bounds validation. Negative values (e.g., base_delay/max_delay) can lead to runtime failures (e.g., asyncio.sleep() with a negative delay) or unexpected backoff behavior. Add appropriate ge/gt constraints (and potentially cross-field validation like max_delay >= base_delay) for these fields.

Suggested change

description="Max retry attempts for transient SDK errors (0 = disabled)",

)

claude_retry_base_delay: float = Field(

DEFAULT_RETRY_BASE_DELAY, description="Base delay in seconds between retries"

)

claude_retry_backoff_factor: float = Field(

DEFAULT_RETRY_BACKOFF_FACTOR, description="Exponential backoff multiplier"

)

claude_retry_max_delay: float = Field(

DEFAULT_RETRY_MAX_DELAY, description="Maximum delay cap in seconds"

)

ge=0,

description="Max retry attempts for transient SDK errors (0 = disabled)",

)

claude_retry_base_delay: float = Field(

DEFAULT_RETRY_BASE_DELAY,

ge=0,

description="Base delay in seconds between retries",

)

claude_retry_backoff_factor: float = Field(

DEFAULT_RETRY_BACKOFF_FACTOR,

gt=0,

description="Exponential backoff multiplier",

)

claude_retry_max_delay: float = Field(

DEFAULT_RETRY_MAX_DELAY,

ge=0,

description="Maximum delay cap in seconds",

)

@model_validator(mode="after")

def validate_retry_delays(self) -> "Settings":

"""Ensure retry delay configuration is internally consistent."""

if self.claude_retry_max_delay < self.claude_retry_base_delay:

raise ValueError(

"claude_retry_max_delay must be greater than or equal to "

"claude_retry_base_delay"

)

return self

Copilot · 2026-03-05T04:36:31Z

src/claude/sdk_integration.py

+    def _is_retryable_error(self, exc: BaseException) -> bool:
+        """Return True for transient errors that warrant a retry.
+        asyncio.TimeoutError is intentional (user-configured timeout) — not retried.
+        Only non-MCP CLIConnectionError is considered transient.
+        """
+        if isinstance(exc, CLIConnectionError):
+            msg = str(exc).lower()
+            return "mcp" not in msg and "server" not in msg
+        return False


_is_retryable_error() determines MCP vs non-MCP by substring matching on str(exc), which is brittle and duplicates the MCP-detection logic used later when translating CLIConnectionError into ClaudeMCPError. To reduce the chance of misclassification and keep behavior consistent, consider centralizing this classification (single helper used for both retry decision and final exception mapping), or using structured attributes from CLIConnectionError if available.

Copilot · 2026-03-05T04:36:31Z

src/claude/sdk_integration.py

+            # Execute with timeout, retrying on transient CLIConnectionError
+            max_attempts = max(1, self.config.claude_retry_max_attempts)
+            last_exc: Optional[BaseException] = None
+
+            for attempt in range(max_attempts):
+                if attempt > 0:
+                    delay = min(
+                        self.config.claude_retry_base_delay
+                        * (self.config.claude_retry_backoff_factor ** (attempt - 1)),
+                        self.config.claude_retry_max_delay,
+                    )
+                    logger.warning(
+                        "Retrying Claude SDK command",
+                        attempt=attempt + 1,
+                        max_attempts=max_attempts,
+                        delay_seconds=delay,
+                    )
+                    await asyncio.sleep(delay)
+                try:
+                    await asyncio.wait_for(
+                        _run_client(),
+                        timeout=self.config.claude_timeout_seconds,
+                    )
+                    break  # success — exit retry loop
+                except CLIConnectionError as exc:
+                    if self._is_retryable_error(exc) and attempt < max_attempts - 1:
+                        last_exc = exc
+                        logger.warning(
+                            "Transient connection error, will retry",
+                            attempt=attempt + 1,
+                            error=str(exc),
+                        )
+                        continue
+                    raise  # non-retryable or attempts exhausted


There are existing unit tests for ClaudeSDKManager.execute_command(), but no tests cover the new retry behavior (e.g., a transient non-MCP CLIConnectionError that succeeds on a subsequent attempt, and that claude_retry_max_attempts=0/1 results in no retries). Adding tests here would help prevent regressions in retry/backoff and logging behavior.

…d validation, retry tests

FridayOpenClawBot · 2026-03-05T04:55:52Z

PR Review
Reviewed head: 4cf8e56f1b7733d12c957c31d4af2a4f13852e7a

Summary

Adds configurable exponential backoff retry to execute_command() for transient CLIConnectionError (non-MCP)
Retry is gated by _is_retryable_error() which excludes MCP errors and asyncio.TimeoutError
Config via 4 new settings fields; defaults: 3 attempts, 1s base, ×3 backoff, 30s cap

What looks good

Clean separation of _is_retryable_error() — easy to extend later without touching the loop
MCP errors correctly excluded; timeout correctly not retried
Good test coverage: transient success, MCP no-retry, disabled (0 attempts), and the helper unit tests

Issues / questions

[Important] sdk_integration.py — _run_client() is defined inside execute_command() and closes over mutable state (e.g. result_messages, session objects). On retry, a fresh _run_client() coroutine is created each iteration, but verify that any shared closure variables are properly reset between attempts. If result_messages or similar accumulate across retries, you may get doubled output or incorrect cost attribution.
[Important] sdk_integration.py — The retry loop catches CLIConnectionError but re-raises on the last attempt. However, if asyncio.wait_for raises asyncio.TimeoutError mid-retry-loop (not caught by the except CLIConnectionError clause), it will propagate immediately — which is the intended behaviour. Just worth a comment in the code to make the intent explicit, since the mixed exception types in the loop aren't obvious.
[Nit] settings.py — claude_retry_backoff_factor uses gt=0 (correct — can't be zero), but claude_retry_base_delay and claude_retry_max_delay use ge=0 which permits 0. A delay of 0 is technically valid (instant retry) but worth a docstring note so operators know what they're opting into.

Suggested tests (if needed)

A test that verifies closure variables (e.g. result collection) don't accumulate across retry attempts — would confirm the inner _run_client() is stateless per call.

Verdict
⚠️ Merge after fixes — the closure state concern (#1) is worth a quick audit before shipping; the rest is straightforward.

— Friday, AI assistant to @RichardAtCT

… semantics

FridayOpenClawBot · 2026-03-05T05:10:58Z

PR Review — Updated head: 0896eac6cc455186a17f1ca588d1200d407d74c4

Previous review raised three main concerns — here's how they landed:

✅ Closure variable reset (#1 — was Important)
messages.clear() at the top of each iteration is the right fix. The late-binding closure means every _run_client() call now starts with an empty accumulator. Good.

✅ Timeout comment (#2 — was Important)
The comment above the try block clearly documents that asyncio.TimeoutError bypasses the retry loop intentionally. That's exactly what was needed.

✅ MCP filter tightened (#3 from first review)
Dropping "server" from the filter was the right call — "mcp" not in msg is precise enough for the use case without the false-positive risk.

✅ ge=0 semantics documented
Both claude_retry_base_delay and claude_retry_max_delay now have docstring notes explaining what 0 means. Clear.

One minor observation
test_retry_disabled_when_max_attempts_zero tests the max(1, ...) normalization in isolation rather than as an integration test. It passes, but it's asserting on the expression itself rather than on observable behaviour (i.e. that a command with max_attempts=0 only calls the SDK once). Not a blocker — just worth knowing the test is more of a unit assertion on a formula than a behaviour test.

Verdict
✅ LGTM — all the flagged issues are addressed cleanly. Good work iterating on this.

— Friday, AI assistant to @RichardAtCT

Copilot AI review requested due to automatic review settings March 5, 2026 04:33

Copilot started reviewing on behalf of haripatel07 March 5, 2026 04:33 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

fix: address review nits - clean retry loop, tighter MCP filter, fiel…

4cf8e56

…d validation, retry tests

fix: reset messages per retry, document timeout bypass and ge=0 delay…

0896eac

… semantics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add exponential backoff retry for transient SDK errors#127

feat: add exponential backoff retry for transient SDK errors#127
haripatel07 wants to merge 3 commits intoRichardAtCT:mainfrom
haripatel07:feat/retry-logic-sdk-calls

haripatel07 commented Mar 5, 2026

Uh oh!

FridayOpenClawBot commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

FridayOpenClawBot commented Mar 5, 2026

Uh oh!

FridayOpenClawBot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

haripatel07 commented Mar 5, 2026

Uh oh!

FridayOpenClawBot commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

FridayOpenClawBot commented Mar 5, 2026

Uh oh!

FridayOpenClawBot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants