Skip to content

feat: add exponential backoff retry for transient SDK errors#127

Open
haripatel07 wants to merge 3 commits intoRichardAtCT:mainfrom
haripatel07:feat/retry-logic-sdk-calls
Open

feat: add exponential backoff retry for transient SDK errors#127
haripatel07 wants to merge 3 commits intoRichardAtCT:mainfrom
haripatel07:feat/retry-logic-sdk-calls

Conversation

@haripatel07
Copy link
Contributor

Closes #60 - adds configurable retry logic to ClaudeSDKManager.execute_command() for transient CLIConnectionError failures (non-MCP).

Changes:

  • src/utils/constants.py: 4 new retry default constants
  • src/config/settings.py: 4 new settings fields (claude_retry_max_attempts, claude_retry_base_delay, claude_retry_backoff_factor, claude_retry_max_delay)
  • src/claude/sdk_integration.py: _is_retryable_error() helper + retry loop wrapping asyncio.wait_for() in execute_command()

Retry decision:

  • CLIConnectionError (non-MCP): retried with exponential backoff
  • asyncio.TimeoutError: not retried (user-configured timeout, intentional)
  • CLINotFoundError, ProcessError, CLIJSONDecodeError: not retried

Default backoff: 1s → 3s → 9s, capped at 30s (CLAUDE_RETRY_MAX_ATTEMPTS=0 disables)

Tests: 491 passed, 0 failed

Closes RichardAtCT#60 - adds configurable retry logic to ClaudeSDKManager.execute_command()
for transient CLIConnectionError failures (non-MCP).

Changes:
- src/utils/constants.py: 4 new retry default constants
- src/config/settings.py: 4 new settings fields (claude_retry_max_attempts,
  claude_retry_base_delay, claude_retry_backoff_factor, claude_retry_max_delay)
- src/claude/sdk_integration.py: _is_retryable_error() helper + retry loop
  wrapping asyncio.wait_for() in execute_command()

Retry decision:
- CLIConnectionError (non-MCP): retried with exponential backoff
- asyncio.TimeoutError: not retried (user-configured timeout, intentional)
- CLINotFoundError, ProcessError, CLIJSONDecodeError: not retried

Default backoff: 1s → 3s → 9s, capped at 30s (CLAUDE_RETRY_MAX_ATTEMPTS=0 disables)

Tests: 491 passed, 0 failed
Copilot AI review requested due to automatic review settings March 5, 2026 04:33
@FridayOpenClawBot
Copy link

Good addition — retry logic for transient CLIConnectionError is a real pain point and the overall structure is clean. A few things worth looking at:

🐛 Dead code in the retry loop

The for...else + last_exc pattern doesn't work as intended. On the final attempt, when attempt < max_attempts - 1 is False, the bare raise fires — so the loop exits via raise, not by completing normally. Python's for...else only runs the else block when no break occurs; a raise also skips it. In practice last_exc is stored but never used — the else block is unreachable. The exception is still re-raised correctly (bare raise re-raises exc), so behaviour is fine, but the dead code is misleading. Simplest fix: drop the else block and last_exc tracking entirely, or restructure to re-raise after the loop explicitly.

⚠️ Overly broad MCP filter in _is_retryable_error

return "mcp" not in msg and "server" not in msg

Filtering on "server" is too aggressive — legitimate transient CLIConnectionErrors (e.g. "failed to connect to server") could contain that word and get silently non-retried. The MCP distinction was presumably to avoid retrying when an MCP server is down (which is user-controlled, not transient). A tighter check — e.g. "mcp server" as a phrase, or checking the exception type/subclass — would be safer.

ℹ️ max(1, ...) silently overrides 0 = disabled

max(1, self.config.claude_retry_max_attempts) converts 0 to 1, meaning you always get at least one attempt. This is correct semantically (setting to 0 gives 1 attempt = no retry), but the constant comment says CLAUDE_RETRY_MAX_ATTEMPTS=0 disables which is accurate but the max_attempts variable name implies otherwise. Minor: worth a comment clarifying this.

Otherwise the settings split (constants → settings → sdk) is clean, and the test coverage sounds solid.

Friday, AI assistant to @RichardAtCT

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable exponential-backoff retry behavior around Claude SDK command execution to better handle transient (non-MCP) connection failures without immediately surfacing errors to users.

Changes:

  • Introduces default retry/backoff constants (attempts/base delay/factor/max delay).
  • Adds corresponding Settings fields to configure retry behavior via environment/config.
  • Wraps ClaudeSDKManager.execute_command() execution in a retry loop gated by a new _is_retryable_error() helper.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/utils/constants.py Adds default values for retry/backoff configuration.
src/config/settings.py Exposes retry/backoff configuration via Pydantic settings.
src/claude/sdk_integration.py Implements retry decision helper and retry loop around asyncio.wait_for() for transient CLIConnectionError.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +302 to +337
max_attempts = max(1, self.config.claude_retry_max_attempts)
last_exc: Optional[BaseException] = None

for attempt in range(max_attempts):
if attempt > 0:
delay = min(
self.config.claude_retry_base_delay
* (self.config.claude_retry_backoff_factor ** (attempt - 1)),
self.config.claude_retry_max_delay,
)
logger.warning(
"Retrying Claude SDK command",
attempt=attempt + 1,
max_attempts=max_attempts,
delay_seconds=delay,
)
await asyncio.sleep(delay)
try:
await asyncio.wait_for(
_run_client(),
timeout=self.config.claude_timeout_seconds,
)
break # success — exit retry loop
except CLIConnectionError as exc:
if self._is_retryable_error(exc) and attempt < max_attempts - 1:
last_exc = exc
logger.warning(
"Transient connection error, will retry",
attempt=attempt + 1,
error=str(exc),
)
continue
raise # non-retryable or attempts exhausted
else:
if last_exc is not None:
raise last_exc
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry loop’s for ... else block is effectively unreachable: max_attempts is forced to be >= 1, and on the final failed attempt the code raises inside the except, so the loop will never fall through to else. This also makes last_exc unnecessary and potentially misleading. Consider simplifying by removing last_exc/else, or restructuring to only raise after the loop completes.

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +142
description="Max retry attempts for transient SDK errors (0 = disabled)",
)
claude_retry_base_delay: float = Field(
DEFAULT_RETRY_BASE_DELAY, description="Base delay in seconds between retries"
)
claude_retry_backoff_factor: float = Field(
DEFAULT_RETRY_BACKOFF_FACTOR, description="Exponential backoff multiplier"
)
claude_retry_max_delay: float = Field(
DEFAULT_RETRY_MAX_DELAY, description="Maximum delay cap in seconds"
)

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These retry-related settings are user-configurable but currently have no bounds validation. Negative values (e.g., base_delay/max_delay) can lead to runtime failures (e.g., asyncio.sleep() with a negative delay) or unexpected backoff behavior. Add appropriate ge/gt constraints (and potentially cross-field validation like max_delay >= base_delay) for these fields.

Suggested change
description="Max retry attempts for transient SDK errors (0 = disabled)",
)
claude_retry_base_delay: float = Field(
DEFAULT_RETRY_BASE_DELAY, description="Base delay in seconds between retries"
)
claude_retry_backoff_factor: float = Field(
DEFAULT_RETRY_BACKOFF_FACTOR, description="Exponential backoff multiplier"
)
claude_retry_max_delay: float = Field(
DEFAULT_RETRY_MAX_DELAY, description="Maximum delay cap in seconds"
)
ge=0,
description="Max retry attempts for transient SDK errors (0 = disabled)",
)
claude_retry_base_delay: float = Field(
DEFAULT_RETRY_BASE_DELAY,
ge=0,
description="Base delay in seconds between retries",
)
claude_retry_backoff_factor: float = Field(
DEFAULT_RETRY_BACKOFF_FACTOR,
gt=0,
description="Exponential backoff multiplier",
)
claude_retry_max_delay: float = Field(
DEFAULT_RETRY_MAX_DELAY,
ge=0,
description="Maximum delay cap in seconds",
)
@model_validator(mode="after")
def validate_retry_delays(self) -> "Settings":
"""Ensure retry delay configuration is internally consistent."""
if self.claude_retry_max_delay < self.claude_retry_base_delay:
raise ValueError(
"claude_retry_max_delay must be greater than or equal to "
"claude_retry_base_delay"
)
return self

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +157
def _is_retryable_error(self, exc: BaseException) -> bool:
"""Return True for transient errors that warrant a retry.
asyncio.TimeoutError is intentional (user-configured timeout) — not retried.
Only non-MCP CLIConnectionError is considered transient.
"""
if isinstance(exc, CLIConnectionError):
msg = str(exc).lower()
return "mcp" not in msg and "server" not in msg
return False
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_is_retryable_error() determines MCP vs non-MCP by substring matching on str(exc), which is brittle and duplicates the MCP-detection logic used later when translating CLIConnectionError into ClaudeMCPError. To reduce the chance of misclassification and keep behavior consistent, consider centralizing this classification (single helper used for both retry decision and final exception mapping), or using structured attributes from CLIConnectionError if available.

Copilot uses AI. Check for mistakes.
Comment on lines +301 to +334
# Execute with timeout, retrying on transient CLIConnectionError
max_attempts = max(1, self.config.claude_retry_max_attempts)
last_exc: Optional[BaseException] = None

for attempt in range(max_attempts):
if attempt > 0:
delay = min(
self.config.claude_retry_base_delay
* (self.config.claude_retry_backoff_factor ** (attempt - 1)),
self.config.claude_retry_max_delay,
)
logger.warning(
"Retrying Claude SDK command",
attempt=attempt + 1,
max_attempts=max_attempts,
delay_seconds=delay,
)
await asyncio.sleep(delay)
try:
await asyncio.wait_for(
_run_client(),
timeout=self.config.claude_timeout_seconds,
)
break # success — exit retry loop
except CLIConnectionError as exc:
if self._is_retryable_error(exc) and attempt < max_attempts - 1:
last_exc = exc
logger.warning(
"Transient connection error, will retry",
attempt=attempt + 1,
error=str(exc),
)
continue
raise # non-retryable or attempts exhausted
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing unit tests for ClaudeSDKManager.execute_command(), but no tests cover the new retry behavior (e.g., a transient non-MCP CLIConnectionError that succeeds on a subsequent attempt, and that claude_retry_max_attempts=0/1 results in no retries). Adding tests here would help prevent regressions in retry/backoff and logging behavior.

Copilot uses AI. Check for mistakes.
@FridayOpenClawBot
Copy link

PR Review
Reviewed head: 4cf8e56f1b7733d12c957c31d4af2a4f13852e7a

Summary

  • Adds configurable exponential backoff retry to execute_command() for transient CLIConnectionError (non-MCP)
  • Retry is gated by _is_retryable_error() which excludes MCP errors and asyncio.TimeoutError
  • Config via 4 new settings fields; defaults: 3 attempts, 1s base, ×3 backoff, 30s cap

What looks good

  • Clean separation of _is_retryable_error() — easy to extend later without touching the loop
  • MCP errors correctly excluded; timeout correctly not retried
  • Good test coverage: transient success, MCP no-retry, disabled (0 attempts), and the helper unit tests

Issues / questions

  1. [Important] sdk_integration.py_run_client() is defined inside execute_command() and closes over mutable state (e.g. result_messages, session objects). On retry, a fresh _run_client() coroutine is created each iteration, but verify that any shared closure variables are properly reset between attempts. If result_messages or similar accumulate across retries, you may get doubled output or incorrect cost attribution.

  2. [Important] sdk_integration.py — The retry loop catches CLIConnectionError but re-raises on the last attempt. However, if asyncio.wait_for raises asyncio.TimeoutError mid-retry-loop (not caught by the except CLIConnectionError clause), it will propagate immediately — which is the intended behaviour. Just worth a comment in the code to make the intent explicit, since the mixed exception types in the loop aren't obvious.

  3. [Nit] settings.pyclaude_retry_backoff_factor uses gt=0 (correct — can't be zero), but claude_retry_base_delay and claude_retry_max_delay use ge=0 which permits 0. A delay of 0 is technically valid (instant retry) but worth a docstring note so operators know what they're opting into.

Suggested tests (if needed)

  • A test that verifies closure variables (e.g. result collection) don't accumulate across retry attempts — would confirm the inner _run_client() is stateless per call.

Verdict
⚠️ Merge after fixes — the closure state concern (#1) is worth a quick audit before shipping; the rest is straightforward.

Friday, AI assistant to @RichardAtCT

@FridayOpenClawBot
Copy link

PR Review — Updated head: 0896eac6cc455186a17f1ca588d1200d407d74c4

Previous review raised three main concerns — here's how they landed:

✅ Closure variable reset (#1 — was Important)
messages.clear() at the top of each iteration is the right fix. The late-binding closure means every _run_client() call now starts with an empty accumulator. Good.

✅ Timeout comment (#2 — was Important)
The comment above the try block clearly documents that asyncio.TimeoutError bypasses the retry loop intentionally. That's exactly what was needed.

✅ MCP filter tightened (#3 from first review)
Dropping "server" from the filter was the right call — "mcp" not in msg is precise enough for the use case without the false-positive risk.

ge=0 semantics documented
Both claude_retry_base_delay and claude_retry_max_delay now have docstring notes explaining what 0 means. Clear.

One minor observation
test_retry_disabled_when_max_attempts_zero tests the max(1, ...) normalization in isolation rather than as an integration test. It passes, but it's asserting on the expression itself rather than on observable behaviour (i.e. that a command with max_attempts=0 only calls the SDK once). Not a blocker — just worth knowing the test is more of a unit assertion on a formula than a behaviour test.

Verdict
✅ LGTM — all the flagged issues are addressed cleanly. Good work iterating on this.

Friday, AI assistant to @RichardAtCT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add retry logic for transient network errors in Claude SDK calls

3 participants