feat: add exponential backoff retry for transient SDK errors#127
feat: add exponential backoff retry for transient SDK errors#127haripatel07 wants to merge 3 commits intoRichardAtCT:mainfrom
Conversation
Closes RichardAtCT#60 - adds configurable retry logic to ClaudeSDKManager.execute_command() for transient CLIConnectionError failures (non-MCP). Changes: - src/utils/constants.py: 4 new retry default constants - src/config/settings.py: 4 new settings fields (claude_retry_max_attempts, claude_retry_base_delay, claude_retry_backoff_factor, claude_retry_max_delay) - src/claude/sdk_integration.py: _is_retryable_error() helper + retry loop wrapping asyncio.wait_for() in execute_command() Retry decision: - CLIConnectionError (non-MCP): retried with exponential backoff - asyncio.TimeoutError: not retried (user-configured timeout, intentional) - CLINotFoundError, ProcessError, CLIJSONDecodeError: not retried Default backoff: 1s → 3s → 9s, capped at 30s (CLAUDE_RETRY_MAX_ATTEMPTS=0 disables) Tests: 491 passed, 0 failed
|
Good addition — retry logic for transient 🐛 Dead code in the retry loop The
return "mcp" not in msg and "server" not in msgFiltering on ℹ️
Otherwise the settings split (constants → settings → sdk) is clean, and the test coverage sounds solid. — Friday, AI assistant to @RichardAtCT |
There was a problem hiding this comment.
Pull request overview
Adds configurable exponential-backoff retry behavior around Claude SDK command execution to better handle transient (non-MCP) connection failures without immediately surfacing errors to users.
Changes:
- Introduces default retry/backoff constants (attempts/base delay/factor/max delay).
- Adds corresponding
Settingsfields to configure retry behavior via environment/config. - Wraps
ClaudeSDKManager.execute_command()execution in a retry loop gated by a new_is_retryable_error()helper.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/utils/constants.py | Adds default values for retry/backoff configuration. |
| src/config/settings.py | Exposes retry/backoff configuration via Pydantic settings. |
| src/claude/sdk_integration.py | Implements retry decision helper and retry loop around asyncio.wait_for() for transient CLIConnectionError. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/claude/sdk_integration.py
Outdated
| max_attempts = max(1, self.config.claude_retry_max_attempts) | ||
| last_exc: Optional[BaseException] = None | ||
|
|
||
| for attempt in range(max_attempts): | ||
| if attempt > 0: | ||
| delay = min( | ||
| self.config.claude_retry_base_delay | ||
| * (self.config.claude_retry_backoff_factor ** (attempt - 1)), | ||
| self.config.claude_retry_max_delay, | ||
| ) | ||
| logger.warning( | ||
| "Retrying Claude SDK command", | ||
| attempt=attempt + 1, | ||
| max_attempts=max_attempts, | ||
| delay_seconds=delay, | ||
| ) | ||
| await asyncio.sleep(delay) | ||
| try: | ||
| await asyncio.wait_for( | ||
| _run_client(), | ||
| timeout=self.config.claude_timeout_seconds, | ||
| ) | ||
| break # success — exit retry loop | ||
| except CLIConnectionError as exc: | ||
| if self._is_retryable_error(exc) and attempt < max_attempts - 1: | ||
| last_exc = exc | ||
| logger.warning( | ||
| "Transient connection error, will retry", | ||
| attempt=attempt + 1, | ||
| error=str(exc), | ||
| ) | ||
| continue | ||
| raise # non-retryable or attempts exhausted | ||
| else: | ||
| if last_exc is not None: | ||
| raise last_exc |
There was a problem hiding this comment.
The retry loop’s for ... else block is effectively unreachable: max_attempts is forced to be >= 1, and on the final failed attempt the code raises inside the except, so the loop will never fall through to else. This also makes last_exc unnecessary and potentially misleading. Consider simplifying by removing last_exc/else, or restructuring to only raise after the loop completes.
| description="Max retry attempts for transient SDK errors (0 = disabled)", | ||
| ) | ||
| claude_retry_base_delay: float = Field( | ||
| DEFAULT_RETRY_BASE_DELAY, description="Base delay in seconds between retries" | ||
| ) | ||
| claude_retry_backoff_factor: float = Field( | ||
| DEFAULT_RETRY_BACKOFF_FACTOR, description="Exponential backoff multiplier" | ||
| ) | ||
| claude_retry_max_delay: float = Field( | ||
| DEFAULT_RETRY_MAX_DELAY, description="Maximum delay cap in seconds" | ||
| ) | ||
|
|
There was a problem hiding this comment.
These retry-related settings are user-configurable but currently have no bounds validation. Negative values (e.g., base_delay/max_delay) can lead to runtime failures (e.g., asyncio.sleep() with a negative delay) or unexpected backoff behavior. Add appropriate ge/gt constraints (and potentially cross-field validation like max_delay >= base_delay) for these fields.
| description="Max retry attempts for transient SDK errors (0 = disabled)", | |
| ) | |
| claude_retry_base_delay: float = Field( | |
| DEFAULT_RETRY_BASE_DELAY, description="Base delay in seconds between retries" | |
| ) | |
| claude_retry_backoff_factor: float = Field( | |
| DEFAULT_RETRY_BACKOFF_FACTOR, description="Exponential backoff multiplier" | |
| ) | |
| claude_retry_max_delay: float = Field( | |
| DEFAULT_RETRY_MAX_DELAY, description="Maximum delay cap in seconds" | |
| ) | |
| ge=0, | |
| description="Max retry attempts for transient SDK errors (0 = disabled)", | |
| ) | |
| claude_retry_base_delay: float = Field( | |
| DEFAULT_RETRY_BASE_DELAY, | |
| ge=0, | |
| description="Base delay in seconds between retries", | |
| ) | |
| claude_retry_backoff_factor: float = Field( | |
| DEFAULT_RETRY_BACKOFF_FACTOR, | |
| gt=0, | |
| description="Exponential backoff multiplier", | |
| ) | |
| claude_retry_max_delay: float = Field( | |
| DEFAULT_RETRY_MAX_DELAY, | |
| ge=0, | |
| description="Maximum delay cap in seconds", | |
| ) | |
| @model_validator(mode="after") | |
| def validate_retry_delays(self) -> "Settings": | |
| """Ensure retry delay configuration is internally consistent.""" | |
| if self.claude_retry_max_delay < self.claude_retry_base_delay: | |
| raise ValueError( | |
| "claude_retry_max_delay must be greater than or equal to " | |
| "claude_retry_base_delay" | |
| ) | |
| return self |
| def _is_retryable_error(self, exc: BaseException) -> bool: | ||
| """Return True for transient errors that warrant a retry. | ||
| asyncio.TimeoutError is intentional (user-configured timeout) — not retried. | ||
| Only non-MCP CLIConnectionError is considered transient. | ||
| """ | ||
| if isinstance(exc, CLIConnectionError): | ||
| msg = str(exc).lower() | ||
| return "mcp" not in msg and "server" not in msg | ||
| return False |
There was a problem hiding this comment.
_is_retryable_error() determines MCP vs non-MCP by substring matching on str(exc), which is brittle and duplicates the MCP-detection logic used later when translating CLIConnectionError into ClaudeMCPError. To reduce the chance of misclassification and keep behavior consistent, consider centralizing this classification (single helper used for both retry decision and final exception mapping), or using structured attributes from CLIConnectionError if available.
| # Execute with timeout, retrying on transient CLIConnectionError | ||
| max_attempts = max(1, self.config.claude_retry_max_attempts) | ||
| last_exc: Optional[BaseException] = None | ||
|
|
||
| for attempt in range(max_attempts): | ||
| if attempt > 0: | ||
| delay = min( | ||
| self.config.claude_retry_base_delay | ||
| * (self.config.claude_retry_backoff_factor ** (attempt - 1)), | ||
| self.config.claude_retry_max_delay, | ||
| ) | ||
| logger.warning( | ||
| "Retrying Claude SDK command", | ||
| attempt=attempt + 1, | ||
| max_attempts=max_attempts, | ||
| delay_seconds=delay, | ||
| ) | ||
| await asyncio.sleep(delay) | ||
| try: | ||
| await asyncio.wait_for( | ||
| _run_client(), | ||
| timeout=self.config.claude_timeout_seconds, | ||
| ) | ||
| break # success — exit retry loop | ||
| except CLIConnectionError as exc: | ||
| if self._is_retryable_error(exc) and attempt < max_attempts - 1: | ||
| last_exc = exc | ||
| logger.warning( | ||
| "Transient connection error, will retry", | ||
| attempt=attempt + 1, | ||
| error=str(exc), | ||
| ) | ||
| continue | ||
| raise # non-retryable or attempts exhausted |
There was a problem hiding this comment.
There are existing unit tests for ClaudeSDKManager.execute_command(), but no tests cover the new retry behavior (e.g., a transient non-MCP CLIConnectionError that succeeds on a subsequent attempt, and that claude_retry_max_attempts=0/1 results in no retries). Adding tests here would help prevent regressions in retry/backoff and logging behavior.
…d validation, retry tests
|
PR Review Summary
What looks good
Issues / questions
Suggested tests (if needed)
Verdict — Friday, AI assistant to @RichardAtCT |
|
PR Review — Updated head: Previous review raised three main concerns — here's how they landed: ✅ Closure variable reset (#1 — was Important) ✅ Timeout comment (#2 — was Important) ✅ MCP filter tightened (#3 from first review) ✅ One minor observation Verdict — Friday, AI assistant to @RichardAtCT |
Closes #60 - adds configurable retry logic to ClaudeSDKManager.execute_command() for transient CLIConnectionError failures (non-MCP).
Changes:
Retry decision:
Default backoff: 1s → 3s → 9s, capped at 30s (CLAUDE_RETRY_MAX_ATTEMPTS=0 disables)
Tests: 491 passed, 0 failed