-
Notifications
You must be signed in to change notification settings - Fork 228
Description
Problem
Transient network errors (e.g. httpx.ConnectError) during Claude API calls immediately fail to the user with a generic error message. There is no retry logic at any layer:
sdk_integration.py:execute_command()— catchesExceptionbut re-raises immediatelyfacade.py:run_command()— only retries on"no conversation found"(stale session), not network errorsorchestrator.py:agentic_text()— catches, formats, and sends error to user
Observed behavior
A momentary network blip causes:
httpx.ConnectError:
which surfaces to the Telegram user as a generic error, even though retrying 2-3 seconds later would succeed.
Proposed solution
Add retry with exponential backoff in sdk_integration.py:execute_command() for transient/retryable errors:
httpx.ConnectError,httpx.TimeoutException- Possibly
asyncio.TimeoutError(currently raised asClaudeTimeoutErrorwith no retry)
Suggested approach:
- 2-3 retries with exponential backoff (e.g. 1s, 3s, 9s)
- Only retry on transport-level errors, not application errors (rate limits, auth failures, etc.)
- Consider adding
tenacityas a dependency, or implement a simple async retry loop - Log each retry attempt at warning level
This is the right layer because execute_command() is the single chokepoint for all Claude API calls. Retrying higher up risks duplicating session side-effects.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels