Skip to content

Add retry logic for transient network errors in Claude SDK calls #60

@RichardAtCT

Description

@RichardAtCT

Problem

Transient network errors (e.g. httpx.ConnectError) during Claude API calls immediately fail to the user with a generic error message. There is no retry logic at any layer:

  • sdk_integration.py:execute_command() — catches Exception but re-raises immediately
  • facade.py:run_command() — only retries on "no conversation found" (stale session), not network errors
  • orchestrator.py:agentic_text() — catches, formats, and sends error to user

Observed behavior

A momentary network blip causes:

httpx.ConnectError: 

which surfaces to the Telegram user as a generic error, even though retrying 2-3 seconds later would succeed.

Proposed solution

Add retry with exponential backoff in sdk_integration.py:execute_command() for transient/retryable errors:

  • httpx.ConnectError, httpx.TimeoutException
  • Possibly asyncio.TimeoutError (currently raised as ClaudeTimeoutError with no retry)

Suggested approach:

  • 2-3 retries with exponential backoff (e.g. 1s, 3s, 9s)
  • Only retry on transport-level errors, not application errors (rate limits, auth failures, etc.)
  • Consider adding tenacity as a dependency, or implement a simple async retry loop
  • Log each retry attempt at warning level

This is the right layer because execute_command() is the single chokepoint for all Claude API calls. Retrying higher up risks duplicating session side-effects.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions