feat: add support for retrying LLM requests on 429 ratelimits (#697)#733
feat: add support for retrying LLM requests on 429 ratelimits (#697)#733raheelshahzad wants to merge 3 commits intokatanemo:mainfrom
Conversation
There was a problem hiding this comment.
Thanks a lot for putting this change together @raheelshahzad . Please join our discord channel too. Overall looks good!
Left some comments in the PR and have some additional suggestions/comments on overall change,
- we should do exponential backoff on retries
- how do we ensure that we have not reached request timeout
- max_retries should be defined somewhere in config.yaml probably not in this PR but we should let developers define that var
- this code change needs an update to docs
- I think we should allow retry to same provider or at least let developers define if they want to retry to different provider. Consider following example,
model_providers:
- model: openai/gpt-4o
base_url: https://dsna-oai.openai.azure.com
access_key: $OPENAI_API_KEY
retry_on_ratelimit: true # new feature
retry_to_same_provider: true # this flag should only allow retry to same provider otherwise we should retry randomly to all models
- model: openai/gpt-5
base_url: https://dsna-oai.openai.azure.com
access_key: $OPENAI_API_KEY
crates/common/src/llm_providers.rs
Outdated
| self.providers.iter().find_map(|(key, provider)| { | ||
| if provider.internal != Some(true) | ||
| && provider.name != current_name | ||
| && key == &provider.name | ||
| { | ||
| Some(Arc::clone(provider)) | ||
| } else { | ||
| None | ||
| } | ||
| }) |
There was a problem hiding this comment.
should pick random model
| if res.status() == StatusCode::TOO_MANY_REQUESTS && attempts < max_attempts { | ||
| let providers = llm_providers.read().await; | ||
| if let Some(provider) = providers.get(¤t_resolved_model) { | ||
| if provider.retry_on_ratelimit == Some(true) { | ||
| if let Some(alt_provider) = providers.get_alternative(¤t_resolved_model) { | ||
| info!( | ||
| request_id = %request_id, | ||
| current_model = %current_resolved_model, | ||
| alt_model = %alt_provider.name, | ||
| "429 received, retrying with alternative model" | ||
| ); | ||
| current_resolved_model = alt_provider.name.clone(); | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
we need to add exponential backoff
| let mut current_resolved_model = resolved_model.clone(); | ||
| let mut current_client_request = client_request; | ||
| let mut attempts = 0; | ||
| let max_attempts = 2; // Original + 1 retry |
There was a problem hiding this comment.
this should be configurable
| ); | ||
| // Capture start time right before sending request to upstream | ||
| let request_start_time = std::time::Instant::now(); | ||
| let _request_start_system_time = std::time::SystemTime::now(); |
|
I looked through envoy retry semantics https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-field-config-route-v3-routeaction-retry-policy I think we should lean toward this design for retries. We don't have to implement this completely but we should implement bare minimal but following similar semantics / config, thoughts? |
…mo#697) - Added 'retry_on_ratelimit' configuration to LlmProvider. - Implemented a retry loop in the LLM handler to automatically failover to an alternative model when a 429 status is received. - Added comprehensive unit tests for fallback selection and failover logic. - Ensured default behavior is unchanged when the feature is disabled.
…andom fallback selection
d1aa3ac to
ca903d2
Compare
raheelshahzad
left a comment
There was a problem hiding this comment.
- Exponential backoff with configurable base and max intervals.
- Configurable
max_retries. retry_to_same_provideroption.- Random alternative selection when failing over to a different model.
- Documentation updates in the reference configuration.
- Comprehensive unit tests for all the above.
|
Thanks a lot Raheel for continuing to make plano better. We are getting there. This may be a slightly better way to specify retries, model_providers:
- model: openai/gpt-4o
access_key: $OPENAI_API_KEY
default: true
retry_policy:
num_retries: 2
# retry_on: [429] # default
# back_off:
# base_interval: 25ms # default
# max_interval: 250ms # default (10x base)
# failover:
# strategy: same_provider # default
# Need more control
- model: anthropic/claude-sonnet-4-0
access_key: $ANTHROPIC_API_KEY
retry_policy:
num_retries: 3
failover:
strategy: any
# Full control
- model: openai/gpt-4o-mini
access_key: $OPENAI_API_KEY
retry_policy:
num_retries: 2
retry_on: [429, 503]
back_off:
base_interval: 100ms
max_interval: 2000ms
failover:
providers:
- anthropic/claude-sonnet-4-0
# No retries (default, just omit retry_policy)
- model: mistral/ministral-3b-latest
access_key: $MISTRAL_API_KEY |
fixes #697