The get_inference_profile_info function has a race condition in its caching logic. When multiple concurrent requests arrive, they will all pass the cache check before any thread can populate the cache, causing a thundering herd of duplicate API calls to GetInferenceProfile on Bedrock.
This can be replicated by:
- Add a custom model using an application inference profile ID to the Bedrock plugin
- In Studio, create a new chatbot app and set the newly imported model as the default
- Publish the app
- Send 100+ concurrent requests to this endpoint, (e.g.,
http://localhost/v1/chat-messages)
The expected behavior is that GetInferenceProfile will only be called once, and the rest of the requests should just use the cached data. However, you will notice from CloudTrail that this will result in 50-100 invocations to GetInferenceProfile.
GetInferenceProfile is a control plane API with a fairly low TPS (double digits based on my testing) that cannot be increased. The race condition can cause ThrottlingException errors during batch operations or with sufficient concurrent users, resulting in 500 errors.
The
get_inference_profile_infofunction has a race condition in its caching logic. When multiple concurrent requests arrive, they will all pass the cache check before any thread can populate the cache, causing a thundering herd of duplicate API calls toGetInferenceProfileon Bedrock.This can be replicated by:
http://localhost/v1/chat-messages)The expected behavior is that
GetInferenceProfilewill only be called once, and the rest of the requests should just use the cached data. However, you will notice from CloudTrail that this will result in 50-100 invocations toGetInferenceProfile.GetInferenceProfileis a control plane API with a fairly low TPS (double digits based on my testing) that cannot be increased. The race condition can causeThrottlingExceptionerrors during batch operations or with sufficient concurrent users, resulting in 500 errors.