Implement dual-rate billing for Gemini image-output models by adambalogh · Pull Request #90 · OpenGradient/tee-gateway

adambalogh · 2026-06-03T19:21:42Z

Summary

Gemini image-output models (gemini-2.5-flash-image, gemini-3.1-flash-image) bill output tokens at two different rates: image-modality tokens at a premium rate (~~$30/MTok) and text/thinking tokens at the standard rate (~~$1.50–$3/MTok). This PR implements proper split-rate billing by:

Adding image_output_price_usd field to ModelConfig for dual-rate models
Extracting and surfacing reasoning_tokens (thinking) through the usage pipeline
Splitting output token billing in compute_session_cost() based on reasoning count
Updating Gemini image model pricing to reflect Google's actual dual-rate structure

Key Changes

model_registry.py:
- Added image_output_price_usd optional field to ModelConfig for image-modality token pricing
- Updated GEMINI_2_5_FLASH_IMAGE and GEMINI_3_1_FLASH_IMAGE to use correct dual-rate pricing (text/thinking at output_price_usd, images at image_output_price_usd)
llm_backend.py:
- Modified extract_usage() to extract and return reasoning_tokens from output_token_details nested in usage metadata
pricing.py:
- Implemented dual-rate billing logic in compute_session_cost(): when image_output_price_usd is set, reasoning tokens are billed at output_price_usd and remaining output tokens (images + captions) at image_output_price_usd
- Conservative approach: never undercharges image tokens and stays well below the previous behavior of billing all output at the image rate
chat_controller.py:
- Modified _create_non_streaming_response() to surface only the standard OpenAI usage triple (prompt/completion/total tokens) while passing reasoning tokens separately to the cost calculator
- Updated streaming response handling in generate() to extract and accumulate reasoning tokens from output_token_details in both non-streaming and streaming paths
- Pass reasoning_tokens to compute_session_cost() via a separate cost_usage dict to avoid polluting the OpenAI-compatible response
test_image_billing.py:
- Updated test documentation to reflect dual-rate billing model
- Modified _usage_dict() to use extract_usage() which now carries reasoning tokens
- Renamed test_generated_image_is_charged_as_output_tokens() → test_generated_image_is_charged_at_image_rate() with updated assertions
- Added test_thinking_tokens_billed_at_text_rate() to verify thinking tokens use the cheaper text rate
- Added test_thinking_is_cheaper_than_billing_all_at_image_rate() regression test ensuring the fix doesn't revert to the old buggy behavior
test_price_feed.py:
- Updated mock model config to include image_output=False and image_output_price_usd=None for single-rate test models

Implementation Details

Reasoning token extraction: LangChain breaks out thinking tokens in output_token_details.reasoning but folds them into the main output_tokens count. The billing split uses this breakdown to charge thinking at the text rate and the remainder (image + any caption) at the image rate.
Conservative billing: The split never undercharges image tokens and is far below the previous behavior of billing all output at the image rate, ensuring we don't lose revenue on image generation.
OpenAI compatibility: The response surface maintains the standard OpenAI usage triple; reasoning tokens ride along to the cost calculator only, not exposed in the API response.

https://claude.ai/code/session_01GDGKRki93xtXCFUNDkcEyq

Google bills nano banana / nano banana 2 output at two rates: image-modality tokens at a high rate ($30/MTok for 2.5-flash-image, $60/MTok for 3.1-flash-image) and text + thinking tokens at a much lower rate ($1.50 and $3/MTok). The gateway previously billed the entire output_tokens count (image + text + thinking) at the single image rate, overcharging thinking tokens up to 20x and inflating a typical generation with reasoning by ~50-70%. Add image_output_price_usd to ModelConfig and split billing: reasoning tokens (broken out by langchain via output_token_details) are charged at output_price_usd (text/thinking rate), the remainder at image_output_price_usd. langchain folds image+text+thinking into one output_tokens count and does not expose the per- modality breakdown, so the small text caption rides the image rate — conservative (never undercharges) and strictly cheaper than the previous behavior. Plumb reasoning_tokens through extract_usage and both streaming paths; keep the OpenAI usage triple on responses clean.

The output_price_usd is now the text/thinking rate ($3/MTok); the image rate ($60/MTok) moved to image_output_price_usd.

claude added 2 commits June 3, 2026 19:20

Update gemini-3.1-flash-image pricing assertion for dual-rate split

43c69ce

The output_price_usd is now the text/thinking rate ($3/MTok); the image rate ($60/MTok) moved to image_output_price_usd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement dual-rate billing for Gemini image-output models#90

Implement dual-rate billing for Gemini image-output models#90
adambalogh wants to merge 2 commits into
mainfrom
claude/practical-mccarthy-Bs3vD

adambalogh commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adambalogh commented Jun 3, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants