Skip to content

Implement dual-rate billing for Gemini image-output models#90

Draft
adambalogh wants to merge 2 commits into
mainfrom
claude/practical-mccarthy-Bs3vD
Draft

Implement dual-rate billing for Gemini image-output models#90
adambalogh wants to merge 2 commits into
mainfrom
claude/practical-mccarthy-Bs3vD

Conversation

@adambalogh
Copy link
Copy Markdown
Contributor

Summary

Gemini image-output models (gemini-2.5-flash-image, gemini-3.1-flash-image) bill output tokens at two different rates: image-modality tokens at a premium rate ($30/MTok) and text/thinking tokens at the standard rate ($1.50–$3/MTok). This PR implements proper split-rate billing by:

  1. Adding image_output_price_usd field to ModelConfig for dual-rate models
  2. Extracting and surfacing reasoning_tokens (thinking) through the usage pipeline
  3. Splitting output token billing in compute_session_cost() based on reasoning count
  4. Updating Gemini image model pricing to reflect Google's actual dual-rate structure

Key Changes

  • model_registry.py:

    • Added image_output_price_usd optional field to ModelConfig for image-modality token pricing
    • Updated GEMINI_2_5_FLASH_IMAGE and GEMINI_3_1_FLASH_IMAGE to use correct dual-rate pricing (text/thinking at output_price_usd, images at image_output_price_usd)
  • llm_backend.py:

    • Modified extract_usage() to extract and return reasoning_tokens from output_token_details nested in usage metadata
  • pricing.py:

    • Implemented dual-rate billing logic in compute_session_cost(): when image_output_price_usd is set, reasoning tokens are billed at output_price_usd and remaining output tokens (images + captions) at image_output_price_usd
    • Conservative approach: never undercharges image tokens and stays well below the previous behavior of billing all output at the image rate
  • chat_controller.py:

    • Modified _create_non_streaming_response() to surface only the standard OpenAI usage triple (prompt/completion/total tokens) while passing reasoning tokens separately to the cost calculator
    • Updated streaming response handling in generate() to extract and accumulate reasoning tokens from output_token_details in both non-streaming and streaming paths
    • Pass reasoning_tokens to compute_session_cost() via a separate cost_usage dict to avoid polluting the OpenAI-compatible response
  • test_image_billing.py:

    • Updated test documentation to reflect dual-rate billing model
    • Modified _usage_dict() to use extract_usage() which now carries reasoning tokens
    • Renamed test_generated_image_is_charged_as_output_tokens()test_generated_image_is_charged_at_image_rate() with updated assertions
    • Added test_thinking_tokens_billed_at_text_rate() to verify thinking tokens use the cheaper text rate
    • Added test_thinking_is_cheaper_than_billing_all_at_image_rate() regression test ensuring the fix doesn't revert to the old buggy behavior
  • test_price_feed.py:

    • Updated mock model config to include image_output=False and image_output_price_usd=None for single-rate test models

Implementation Details

  • Reasoning token extraction: LangChain breaks out thinking tokens in output_token_details.reasoning but folds them into the main output_tokens count. The billing split uses this breakdown to charge thinking at the text rate and the remainder (image + any caption) at the image rate.
  • Conservative billing: The split never undercharges image tokens and is far below the previous behavior of billing all output at the image rate, ensuring we don't lose revenue on image generation.
  • OpenAI compatibility: The response surface maintains the standard OpenAI usage triple; reasoning tokens ride along to the cost calculator only, not exposed in the API response.

https://claude.ai/code/session_01GDGKRki93xtXCFUNDkcEyq

claude added 2 commits June 3, 2026 19:20
Google bills nano banana / nano banana 2 output at two rates: image-modality
tokens at a high rate ($30/MTok for 2.5-flash-image, $60/MTok for 3.1-flash-image)
and text + thinking tokens at a much lower rate ($1.50 and $3/MTok). The gateway
previously billed the entire output_tokens count (image + text + thinking) at the
single image rate, overcharging thinking tokens up to 20x and inflating a typical
generation with reasoning by ~50-70%.

Add image_output_price_usd to ModelConfig and split billing: reasoning tokens
(broken out by langchain via output_token_details) are charged at output_price_usd
(text/thinking rate), the remainder at image_output_price_usd. langchain folds
image+text+thinking into one output_tokens count and does not expose the per-
modality breakdown, so the small text caption rides the image rate — conservative
(never undercharges) and strictly cheaper than the previous behavior.

Plumb reasoning_tokens through extract_usage and both streaming paths; keep the
OpenAI usage triple on responses clean.
The output_price_usd is now the text/thinking rate ($3/MTok); the image
rate ($60/MTok) moved to image_output_price_usd.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants