Implement dual-rate billing for Gemini image-output models#90
Draft
adambalogh wants to merge 2 commits into
Draft
Implement dual-rate billing for Gemini image-output models#90adambalogh wants to merge 2 commits into
adambalogh wants to merge 2 commits into
Conversation
Google bills nano banana / nano banana 2 output at two rates: image-modality tokens at a high rate ($30/MTok for 2.5-flash-image, $60/MTok for 3.1-flash-image) and text + thinking tokens at a much lower rate ($1.50 and $3/MTok). The gateway previously billed the entire output_tokens count (image + text + thinking) at the single image rate, overcharging thinking tokens up to 20x and inflating a typical generation with reasoning by ~50-70%. Add image_output_price_usd to ModelConfig and split billing: reasoning tokens (broken out by langchain via output_token_details) are charged at output_price_usd (text/thinking rate), the remainder at image_output_price_usd. langchain folds image+text+thinking into one output_tokens count and does not expose the per- modality breakdown, so the small text caption rides the image rate — conservative (never undercharges) and strictly cheaper than the previous behavior. Plumb reasoning_tokens through extract_usage and both streaming paths; keep the OpenAI usage triple on responses clean.
The output_price_usd is now the text/thinking rate ($3/MTok); the image rate ($60/MTok) moved to image_output_price_usd.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gemini image-output models (gemini-2.5-flash-image, gemini-3.1-flash-image) bill output tokens at two different rates: image-modality tokens at a premium rate (
$30/MTok) and text/thinking tokens at the standard rate ($1.50–$3/MTok). This PR implements proper split-rate billing by:image_output_price_usdfield toModelConfigfor dual-rate modelsreasoning_tokens(thinking) through the usage pipelinecompute_session_cost()based on reasoning countKey Changes
model_registry.py:
image_output_price_usdoptional field toModelConfigfor image-modality token pricingGEMINI_2_5_FLASH_IMAGEandGEMINI_3_1_FLASH_IMAGEto use correct dual-rate pricing (text/thinking atoutput_price_usd, images atimage_output_price_usd)llm_backend.py:
extract_usage()to extract and returnreasoning_tokensfromoutput_token_detailsnested in usage metadatapricing.py:
compute_session_cost(): whenimage_output_price_usdis set, reasoning tokens are billed atoutput_price_usdand remaining output tokens (images + captions) atimage_output_price_usdchat_controller.py:
_create_non_streaming_response()to surface only the standard OpenAI usage triple (prompt/completion/total tokens) while passing reasoning tokens separately to the cost calculatorgenerate()to extract and accumulatereasoningtokens fromoutput_token_detailsin both non-streaming and streaming pathsreasoning_tokenstocompute_session_cost()via a separatecost_usagedict to avoid polluting the OpenAI-compatible responsetest_image_billing.py:
_usage_dict()to useextract_usage()which now carries reasoning tokenstest_generated_image_is_charged_as_output_tokens()→test_generated_image_is_charged_at_image_rate()with updated assertionstest_thinking_tokens_billed_at_text_rate()to verify thinking tokens use the cheaper text ratetest_thinking_is_cheaper_than_billing_all_at_image_rate()regression test ensuring the fix doesn't revert to the old buggy behaviortest_price_feed.py:
image_output=Falseandimage_output_price_usd=Nonefor single-rate test modelsImplementation Details
output_token_details.reasoningbut folds them into the mainoutput_tokenscount. The billing split uses this breakdown to charge thinking at the text rate and the remainder (image + any caption) at the image rate.https://claude.ai/code/session_01GDGKRki93xtXCFUNDkcEyq