This document describes support for OpenAI's reasoning models (o1, o3, o4 series) in the benchmarking system.
Reasoning models (o1, o3, o4 series) require the OpenAI Responses API (/v1/responses) instead of the Chat Completions API (/v1/chat/completions). The implementation automatically detects reasoning models and uses the appropriate endpoint.
The provider automatically detects reasoning models based on their name prefix:
REASONING_MODEL_PREFIXES = ("o1", "o3", "o4")
def _is_reasoning_model(model_name: str) -> bool:
"""Check if model is a reasoning model that requires Responses API."""
return any(model_name.startswith(prefix) for prefix in REASONING_MODEL_PREFIXES)Models that match these prefixes will use the Responses API.
| Feature | Chat Completions | Responses API |
|---|---|---|
| Endpoint | /v1/chat/completions |
/v1/responses |
| Token limit param | max_completion_tokens |
max_output_tokens |
| Input format | messages array |
input string |
| Response format | choices array |
output array |
| Streaming events | Delta chunks | Semantic events |
| Reasoning tokens | Not exposed | Included in usage |
Responses API provides usage.output_tokens which includes:
- Visible output tokens: Text generated for the user
- Reasoning tokens: Internal reasoning (not visible)
For benchmarking, we use output_tokens (total) to measure model performance, as reasoning is part of the generation process.
Example from o3-mini:
input_tokens: 19
output_tokens: 512
└─ reasoning_tokens: 256 (internal)
└─ visible tokens: 256 (shown to user)
The Responses API uses semantic events instead of delta chunks:
| Event Type | Description |
|---|---|
response.created |
Response started |
response.in_progress |
Generation in progress |
response.output_text.delta |
Text chunk (with timing) |
response.completed |
Generation finished successfully |
response.incomplete |
Hit token limit |
response.failed |
Generation failed |
The implementation tracks timing from response.output_text.delta events.
-
No visible text output: Some reasoning models (e.g., o3-mini with low token limits) may use all tokens for reasoning and emit no visible text. This is valid -
output_tokenswill be non-zero buttime_to_first_tokenwill be 0. -
Incomplete responses: If
max_output_tokensis too low, the response will be markedincomplete. The usage metrics are still captured correctly. -
Temperature parameter: Some reasoning models reject the
temperatureparameter. The implementation automatically retries without it.
Three test scripts are provided:
Tests the raw OpenAI Responses API to understand behavior:
uv run python test_responses_api.pyTests the provider implementation:
uv run python test_openai_provider.pyComprehensive end-to-end test suite:
uv run python test_reasoning_models_e2e.pyTo add a new reasoning model to the benchmark system:
- Add to MongoDB models collection:
mongosh "$MONGODB_URI" --eval '
db.models.insertOne({
provider: "openai",
model_id: "o5-mini", # New model
enabled: true,
deprecated: false,
created_at: new Date()
})
'-
No code changes needed: The prefix-based detection automatically handles new o1/o3/o4 models.
-
Test the model:
uv run python test_openai_provider.py # Update test_cases listThis indicates a model requires the Responses API but wasn't detected. Check if the model name matches one of the prefixes: o1, o3, o4.
Check if the model is using all tokens for reasoning. Increase max_tokens in the run config (default is 64, reasoning models may need 256+).
The Responses API should always include usage data. If missing, check:
- Response object exists (
response_objnot None) - Event stream completed (received
response.completedorresponse.incomplete) - OpenAI API version is up to date (
openai>=1.63.2)