Skip to content

fix: OpenAI Compatible Provider 空响应自动重试机制#191

Open
liaoyl830 wants to merge 1 commit into
shenminglinyi:masterfrom
liaoyl830:fix/issue-185
Open

fix: OpenAI Compatible Provider 空响应自动重试机制#191
liaoyl830 wants to merge 1 commit into
shenminglinyi:masterfrom
liaoyl830:fix/issue-185

Conversation

@liaoyl830

@liaoyl830 liaoyl830 commented Jun 8, 2026

Copy link
Copy Markdown

Summary

  • Provider 层自动重试:OpenAIProvider.generate() 增加空响应检测与自动重试逻辑,最多重试 2 次,采用指数退避策略
  • 诊断日志增强:空响应时记录原始响应摘要,便于定位上游网关问题
  • 管线层重试识别:structured_json_pipeline 将 empty content 标记为可重试错误

Test plan

  • 全部 17 个 OpenAI Provider 测试通过
  • 新增重试成功/重试耗尽两个测试用例

Closes #185

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved stability when AI provider returns empty or incomplete responses through automatic retry logic with exponential backoff.
    • Enhanced fallback mechanism for Chat Completions requests to better handle edge cases.
    • Added diagnostic logging to provide clearer error messages when empty responses persist after retries.

- OpenAI Provider 增加空响应自动重试机制(最多 2 次重试,指数退避)
- 增强空响应诊断日志,记录原始响应摘要便于排查
- structured_json_pipeline 将空响应标记为可重试错误
- 更新并新增单元测试覆盖重试逻辑

Closes shenminglinyi#185
@liaoyl830 liaoyl830 requested a review from shenminglinyi as a code owner June 8, 2026 14:51
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements automatic retry logic with exponential backoff for transient empty-content responses from the OpenAI provider. It expands error detection markers in the pipeline, reworks the provider's generate method to retry empty responses, adds diagnostic logging of raw response summaries, and adds comprehensive test coverage for retry and recovery paths.

Changes

Empty Content Retry Mechanism

Layer / File(s) Summary
Error Detection Markers
application/ai/structured_json_pipeline.py
Updated _is_retryable_llm_error docstring and added new error markers for "empty content" and "empty non-stream content" to classify these as transient, retryable conditions.
Retry Infrastructure
infrastructure/ai/providers/openai_provider.py
Added asyncio import and introduced MAX_RETRIES and RETRY_BASE_DELAY_SECONDS constants to configure exponential backoff retry behavior.
Generate Method Retry Control Flow
infrastructure/ai/providers/openai_provider.py
Reworked generate() to wrap generation in a retry loop that detects empty-content responses, integrates the Responses-to-Chat-Completions fallback into the retry flow, caches base_url to avoid repeated downgrade overhead, and raises a clear error message after exhausting retries.
Diagnostic Response Logging
infrastructure/ai/providers/openai_provider.py
Added _summarize_raw_response() helper to build a compact diagnostic string from response choices, finish_reason, reasoning presence, and token counts; integrated diagnostic logging before fallback to streaming aggregation.
Retry Behavior Test Coverage
tests/unit/infrastructure/ai/providers/test_openai_provider.py
Added two new legacy-path tests that patch asyncio.sleep and verify retry counts, sleep invocations, and recovery scenarios (one for exhaustion, one for eventual success); updated Responses API test to validate retry attempts and sleep call counts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • shenminglinyi/PlotPilot#46: Related provider modifications for handling empty responses from Responses API with fallback to Chat Completions.

Suggested reviewers

  • shenminglinyi

Poem

🐰 A rabbit hops through empty skies,
With retries now to our surprise!
Exponential backoff does the trick,
Empty content? We'll retry quick! 🔄
Recover, retry, succeed with glee! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is missing critical required sections from the template including 变更类型 (change type), 架构影响 (architecture impact), and 测试 (testing command results). Complete the PR description by filling in all required template sections: mark the change type (fix), specify affected architecture layers, and provide actual test command outputs.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title uses Chinese mixed with technical terms, making it difficult to assess clarity; in English it translates to 'fix: OpenAI Compatible Provider empty response auto-retry mechanism'. Consider using English-only titles or translating the full title to English for consistency with international developer teams and Git history readability.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR addresses all core requirements from issue #185: empty response detection at Provider layer, automatic retry mechanism with backoff, original response logging for diagnostics, and retry identification in pipeline layer.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the objectives in issue #185: modifications only target Provider retry logic, pipeline retry markers, and corresponding unit tests without introducing unrelated functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@infrastructure/ai/providers/openai_provider.py`:
- Around line 77-93: The computed flag use_responses should be recalculated on
each retry so changes to the class-level _fallback_to_chat_cache are respected;
move or re-evaluate use_responses (which depends on base_url, self._use_legacy
and self.__class__._fallback_to_chat_cache) inside the for loop in generate()
before deciding to call _generate_via_responses, so that after you add base_url
to _fallback_to_chat_cache subsequent attempts will skip the Responses path and
fall back to chat completions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ceb01747-95b5-4c7d-a1ed-0257cd115370

📥 Commits

Reviewing files that changed from the base of the PR and between 1c7df5e and bf693ac.

📒 Files selected for processing (3)
  • application/ai/structured_json_pipeline.py
  • infrastructure/ai/providers/openai_provider.py
  • tests/unit/infrastructure/ai/providers/test_openai_provider.py

Comment on lines 77 to +93
base_url = self.settings.base_url or "https://api.openai.com/v1"
use_responses = not self._use_legacy and base_url not in self.__class__._fallback_to_chat_cache

if use_responses:
last_empty_exc: Exception | None = None
for attempt in range(1 + _EMPTY_CONTENT_MAX_RETRIES):
try:
return await self._generate_via_responses(prompt, config)
except (openai.NotFoundError, openai.BadRequestError) as e:
logger.info(f"Responses API unsupported for {base_url}, falling back to chat completions: {str(e)}")
self.__class__._fallback_to_chat_cache.add(base_url)
except Exception as e:
# 某些网关在路径错误时可能不抛严格的 404 而是抛出其他错误,如果消息含有明确路径错误也尝试降级
if "404" in str(e) or "Not Found" in str(e) or "400" in str(e) or "Account invalid" in str(e) or "INVALID_ARGUMENT" in str(e):
logger.info(f"Gateway returned error for Responses API ({base_url}), falling back: {str(e)}")
self.__class__._fallback_to_chat_cache.add(base_url)
else:
raise
if use_responses:
try:
return await self._generate_via_responses(prompt, config)
except (openai.NotFoundError, openai.BadRequestError) as e:
logger.info(f"Responses API unsupported for {base_url}, falling back to chat completions: {str(e)}")
self.__class__._fallback_to_chat_cache.add(base_url)
except Exception as e:
# 某些网关在路径错误时可能不抛严格的 404 而是抛出其他错误,如果消息含有明确路径错误也尝试降级
if "404" in str(e) or "Not Found" in str(e) or "400" in str(e) or "Account invalid" in str(e) or "INVALID_ARGUMENT" in str(e):
logger.info(f"Gateway returned error for Responses API ({base_url}), falling back: {str(e)}")
self.__class__._fallback_to_chat_cache.add(base_url)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Recompute use_responses inside the retry loop.

Line 78 computes use_responses once, but Lines 88/93 mutate _fallback_to_chat_cache later. In the same generate() call, subsequent retries still use the stale True and keep hitting Responses before falling back again.

💡 Suggested fix
-            use_responses = not self._use_legacy and base_url not in self.__class__._fallback_to_chat_cache
-
             last_empty_exc: Exception | None = None
             for attempt in range(1 + _EMPTY_CONTENT_MAX_RETRIES):
+                use_responses = (
+                    not self._use_legacy
+                    and base_url not in self.__class__._fallback_to_chat_cache
+                )
                 try:
                     if use_responses:
                         try:
                             return await self._generate_via_responses(prompt, config)
                         except (openai.NotFoundError, openai.BadRequestError) as e:
                             logger.info(f"Responses API unsupported for {base_url}, falling back to chat completions: {str(e)}")
                             self.__class__._fallback_to_chat_cache.add(base_url)
+                            use_responses = False
                         except Exception as e:
                             # 某些网关在路径错误时可能不抛严格的 404 而是抛出其他错误,如果消息含有明确路径错误也尝试降级
                             if "404" in str(e) or "Not Found" in str(e) or "400" in str(e) or "Account invalid" in str(e) or "INVALID_ARGUMENT" in str(e):
                                 logger.info(f"Gateway returned error for Responses API ({base_url}), falling back: {str(e)}")
                                 self.__class__._fallback_to_chat_cache.add(base_url)
+                                use_responses = False
                             else:
                                 raise
🧰 Tools
🪛 Ruff (0.15.15)

[warning] 87-87: Use explicit conversion flag

Replace with conversion flag

(RUF010)


[warning] 90-90: Comment contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF003)


[warning] 92-92: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infrastructure/ai/providers/openai_provider.py` around lines 77 - 93, The
computed flag use_responses should be recalculated on each retry so changes to
the class-level _fallback_to_chat_cache are respected; move or re-evaluate
use_responses (which depends on base_url, self._use_legacy and
self.__class__._fallback_to_chat_cache) inside the for loop in generate() before
deciding to call _generate_via_responses, so that after you add base_url to
_fallback_to_chat_cache subsequent attempts will skip the Responses path and
fall back to chat completions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenAI Compatible Provider 返回空内容时多个模块失效

1 participant