feat(shared): exponential backoff retry and graceful degradation by zong0728 · Pull Request #167 · kubeflow/docs-agent

zong0728 · 2026-03-22T19:35:25Z

Summary

Fixes #160

milvus_search() silently swallows all exceptions and returns {"results": []},
preventing retry on transient failures and causing the LLM to hallucinate from
empty context instead of informing the user the service is unavailable.

Root Cause

Both servers wrap the entire Milvus operation in a broad try/except that
discards the error and returns a fake success:

except Exception as e:
    print(f"[ERROR] Milvus search failed: {e}")
    return {"results": []}   # exception swallowed, caller sees no failure

Because the exception never propagates, there is no opportunity to retry. execute_tool receives an empty result set and passes it to the LLM as if the search had succeeded.

Failure Scenario

User asks a Kubeflow-specific question
LLM correctly calls search_kubeflow_docs
Milvus pod is restarting (transient unavailability)
milvus_search fails → exception caught → returns {"results": []} ❌
execute_tool sends "No relevant results found." to the LLM
LLM generates a hallucinated answer — user never knows the search failed

Fix

shared/retry.py: reusable @with_retry decorator (sync + async) using AWS full-jitter exponential backoff (sleep = random.uniform(0, min(cap, base * 2^n))); exposes DEGRADED_RESULT sentinel string for LLM-visible outage messages
milvus_search: remove silent exception swallow; decorated with @with_retry(max_attempts=3, base_delay=1s, backoff_factor=2x, jitter=True); encoder loaded once at module level via _get_encoder() singleton
execute_tool: offload blocking search to asyncio.to_thread (WebSocket server) and run_in_threadpool (FastAPI server); return DEGRADED_RESULT on retry exhaustion so the LLM communicates the outage to the user

Testing

Python syntax verified
Simulated Milvus failure triggers 3 retry attempts with increasing delays
DEGRADED_RESULT returned to LLM after retry exhaustion
Single-turn happy-path behaviour unchanged

Checklist

Commits are signed off (DCO)
Fixes #
Implements GSoC 2026 spec Requirement Https support and updated Readme #5 (exponential backoff + graceful degradation)
No regressions to single-turn query paths

Implements GSoC 2026 Agentic RAG spec Requirement kubeflow#5: 'Robust retry logic is a must for all tools. The agent implements exponential backoff with jitter for Vector DB retrievals and LLM API timeouts. If tools strictly fail, the agent is configured to transparently degrade, informing the user that Live code context is currently unreachable.' Changes: - shared/retry.py: reusable @with_retry decorator supporting both sync and async callables; uses AWS full-jitter strategy (random.uniform(0, delay)) to prevent thundering-herd on retry; exposes DEGRADED_RESULT sentinel string for LLM-visible outage messages - server/app.py, server-https/app.py: * milvus_search: remove silent exception swallow; add @with_retry (3 attempts, base 1s, max 10s, factor 2x + jitter); encoder loaded once at module level via _get_encoder() singleton * execute_tool: offload blocking milvus_search to asyncio.to_thread (websocket server) and run_in_threadpool (FastAPI server) so the async event loop stays responsive under concurrent load; on retry exhaustion return DEGRADED_RESULT so LLM communicates the outage to the user instead of silently hallucinating from empty context Signed-off-by: Shengzhong Guan <guan@cmu.edu> Made-with: Cursor Signed-off-by: Shengzhong Guan <guan@cmu.edu>

google-oss-prow · 2026-03-22T19:35:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested a review from franciscojavierarceo March 22, 2026 19:35

google-oss-prow bot added the size/L label Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(shared): exponential backoff retry and graceful degradation#167

feat(shared): exponential backoff retry and graceful degradation#167
zong0728 wants to merge 1 commit intokubeflow:mainfrom
zong0728:feat/retry-graceful-degradation

zong0728 commented Mar 22, 2026

Uh oh!

google-oss-prow bot commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zong0728 commented Mar 22, 2026

Summary

Root Cause

Failure Scenario

Fix

Testing

Checklist

Uh oh!

google-oss-prow bot commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant