Skip to content

feat(shared): exponential backoff retry and graceful degradation#167

Open
zong0728 wants to merge 1 commit intokubeflow:mainfrom
zong0728:feat/retry-graceful-degradation
Open

feat(shared): exponential backoff retry and graceful degradation#167
zong0728 wants to merge 1 commit intokubeflow:mainfrom
zong0728:feat/retry-graceful-degradation

Conversation

@zong0728
Copy link
Copy Markdown

Summary

Fixes #160

milvus_search() silently swallows all exceptions and returns {"results": []},
preventing retry on transient failures and causing the LLM to hallucinate from
empty context instead of informing the user the service is unavailable.

Root Cause

Both servers wrap the entire Milvus operation in a broad try/except that
discards the error and returns a fake success:

except Exception as e:
    print(f"[ERROR] Milvus search failed: {e}")
    return {"results": []}   # exception swallowed, caller sees no failure

Because the exception never propagates, there is no opportunity to retry. execute_tool receives an empty result set and passes it to the LLM as if the search had succeeded.

Failure Scenario

User asks a Kubeflow-specific question
LLM correctly calls search_kubeflow_docs
Milvus pod is restarting (transient unavailability)
milvus_search fails → exception caught → returns {"results": []} ❌
execute_tool sends "No relevant results found." to the LLM
LLM generates a hallucinated answer — user never knows the search failed

Fix

shared/retry.py: reusable @with_retry decorator (sync + async) using AWS full-jitter exponential backoff (sleep = random.uniform(0, min(cap, base * 2^n))); exposes DEGRADED_RESULT sentinel string for LLM-visible outage messages
milvus_search: remove silent exception swallow; decorated with @with_retry(max_attempts=3, base_delay=1s, backoff_factor=2x, jitter=True); encoder loaded once at module level via _get_encoder() singleton
execute_tool: offload blocking search to asyncio.to_thread (WebSocket server) and run_in_threadpool (FastAPI server); return DEGRADED_RESULT on retry exhaustion so the LLM communicates the outage to the user

Testing

  • Python syntax verified

  • Simulated Milvus failure triggers 3 retry attempts with increasing delays

  • DEGRADED_RESULT returned to LLM after retry exhaustion

  • Single-turn happy-path behaviour unchanged

Checklist

  • Commits are signed off (DCO)

  • Fixes #

  • Implements GSoC 2026 spec Requirement Https support and updated Readme #5 (exponential backoff + graceful degradation)

  • No regressions to single-turn query paths

Implements GSoC 2026 Agentic RAG spec Requirement kubeflow#5:

  'Robust retry logic is a must for all tools. The agent implements
   exponential backoff with jitter for Vector DB retrievals and LLM
   API timeouts. If tools strictly fail, the agent is configured to
   transparently degrade, informing the user that Live code context
   is currently unreachable.'

Changes:
- shared/retry.py: reusable @with_retry decorator supporting both
  sync and async callables; uses AWS full-jitter strategy
  (random.uniform(0, delay)) to prevent thundering-herd on retry;
  exposes DEGRADED_RESULT sentinel string for LLM-visible outage
  messages
- server/app.py, server-https/app.py:
  * milvus_search: remove silent exception swallow; add @with_retry
    (3 attempts, base 1s, max 10s, factor 2x + jitter); encoder
    loaded once at module level via _get_encoder() singleton
  * execute_tool: offload blocking milvus_search to asyncio.to_thread
    (websocket server) and run_in_threadpool (FastAPI server) so the
    async event loop stays responsive under concurrent load; on retry
    exhaustion return DEGRADED_RESULT so LLM communicates the outage
    to the user instead of silently hallucinating from empty context

Signed-off-by: Shengzhong Guan <guan@cmu.edu>
Made-with: Cursor
Signed-off-by: Shengzhong Guan <guan@cmu.edu>
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(server, server-https): milvus_search silently swallows all exceptions, preventing retry and transparent degradation

1 participant