Fix performance issue: Avoid re-initializing SentenceTransformer and remove duplicate milvus_search definition by Ayush-kathil · Pull Request #186 · kubeflow/docs-agent

Ayush-kathil · 2026-03-29T19:49:10Z

Fixes #128

Problem:
SentenceTransformer(EMBEDDING_MODEL) was instantiated inside the milvus_search() function, causing repeated model loading on every request, leading to latency spikes and increased memory usage. Additionally, duplicate definitions of milvus_search existed, causing ambiguity.

Solution:

Moved SentenceTransformer initialization to global scope so it loads once at startup
Updated milvus_search() to reuse the global model instance
Removed duplicate function definition to improve clarity

Impact:

Eliminates repeated model loading
Reduces query latency significantly
Improves memory efficiency
Cleans up redundant code

Tested locally and observed faster response times for repeated queries.

google-oss-prow · 2026-03-29T19:49:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Ayush-kathil · 2026-03-29T19:55:29Z

Hi, I’ve submitted a PR fixing this issue by moving SentenceTransformer initialization to the global scope and removing duplicate function definitions. This significantly improves performance and code clarity. Would appreciate feedback!

Instantiate the SentenceTransformer at module level in server-https to avoid recreating the encoder for each milvus_search call, and update milvus_search to use embedding_model.encode(...). Remove the duplicated milvus_search implementation from server/app.py to centralize the search logic and reduce redundancy and overhead from repeated model loads. Signed-off-by: Ayush-kathil <kathilshiva@gmail.com>

Signed-off-by: Ayush-kathil <kathilshiva@gmail.com>

Ayush-kathil · 2026-04-05T18:45:57Z

This PR addresses a clear performance anti-pattern in the RAG pipeline.

Previously, SentenceTransformer was instantiated inside milvus_search() on every invocation, which is particularly costly given the model size (all-mpnet-base-v2) and its load time (~2–5 seconds). In agentic workflows where multiple tool calls are triggered per user request, this results in compounded latency and unnecessary memory churn.

The refactor ensures that the embedding model is initialized once and reused across requests, aligning with standard practices for ML model lifecycle management in backend services.

What’s good:

Eliminates repeated model loading → significantly reduces per-request latency
Prevents redundant high-memory allocations (~400MB per load)
Improves throughput and scalability under multi-step RAG execution

Suggestions / Minor improvements:

Consider adding a short comment near the initialization explaining why the model is cached (helps future contributors avoid regressions)
If not already handled, ensure thread-safety or document assumptions (e.g., single-process vs multi-worker deployment like Gunicorn)
Optionally, lazy initialization (on first request) could further optimize cold-start scenarios

Overall, this is a meaningful performance improvement with no functional regression. Good contribution.

…s_search Implemented thread-safe lazy-loading for SentenceTransformer to eliminate redundant loading within milvus_search. Signed-off-by: Ayush-kathil <kathilshiva@gmail.com>

Ayush-kathil · 2026-04-06T13:43:58Z

Description

This PR resolves a critical performance and memory bottleneck in the RAG pipeline caused by redundant instantiation of SentenceTransformer on every call to milvus_search.

Core Changes

Thread-Safe Singleton
- Introduced a globally cached encoder instance using threading.Lock() with double-checked locking.
- Ensures safe, one-time initialization in multi-threaded environments (e.g., FastAPI, Gunicorn, Uvicorn).
Refactored Endpoints
- Applied the global caching mechanism consistently across:
  - server/app.py (WebSocket layer)
  - server-https/app.py (FastAPI layer)

Performance Impact

Latency Reduction
- Eliminates repeated model loading overhead.
- Converts per-query cost from O(n) initialization to a one-time O(1) startup cost.
- First (cold) query retains ~500ms+ load time; subsequent (warm) queries reuse the cached encoder with near-instant response.
Memory Stability
- Prevents multiple model instances from being created.
- Resolves memory bloat issues observed in long-running Gunicorn/Uvicorn workers.

Validation

Tested locally with no regressions observed.
Performance improvement verified under repeated query conditions.

Please let me know if any refinements or additional checks are required before merge.

google-oss-prow bot requested a review from franciscojavierarceo March 29, 2026 19:49

google-oss-prow bot added the size/M label Mar 29, 2026

Ayush-kathil force-pushed the main branch 3 times, most recently from f1c478d to 63baf14 Compare March 30, 2026 11:25

Ayush-kathil added 2 commits March 30, 2026 16:57

fix(server-https): preserve multi-hop citations in stream_llm_response

f05614a

Signed-off-by: Ayush-kathil <kathilshiva@gmail.com>

Ayush-kathil force-pushed the main branch from 63baf14 to f05614a Compare March 30, 2026 11:27

Ayush-kathil mentioned this pull request Mar 30, 2026

Performance Bug: SentenceTransformer is re-initialized on every Milvus search request #128

Open

Ayush-kathil force-pushed the main branch from bd1725f to 2e45262 Compare April 5, 2026 18:44

Ayush-kathil mentioned this pull request Apr 5, 2026

SentenceTransformer model reloaded on every search request #189

Open

Ayush-kathil force-pushed the main branch from 2e45262 to f1c0cd0 Compare April 6, 2026 13:35

fix(performance): avoid repeated SentenceTransformer loading in milvu…

c323042

…s_search Implemented thread-safe lazy-loading for SentenceTransformer to eliminate redundant loading within milvus_search. Signed-off-by: Ayush-kathil <kathilshiva@gmail.com>

Ayush-kathil force-pushed the main branch from f1c0cd0 to c323042 Compare April 6, 2026 13:39

Ayush-kathil mentioned this pull request Apr 6, 2026

perf: SentenceTransformer model reloaded on every search request #178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance issue: Avoid re-initializing SentenceTransformer and remove duplicate milvus_search definition#186

Fix performance issue: Avoid re-initializing SentenceTransformer and remove duplicate milvus_search definition#186
Ayush-kathil wants to merge 3 commits intokubeflow:mainfrom
Ayush-kathil:main

Ayush-kathil commented Mar 29, 2026

Uh oh!

google-oss-prow bot commented Mar 29, 2026

Uh oh!

Ayush-kathil commented Mar 29, 2026

Uh oh!

Ayush-kathil commented Apr 5, 2026

Uh oh!

Ayush-kathil commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ayush-kathil commented Mar 29, 2026

Uh oh!

google-oss-prow bot commented Mar 29, 2026

Uh oh!

Ayush-kathil commented Mar 29, 2026

Uh oh!

Ayush-kathil commented Apr 5, 2026

Uh oh!

Ayush-kathil commented Apr 6, 2026

Description

Core Changes

Performance Impact

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant