perf/architecture: centralize SentenceTransformer and Milvus client#184
Open
B-Mohid wants to merge 1 commit intokubeflow:mainfrom
Open
perf/architecture: centralize SentenceTransformer and Milvus client#184B-Mohid wants to merge 1 commit intokubeflow:mainfrom
B-Mohid wants to merge 1 commit intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: B-Mohid <mohidbabanbhai05@gmail.com>
a85c358 to
33d52fe
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR refactors the embedding and vector-store access paths in docs-agent to treat the SentenceTransformer embedding model and Milvus client as shared, long‑lived services with connection pooling and in‑memory caching, instead of recreating them per request or per component.
The goal is to better align with the Agentic RAG reference architecture (GSoC 2026) by:
Reducing cold‑start latency for both embedding and retrieval.
Avoiding unnecessary GPU/CPU thrash from repeatedly loading embedding weights.
Providing a single, well‑defined “vector layer” abstraction that can be reused by KFP components and the online agent server.
This is a non‑breaking change for external users (same API and env vars), but it significantly improves internal performance and makes the codebase more modular for future agentic extensions.
Problem
Today, the docs-agent stack uses SentenceTransformers and Milvus in multiple places:
In Kubeflow Pipelines components during indexing (chunk_and_embed, store_milvus).
In the online RAG flow when the agent executes documentation search tools via the HTTPS / WebSocket APIs.
In each layer, we (a) create new SentenceTransformer instances and (b) open independent Milvus connections, typically without explicit connection reuse or structured pooling. This leads to several issues:
Cold starts and resource waste: Loading the same embedding model multiple times (especially GPU‑backed) increases startup time and GPU memory pressure in both pipelines and serving pods.
Scattered Milvus usage: Query logic is partially duplicated and tightly bound to individual components, which complicates future extensions like multi‑collection routing or “Golden Data” indices.
Harder to evolve toward agentic RAG: The GSoC 2026 design envisions multiple specialized agents sharing common tooling (indices, embedding services). With the current layout, each new agent/tool risks duplicating model loading and connection handling rather than plugging into a shared vector service.
The README already highlights “Future Improvements” suggesting exposing the embedding model as a service to reduce repeated heavy installs. This PR is a step in that direction at the code level.
Design
This PR introduces a small vector_services module and refactors call sites to use it:
Centralized SentenceTransformer service
New module (e.g. server/vector_services/embedding.py) exposes a singleton-style accessor:
All embedding code (both in the online agent and, where practical, in pipeline components) calls this accessor instead of instantiating SentenceTransformer directly.
This guarantees a single in‑process model instance per pod, enabling warm reuse and making it trivial to later swap this local model with an embedding microservice if the project adopts that future architecture.
milvus
+2
Centralized Milvus client with pooling
New module (e.g. server/vector_services/milvus_client.py) encapsulates pymilvus.connections with a thin pooling pattern:
All query / search functions use get_milvus_collection(); no ad‑hoc calls to connections.connect scattered across the codebase.
github
This matches Milvus best practices for client reuse and simplifies any later move to multiple collections or sharding.
milvus
+1
Lightweight in‑process embedding cache
For high‑traffic, doc‑heavy queries (e.g. repeated lookups of the same documentation chunk or canonical question forms), a simple functools.lru_cache wrapper is added at the embedding level:
For batch cases, we add a helper that internally uses the cached single‑text function where applicable (or at least shares the same underlying model instance).
This reduces recomputation of identical embeddings in the online agent path, particularly for templated questions, canonical doc titles, or internal tool prompts.
Separation of concerns for Agentic RAG
###Backward compatibility
No change to external API endpoints (/chat, WebSocket protocol) or request/response format.
Existing environment variables (EMBEDDING_MODEL, MILVUS_HOST, MILVUS_PORT, MILVUS_COLLECTION) still control behaviour; this PR only centralizes their use.
Helm values and manifests do not require changes for basic deployments.
Operational considerations
Because the embedding model is now shared, the first request after pod startup still pays the load cost; subsequent calls are significantly cheaper.
For very memory‑constrained deployments, the cache size is configurable (or can be turned off) via a small constant or future env var (EMBEDDING_CACHE_SIZE).
This structure makes it straightforward to follow the README’s future direction of moving embedding to a separate KServe or MCP service: only vector_services/embedding.py needs to be swapped to an HTTP/gRPC client.
milvus