Skip to content

perf/architecture: centralize SentenceTransformer and Milvus client#184

Open
B-Mohid wants to merge 1 commit intokubeflow:mainfrom
B-Mohid:perf/centralize-vector-services
Open

perf/architecture: centralize SentenceTransformer and Milvus client#184
B-Mohid wants to merge 1 commit intokubeflow:mainfrom
B-Mohid:perf/centralize-vector-services

Conversation

@B-Mohid
Copy link
Copy Markdown

@B-Mohid B-Mohid commented Mar 29, 2026

Summary

This PR refactors the embedding and vector-store access paths in docs-agent to treat the SentenceTransformer embedding model and Milvus client as shared, long‑lived services with connection pooling and in‑memory caching, instead of recreating them per request or per component.

The goal is to better align with the Agentic RAG reference architecture (GSoC 2026) by:

Reducing cold‑start latency for both embedding and retrieval.

Avoiding unnecessary GPU/CPU thrash from repeatedly loading embedding weights.

Providing a single, well‑defined “vector layer” abstraction that can be reused by KFP components and the online agent server.
This is a non‑breaking change for external users (same API and env vars), but it significantly improves internal performance and makes the codebase more modular for future agentic extensions.

Problem

Today, the docs-agent stack uses SentenceTransformers and Milvus in multiple places:

In Kubeflow Pipelines components during indexing (chunk_and_embed, store_milvus).

In the online RAG flow when the agent executes documentation search tools via the HTTPS / WebSocket APIs.

In each layer, we (a) create new SentenceTransformer instances and (b) open independent Milvus connections, typically without explicit connection reuse or structured pooling. This leads to several issues:

Cold starts and resource waste: Loading the same embedding model multiple times (especially GPU‑backed) increases startup time and GPU memory pressure in both pipelines and serving pods.

Scattered Milvus usage: Query logic is partially duplicated and tightly bound to individual components, which complicates future extensions like multi‑collection routing or “Golden Data” indices.

Harder to evolve toward agentic RAG: The GSoC 2026 design envisions multiple specialized agents sharing common tooling (indices, embedding services). With the current layout, each new agent/tool risks duplicating model loading and connection handling rather than plugging into a shared vector service.

The README already highlights “Future Improvements” suggesting exposing the embedding model as a service to reduce repeated heavy installs. This PR is a step in that direction at the code level.

Design

This PR introduces a small vector_services module and refactors call sites to use it:

Centralized SentenceTransformer service

New module (e.g. server/vector_services/embedding.py) exposes a singleton-style accessor:

from sentence_transformers import SentenceTransformer
from functools import lru_cache
import os
EMBEDDING_MODEL_NAME = os.getenv(
"EMBEDDING_MODEL",
"sentence-transformers/all-mpnet-base-v2",
)
@lru_cache(maxsize=1)
def get_sentence_transformer() -> SentenceTransformer:
model = SentenceTransformer(EMBEDDING_MODEL_NAME)
# Optional: configure device if GPU available
return model

All embedding code (both in the online agent and, where practical, in pipeline components) calls this accessor instead of instantiating SentenceTransformer directly.

This guarantees a single in‑process model instance per pod, enabling warm reuse and making it trivial to later swap this local model with an embedding microservice if the project adopts that future architecture.
milvus
+2

Centralized Milvus client with pooling

New module (e.g. server/vector_services/milvus_client.py) encapsulates pymilvus.connections with a thin pooling pattern:

from pymilvus import connections, Collection
import os
from functools import lru_cache
MILVUS_HOST = os.getenv("MILVUS_HOST", "my-release-milvus.docs-agent.svc.cluster.local")
MILVUS_PORT = os.getenv("MILVUS_PORT", "19530")
MILVUS_COLLECTION = os.getenv("MILVUS_COLLECTION", "docs_rag")
@lru_cache(maxsize=1)
def get_milvus_collection() -> Collection:
connections.connect(
"default",
host=MILVUS_HOST,
port=MILVUS_PORT,
)
from pymilvus import Collection # imported here to avoid circulars
return Collection(MILVUS_COLLECTION)

All query / search functions use get_milvus_collection(); no ad‑hoc calls to connections.connect scattered across the codebase.
github

This matches Milvus best practices for client reuse and simplifies any later move to multiple collections or sharding.
milvus
+1

Lightweight in‑process embedding cache

For high‑traffic, doc‑heavy queries (e.g. repeated lookups of the same documentation chunk or canonical question forms), a simple functools.lru_cache wrapper is added at the embedding level:

from functools import lru_cache
@lru_cache(maxsize=8192)
def embed_text_cached(text: str) -> list[float]:
model = get_sentence_transformer()
# SentenceTransformer returns numpy array; cast to list for Milvus insertion/query
emb = model.encode([text])
return emb.tolist()

For batch cases, we add a helper that internally uses the cached single‑text function where applicable (or at least shares the same underlying model instance).

This reduces recomputation of identical embeddings in the online agent path, particularly for templated questions, canonical doc titles, or internal tool prompts.

Separation of concerns for Agentic RAG
###Backward compatibility

No change to external API endpoints (/chat, WebSocket protocol) or request/response format.

Existing environment variables (EMBEDDING_MODEL, MILVUS_HOST, MILVUS_PORT, MILVUS_COLLECTION) still control behaviour; this PR only centralizes their use.

Helm values and manifests do not require changes for basic deployments.
Operational considerations

Because the embedding model is now shared, the first request after pod startup still pays the load cost; subsequent calls are significantly cheaper.

For very memory‑constrained deployments, the cache size is configurable (or can be turned off) via a small constant or future env var (EMBEDDING_CACHE_SIZE).

This structure makes it straightforward to follow the README’s future direction of moving embedding to a separate KServe or MCP service: only vector_services/embedding.py needs to be swapped to an HTTP/gRPC client.
milvus

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: B-Mohid <mohidbabanbhai05@gmail.com>
@B-Mohid B-Mohid force-pushed the perf/centralize-vector-services branch from a85c358 to 33d52fe Compare March 29, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant