Summary
Both server/app.py and server-https/app.py instantiate a new SentenceTransformer and open a new Milvus connection inside the milvus_search function on every search call. These two costs compound to roughly 2-3 seconds of overhead per query on CPU before the actual vector search even begins.
Issues #63 and #28 each track one half of this individually. This issue consolidates them as a compound performance problem since fixing only one still leaves significant per-request overhead.
Location
In server/app.py and server-https/app.py -- milvus_search() function:
def milvus_search(query, top_k=5):
# Cost 1: New Milvus connection per request (~200-500ms)
connections.connect(alias="default", host=MILVUS_HOST, port=MILVUS_PORT)
collection = Collection(MILVUS_COLLECTION)
collection.load() # idempotent after first call
# Cost 2: Model reload from disk per request (~2-3s on CPU)
encoder = SentenceTransformer(EMBEDDING_MODEL)
The Right Pattern Already Exists
kagent-feast-mcp/mcp-server/server.py already implements the correct pattern:
model: SentenceTransformer = None
client: MilvusClient = None
def _init():
global model, client
if model is None:
model = SentenceTransformer(EMBEDDING_MODEL)
if client is None:
client = MilvusClient(uri=MILVUS_URI, ...)
The main servers should adopt this same lazy-init singleton pattern. Combined, this would reduce per-query overhead from ~3s to near zero for all requests after the first.
Note: As Sinan pointed out in the Slack discussion, collection.load() is idempotent server-side in Milvus -- once loaded it stays loaded across client disconnects. So the real per-request costs are the model reload and the connection setup/teardown, not all three.
PR freeze is on, so flagging this for when PRs reopen. Happy to pick this up.
Related: #63 (model reload), #28 (connection pooling), #181 (content truncation)
Summary
Both
server/app.pyandserver-https/app.pyinstantiate a newSentenceTransformerand open a new Milvus connection inside themilvus_searchfunction on every search call. These two costs compound to roughly 2-3 seconds of overhead per query on CPU before the actual vector search even begins.Issues #63 and #28 each track one half of this individually. This issue consolidates them as a compound performance problem since fixing only one still leaves significant per-request overhead.
Location
In
server/app.pyandserver-https/app.py--milvus_search()function:The Right Pattern Already Exists
kagent-feast-mcp/mcp-server/server.pyalready implements the correct pattern:The main servers should adopt this same lazy-init singleton pattern. Combined, this would reduce per-query overhead from ~3s to near zero for all requests after the first.
Note: As Sinan pointed out in the Slack discussion,
collection.load()is idempotent server-side in Milvus -- once loaded it stays loaded across client disconnects. So the real per-request costs are the model reload and the connection setup/teardown, not all three.PR freeze is on, so flagging this for when PRs reopen. Happy to pick this up.
Related: #63 (model reload), #28 (connection pooling), #181 (content truncation)