Skip to content

Performance: SentenceTransformer reload + Milvus reconnect compound to ~3s overhead per search in main servers #183

@JayDS22

Description

@JayDS22

Summary

Both server/app.py and server-https/app.py instantiate a new SentenceTransformer and open a new Milvus connection inside the milvus_search function on every search call. These two costs compound to roughly 2-3 seconds of overhead per query on CPU before the actual vector search even begins.

Issues #63 and #28 each track one half of this individually. This issue consolidates them as a compound performance problem since fixing only one still leaves significant per-request overhead.

Location

In server/app.py and server-https/app.py -- milvus_search() function:

def milvus_search(query, top_k=5):
    # Cost 1: New Milvus connection per request (~200-500ms)
    connections.connect(alias="default", host=MILVUS_HOST, port=MILVUS_PORT)
    collection = Collection(MILVUS_COLLECTION)
    collection.load()  # idempotent after first call
    
    # Cost 2: Model reload from disk per request (~2-3s on CPU)
    encoder = SentenceTransformer(EMBEDDING_MODEL)

The Right Pattern Already Exists

kagent-feast-mcp/mcp-server/server.py already implements the correct pattern:

model: SentenceTransformer = None
client: MilvusClient = None

def _init():
    global model, client
    if model is None:
        model = SentenceTransformer(EMBEDDING_MODEL)
    if client is None:
        client = MilvusClient(uri=MILVUS_URI, ...)

The main servers should adopt this same lazy-init singleton pattern. Combined, this would reduce per-query overhead from ~3s to near zero for all requests after the first.

Note: As Sinan pointed out in the Slack discussion, collection.load() is idempotent server-side in Milvus -- once loaded it stays loaded across client disconnects. So the real per-request costs are the model reload and the connection setup/teardown, not all three.

PR freeze is on, so flagging this for when PRs reopen. Happy to pick this up.

Related: #63 (model reload), #28 (connection pooling), #181 (content truncation)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions