Problem
The milvus_search() function in both server/app.py (line 77) and server-https/app.py (line 123) initializes a new SentenceTransformer instance on every call:
encoder = SentenceTransformer(EMBEDDING_MODEL)
Loading sentence-transformers/all-mpnet-base-v2 takes ~2-5 seconds and ~400MB of memory each time. In agentic RAG workflows with multiple tool calls per user turn, this adds 10-25 seconds of unnecessary latency per query.
Proposed Fix
Initialize the model once at module level and reuse it across all requests.
Problem
The
milvus_search()function in bothserver/app.py(line 77) andserver-https/app.py(line 123) initializes a newSentenceTransformerinstance on every call:Loading
sentence-transformers/all-mpnet-base-v2takes ~2-5 seconds and ~400MB of memory each time. In agentic RAG workflows with multiple tool calls per user turn, this adds 10-25 seconds of unnecessary latency per query.Proposed Fix
Initialize the model once at module level and reuse it across all requests.