Skip to content

feat: Optimize similarity search with vectorized cosine similarity (#634)#648

Open
fennhelloworld wants to merge 1 commit into
ritesh-1918:mainfrom
fennhelloworld:feat/vectorized-cosine-similarity
Open

feat: Optimize similarity search with vectorized cosine similarity (#634)#648
fennhelloworld wants to merge 1 commit into
ritesh-1918:mainfrom
fennhelloworld:feat/vectorized-cosine-similarity

Conversation

@fennhelloworld
Copy link
Copy Markdown

@fennhelloworld fennhelloworld commented May 29, 2026

Summary

Closes #634 — Optimizes the duplicate detection similarity search by replacing the per-ticket loop with vectorized batched cosine similarity.

Problem

DuplicateService.check_duplicate() previously iterated over every stored ticket embedding and called util.cos_sim() individually, resulting in O(n) separate tensor operations and kernel launches. Under load with many cached tickets, this caused significant latency.

Solution

All stored embeddings are now stacked into a single 2D tensor (_embedding_matrix) and compared against the query embedding in one batched matrix operation, then torch.argmax() identifies the best match.

Key changes

File Change
backend/services/duplicate_service.py Vectorized check_duplicate(), added _rebuild_embedding_matrix(), lazy matrix caching
backend/services/benchmark_similarity.py New benchmark script comparing loop vs vectorized performance

Benchmark results

Tickets Loop (ms) Vectorized (ms) Speedup
10 0.70 0.07 10x
100 2.90 0.09 33x
500 14.43 0.07 196x
1,000 29.52 0.07 394x
5,000 144.16 0.34 421x

Implementation details

  • Lazy rebuild: The embedding matrix is only rebuilt when _embedding_matrix_dirty is True (after add_ticket()), avoiding redundant computation.
  • Backward compatible: The public API (check_duplicate(), add_ticket(), is_available(), load()) is unchanged — same inputs, same outputs.
  • No new dependencies: Uses existing torch and sentence_transformers.util already in the project.

How to test

# Run the benchmark
python backend/services/benchmark_similarity.py

Checklist

Summary by CodeRabbit

  • Performance

    • Optimized duplicate detection to use vectorized similarity computations, improving throughput when processing large ticket batches.
  • Chores

    • Added internal performance benchmarking tooling for duplicate detection analysis.

Review Change Stack

…itesh-1918#634)

Replace per-ticket loop in DuplicateService.check_duplicate() with
vectorized batched cosine similarity computation. Instead of calling
util.cos_sim() individually for each stored embedding (O(n) kernel
launches), all stored embeddings are stacked into a single 2D tensor
and compared against the query in one matrix operation.

Key changes:
- Add _embedding_matrix, _ticket_ids, and _embedding_matrix_dirty
  to DuplicateService for lazy-rebuild caching
- Add _rebuild_embedding_matrix() to stack embeddings into 2D tensor
- Rewrite check_duplicate() to use vectorized util.cos_sim() with
  the stacked matrix and torch.argmax() for best-match selection
- Mark matrix dirty on add_ticket() for correctness
- Add benchmark_similarity.py showing speedup results:
  n=10: 10x, n=100: 33x, n=500: 196x, n=1000: 394x, n=5000: 421x

Closes ritesh-1918#634
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

Someone is attempting to deploy a commit to the ritesh Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

📝 Walkthrough

Walkthrough

The PR introduces vectorized cosine similarity computation to duplicate ticket detection. DuplicateService now caches a stacked embedding matrix and replaces per-ticket loops with a single batched similarity call. A benchmark script validates the performance improvement across multiple dataset sizes.

Changes

Vectorized Duplicate Detection

Layer / File(s) Summary
DuplicateService vectorized implementation
backend/services/duplicate_service.py
Adds torch and numpy imports, caches an embedding matrix and ticket-ID list with a dirty flag, introduces _rebuild_embedding_matrix() to construct the matrix from stored tickets, marks the cache as stale in add_ticket(), and replaces per-ticket cosine similarity looping in check_duplicate() with vectorized matrix computation via torch.argmax.
Benchmark comparison script
backend/services/benchmark_similarity.py
Generates synthetic unit-normalized embeddings, implements separate loop-based and vectorized similarity benchmark functions, and runs both approaches across multiple dataset sizes to measure and report speedup.

Sequence Diagram(s)

Not applicable. The changes are a performance optimization refactor within a single service class and a supporting benchmark utility; they do not introduce new multi-component control flows or external interactions that would benefit from visualization beyond the checkpoints already shown in the hidden artifact.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #634 — Vectorize Sentence-Transformers Cosine Similarity Computations: This PR directly implements the core vectorization objective using PyTorch (not ONNX); it replaces loop-based cosine similarity with batched matrix operations and includes benchmark validation of the speedup.

Poem

🐰 A rabbit's ode to swift lookups—
No loops, just stacked embeddings bright,
One argmax finds the match at light!
The matrix dances, fast and lean,
Benchmarks sing of scaling supreme.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning Vectorization objective met with torch batched operations; benchmarks provided showing speedups; but ONNX export script not implemented, partially addressing #634 requirements. Implement ONNX export script to convert the SentenceTransformer model to .onnx format as required by issue #634.
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main optimization: vectorizing cosine similarity for faster duplicate detection.
Out of Scope Changes check ✅ Passed All changes are in-scope: benchmark script and DuplicateService modifications directly support the vectorization objective; no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
backend/services/duplicate_service.py (2)

125-125: ⚡ Quick win

Make the optional parameter explicit (float | None).

Ruff flags this as an implicit Optional (RUF013). Line 23 already uses | None syntax, so this is consistent with the file.

♻️ Proposed fix
-    def check_duplicate(self, text: str, threshold: float = None) -> dict:
+    def check_duplicate(self, text: str, threshold: float | None = None) -> dict:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/duplicate_service.py` at line 125, The function signature
for check_duplicate currently uses the implicit Optional pattern (threshold:
float = None); update the type annotation to be explicit by changing it to
threshold: float | None = None in the check_duplicate method so it matches the
file's use of `| None` and satisfies the RUF013 rule.

96-112: ⚡ Quick win

Fix potential state desync in _rebuild_embedding_matrix() by snapshotting _tickets

DuplicateService._rebuild_embedding_matrix() builds _ticket_ids and the stacked embeddings from two separate passes over self._tickets. add_ticket() appends to self._tickets and sets _embedding_matrix_dirty=True, while check_duplicate() may rebuild the matrix when dirty/stale, so concurrent mutation could desync _ticket_ids vs _embedding_matrix.

In backend/main.py, the call sites for duplicate_service.add_ticket(...) and duplicate_service.check_duplicate(...) are inside async def routes, but the service methods are synchronous and torch ops may release the GIL; if the app is running with multiple threads/workers within a process, this race is still plausible. Snapshotting avoids the mismatch without relying on deployment details.

-        self._ticket_ids = [tid for tid, _, _ in self._tickets]
-        embeddings = [emb for _, emb, _ in self._tickets]
-        self._embedding_matrix = torch.stack(embeddings)
+        tickets = list(self._tickets)  # consistent snapshot
+        self._ticket_ids = [tid for tid, _, _ in tickets]
+        embeddings = [emb for _, emb, _ in tickets]
+        self._embedding_matrix = torch.stack(embeddings)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/duplicate_service.py` around lines 96 - 112,
_rebuild_embedding_matrix currently iterates over self._tickets twice, which can
lead to _ticket_ids vs the stacked _embedding_matrix getting out of sync if
self._tickets is mutated concurrently (e.g., between add_ticket and
check_duplicate); fix by snapshotting tickets at the start of
_rebuild_embedding_matrix (e.g., local_tickets = list(self._tickets)) and then
build _ticket_ids and embeddings from that snapshot before calling torch.stack,
then set _embedding_matrix and _ticket_ids and clear _embedding_matrix_dirty;
this ensures atomic consistency without changing add_ticket or check_duplicate
signatures.
backend/services/benchmark_similarity.py (1)

26-45: ⚡ Quick win

Add an untimed warm-up before measuring.

The first timed round absorbs one-time allocation/kernel-init overhead, which can skew the reported averages (most visibly at small n). Since the PR's speedup claims rely on these numbers, a warm-up call makes them more representative.

♻️ Proposed fix
 def benchmark_loop(query: torch.Tensor, stored: list[torch.Tensor], rounds: int = 5) -> float:
     """Old approach: iterate and compute cos_sim one at a time."""
+    for emb in stored:  # warm-up
+        util.cos_sim(query, emb)
     times = []
 def benchmark_vectorized(query: torch.Tensor, matrix: torch.Tensor, rounds: int = 5) -> float:
     """New approach: single batched cos_sim call."""
     query_2d = query.unsqueeze(0)
+    util.cos_sim(query_2d, matrix)  # warm-up
     times = []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/benchmark_similarity.py` around lines 26 - 45, Both
benchmark_loop and benchmark_vectorized should perform an untimed warm-up call
to amortize one-time allocation/kernel-init overhead before starting the timed
rounds; update the functions (benchmark_loop and benchmark_vectorized) to run
the same computation once (e.g., one pass over stored in benchmark_loop and one
util.cos_sim call in benchmark_vectorized) prior to the for _ in range(rounds)
timing loop so the measured rounds exclude initialization costs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@backend/services/benchmark_similarity.py`:
- Around line 26-45: Both benchmark_loop and benchmark_vectorized should perform
an untimed warm-up call to amortize one-time allocation/kernel-init overhead
before starting the timed rounds; update the functions (benchmark_loop and
benchmark_vectorized) to run the same computation once (e.g., one pass over
stored in benchmark_loop and one util.cos_sim call in benchmark_vectorized)
prior to the for _ in range(rounds) timing loop so the measured rounds exclude
initialization costs.

In `@backend/services/duplicate_service.py`:
- Line 125: The function signature for check_duplicate currently uses the
implicit Optional pattern (threshold: float = None); update the type annotation
to be explicit by changing it to threshold: float | None = None in the
check_duplicate method so it matches the file's use of `| None` and satisfies
the RUF013 rule.
- Around line 96-112: _rebuild_embedding_matrix currently iterates over
self._tickets twice, which can lead to _ticket_ids vs the stacked
_embedding_matrix getting out of sync if self._tickets is mutated concurrently
(e.g., between add_ticket and check_duplicate); fix by snapshotting tickets at
the start of _rebuild_embedding_matrix (e.g., local_tickets =
list(self._tickets)) and then build _ticket_ids and embeddings from that
snapshot before calling torch.stack, then set _embedding_matrix and _ticket_ids
and clear _embedding_matrix_dirty; this ensures atomic consistency without
changing add_ticket or check_duplicate signatures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 42974083-3733-43bf-ad65-454075d2fccd

📥 Commits

Reviewing files that changed from the base of the PR and between da8faf2 and 35a9990.

📒 Files selected for processing (2)
  • backend/services/benchmark_similarity.py
  • backend/services/duplicate_service.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BOUNTY] [level:critical] Vectorize Sentence-Transformers Cosine Similarity Computations with NumPy and ONNX Runtime

1 participant