Skip to content

fix: separate document and query prefixes for embedding models#11985

Draft
roomote-v0[bot] wants to merge 1 commit intomainfrom
fix/embedding-query-prefix-separation
Draft

fix: separate document and query prefixes for embedding models#11985
roomote-v0[bot] wants to merge 1 commit intomainfrom
fix/embedding-query-prefix-separation

Conversation

@roomote-v0
Copy link
Copy Markdown
Contributor

@roomote-v0 roomote-v0 bot commented Mar 24, 2026

Related GitHub Issue

Closes: #11707

Description

This PR attempts to address Issue #11707 by separating document (indexing) and query (search) prefixes for embedding models. Feedback and guidance are welcome.

Problem: The queryPrefix in EmbeddingModelProfile was applied to all texts indiscriminately -- both during indexing (scanner/file-watcher) and during search queries. Most modern embedding models (including nomic-embed-code) require different instructions for indexing vs querying, or at minimum: no prefix for indexing and a specific prefix for querying.

Solution:

  • Added EmbeddingPurpose type ("index" | "query") to @roo-code/types
  • Added optional documentPrefix field to EmbeddingModelProfile for indexing/document embedding
  • Added getModelDocumentPrefix() and getModelPrefixForPurpose() helpers
  • Added optional purpose parameter to IEmbedder.createEmbeddings():
    • "index" -> uses documentPrefix (for document indexing)
    • "query" or undefined -> uses queryPrefix (backward compatible)
  • Updated all 8 embedder implementations to accept and pass through the purpose parameter
  • Scanner and file-watcher now pass "index" purpose
  • Search service now passes "query" purpose
  • nomic-embed-code profiles: queryPrefix kept for search, documentPrefix intentionally omitted (undefined) since nomic does not require a prefix for document indexing

Backward compatibility: Models without documentPrefix continue to work exactly as before. The purpose parameter is optional and defaults to queryPrefix behavior when omitted.

Test Procedure

  • Added 14 new tests in embeddingModels.spec.ts covering getModelQueryPrefix, getModelDocumentPrefix, and getModelPrefixForPurpose
  • Updated 5 existing test assertions in gemini, mistral, and vercel-ai-gateway specs for the new purpose parameter passthrough
  • All 407 tests pass (26 embeddingModels + 174 embedders + 207 processors/services)
  • Lint and type checks pass across the entire monorepo

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Testing: New and/or updated tests have been added to cover my changes.
  • Documentation Impact: No documentation updates required -- this is an internal API change.
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Documentation Updates

  • No documentation updates are required.

Additional Notes

This change is fully backward compatible. Existing callers that do not pass a purpose will continue to use queryPrefix as before.

Interactively review PR in Roo Code Cloud

Adds EmbeddingPurpose type and documentPrefix field to support
different prefixes for indexing vs querying. Models like nomic-embed-code
require a query prefix for search but no prefix for document indexing.

- Add documentPrefix to EmbeddingModelProfile type
- Add getModelDocumentPrefix() and getModelPrefixForPurpose() helpers
- Add purpose parameter to IEmbedder.createEmbeddings()
- Update all embedder implementations to use purpose-aware prefixing
- Scanner and file-watcher pass "index" purpose
- Search service passes "query" purpose
- Update nomic-embed-code profiles (no documentPrefix for indexing)
- Update tests for new behavior

Closes #11707
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Embedding services wrong / incorrect use of queryPrefix

1 participant