Skip to content

TFIDF slop calculator implementation#5

Open
boda26 wants to merge 2 commits into
scoringfrom
tfidf-slop
Open

TFIDF slop calculator implementation#5
boda26 wants to merge 2 commits into
scoringfrom
tfidf-slop

Conversation

@boda26

@boda26 boda26 commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Add SlopCalculator for TFIDF scoring

Adds SlopCalculator (src/indexes/scoring/slop_calculator.{h,cc}) — a standalone, dependency-free component computing slop, the proximity penalty that
divides the TFIDF numerator (score = words_TFIDF * doc_score / norm / slop).

This is the math-only slice; wiring it into query execution is a follow-up.

Algorithm

slop = max(1, floor(sqrt(Σ MinGap(node[i], node[i+1])²)))

over consecutive outermost-level query nodes. A nested group collapses to the union of its leaf positions (one anchor); MinGap is a two-pointer closest-pair
scan. The min-1 guard applies only to the final result.

Behavior

  • Query order preserved (no OR reordering) — diverges from RediSearch on nested/OR queries by design; matches on single-term, flat-AND, flat-OR.
  • Repeated terms (red red): no special case → gap 0 → guard → slop 1.

Tests

testing/scoring/slop_calculator_test.cc — 11 algorithm cases + 2 death tests, golden values hand-computed. All pass.

Build

New scoring lib + scoring_test target. Run: ./build.sh --run-tests=scoring_test.

Summary by CodeRabbit

  • New Features

    • Added scoring calculation system for improved document relevance ranking based on term proximity analysis.
  • Tests

    • Added comprehensive test suite for scoring functionality, covering single and multi-term scenarios, nested query grouping, and edge cases.

boda26 added 2 commits June 3, 2026 18:37
Signed-off-by: Miles Song <bodasong@amazon.com>
Signed-off-by: Miles Song <bodasong@amazon.com>
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 24a6f4b3-5ce6-4b75-bccb-c30c8ca8e16d

📥 Commits

Reviewing files that changed from the base of the PR and between 2f27f41 and a66e816.

📒 Files selected for processing (5)
  • src/indexes/CMakeLists.txt
  • src/indexes/scoring/slop_calculator.cc
  • src/indexes/scoring/slop_calculator.h
  • testing/CMakeLists.txt
  • testing/scoring/slop_calculator_test.cc

📝 Walkthrough

Walkthrough

This PR introduces SlopCalculator, a new utility in the valkey_search::indexes::scoring namespace for computing per-document TFIDF slop scores. The implementation tracks term positions through nested query groups, collapses groups into unioned anchors, and returns a slop value computed from the floored Euclidean distance of gaps between consecutive anchors.

Changes

Slop Calculator Scoring Feature

Layer / File(s) Summary
Header contract and type definitions
src/indexes/scoring/slop_calculator.h
Declares SlopPosition type alias, SlopCalculator class with public API (OnTerm, EnterGroup, ExitGroup, Finalize), internal anchor and group-stack structures, and required standard/absl includes.
Core slop calculation logic
src/indexes/scoring/slop_calculator.cc
Implements helper functions (MinGap for two-pointer gap scanning, IntSqrt for safe integer square root), anchor/group management (EmitAnchor, OnTerm, EnterGroup, ExitGroup), and Finalize method that sorts anchors, computes accumulated squared gaps with uint64_t saturation, and returns floor-square-root of sum.
Build system integration
src/indexes/CMakeLists.txt
Registers scoring static library target with source files, public include directory, and linkage to vmsdklib.
Test suite and validation
testing/CMakeLists.txt, testing/scoring/slop_calculator_test.cc
Adds scoring_test executable; provides GoogleTest coverage for single terms, flat/nested AND/OR combinations, term ordering, repeated terms, gap-zero guard behavior, absent OR branches, and death tests validating group lifecycle invariants.

🎯 3 (Moderate) | ⏱️ ~20 minutes


🐰 A calculator hops through anchor lands,
Gap by gap, it understands,
Squared roots bloom where positions stand,
TFIDF slop, perfectly planned!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'TFIDF slop calculator implementation' directly and clearly summarizes the main change: introducing a new SlopCalculator component for TFIDF proximity penalty computation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tfidf-slop

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

// (single-char tokens are dropped, so only the words below have positions)
// apple=0 red=1 blue=2 banana=3 yellow=4 green=5 grape=6 purple=7
// orange=8 cherry=9 pink=10 violet=11 one=12 two=13 three=14 four=15
// five=16 six=17

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like we should continue with listing the last few positions if we are going to list them at all

// rounding of std::sqrt under -ffast-math: seed from a double then correct.
// Comparisons use division rather than squaring so they cannot overflow even
// when n is near UINT64_MAX (where (x+1)*(x+1) would wrap).
uint32_t IntSqrt(uint64_t n) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a test for this IntSqrt as this seems like something we had to update. The code seems correct but having a test on the overflow might be a good sanity check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants