fix: dedupe duplicate observations in retrieval output#673
Conversation
Remove exact normalized duplicate observations when formatting representation and reasoning-chain tool output without mutating stored documents or source links. Add regression coverage for per-level retrieval-time dedupe.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughThis PR adds deduplication logic to remove duplicate observation documents across document retrieval and tool output rendering. Deduplication is based on normalized content (case-insensitive, whitespace-collapsed) and level matching. The core helpers are applied in ChangesDocument Deduplication Feature
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Safety
This is retrieval-time only. It does not delete, rewrite, or merge raw
documentsrows, vector-store data, source IDs, premise links, or session metadata. Ranking remains stable by keeping the first document in input order.Test plan
python -m pytest tests/utils/test_representation_retrieval_dedupe.py -qSummary by CodeRabbit
Bug Fixes
Tests