Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Context: Update VectorStore Interface with URL Parameter

## Requirements
- Add `url string` parameter to `VectorStore.Search` interface method
- Add `URL string` field to `SearchCall` struct
- Update `MockVectorStore.Search` to accept and record the `url` parameter
- Package `vectorstore` must compile; downstream breakage is expected and handled by tasks 02/03

## Key Files
- `backend/internal/vectorstore/store.go` — interface, mock, and call recording types (PRIMARY TARGET)
- `backend/internal/vectorstore/store_test.go` — mock tests (must be updated)
- `backend/internal/vectorstore/qdrant.go:135` — QdrantStore.Search (task 02)
- `backend/internal/vectorstore/qdrant_test.go` — Qdrant tests (task 02)
- `backend/internal/rag/pipeline.go:100` — RAG caller (task 03)

## Patterns
- Interface + mock live in same file (`store.go`)
- Call recording structs capture all parameters for test assertions
- Mock returns preconfigured results/errors, records all calls

## Downstream Impact
Callers that will break (handled by later tasks):
- `QdrantStore.Search` in qdrant.go (task 02)
- `pipeline.go:100` in rag package (task 03)
- All test files calling Search (tasks 02/03)
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Plan: task-01-update-vectorstore-interface

## Test Strategy
- Existing `store_test.go` tests updated to pass `url` parameter
- `TestMockSearch_RecordsCalls` extended with URL assertion to verify recording
- `TestVectorStoreInterfaceSatisfied` confirms mock still satisfies interface

## Implementation Plan
1. Add `url string` as 5th parameter to `VectorStore.Search` interface
2. Add `URL string` field to `SearchCall` struct
3. Update `MockVectorStore.Search` signature to accept and record `url`
4. Update all test calls with appropriate URL values
5. Verify compilation and tests pass
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Progress: task-01-update-vectorstore-interface

## Setup
- [x] Created documentation directory structure
- [x] Discovered instruction files (README.md, backend/README.md)
- [x] Read existing store.go and identified all callers

## Implementation Checklist
- [x] Update `VectorStore.Search` interface signature
- [x] Add `URL string` field to `SearchCall`
- [x] Update `MockVectorStore.Search` method signature and recording
- [x] Update `store_test.go` mock tests to pass new `url` parameter
- [x] Add URL assertion in `TestMockSearch_RecordsCalls`
- [x] Verify vectorstore package compiles
- [x] Verify vectorstore tests pass

## TDD Cycles
1. Updated interface + mock + tests simultaneously (single coherent change)
2. All 10 tests pass: `ok github.com/parth/smolterms/backend/internal/vectorstore 0.003s`

## Commit
_(pending)_
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Context: Add URL Filter to Qdrant Search

## Requirements
1. Update `QdrantStore.Search` signature to match new interface: add `url string` parameter
2. When `url` is non-empty, add a `Filter` to `QueryPoints` with `qdrant.NewMatchKeyword("url", url)`
3. When `url` is empty, no filter applied (backward-compatible)
4. Add `url` to the search log line
5. Update existing tests to pass `""` for url
6. Add new tests verifying filter presence/absence

## Key Files
- `backend/internal/vectorstore/qdrant.go:135` — `QdrantStore.Search` method
- `backend/internal/vectorstore/qdrant_test.go` — all Qdrant tests
- `backend/internal/vectorstore/store.go` — interface (already updated in task-01)

## Qdrant Filter API
- `qdrant.NewMatchKeyword(field, keyword string) *qdrant.Condition` — helper constructor
- `&qdrant.Filter{Must: []*qdrant.Condition{...}}` — wrap conditions
- `QueryPoints.Filter` field accepts `*qdrant.Filter`
- Mock records `queryCalls []*qdrant.QueryPoints` — tests can inspect `.Filter`

## Test Strategy
- Existing tests: add `""` as url parameter (no filter applied)
- New test: `TestQdrantStore_Search_WithURLFilter` — verify filter present in QueryPoints
- New test: `TestQdrantStore_Search_WithoutURLFilter` — verify filter nil in QueryPoints
- Log test: verify `url` attribute in search log entry
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Progress: task-02-add-url-filter-to-qdrant-search

## Setup
- [x] Created documentation directory
- [x] Read qdrant.go and qdrant_test.go
- [x] Researched Qdrant Go client filter API

## Implementation Checklist
- [x] Update `QdrantStore.Search` signature with `url string` parameter
- [x] Add conditional filter to `QueryPoints` when url is non-empty
- [x] Add `url` to search log line
- [x] Update existing qdrant tests to pass `""` for url (7 call sites)
- [x] Add test: `TestQdrantStore_Search_WithURLFilter` — verifies filter structure
- [x] Add test: `TestQdrantStore_Search_WithoutURLFilter` — verifies nil filter
- [x] Update log test to verify url attribute
- [x] All 36 tests pass

## TDD Cycles
1. Updated implementation and tests together (interface change + filter logic + tests)
2. All tests pass on first run: `ok github.com/parth/smolterms/backend/internal/vectorstore 0.004s`

## Commit
- Hash: `a8489f0`
- Message: `feat(vectorstore): add URL-based filtering to Qdrant search`
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Task 03: Thread URL Through RAG and Analyzer — Progress

## Status: Complete

## Changes Made

### 1. `backend/internal/rag/pipeline.go`
- Updated `Pipeline.Retrieve` signature: added `url string` parameter
- Forwarded `url` to `p.store.Search(ctx, p.collection, vectors[0], limit, url)`
- Added `slog.String("url", url)` to the retrieve log line

### 2. `backend/internal/rag/pipeline_test.go`
- Updated all 6 `Retrieve` calls to include the `url` parameter
- Added `sc.URL` assertion in `TestRetrieve_EmbedsAndSearches` to verify URL forwarding
- Added `"url"` to the expected log attrs in `TestRetrieve_LogsOperationFields`

### 3. `backend/internal/analyzer/analyzer.go`
- Updated Stage 6 (RAG Retrieve) to pass `req.URL` as the 4th arg to `Retrieve`

### 4. `backend/internal/analyzer/analyze_pipeline_test.go`
- Added `searchCall.URL` assertion in `TestAnalyze_CorrectDependencyCalls` to verify `req.URL` reaches the vector store search

## Test Results
- All backend tests pass: `go test ./backend/...` — all packages OK
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Task: Update VectorStore Interface and Mock with URL Parameter

## Description
Add a `url string` parameter to the `VectorStore.Search` method signature, the `SearchCall` recording struct, and the `MockVectorStore` implementation. This is the foundational change that all downstream packages depend on.

## Background
All embeddings from all analyzed websites are currently stored in a single Qdrant collection. The `Search` method performs pure vector similarity across the entire collection with no filtering. Since most privacy policies use similar language, this causes cross-contamination — analyzing site-b.com can pull back chunks from site-a.com. The `url` field is already stored as payload metadata and has a keyword index in Qdrant, but is never used during search.

This task adds the `url` parameter to the interface so downstream implementations (Qdrant, mock) can apply URL-based filtering.

## Technical Requirements
1. Add `url string` parameter to `VectorStore.Search` interface method
2. Add `URL string` field to `SearchCall` struct for test assertion
3. Update `MockVectorStore.Search` to accept and record the `url` parameter
4. All existing tests that call the mock must still compile after this change (they will be updated in task-03)

## Dependencies
- `backend/internal/vectorstore/store.go` — the file being modified
- No external dependencies

## Implementation Approach
1. Read the current `store.go` to understand the interface, mock, and call recording types
2. Add `url string` to the `Search` method in the `VectorStore` interface
3. Add `URL string` field to `SearchCall`
4. Update `MockVectorStore.Search` signature and recording logic
5. Run `go build ./backend/internal/vectorstore/...` to verify the package compiles
6. Note: downstream packages (qdrant, rag, analyzer) will fail to compile until tasks 02 and 03 are completed

## Acceptance Criteria

1. **Interface Updated**
- Given the `VectorStore` interface in `store.go`
- When a developer reads the `Search` method signature
- Then it includes `url string` as the fifth parameter: `Search(ctx context.Context, collectionID string, query []float32, limit int, url string) ([]Chunk, error)`

2. **SearchCall Records URL**
- Given a `SearchCall` struct
- When a test inspects recorded calls
- Then the `URL` field contains the URL passed to `Search`

3. **Mock Implementation Updated**
- Given the `MockVectorStore`
- When `Search` is called with a URL
- Then the URL is recorded in `SearchCalls` and behavior is otherwise unchanged

4. **Package Compiles**
- Given the updated `store.go`
- When running `go build ./backend/internal/vectorstore/...`
- Then compilation succeeds with no errors

## Metadata
- **Complexity**: Low
- **Labels**: VectorStore, Interface, Mock, Refactor
- **Required Skills**: Go interfaces, test doubles
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Task: Add URL Filter to Qdrant Search Implementation

## Description
Update the `QdrantStore.Search` method to accept the new `url string` parameter and apply a Qdrant payload filter that restricts results to chunks matching the given URL. When the URL is empty, no filter is applied (preserving backward-compatible behavior).

## Background
The Qdrant collection already has a keyword index on the `url` field (created in `ensureCollection`), but the `Search` method never uses it. This task wires up a `FieldCondition` filter on `url` in the `QueryPoints` request so that vector similarity search is scoped to a single website's chunks.

## Technical Requirements
1. Update `QdrantStore.Search` signature to match the new `VectorStore` interface: `Search(ctx context.Context, collectionID string, query []float32, limit int, url string) ([]Chunk, error)`
2. When `url` is non-empty, add a `Filter` to `QueryPoints` with a `FieldCondition` matching `url` exactly (keyword match)
3. When `url` is empty, do not add any filter (search across all chunks)
4. Add the `url` value to the search log line for observability
5. Update existing Qdrant tests to pass the new parameter
6. Add new test cases verifying filter is applied when URL is provided and omitted when empty

## Dependencies
- Task 01 must be completed first (interface change)
- `backend/internal/vectorstore/qdrant.go` — implementation file
- `backend/internal/vectorstore/qdrant_test.go` — test file
- Qdrant Go client `qdrant` package for filter types

## Implementation Approach
1. Read `qdrant.go` and `qdrant_test.go` to understand current implementation and test patterns
2. Update `Search` method signature to include `url string`
3. Build the Qdrant filter using `qdrant.Filter` with a `Must` condition containing a `FieldCondition` on the `url` field with a `Match` of type `MatchKeyword`
4. Conditionally attach the filter to `QueryPoints` only when `url != ""`
5. Add `url` to the existing slog line in Search
6. Update all existing test cases to pass `""` for `url` (preserving current behavior)
7. Add test: `TestQdrantStore_Search_WithURLFilter` — verify the `QueryPoints` sent to mock client includes the filter when URL is provided
8. Add test: `TestQdrantStore_Search_WithoutURLFilter` — verify no filter when URL is empty
9. Run `go test ./backend/internal/vectorstore/...` to verify all tests pass

## Acceptance Criteria

1. **Filter Applied When URL Provided**
- Given a `QdrantStore` with a mock client
- When `Search` is called with `url = "https://example.com/privacy"`
- Then the `QueryPoints` sent to the Qdrant client includes a `Filter` with a `Must` condition matching `url` = `"https://example.com/privacy"`

2. **No Filter When URL Empty**
- Given a `QdrantStore` with a mock client
- When `Search` is called with `url = ""`
- Then the `QueryPoints` sent to the Qdrant client has `nil` or empty `Filter`

3. **Results Unchanged for Matching Chunks**
- Given chunks stored with URL "https://example.com/privacy"
- When `Search` is called with that same URL
- Then matching chunks are returned with correct scores and metadata

4. **URL Logged**
- Given a `Search` call with a URL
- When the search completes
- Then the log line includes the URL value

5. **Existing Tests Pass**
- Given the updated implementation
- When running `go test ./backend/internal/vectorstore/...`
- Then all tests pass including new URL filter tests

## Metadata
- **Complexity**: Medium
- **Labels**: VectorStore, Qdrant, Filter, gRPC
- **Required Skills**: Go, Qdrant gRPC client API, payload filtering
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Task: Thread URL Through RAG Pipeline and Analyzer

## Description
Update the RAG `Pipeline.Retrieve` method to accept and forward the `url` parameter to `VectorStore.Search`, then update `Analyzer.Analyze` to pass `req.URL` through the retrieval call. Update all affected tests across both packages.

## Background
With the `VectorStore.Search` interface now accepting a `url` parameter (tasks 01-02), the RAG pipeline and analyzer need to thread the URL through so that retrieval is scoped to the correct website. Currently `Pipeline.Retrieve` takes only `query` and `limit`, and `Analyzer.Analyze` calls `Retrieve` without any URL context.

## Technical Requirements
1. Update `Pipeline.Retrieve` signature to accept `url string`: `Retrieve(ctx context.Context, query string, limit int, url string) ([]vectorstore.Chunk, error)`
2. Pass `url` through to `p.store.Search(ctx, p.collection, vectors[0], limit, url)`
3. Add `url` to the retrieve log line for observability
4. Update `Analyzer.Analyze` stage 6 to pass `req.URL` to `Retrieve`: `a.rag.Retrieve(ctx, broadRetrievalQuery, retrievalLimit, req.URL)`
5. Update all RAG pipeline tests to pass URL parameter and assert it's forwarded correctly
6. Update all analyzer pipeline tests to verify `req.URL` reaches the `Search` call

## Dependencies
- Tasks 01 and 02 must be completed first
- `backend/internal/rag/pipeline.go` — RAG pipeline
- `backend/internal/rag/pipeline_test.go` — RAG tests
- `backend/internal/analyzer/analyzer.go` — analyzer pipeline
- `backend/internal/analyzer/analyze_pipeline_test.go` — analyzer tests
- Any other files that call `Pipeline.Retrieve` or `MockVectorStore.Search`

## Implementation Approach
1. Read `pipeline.go`, `pipeline_test.go`, `analyzer.go`, and `analyze_pipeline_test.go`
2. Update `Pipeline.Retrieve` to accept `url string` and forward it to `store.Search`
3. Add `slog.String("url", url)` to the retrieve log line
4. Update `Analyzer.Analyze` line 148 to pass `req.URL` as the fourth argument to `Retrieve`
5. Update RAG pipeline tests:
- All existing `Retrieve` calls need the URL parameter added
- Add test case verifying URL is passed through to `MockVectorStore.Search`
- Assert `SearchCall.URL` matches expected value
6. Update analyzer pipeline tests:
- Verify the mock's `SearchCalls[0].URL` equals `req.URL` in the happy-path test
- Update any other test cases that call through the pipeline
7. Search for any other callers of `Retrieve` across the codebase and update them
8. Run full test suite: `go test ./backend/...`

## Acceptance Criteria

1. **RAG Retrieve Forwards URL**
- Given a RAG pipeline with a mock vector store
- When `Retrieve` is called with `url = "https://example.com/privacy"`
- Then `MockVectorStore.Search` is called with that same URL

2. **Analyzer Passes req.URL**
- Given an analyzer processing a request with `URL: "https://example.com/privacy"`
- When the pipeline reaches the RAG retrieve stage
- Then the vector store search is filtered to `"https://example.com/privacy"`

3. **URL Logged in Retrieve**
- Given a `Retrieve` call with a URL
- When retrieval completes
- Then the log line includes the URL

4. **All Tests Pass**
- Given the updated code across all packages
- When running `go test ./backend/...`
- Then all tests pass with no compilation errors

5. **No Cross-Contamination Path**
- Given the full pipeline from API request to vector search
- When tracing the URL parameter
- Then `req.URL` flows through `Analyzer.Analyze` → `Pipeline.Retrieve` → `VectorStore.Search` without being lost or defaulted

## Metadata
- **Complexity**: Medium
- **Labels**: RAG, Analyzer, Integration, Pipeline
- **Required Skills**: Go, interface threading, test mocks, pipeline architecture
8 changes: 7 additions & 1 deletion backend/internal/analyzer/analyze_pipeline_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -490,11 +490,17 @@ func TestAnalyze_CorrectDependencyCalls(t *testing.T) {
}
}

// Verify the search limit was 20
// Verify the search limit, URL filter, and content hash filter
if len(deps.store.SearchCalls) > 0 {
searchCall := deps.store.SearchCalls[0]
if searchCall.Limit != retrievalLimit {
t.Errorf("Search limit = %d, want %d", searchCall.Limit, retrievalLimit)
}
if searchCall.URL != req.URL {
t.Errorf("Search URL = %q, want %q", searchCall.URL, req.URL)
}
if searchCall.ContentHash == "" {
t.Error("Search ContentHash is empty, want non-empty (should match computed content hash)")
}
}
}
2 changes: 1 addition & 1 deletion backend/internal/analyzer/analyzer.go
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ func (a *Analyzer) Analyze(ctx context.Context, req AnalysisRequest) (*AnalysisR

// Stage 6: RAG Retrieve
start = time.Now()
retrieved, err := a.rag.Retrieve(ctx, broadRetrievalQuery, retrievalLimit)
retrieved, err := a.rag.Retrieve(ctx, broadRetrievalQuery, retrievalLimit, req.URL, contentHash)
if err != nil {
return nil, fmt.Errorf("analyze: rag retrieve: %w", err)
}
Expand Down
6 changes: 4 additions & 2 deletions backend/internal/rag/pipeline.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ func (p *Pipeline) Store(ctx context.Context, url string, contentHash string, ch

// Retrieve embeds the query, searches the vector store, and returns
// deduplicated results.
func (p *Pipeline) Retrieve(ctx context.Context, query string, limit int) ([]vectorstore.Chunk, error) {
func (p *Pipeline) Retrieve(ctx context.Context, query string, limit int, url string, contentHash string) ([]vectorstore.Chunk, error) {
if err := ctx.Err(); err != nil {
return nil, fmt.Errorf("rag: retrieve: %w", err)
}
Expand All @@ -97,7 +97,7 @@ func (p *Pipeline) Retrieve(ctx context.Context, query string, limit int) ([]vec
embedLatency := time.Since(embedStart)

searchStart := time.Now()
results, err := p.store.Search(ctx, p.collection, vectors[0], limit)
results, err := p.store.Search(ctx, p.collection, vectors[0], limit, url, contentHash)
if err != nil {
return nil, fmt.Errorf("rag: search: %w", err)
}
Expand All @@ -113,6 +113,8 @@ func (p *Pipeline) Retrieve(ctx context.Context, query string, limit int) ([]vec
p.logger.Info("retrieved chunks",
slog.String("operation", "retrieve"),
slog.String("query", truncatedQuery),
slog.String("url", url),
slog.String("content_hash", contentHash),
slog.Int("result_count", len(deduped)),
slog.Duration("embed_latency", embedLatency),
slog.Duration("search_latency", searchLatency),
Expand Down
Loading