-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Part of: #408
Story: HNSW/FTS/Payload Cache Invalidation (Change Detection + TTL)
Part of: #408
[Conversation Reference: "In memory for performance. We need invalidation signaling. Before using an index, maybe a quick check on a signal file... Or simply we check for a new versioned folder" and "HNSW/FTS/Payload caches stay local per instance (TTL-based) <- AND cache invalidation based on change detection too"]
Story Overview
Objective: Implement a dual-strategy cache invalidation system for the per-node HNSW, FTS, and payload caches. Strategy 1: change detection via alias JSON mtime/content check provides immediate invalidation when a repo is refreshed. Strategy 2: TTL-based expiry provides background cleanup for edge cases. Both strategies ensure that after a golden repo refresh (which changes the versioned path), all nodes eventually serve queries from the new index data.
User Value: After a repo refresh, query results reflect the updated code within seconds on all cluster nodes, without requiring manual cache clearing or node restarts. No stale results served.
CRITICAL DEPENDENCY: Cache invalidation latency is bounded by the NFS attribute cache timeout (actimeo). Story #419 sets actimeo=3, meaning mtime-based change detection has worst-case 3-second delay. If actimeo is not set (default 60s), invalidation is delayed by up to 60 seconds.
Acceptance Criteria
AC1: Alias JSON Change Detection (Immediate Invalidation)
Scenario: A repo refresh creates a new versioned path, and queries detect the change.
Given Node A caches HNSW/FTS indexes for repo "my-repo" pointing to .versioned/my-repo/v_1000/
When the leader node refreshes "my-repo" and alias JSON target_path changes to .versioned/my-repo/v_2000/
Then Node A's next query for "my-repo" detects the alias JSON has changed
And Node A evicts the cached HNSW/FTS/payload for "my-repo"
And Node A loads fresh indexes from .versioned/my-repo/v_2000/
And the query returns results from the updated indexTechnical Requirements:
-
CacheInvalidationManagerinsrc/code_indexer/server/services/cache_invalidation_manager.py - On each query: check alias JSON file mtime before using cached indexes
- If mtime changed: re-read alias JSON, compare target_path with cached path
- If target_path changed: evict all caches for that repo, reload from new path
- mtime check is cheap (~0.1ms stat() call on local, ~1-5ms on NFS) -- acceptable per-query overhead
- stat() calls are per-queried-repo only (not all repos on every query) to keep overhead at ~1-5ms per query on NFS
- AliasManager.read_alias() already reads per query -- leverage this
AC2: Mtime-Based Index Staleness Detection
Scenario: Index files on disk are newer than the cached version.
Given Node A has cached HNSW index loaded from disk at time T1
When another node or the leader re-indexes and writes new index files at time T2
Then Node A's cache invalidation detects the index file mtime is newer than T1
And Node A reloads the HNSW index from diskTechnical Requirements:
- Track mtime of loaded index files (HNSW
.hnswfiles, FTS tantivy dir, payload JSON) - On query: stat the index directory/file mtime
- If disk mtime > cached load time: reload index
- Stat the index directory (not individual files) for efficiency
AC3: TTL-Based Background Cleanup
Scenario: Caches expire after a configurable TTL as a safety net.
Given a cache TTL of 300 seconds (configurable)
When an HNSW/FTS/payload cache entry has not been revalidated in 300 seconds
Then it is evicted from memory
And the next query triggers a fresh load from diskTechnical Requirements:
- Configurable TTL:
cache_ttl_secondsin config.json (default 300) - Background sweep thread: runs every 60 seconds, evicts expired entries
- TTL is a SAFETY NET, not the primary invalidation mechanism
- Change detection (AC1/AC2) provides immediate invalidation; TTL handles edge cases
AC4: Cache Entry Metadata Tracking
Scenario: Each cache entry tracks when it was loaded and from which path.
Given an HNSW index is cached for repo "my-repo"
When the cache entry is inspected
Then it records: repo_alias, source_path, loaded_at, last_validated_at, alias_json_mtime
And this metadata enables both change detection and TTL expiryTechnical Requirements:
-
CacheEntrydataclass: repo_alias, source_path, loaded_at, last_validated_at, alias_json_mtime, index_dir_mtime - Updated on each validation check (even if cache is still valid)
- Available for diagnostics/health endpoint reporting
AC5: Concurrent Reload Protection
Scenario: Multiple queries trigger cache invalidation for the same repo simultaneously.
Given HNSW index reload is in progress for repo "my-repo"
When subsequent queries also detect that "my-repo" cache is stale
Then subsequent invalidation detections for the same repo queue behind the in-progress reload
And only one reload occurs (not multiple concurrent reloads)
And queued queries wait for the reload to complete and then use the fresh cacheTechnical Requirements:
- Per-repo reload lock (threading.Lock per alias) prevents concurrent reloads
- Queries that arrive during reload wait for completion rather than spawning parallel reloads
- Lock is lightweight and does not block queries to other repos
AC6: Standalone Mode Compatibility
Scenario: Cache invalidation works in standalone mode too.
Given the server is running in standalone (SQLite) mode
When a repo is refreshed and versioned path changes
Then the same change detection mechanism invalidates caches
And the TTL mechanism also works
And no cluster-specific infrastructure requiredTechnical Requirements:
- CacheInvalidationManager works regardless of storage_mode
- mtime checks work on local filesystem (standalone) and NFS (cluster)
- No PostgreSQL dependency for cache invalidation
- Existing cache behavior preserved in standalone mode (additive, not replacement)
Implementation Status
- Core implementation complete
- Unit tests passing
- Integration tests passing
- E2E tests passing
- Code review approved
- Manual E2E testing completed
- Documentation updated
Technical Implementation Details
Invalidation Decision Flow (Per Query)
query arrives for repo "my-repo"
|
v
stat alias JSON mtime (for queried repo only, not all repos)
|
+-- mtime unchanged? --> use cached indexes
|
+-- mtime changed? --> re-read alias JSON
|
+-- target_path same? --> update mtime record, use cache
|
+-- target_path changed? --> acquire per-repo reload lock
|
+-- EVICT, reload from new path
|
+-- release lock, serve query
File Structure
src/code_indexer/server/services/
cache_invalidation_manager.py # CacheInvalidationManager
Integration Points
The cache invalidation manager integrates with existing caching layers:
- HNSW cache:
SemanticQueryManagercachesHNSWIndexobjects in memory - FTS cache:
TantivyIndexManagercaches tantivy index objects in memory - Payload cache:
PayloadStorecaches document payloads in memory
Each cache consumer needs to check with CacheInvalidationManager before using cached data:
class SemanticQueryManager:
def query(self, repo_alias: str, query_text: str, ...):
if self._cache_invalidation.is_stale(repo_alias):
self._evict_cache(repo_alias)
self._load_cache(repo_alias)
# proceed with cached indexNFS Considerations
stat()calls on NFS have ~1-5ms latency (vs ~0.1ms local)- Acceptable for per-query overhead (query itself takes 50-500ms)
- NFS attribute caching (
actimeo) MUST be set to 3 seconds (Story [STORY] ONTAP FSx NFS Mount Management and Validation #419) -- this bounds worst-case invalidation delay - If actimeo is not set (default 60s), mtime-based detection is delayed by up to 60 seconds
- stat() is performed only for the queried repo, not all repos, keeping overhead minimal
Cache Metrics (for diagnostics)
- Total cache entries per repo
- Cache hit rate
- Cache evictions (change-detected vs TTL-expired)
- Average time since last validation
- Available via health/diagnostics endpoint
Testing Requirements
- Automated: Cache detects alias JSON mtime change and evicts.
- Automated: Cache detects target_path change and reloads from new path.
- Automated: TTL expiry evicts old cache entries.
- Automated: Unchanged alias JSON does NOT trigger eviction (performance).
- Automated: CacheEntry metadata is correctly tracked.
- Automated: Concurrent reload protection -- only one reload per repo at a time.
- Manual E2E: In cluster mode, refresh a golden repo on the leader, then immediately query on a follower node. Verify the follower returns results from the updated index (not stale cache).
Definition of Done
- Alias JSON mtime change detection triggers cache eviction
- Index directory mtime detection triggers reload
- TTL-based background cleanup works as safety net
- CacheEntry metadata tracked for all cached indexes
- Concurrent reload protection (one reload per repo at a time)
- Per-query overhead under 5ms (stat calls on queried repo only)
- Works in both standalone and cluster modes
- All tests pass