Skip to content

[STORY] HNSW FTS Payload Cache Invalidation #427

@jsbattig

Description

@jsbattig

Part of: #408

Story: HNSW/FTS/Payload Cache Invalidation (Change Detection + TTL)

Part of: #408

[Conversation Reference: "In memory for performance. We need invalidation signaling. Before using an index, maybe a quick check on a signal file... Or simply we check for a new versioned folder" and "HNSW/FTS/Payload caches stay local per instance (TTL-based) <- AND cache invalidation based on change detection too"]

Story Overview

Objective: Implement a dual-strategy cache invalidation system for the per-node HNSW, FTS, and payload caches. Strategy 1: change detection via alias JSON mtime/content check provides immediate invalidation when a repo is refreshed. Strategy 2: TTL-based expiry provides background cleanup for edge cases. Both strategies ensure that after a golden repo refresh (which changes the versioned path), all nodes eventually serve queries from the new index data.

User Value: After a repo refresh, query results reflect the updated code within seconds on all cluster nodes, without requiring manual cache clearing or node restarts. No stale results served.

CRITICAL DEPENDENCY: Cache invalidation latency is bounded by the NFS attribute cache timeout (actimeo). Story #419 sets actimeo=3, meaning mtime-based change detection has worst-case 3-second delay. If actimeo is not set (default 60s), invalidation is delayed by up to 60 seconds.

Acceptance Criteria

AC1: Alias JSON Change Detection (Immediate Invalidation)

Scenario: A repo refresh creates a new versioned path, and queries detect the change.

Given Node A caches HNSW/FTS indexes for repo "my-repo" pointing to .versioned/my-repo/v_1000/
When the leader node refreshes "my-repo" and alias JSON target_path changes to .versioned/my-repo/v_2000/
Then Node A's next query for "my-repo" detects the alias JSON has changed
And Node A evicts the cached HNSW/FTS/payload for "my-repo"
And Node A loads fresh indexes from .versioned/my-repo/v_2000/
And the query returns results from the updated index

Technical Requirements:

  • CacheInvalidationManager in src/code_indexer/server/services/cache_invalidation_manager.py
  • On each query: check alias JSON file mtime before using cached indexes
  • If mtime changed: re-read alias JSON, compare target_path with cached path
  • If target_path changed: evict all caches for that repo, reload from new path
  • mtime check is cheap (~0.1ms stat() call on local, ~1-5ms on NFS) -- acceptable per-query overhead
  • stat() calls are per-queried-repo only (not all repos on every query) to keep overhead at ~1-5ms per query on NFS
  • AliasManager.read_alias() already reads per query -- leverage this

AC2: Mtime-Based Index Staleness Detection

Scenario: Index files on disk are newer than the cached version.

Given Node A has cached HNSW index loaded from disk at time T1
When another node or the leader re-indexes and writes new index files at time T2
Then Node A's cache invalidation detects the index file mtime is newer than T1
And Node A reloads the HNSW index from disk

Technical Requirements:

  • Track mtime of loaded index files (HNSW .hnsw files, FTS tantivy dir, payload JSON)
  • On query: stat the index directory/file mtime
  • If disk mtime > cached load time: reload index
  • Stat the index directory (not individual files) for efficiency

AC3: TTL-Based Background Cleanup

Scenario: Caches expire after a configurable TTL as a safety net.

Given a cache TTL of 300 seconds (configurable)
When an HNSW/FTS/payload cache entry has not been revalidated in 300 seconds
Then it is evicted from memory
And the next query triggers a fresh load from disk

Technical Requirements:

  • Configurable TTL: cache_ttl_seconds in config.json (default 300)
  • Background sweep thread: runs every 60 seconds, evicts expired entries
  • TTL is a SAFETY NET, not the primary invalidation mechanism
  • Change detection (AC1/AC2) provides immediate invalidation; TTL handles edge cases

AC4: Cache Entry Metadata Tracking

Scenario: Each cache entry tracks when it was loaded and from which path.

Given an HNSW index is cached for repo "my-repo"
When the cache entry is inspected
Then it records: repo_alias, source_path, loaded_at, last_validated_at, alias_json_mtime
And this metadata enables both change detection and TTL expiry

Technical Requirements:

  • CacheEntry dataclass: repo_alias, source_path, loaded_at, last_validated_at, alias_json_mtime, index_dir_mtime
  • Updated on each validation check (even if cache is still valid)
  • Available for diagnostics/health endpoint reporting

AC5: Concurrent Reload Protection

Scenario: Multiple queries trigger cache invalidation for the same repo simultaneously.

Given HNSW index reload is in progress for repo "my-repo"
When subsequent queries also detect that "my-repo" cache is stale
Then subsequent invalidation detections for the same repo queue behind the in-progress reload
And only one reload occurs (not multiple concurrent reloads)
And queued queries wait for the reload to complete and then use the fresh cache

Technical Requirements:

  • Per-repo reload lock (threading.Lock per alias) prevents concurrent reloads
  • Queries that arrive during reload wait for completion rather than spawning parallel reloads
  • Lock is lightweight and does not block queries to other repos

AC6: Standalone Mode Compatibility

Scenario: Cache invalidation works in standalone mode too.

Given the server is running in standalone (SQLite) mode
When a repo is refreshed and versioned path changes
Then the same change detection mechanism invalidates caches
And the TTL mechanism also works
And no cluster-specific infrastructure required

Technical Requirements:

  • CacheInvalidationManager works regardless of storage_mode
  • mtime checks work on local filesystem (standalone) and NFS (cluster)
  • No PostgreSQL dependency for cache invalidation
  • Existing cache behavior preserved in standalone mode (additive, not replacement)

Implementation Status

  • Core implementation complete
  • Unit tests passing
  • Integration tests passing
  • E2E tests passing
  • Code review approved
  • Manual E2E testing completed
  • Documentation updated

Technical Implementation Details

Invalidation Decision Flow (Per Query)

query arrives for repo "my-repo"
    |
    v
stat alias JSON mtime (for queried repo only, not all repos)
    |
    +-- mtime unchanged? --> use cached indexes
    |
    +-- mtime changed? --> re-read alias JSON
                               |
                               +-- target_path same? --> update mtime record, use cache
                               |
                               +-- target_path changed? --> acquire per-repo reload lock
                                                               |
                                                               +-- EVICT, reload from new path
                                                               |
                                                               +-- release lock, serve query

File Structure

src/code_indexer/server/services/
    cache_invalidation_manager.py   # CacheInvalidationManager

Integration Points

The cache invalidation manager integrates with existing caching layers:

  1. HNSW cache: SemanticQueryManager caches HNSWIndex objects in memory
  2. FTS cache: TantivyIndexManager caches tantivy index objects in memory
  3. Payload cache: PayloadStore caches document payloads in memory

Each cache consumer needs to check with CacheInvalidationManager before using cached data:

class SemanticQueryManager:
    def query(self, repo_alias: str, query_text: str, ...):
        if self._cache_invalidation.is_stale(repo_alias):
            self._evict_cache(repo_alias)
            self._load_cache(repo_alias)
        # proceed with cached index

NFS Considerations

  • stat() calls on NFS have ~1-5ms latency (vs ~0.1ms local)
  • Acceptable for per-query overhead (query itself takes 50-500ms)
  • NFS attribute caching (actimeo) MUST be set to 3 seconds (Story [STORY] ONTAP FSx NFS Mount Management and Validation #419) -- this bounds worst-case invalidation delay
  • If actimeo is not set (default 60s), mtime-based detection is delayed by up to 60 seconds
  • stat() is performed only for the queried repo, not all repos, keeping overhead minimal

Cache Metrics (for diagnostics)

  • Total cache entries per repo
  • Cache hit rate
  • Cache evictions (change-detected vs TTL-expired)
  • Average time since last validation
  • Available via health/diagnostics endpoint

Testing Requirements

  • Automated: Cache detects alias JSON mtime change and evicts.
  • Automated: Cache detects target_path change and reloads from new path.
  • Automated: TTL expiry evicts old cache entries.
  • Automated: Unchanged alias JSON does NOT trigger eviction (performance).
  • Automated: CacheEntry metadata is correctly tracked.
  • Automated: Concurrent reload protection -- only one reload per repo at a time.
  • Manual E2E: In cluster mode, refresh a golden repo on the leader, then immediately query on a follower node. Verify the follower returns results from the updated index (not stale cache).

Definition of Done

  • Alias JSON mtime change detection triggers cache eviction
  • Index directory mtime detection triggers reload
  • TTL-based background cleanup works as safety net
  • CacheEntry metadata tracked for all cached indexes
  • Concurrent reload protection (one reload per repo at a time)
  • Per-query overhead under 5ms (stat calls on queried repo only)
  • Works in both standalone and cluster modes
  • All tests pass

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions