Skip to content

RAG Ingestion Script Critical Issues and Improvements #16

@ForestMars

Description

@ForestMars

RAG Ingestion Script: Critical Issues and Improvements

Issue Summary

The current RAG ingestion script (src/scripts/ingestRagFolder.ts) has several reliability and data integrity issues that could lead to missing documents, data corruption, and inconsistent incremental processing.

Critical Issues

1. File Name Collision Risk (high)

Problem:

const fileName = path.basename(filePath);

Files with identical names in different subdirectories will overwrite each other in the database. Since the ingestion script (now) recurses into sub-directories, this is a concern.

Example:

  • docs/readme.md
  • examples/readme.md
  • api/readme.md

All three files get stored as readme.md, causing the latter two to overwrite the first.

Impact: Missing documents in RAG retrieval, data loss

Solution:

const fileName = path.relative(ragFolder, filePath);

2. No Transaction Safety (medium-high)

Problem:
Files are processed sequentially without database transactions. If processing fails mid-file, partial chunks remain in the database.

Example Failure Scenario:

  1. File starts processing, inserts chunks 1-5
  2. Embedding API fails on chunk 6
  3. File processing stops, but chunks 1-5 remain in database
  4. Next run skips file (thinks it's already processed)
  5. File is permanently incomplete in RAG system

Impact: Incomplete documents, inconsistent retrieval results

Solution:
Wrap each file's processing in a database transaction:

async function processFile(filePath: string, fileName: string) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');
    // ... process chunks ...
    await client.query('COMMIT');
  } catch (error) {
    await client.query('ROLLBACK');
    throw error;
  } finally {
    client.release();
  }
}

3. Incremental Logic Gap (medium)

Problem:
The script checks existingFiles.has(fileName) using only distinct file names. Partially processed files (due to errors) may not be detected properly.

Current Logic:

async function getExistingFiles(): Promise<Set<string>> {
  const result = await pool.query("SELECT DISTINCT file_name FROM rag_documents");
  return new Set(result.rows.map(row => row.file_name));
}

Issue: A file with only some chunks inserted will still appear "existing" and may be skipped inappropriately.

Impact: Incomplete files never get fully reprocessed

Solution:
Check for complete file processing by validating chunk sequences:

async function getFileProcessingStatus(fileName: string): Promise<'complete' | 'partial' | 'missing'> {
  const result = await pool.query(
    "SELECT COUNT(*) as chunk_count, MAX(chunk_index) as max_index FROM rag_documents WHERE file_name = $1",
    [fileName]
  );
  
  if (result.rows[0].chunk_count === 0) return 'missing';
  
  // Check if chunks are sequential (no gaps)
  const expectedChunks = result.rows[0].max_index + 1;
  const actualChunks = result.rows[0].chunk_count;
  
  return expectedChunks === actualChunks ? 'complete' : 'partial';
}

Implementation Priority

  1. High Priority: Fix file name collision (prevents data loss)
  2. Medium Priority: Add transaction safety (prevents partial corruption)
  3. Low Priority: Improve incremental logic (handles edge cases)

Acceptance Criteria

  • Files in different directories with same names are stored with unique identifiers
  • File processing is atomic (all chunks inserted or none)
  • Partially processed files are detected and reprocessed
  • All existing functionality remains intact
  • Incremental mode continues to skip truly unchanged files

Additional Considerations

  • Consider adding a processing_status column to track file processing state
  • Add retry logic for transient failures (network, API rate limits)
  • Consider batch processing for better performance with large document sets

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions