-
Notifications
You must be signed in to change notification settings - Fork 0
Description
RAG Ingestion Script: Critical Issues and Improvements
Issue Summary
The current RAG ingestion script (src/scripts/ingestRagFolder.ts) has several reliability and data integrity issues that could lead to missing documents, data corruption, and inconsistent incremental processing.
Critical Issues
1. File Name Collision Risk (high)
Problem:
const fileName = path.basename(filePath);Files with identical names in different subdirectories will overwrite each other in the database. Since the ingestion script (now) recurses into sub-directories, this is a concern.
Example:
docs/readme.mdexamples/readme.mdapi/readme.md
All three files get stored as readme.md, causing the latter two to overwrite the first.
Impact: Missing documents in RAG retrieval, data loss
Solution:
const fileName = path.relative(ragFolder, filePath);2. No Transaction Safety (medium-high)
Problem:
Files are processed sequentially without database transactions. If processing fails mid-file, partial chunks remain in the database.
Example Failure Scenario:
- File starts processing, inserts chunks 1-5
- Embedding API fails on chunk 6
- File processing stops, but chunks 1-5 remain in database
- Next run skips file (thinks it's already processed)
- File is permanently incomplete in RAG system
Impact: Incomplete documents, inconsistent retrieval results
Solution:
Wrap each file's processing in a database transaction:
async function processFile(filePath: string, fileName: string) {
const client = await pool.connect();
try {
await client.query('BEGIN');
// ... process chunks ...
await client.query('COMMIT');
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
}3. Incremental Logic Gap (medium)
Problem:
The script checks existingFiles.has(fileName) using only distinct file names. Partially processed files (due to errors) may not be detected properly.
Current Logic:
async function getExistingFiles(): Promise<Set<string>> {
const result = await pool.query("SELECT DISTINCT file_name FROM rag_documents");
return new Set(result.rows.map(row => row.file_name));
}Issue: A file with only some chunks inserted will still appear "existing" and may be skipped inappropriately.
Impact: Incomplete files never get fully reprocessed
Solution:
Check for complete file processing by validating chunk sequences:
async function getFileProcessingStatus(fileName: string): Promise<'complete' | 'partial' | 'missing'> {
const result = await pool.query(
"SELECT COUNT(*) as chunk_count, MAX(chunk_index) as max_index FROM rag_documents WHERE file_name = $1",
[fileName]
);
if (result.rows[0].chunk_count === 0) return 'missing';
// Check if chunks are sequential (no gaps)
const expectedChunks = result.rows[0].max_index + 1;
const actualChunks = result.rows[0].chunk_count;
return expectedChunks === actualChunks ? 'complete' : 'partial';
}Implementation Priority
- High Priority: Fix file name collision (prevents data loss)
- Medium Priority: Add transaction safety (prevents partial corruption)
- Low Priority: Improve incremental logic (handles edge cases)
Acceptance Criteria
- Files in different directories with same names are stored with unique identifiers
- File processing is atomic (all chunks inserted or none)
- Partially processed files are detected and reprocessed
- All existing functionality remains intact
- Incremental mode continues to skip truly unchanged files
Additional Considerations
- Consider adding a
processing_statuscolumn to track file processing state - Add retry logic for transient failures (network, API rate limits)
- Consider batch processing for better performance with large document sets