Releases: DNSdecoded/IndicRAG
Releases · DNSdecoded/IndicRAG
v1.2.0 – Security Hardening & Ingest Fixes
🔒 Security
Path Traversal Hardened — CWE-22/23/36/73/99 (api_server.py)
IngestRequest.pdf_pathnow validated at source via Pydanticfield_validator:
rejects absolute paths,..traversal sequences, and shell metacharacters- Runtime uses
Path.relative_to()to confirm containment, then reconstructs
safe_pdf_pathpurely frombase_dir— fully severs taint chain for CodeQL
🐛 Bug Fixes
Ingest returns 0 chunks for multi-chapter textbooks (ingest.py)
ingest_papernow uses a two-pass approach: normal pass skips references,
fallback pass retries without section-name filter when 0 chunks result- Verified: 34 MB electromagnetics textbook now yields 983 chunks instead of 0
422 Unprocessable Entity on filenames with accents/parens (api_server.py)
- Replaced strict allowlist regex with a blocklist that only rejects genuinely
dangerous shell metacharacters — accented characters (é, á, ó), parentheses,
commas, and hyphens in filenames are now accepted
v1.1.0 – Bug Fixes & Background Ingestion
Bug Fixes
api_server.py
- Added a
BulkIngestResponsePydantic model with complete statistics fields. - Updated
POST /ingest/allto returnstatus: "partial"when one or more files fail. - Restored proper exception chaining using
raise ... from e. - Removed the unnecessary in-function
import config.
ingest.py
- Wrapped
delete_by_paper_idin atry/exceptblock to stop execution on deletion failure and prevent duplicate chunks. - Removed the internal import of
calculate_md5within_extract_worker; hashing is now handled locally usinghashlib. - Evaluated
metadata_fnin the parent process and passed the resulting metadata dictionary to worker processes.
pdf_utils.py
- Extended
math_patternto support additional display math formats:$$...$$,\[...\], and\(...\). - Improved sentence splitting with the lookbehind regex
(?<=[.!?])\s+to preserve punctuation. - Merged chunks smaller than
MIN_CHUNK_SIZEwith the subsequent chunk instead of dropping them.
patterns.json
- Anchored citation matching to full lines using:
^\[\d+(?:,\s*\d+)*\]\s*$. - Corrected the email TLD character class from
[A-Z|a-z]to[A-Za-z].
IndicRAG v1.0.0 - Major UI Overhaul and Ingestion Optimization
🎨 UI/UX Transformation
- Premium Aesthetics: New tokenized design system with glassmorphism.
- Dynamic Theming: Native Dark/Light mode support with smooth transitions.
- Scientific Rendering: High-fidelity markdown and citation rendering via
marked.js.
⚙️ Performance & Architecture
- Parallel Ingestion: Drastically reduced processing time using
ProcessPoolExecutor. - MD5 Caching: Smart change detection to skip redundant processing of unchanged PDFs.
- regex-based Chunking: Superior handling of scientific notation and math formula preservation.
🛡️ Stability & Rigor
- Grounded Prompting: System prompts now enforce strict epistemic honesty and mechanistic rigor.
- Windows Reliability: Refactored
purge.pyto handle file locking gracefully. - Bulk Operations: Added
/ingest/allendpoint for one-click knowledge base builds.