Skip to content

Releases: DNSdecoded/IndicRAG

v1.2.0 – Security Hardening & Ingest Fixes

24 Feb 15:46

Choose a tag to compare

🔒 Security

Path Traversal Hardened — CWE-22/23/36/73/99 (api_server.py)

  • IngestRequest.pdf_path now validated at source via Pydantic field_validator:
    rejects absolute paths, .. traversal sequences, and shell metacharacters
  • Runtime uses Path.relative_to() to confirm containment, then reconstructs
    safe_pdf_path purely from base_dir — fully severs taint chain for CodeQL

🐛 Bug Fixes

Ingest returns 0 chunks for multi-chapter textbooks (ingest.py)

  • ingest_paper now uses a two-pass approach: normal pass skips references,
    fallback pass retries without section-name filter when 0 chunks result
  • Verified: 34 MB electromagnetics textbook now yields 983 chunks instead of 0

422 Unprocessable Entity on filenames with accents/parens (api_server.py)

  • Replaced strict allowlist regex with a blocklist that only rejects genuinely
    dangerous shell metacharacters — accented characters (é, á, ó), parentheses,
    commas, and hyphens in filenames are now accepted

v1.1.0 – Bug Fixes & Background Ingestion

24 Feb 15:09

Choose a tag to compare

Bug Fixes

api_server.py

  • Added a BulkIngestResponse Pydantic model with complete statistics fields.
  • Updated POST /ingest/all to return status: "partial" when one or more files fail.
  • Restored proper exception chaining using raise ... from e.
  • Removed the unnecessary in-function import config.

ingest.py

  • Wrapped delete_by_paper_id in a try/except block to stop execution on deletion failure and prevent duplicate chunks.
  • Removed the internal import of calculate_md5 within _extract_worker; hashing is now handled locally using hashlib.
  • Evaluated metadata_fn in the parent process and passed the resulting metadata dictionary to worker processes.

pdf_utils.py

  • Extended math_pattern to support additional display math formats: $$...$$, \[...\], and \(...\).
  • Improved sentence splitting with the lookbehind regex (?<=[.!?])\s+ to preserve punctuation.
  • Merged chunks smaller than MIN_CHUNK_SIZE with the subsequent chunk instead of dropping them.

patterns.json

  • Anchored citation matching to full lines using: ^\[\d+(?:,\s*\d+)*\]\s*$.
  • Corrected the email TLD character class from [A-Z|a-z] to [A-Za-z].

IndicRAG v1.0.0 - Major UI Overhaul and Ingestion Optimization

23 Feb 19:01

Choose a tag to compare

🎨 UI/UX Transformation

  • Premium Aesthetics: New tokenized design system with glassmorphism.
  • Dynamic Theming: Native Dark/Light mode support with smooth transitions.
  • Scientific Rendering: High-fidelity markdown and citation rendering via marked.js.

⚙️ Performance & Architecture

  • Parallel Ingestion: Drastically reduced processing time using ProcessPoolExecutor.
  • MD5 Caching: Smart change detection to skip redundant processing of unchanged PDFs.
  • regex-based Chunking: Superior handling of scientific notation and math formula preservation.

🛡️ Stability & Rigor

  • Grounded Prompting: System prompts now enforce strict epistemic honesty and mechanistic rigor.
  • Windows Reliability: Refactored purge.py to handle file locking gracefully.
  • Bulk Operations: Added /ingest/all endpoint for one-click knowledge base builds.