Skip to content

A privacy-centric document intelligence platform designed for secure, local semantic analysis. By leveraging locally saved documents and embeddings, it ensures sensitive data remains secure while delivering powerful Retrieval-Augmented Generation (RAG) capabilities. Built with Streamlit, LangChain, Groq, and ChromaDB.

Notifications You must be signed in to change notification settings

rishirochan/DocVaultAI

Repository files navigation

🛠️ DocVaultAI

A privacy-centric document intelligence platform designed for secure, local semantic analysis. By leveraging locally saved documents and embeddings, it ensures sensitive data remains secure while delivering powerful Retrieval-Augmented Generation (RAG) capabilities. Built with Streamlit, LangChain, Groq, and ChromaDB.

🛡️ Functional Service Architecture

We position this not just as an interface, but as a secure Data Pipeline distinguishing it from standard conversational interfaces by offering professional-grade data sovereignty.

🔒 Air-Gapped Embedding Engine

The semantic meaning of your documents never leaves your machine. Vectors are generated locally using Ollama/FastEmbed, ensuring no third party or cloud provider ever reads, processes, or stores your original files or their semantic representations.

☁️ Hybrid-Cloud Architecture

Local Privacy with Global Intelligence. Heavy reasoning and language generation are offloaded to Groq's high-speed inference engine, but the sensitive context selection and retrieval happen entirely on-premise. Your full document set is never exposed to the cloud—only the specific, anonymized snippets relevant to a query.

📜 Audit-Ready Citations

Designed for verification-heavy fields like legal or medical research. Every response provides a transparent Provenance Log, linking assertions directly to source PDF pages with precise "confidence scores," verifying the reliability of the information.

🚫 Zero-Retention Guarantee

Data sent to the Groq API is strictly for "Inference Only". As a self-hosted solution, its guaranteed that your query context is never stored, trained on, or retained, offering a level of privacy unattainable with public SaaS models like ChatGPT.

✨ Features

  • Query Your Documents — Ask questions and get accurate answers based on your documents
  • Dark Notion-style UI — Clean, minimal dark theme interface
  • Document Management — Add and delete documents from the sidebar
  • Duplicate Detection — Warns before re-uploading files already indexed
  • Semantic Chunking — Splits by topic, not arbitrary character counts
  • Text Preprocessing — Removes citations, page numbers, and bibliography noise
  • Cross-Encoder Reranking — Filters irrelevant results using semantic relevance scoring
  • Persistent Memory — ChromaDB saves embeddings to disk (load in seconds)
  • High-Performance LLM — Groq API running Llama 3.3 70B
  • Local Embeddings — Ollama nomic-embed-text for private processing

🛠️ Tech Stack

Component Tool Why?
Frontend Streamlit Fast, interactive UI in pure Python
Framework LangChain Orchestrates the RAG pipeline
LLM Groq API Extremely fast inference for Llama 3
Embeddings Ollama Runs nomic-embed-text locally
Vector Store ChromaDB Persists to disk (unlike RAM-only FAISS)
PDF Parser PyMuPDF Better text extraction than PyPDF
Reranker Cross-Encoder Filters irrelevant results with semantic scoring

📁 Project Structure

DocVaultAI/
├── src/
│   ├── app.py          # Streamlit UI
│   ├── rag_core.py     # RAG logic
│   └── styles.css      # Dark theme CSS
├── documents/          # Your PDFs go here
├── rag_vector_store/   # ChromaDB persistence
└── .env                # API keys

⚙️ Setup

  1. Prerequisites:

    • Python 3.13+
    • Ollama installed
    • Pull the embedding model: ollama pull nomic-embed-text
  2. Install Dependencies:

    pip install -r requirements.txt
  3. Environment Variables: Create a .env file in the root directory:

    GROQ_API_KEY=your_groq_api_key_here
  4. Run the App:

    cd src
    streamlit run app.py

🧠 How It Works

  1. Ingestion — Scans documents/ folder for PDFs (PyMuPDF)
  2. Cleaning — Removes citations, page numbers, bibliography entries
  3. Chunking — Semantic splitting by topic shifts, with size limits
  4. Embedding — Converts text to vectors via nomic-embed-text
  5. Storage — Saves vectors to rag_vector_store/ (ChromaDB)
  6. Retrieval — Fetches top 20 similar chunks for your question
  7. Reranking — Cross-Encoder scores relevance, filters to top 3
  8. Generation — Sends question + context to Groq (Llama 3.3)

💡 Lessons Learned

Vector Store: FAISS vs Chroma

  • FAISS: Stores in RAM, requires re-processing on restart
  • Chroma: Persists to disk, instant 2-second reload ✓

Embeddings: Speed vs Accuracy

  • HuggingFace (all-MiniLM-L6-v2): Fast but lower accuracy
  • Ollama (nomic-embed-text): Best balance, 8192 token context ✓
  • FastEmbed (BAAI/bge-small): Future option for 1000+ docs

Chunking: Character vs Semantic

Method How it works Pros Cons
Character Cut every N chars Simple, fast Breaks mid-sentence
Semantic Split by topic shifts Coherent chunks Variable sizes
Recursive Semantic Semantic + size limits Best of both ✓ More complex
Small-to-Big Search small chunks, return parent context Very precise search + full context Complex metadata linking

PDF Parsing: PyPDF vs PyMuPDF

  • PyPDF: Simple but breaks text with unusual fonts ("Ar e W e")
  • PyMuPDF: Handles styled text, fonts, and formatting better ✓

Text Preprocessing Trade-offs

Regex cleaning removes citations and page numbers but may catch valid content like "Table 1". The reranker post-retrieval filters irrelevant results more intelligently using semantic understanding.

Reranking: Why Cross-Encoder?

Reranker Latency (20 docs) MRR@10* Cost Complexity
Cross-Encoder ~150ms 0.39 Free (local) Low ✓
ColBERT ~50ms 0.36 Free (local) High (GPU)
LLM-as-Reranker ~2s 0.40+ API costs Low
Cohere API ~100ms 0.40 Per-request Very Low

*MRR@10 = Mean Reciprocal Rank on MS MARCO passage reranking benchmark

Why Cross-Encoder: Best local accuracy (MRR 0.39), runs locally with no API costs, simple integration with sentence-transformers, and works well for small candidate sets (k ≤ 25).

Reranker Implementation: .rank() vs .predict()

Method Pros Cons
.predict() Full control Manual sorting required
.rank() Built-in sorting, cleaner API ✓ Less flexible

Using Sigmoid() activation converts raw logits to 0-1 probability scores for interpretability.

Rejected Idea: Passing Relevance Scores to LLM

Idea: Include relevance scores with context so LLM can weight sources differently.

Why rejected:

  • LLMs don't reason well about numerical scores
  • Document ordering already conveys importance
  • Risk of LLM ignoring correct content due to low score
  • Adds prompt complexity without clear benefit

Known Limitations: Styled Text Extraction

PyMuPDF sometimes fails to extract styled text (bold, italic, hyperlinks, colored text).

Examples encountered:

  • "Cyberpunk" (blue hyperlink) → extracted as blank
  • "MMLU" (bold italic) → not captured

Workaround: The reranker's min_score=0.3 threshold filters out irrelevant results.

Future solution: Multimodal RAG using vision models to "see" PDFs as images.


🏗️ Architecture

Screenshot 2025-12-28 at 12 44 30 AM

About

A privacy-centric document intelligence platform designed for secure, local semantic analysis. By leveraging locally saved documents and embeddings, it ensures sensitive data remains secure while delivering powerful Retrieval-Augmented Generation (RAG) capabilities. Built with Streamlit, LangChain, Groq, and ChromaDB.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published