🛠️ DocVaultAI

A privacy-centric document intelligence platform designed for secure, local semantic analysis. By leveraging locally saved documents and embeddings, it ensures sensitive data remains secure while delivering powerful Retrieval-Augmented Generation (RAG) capabilities. Built with Streamlit, LangChain, Groq, and ChromaDB.

🛡️ Functional Service Architecture

We position this not just as an interface, but as a secure Data Pipeline distinguishing it from standard conversational interfaces by offering professional-grade data sovereignty.

🔒 Air-Gapped Embedding Engine

The semantic meaning of your documents never leaves your machine. Vectors are generated locally using Ollama/FastEmbed, ensuring no third party or cloud provider ever reads, processes, or stores your original files or their semantic representations.

☁️ Hybrid-Cloud Architecture

Local Privacy with Global Intelligence. Heavy reasoning and language generation are offloaded to Groq's high-speed inference engine, but the sensitive context selection and retrieval happen entirely on-premise. Your full document set is never exposed to the cloud—only the specific, anonymized snippets relevant to a query.

📜 Audit-Ready Citations

Designed for verification-heavy fields like legal or medical research. Every response provides a transparent Provenance Log, linking assertions directly to source PDF pages with precise "confidence scores," verifying the reliability of the information.

🚫 Zero-Retention Guarantee

Data sent to the Groq API is strictly for "Inference Only". As a self-hosted solution, its guaranteed that your query context is never stored, trained on, or retained, offering a level of privacy unattainable with public SaaS models like ChatGPT.

✨ Features

Query Your Documents — Ask questions and get accurate answers based on your documents
Dark Notion-style UI — Clean, minimal dark theme interface
Document Management — Add and delete documents from the sidebar
Duplicate Detection — Warns before re-uploading files already indexed
Semantic Chunking — Splits by topic, not arbitrary character counts
Text Preprocessing — Removes citations, page numbers, and bibliography noise
Cross-Encoder Reranking — Filters irrelevant results using semantic relevance scoring
Persistent Memory — ChromaDB saves embeddings to disk (load in seconds)
High-Performance LLM — Groq API running Llama 3.3 70B
Local Embeddings — Ollama nomic-embed-text for private processing

🛠️ Tech Stack

Component	Tool	Why?
Frontend	Streamlit	Fast, interactive UI in pure Python
Framework	LangChain	Orchestrates the RAG pipeline
LLM	Groq API	Extremely fast inference for Llama 3
Embeddings	Ollama	Runs `nomic-embed-text` locally
Vector Store	ChromaDB	Persists to disk (unlike RAM-only FAISS)
PDF Parser	PyMuPDF	Better text extraction than PyPDF
Reranker	Cross-Encoder	Filters irrelevant results with semantic scoring

📁 Project Structure

DocVaultAI/
├── src/
│   ├── app.py          # Streamlit UI
│   ├── rag_core.py     # RAG logic
│   └── styles.css      # Dark theme CSS
├── documents/          # Your PDFs go here
├── rag_vector_store/   # ChromaDB persistence
└── .env                # API keys

⚙️ Setup

Prerequisites:
- Python 3.13+
- Ollama installed
- Pull the embedding model: ollama pull nomic-embed-text
Install Dependencies:
```
pip install -r requirements.txt
```
Environment Variables: Create a .env file in the root directory:
```
GROQ_API_KEY=your_groq_api_key_here
```
Run the App:
```
cd src
streamlit run app.py
```

🧠 How It Works

Ingestion — Scans documents/ folder for PDFs (PyMuPDF)
Cleaning — Removes citations, page numbers, bibliography entries
Chunking — Semantic splitting by topic shifts, with size limits
Embedding — Converts text to vectors via nomic-embed-text
Storage — Saves vectors to rag_vector_store/ (ChromaDB)
Retrieval — Fetches top 20 similar chunks for your question
Reranking — Cross-Encoder scores relevance, filters to top 3
Generation — Sends question + context to Groq (Llama 3.3)

💡 Lessons Learned

Vector Store: FAISS vs Chroma

FAISS: Stores in RAM, requires re-processing on restart
Chroma: Persists to disk, instant 2-second reload ✓

Embeddings: Speed vs Accuracy

HuggingFace (all-MiniLM-L6-v2): Fast but lower accuracy
Ollama (nomic-embed-text): Best balance, 8192 token context ✓
FastEmbed (BAAI/bge-small): Future option for 1000+ docs

Chunking: Character vs Semantic

Method	How it works	Pros	Cons
Character	Cut every N chars	Simple, fast	Breaks mid-sentence
Semantic	Split by topic shifts	Coherent chunks	Variable sizes
Recursive Semantic	Semantic + size limits	Best of both ✓	More complex
Small-to-Big	Search small chunks, return parent context	Very precise search + full context	Complex metadata linking

PDF Parsing: PyPDF vs PyMuPDF

PyPDF: Simple but breaks text with unusual fonts ("Ar e W e")
PyMuPDF: Handles styled text, fonts, and formatting better ✓

Text Preprocessing Trade-offs

Regex cleaning removes citations and page numbers but may catch valid content like "Table 1". The reranker post-retrieval filters irrelevant results more intelligently using semantic understanding.

Reranking: Why Cross-Encoder?

Reranker	Latency (20 docs)	MRR@10*	Cost	Complexity
Cross-Encoder	~150ms	0.39	Free (local)	Low ✓
ColBERT	~50ms	0.36	Free (local)	High (GPU)
LLM-as-Reranker	~2s	0.40+	API costs	Low
Cohere API	~100ms	0.40	Per-request	Very Low

*MRR@10 = Mean Reciprocal Rank on MS MARCO passage reranking benchmark

Why Cross-Encoder: Best local accuracy (MRR 0.39), runs locally with no API costs, simple integration with sentence-transformers, and works well for small candidate sets (k ≤ 25).

Reranker Implementation: `.rank()` vs `.predict()`

Method	Pros	Cons
`.predict()`	Full control	Manual sorting required
`.rank()`	Built-in sorting, cleaner API ✓	Less flexible

Using Sigmoid() activation converts raw logits to 0-1 probability scores for interpretability.

Rejected Idea: Passing Relevance Scores to LLM

Idea: Include relevance scores with context so LLM can weight sources differently.

Why rejected:

LLMs don't reason well about numerical scores
Document ordering already conveys importance
Risk of LLM ignoring correct content due to low score
Adds prompt complexity without clear benefit

Known Limitations: Styled Text Extraction

PyMuPDF sometimes fails to extract styled text (bold, italic, hyperlinks, colored text).

Examples encountered:

"Cyberpunk" (blue hyperlink) → extracted as blank
"MMLU" (bold italic) → not captured

Workaround: The reranker's min_score=0.3 threshold filters out irrelevant results.

Future solution: Multimodal RAG using vision models to "see" PDFs as images.

🏗️ Architecture

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
RAG-architecture.pdf		RAG-architecture.pdf
README.md		README.md
pyproject.toml		pyproject.toml
requirement.txt		requirement.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛠️ DocVaultAI

🛡️ Functional Service Architecture

🔒 Air-Gapped Embedding Engine

☁️ Hybrid-Cloud Architecture

📜 Audit-Ready Citations

🚫 Zero-Retention Guarantee

✨ Features

🛠️ Tech Stack

📁 Project Structure

⚙️ Setup

🧠 How It Works

💡 Lessons Learned

Vector Store: FAISS vs Chroma

Embeddings: Speed vs Accuracy

Chunking: Character vs Semantic

PDF Parsing: PyPDF vs PyMuPDF

Text Preprocessing Trade-offs

Reranking: Why Cross-Encoder?

Reranker Implementation: `.rank()` vs `.predict()`

Rejected Idea: Passing Relevance Scores to LLM

Known Limitations: Styled Text Extraction

🏗️ Architecture

About

Uh oh!

Releases

Packages

Languages

rishirochan/DocVaultAI

Folders and files

Latest commit

History

Repository files navigation

🛠️ DocVaultAI

🛡️ Functional Service Architecture

🔒 Air-Gapped Embedding Engine

☁️ Hybrid-Cloud Architecture

📜 Audit-Ready Citations

🚫 Zero-Retention Guarantee

✨ Features

🛠️ Tech Stack

📁 Project Structure

⚙️ Setup

🧠 How It Works

💡 Lessons Learned

Vector Store: FAISS vs Chroma

Embeddings: Speed vs Accuracy

Chunking: Character vs Semantic

PDF Parsing: PyPDF vs PyMuPDF

Text Preprocessing Trade-offs

Reranking: Why Cross-Encoder?

Reranker Implementation: .rank() vs .predict()

Rejected Idea: Passing Relevance Scores to LLM

Known Limitations: Styled Text Extraction

🏗️ Architecture

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Reranker Implementation: `.rank()` vs `.predict()`

Packages