A privacy-centric document intelligence platform designed for secure, local semantic analysis. By leveraging locally saved documents and embeddings, it ensures sensitive data remains secure while delivering powerful Retrieval-Augmented Generation (RAG) capabilities. Built with Streamlit, LangChain, Groq, and ChromaDB.
We position this not just as an interface, but as a secure Data Pipeline distinguishing it from standard conversational interfaces by offering professional-grade data sovereignty.
The semantic meaning of your documents never leaves your machine. Vectors are generated locally using Ollama/FastEmbed, ensuring no third party or cloud provider ever reads, processes, or stores your original files or their semantic representations.
Local Privacy with Global Intelligence. Heavy reasoning and language generation are offloaded to Groq's high-speed inference engine, but the sensitive context selection and retrieval happen entirely on-premise. Your full document set is never exposed to the cloud—only the specific, anonymized snippets relevant to a query.
Designed for verification-heavy fields like legal or medical research. Every response provides a transparent Provenance Log, linking assertions directly to source PDF pages with precise "confidence scores," verifying the reliability of the information.
Data sent to the Groq API is strictly for "Inference Only". As a self-hosted solution, its guaranteed that your query context is never stored, trained on, or retained, offering a level of privacy unattainable with public SaaS models like ChatGPT.
- Query Your Documents — Ask questions and get accurate answers based on your documents
- Dark Notion-style UI — Clean, minimal dark theme interface
- Document Management — Add and delete documents from the sidebar
- Duplicate Detection — Warns before re-uploading files already indexed
- Semantic Chunking — Splits by topic, not arbitrary character counts
- Text Preprocessing — Removes citations, page numbers, and bibliography noise
- Cross-Encoder Reranking — Filters irrelevant results using semantic relevance scoring
- Persistent Memory — ChromaDB saves embeddings to disk (load in seconds)
- High-Performance LLM — Groq API running Llama 3.3 70B
- Local Embeddings — Ollama
nomic-embed-textfor private processing
| Component | Tool | Why? |
|---|---|---|
| Frontend | Streamlit | Fast, interactive UI in pure Python |
| Framework | LangChain | Orchestrates the RAG pipeline |
| LLM | Groq API | Extremely fast inference for Llama 3 |
| Embeddings | Ollama | Runs nomic-embed-text locally |
| Vector Store | ChromaDB | Persists to disk (unlike RAM-only FAISS) |
| PDF Parser | PyMuPDF | Better text extraction than PyPDF |
| Reranker | Cross-Encoder | Filters irrelevant results with semantic scoring |
DocVaultAI/
├── src/
│ ├── app.py # Streamlit UI
│ ├── rag_core.py # RAG logic
│ └── styles.css # Dark theme CSS
├── documents/ # Your PDFs go here
├── rag_vector_store/ # ChromaDB persistence
└── .env # API keys
-
Prerequisites:
- Python 3.13+
- Ollama installed
- Pull the embedding model:
ollama pull nomic-embed-text
-
Install Dependencies:
pip install -r requirements.txt
-
Environment Variables: Create a
.envfile in the root directory:GROQ_API_KEY=your_groq_api_key_here
-
Run the App:
cd src streamlit run app.py
- Ingestion — Scans
documents/folder for PDFs (PyMuPDF) - Cleaning — Removes citations, page numbers, bibliography entries
- Chunking — Semantic splitting by topic shifts, with size limits
- Embedding — Converts text to vectors via
nomic-embed-text - Storage — Saves vectors to
rag_vector_store/(ChromaDB) - Retrieval — Fetches top 20 similar chunks for your question
- Reranking — Cross-Encoder scores relevance, filters to top 3
- Generation — Sends question + context to Groq (Llama 3.3)
- FAISS: Stores in RAM, requires re-processing on restart
- Chroma: Persists to disk, instant 2-second reload ✓
- HuggingFace (
all-MiniLM-L6-v2): Fast but lower accuracy - Ollama (
nomic-embed-text): Best balance, 8192 token context ✓ - FastEmbed (
BAAI/bge-small): Future option for 1000+ docs
| Method | How it works | Pros | Cons |
|---|---|---|---|
| Character | Cut every N chars | Simple, fast | Breaks mid-sentence |
| Semantic | Split by topic shifts | Coherent chunks | Variable sizes |
| Recursive Semantic | Semantic + size limits | Best of both ✓ | More complex |
| Small-to-Big | Search small chunks, return parent context | Very precise search + full context | Complex metadata linking |
- PyPDF: Simple but breaks text with unusual fonts (
"Ar e W e") - PyMuPDF: Handles styled text, fonts, and formatting better ✓
Regex cleaning removes citations and page numbers but may catch valid content like "Table 1". The reranker post-retrieval filters irrelevant results more intelligently using semantic understanding.
| Reranker | Latency (20 docs) | MRR@10* | Cost | Complexity |
|---|---|---|---|---|
| Cross-Encoder | ~150ms | 0.39 | Free (local) | Low ✓ |
| ColBERT | ~50ms | 0.36 | Free (local) | High (GPU) |
| LLM-as-Reranker | ~2s | 0.40+ | API costs | Low |
| Cohere API | ~100ms | 0.40 | Per-request | Very Low |
*MRR@10 = Mean Reciprocal Rank on MS MARCO passage reranking benchmark
Why Cross-Encoder: Best local accuracy (MRR 0.39), runs locally with no API costs, simple integration with sentence-transformers, and works well for small candidate sets (k ≤ 25).
| Method | Pros | Cons |
|---|---|---|
.predict() |
Full control | Manual sorting required |
.rank() |
Built-in sorting, cleaner API ✓ | Less flexible |
Using Sigmoid() activation converts raw logits to 0-1 probability scores for interpretability.
Idea: Include relevance scores with context so LLM can weight sources differently.
Why rejected:
- LLMs don't reason well about numerical scores
- Document ordering already conveys importance
- Risk of LLM ignoring correct content due to low score
- Adds prompt complexity without clear benefit
PyMuPDF sometimes fails to extract styled text (bold, italic, hyperlinks, colored text).
Examples encountered:
"Cyberpunk"(blue hyperlink) → extracted as blank"MMLU"(bold italic) → not captured
Workaround: The reranker's min_score=0.3 threshold filters out irrelevant results.
Future solution: Multimodal RAG using vision models to "see" PDFs as images.