A production-grade Retrieval-Augmented Generation (RAG) chatbot that answers driving rules questions using official DMV manuals from all 50 US states + DC. Built with hybrid vector search, cross-encoder reranking, Redis caching, and Llama 4 via Groq.
User Query
│
▼
State Detection ← Detects state from query text (e.g. "California" → "CA")
│
▼
Redis Cache Check ← Return instantly if same query was asked before
│ (cache miss)
▼
Hybrid Search ← Dense vector search (Qdrant) + Keyword search, merged via RRF
│
▼
S3 Hydration ← Fetch full chunk text from S3 (Qdrant stores only previews)
│
▼
Reranking ← Cross-encoder scores each chunk for true relevance
│
▼
LLM Generation ← Groq (Llama 4) generates cited answer from top 5 chunks
│
▼
Cache Write + Response ← Store in Redis, return answer with sources to user
Text chunks are converted into 1024-dimensional numerical vectors using BAAI/bge-large-en-v1.5. Semantically similar text produces numerically close vectors, enabling meaning-based search rather than keyword matching.
Stores all 7,644 chunk vectors and indexes them using HNSW graphs for millisecond similarity search. When a user asks a question, the query is embedded and Qdrant finds the closest matching vectors filtered by state.
Combines dense vector search (semantic similarity via Qdrant) with keyword search (term frequency over chunk previews), then merges both result lists using Reciprocal Rank Fusion (RRF). This catches cases where one method misses what the other finds.
A rank merging algorithm that scores each chunk as 1 / (60 + rank) from both search methods and sums the scores. Chunks appearing highly ranked in both searches get the highest combined score.
A cross-encoder model (BAAI/bge-reranker-v2-m3) scores each query-chunk pair jointly for true relevance. Unlike embeddings which encode query and chunk separately, the cross-encoder reads both together and produces a more accurate relevance score.
The system prompt instructs the LLM to only use provided context, always cite sources as (STATE Manual, Page X), and explicitly say when it cannot find an answer rather than hallucinating. The RAG prompt formats each chunk with its source header before sending to the LLM.
Redis stores two types of data: LLM responses (keyed by MD5 hash of query + state, TTL 24h) and hot chunks promoted from S3 (chunks accessed 10+ times get cached for 7 days). This prevents redundant LLM calls and reduces S3 fetch latency.
| Tier | Storage | What Lives Here |
|---|---|---|
| 1 | Redis | Cached LLM responses + hot chunks |
| 2 | Qdrant | Vectors + metadata + text previews |
| 3 | S3 Standard | Full chunk JSON files |
| 4 | S3 Glacier | Original PDF manuals (cold archive) |
dmv-rag/
├── ingestion/
│ ├── download_pdfs.py # Downloads all 51 state manuals
│ ├── parse_pdfs.py # Extracts text via PyMuPDF + pdfplumber
│ ├── chunk.py # Splits pages into 512-token chunks
│ └── embed_and_upload.py # Embeds chunks, uploads to Qdrant + S3
│
├── retrieval/
│ ├── state_detector.py # Detects US state from query text
│ ├── hybrid_search.py # Dense + keyword search with RRF fusion
│ └── reranker.py # Cross-encoder reranking via HF API
│
├── generation/
│ ├── prompt.py # System prompt + RAG prompt builder
│ └── llm.py # Groq (primary) + OpenAI (fallback) caller
│
├── cache/
│ ├── redis_cache.py # LLM response cache
│ └── tier_manager.py # S3 fetcher with Redis promotion
│
├── api/
│ └── main.py # FastAPI app with /query, /health, /cache/stats
│
├── frontend/
│ └── app.py # Streamlit chatbot UI
│
├── data/ # Local data (gitignored)
│ ├── pdfs/ # Downloaded state manuals
│ ├── parsed/ # Extracted text JSON
│ └── chunks/ # Chunked text JSON
│
├── requirements.txt
└── .env
| Component | Technology |
|---|---|
| LLM | Groq — Llama 4 Scout 17B |
| Embeddings | BAAI/bge-large-en-v1.5 via HF Inference API |
| Reranker | BAAI/bge-reranker-v2-m3 via HF Inference API |
| Vector DB | Qdrant (Docker) |
| Cache | Redis 7 (Docker) |
| Object Storage | AWS S3 |
| PDF Parsing | PyMuPDF + pdfplumber |
| Chunking | LangChain RecursiveCharacterTextSplitter + tiktoken |
| API | FastAPI + uvicorn |
| Frontend | Streamlit |
- Python 3.10
- Docker Desktop
- AWS account with S3 bucket
- Groq API key (free at console.groq.com)
- HuggingFace API token (free at huggingface.co/settings/tokens)
Create a .env file in the project root:
# HuggingFace
HF_API_TOKEN=hf_xxx
EMBED_MODEL=BAAI/bge-large-en-v1.5
RERANKER_MODEL=BAAI/bge-reranker-v2-m3
# Qdrant
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_COLLECTION=dmv_manuals
# AWS
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
AWS_DEFAULT_REGION=us-east-1
S3_CHUNKS_BUCKET=your-chunks-bucket
S3_PDFS_BUCKET=your-pdfs-bucket
# Redis
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_TTL_SECONDS=86400
REDIS_PROMOTE_THRESHOLD=10
# LLM
GROQ_API_KEY=xxx
OPENAI_API_KEY=xxx
LLM_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
# Search
TOP_K_RETRIEVE=20
TOP_K_RERANK=5# Clone and setup
git clone https://github.com/yourusername/dmv-rag.git
cd dmv-rag
python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Start Docker services
docker start qdrant redis
# or first time:
docker run -d --name qdrant -p 6333:6333 -v ~/qdrant_data:/qdrant/storage qdrant/qdrant
docker run -d --name redis -p 6379:6379 redis:7-alpinepython ingestion/download_pdfs.py # Download 51 state PDFs
python ingestion/parse_pdfs.py # Extract text
python ingestion/chunk.py # Split into chunks
python ingestion/embed_and_upload.py # Embed + upload to Qdrant + S3# Terminal 1 — API
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
# Terminal 2 — Frontend
streamlit run frontend/app.py| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check + collection name |
| GET | /cache/stats |
Redis cache statistics |
| POST | /query |
Main RAG query endpoint |
curl -X POST http://localhost:8000/query \
-H 'Content-Type: application/json' \
-d '{"query": "What is the speed limit in school zones in California?"}'{
"answer": "The speed limit in school zones in California is 25 mph...",
"state_detected": "CA",
"sources": [{"state": "CA", "page": 73, "rerank_score": 0.424}],
"cached": false,
"model_used": "meta-llama/llama-4-scout-17b-16e-instruct"
}- 51 state driver manuals (50 states + DC)
- ~4,200 pages of text extracted
- 7,644 chunks at 512 tokens each
- 1024-dimensional embeddings
The project includes a pre-computed analysis module that scores and ranks all 51 state DMV manuals across three dimensions.
| Dimension | Weight | What It Measures |
|---|---|---|
| Content Depth | 20% | avg words per page, total pages, image-only pages |
| Readability | 20% | Flesch Reading Ease, FK Grade, Gunning Fog, SMOG |
| Topic Coverage | 60% | keyword frequency across 7 categories |
Test Prep · Safety · Legal Compliance · Teen Rules · Emergency · Registration · Commercial
weighted_score = (0.20 × content_depth) + (0.20 × readability) + (0.60 × topic_coverage)
10 standard questions (BAC limits, school zone speeds, minimum age, DUI penalties, etc.) are queried against all 51 states via the RAG pipeline. Reranker confidence scores surface which states document each topic clearly and which have coverage gaps.
# Pre-compute stats (run once)
python -m analysis.compute_stats
python -m analysis.cross_state_compare
# View dashboard
streamlit run frontend/app.py
# Navigate to "Manual Comparison" in the sidebarDeveloped by Yash Mahajan Under the guidance of Dr. Rakesh Mahto and Dr. Deepak Sharma