Drive RAG System

Problem

Organizations accumulate thousands of documents in Google Drive. Finding information requires manually searching through hundreds of files, often taking hours and yielding incomplete results. Traditional keyword search fails when documents use different terminology, and even when relevant documents are found, users must read through multiple files and manually synthesize information.

Solution

This system enables natural language questions over document collections, returning accurate answers with verifiable source citations. It addresses three fundamental limitations of traditional search:

Semantic understanding: Finds documents even when they use different terminology than the query
Information synthesis: Combines information from multiple documents into coherent answers
Source verification: Every answer includes specific citations with document names and page references

How It Works

The system implements a retrieval-augmented generation pipeline:

Ingestion: Documents are parsed, segmented into semantic chunks, converted to vector embeddings, and indexed for both semantic and keyword search.

Query Processing: User questions are expanded into multiple variations, searched using both vector similarity and keyword matching, reranked for relevance, and synthesized into answers with source citations.

Key Design Decisions:

Hybrid retrieval combines semantic search with traditional keyword matching
Two-stage retrieval (broad recall, then precision reranking) optimizes accuracy
Structured prompts enforce source citations and prevent hallucination
Asynchronous ingestion handles large document collections

Documentation

SETUP.md - Installation and configuration guide
LM_STUDIO.md - LM Studio setup guide (beginner-friendly GUI for local LLMs)
GEMINI.md - Google Gemini API setup guide
ITERATIVE_RAG.md - Agentic RAG with iterative search
ARCHITECTURE.md - Design decisions and rationale

Technology Stack

Backend: FastAPI with async request handling, PostgreSQL with pgvector for vector operations, Celery and Redis for background job processing

RAG Components: LangChain framework with HuggingFace embeddings (multilingual-e5-large) for local embedding generation, hybrid search combining vector similarity with BM25 full-text search, BGE cross-encoder for reranking, flexible LLM support (Ollama, OpenAI-compatible APIs like LM Studio, or Gemini) for answer generation via LangChain providers

Fully Open Source: No API keys required - runs entirely on local infrastructure with open source models

LLM Options:

Ollama (default): Free, open-source local models (Mistral, Llama, etc.)
LM Studio: OpenAI-compatible API for running local models with a GUI
Gemini: Google's Gemini API (requires API key from AI Studio)
Other OpenAI-compatible APIs: Any service that implements the OpenAI API standard

Evaluation: Ragas framework for quality metrics (faithfulness, recall, precision, relevance)

Quick Start

See SETUP.md for detailed installation instructions.

# Configure environment
cp .env.example .env
# Edit .env with your credentials

# Start services
docker-compose up -d

# Verify
curl http://localhost:8000/healthz

## Usage Examples

**List available documents:**
```bash
python scripts/list_drive_files.py --folder-id YOUR_FOLDER_ID

Selective ingestion:

python scripts/list_drive_files.py --folder-id YOUR_FOLDER_ID --format csv > files.csv
# Edit files.csv to keep only desired files
python scripts/ingest_from_csv.py --csv files.csv

Query via API:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "your question here", "multi_query": true}'

Search without answer generation:

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "search terms", "k": 20}'

Prerequisites

Docker and Docker Compose
Google Drive Service Account credentials (for document ingestion only)
LLM provider (choose one):
- Ollama (default): Free, open-source, install from https://ollama.ai
- LM Studio: User-friendly GUI for local models, download from https://lmstudio.ai
- Any OpenAI-compatible API endpoint

No paid API services required. For detailed setup instructions, see SETUP.md.

Project Structure

app/
├── ingest/         # Google Drive integration
├── parse/          # PDF and Google Docs extraction
├── chunking/       # Semantic text segmentation
├── index/          # Vector and full-text indexing
├── retrieval/      # Hybrid search implementation
├── rerank/         # Cross-encoder reranking
├── generate/       # LLM answer generation
└── eval/           # Quality evaluation

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
migrations		migrations
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
ITERATIVE_RAG.md		ITERATIVE_RAG.md
LM_STUDIO.md		LM_STUDIO.md
README.md		README.md
SETUP.md		SETUP.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Drive RAG System

Problem

Solution

How It Works

Documentation

Technology Stack

Quick Start

Prerequisites

Project Structure

License

About

Uh oh!

Languages

MineTech0/drive-rag

Folders and files

Latest commit

History

Repository files navigation

Drive RAG System

Problem

Solution

How It Works

Documentation

Technology Stack

Quick Start

Prerequisites

Project Structure

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages