You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A comprehensive search platform for the DOJ-released Epstein files. Unlike the official DOJ search that only matches document titles, this platform enables full-text search, entity extraction, and document analysis across 15,875 documents.
Current Status (Feb 2025)
Component
Status
Details
Data Ingestion
✅ Complete
15,875 documents from 12 DOJ data sets
Full-Text Search
✅ Working
Meilisearch indexed, instant results
Entity Extraction
✅ Complete
139,018 entities, 731,021 mentions
Image Extraction
✅ Complete
15,875 image folders
OCR Processing
✅ Complete
All documents text-extracted
Court Documents
✅ Added
Maxwell case 2024 (943 pages)
Documents View
✅ Complete
Browse all documents with pagination
PDF Viewer
✅ Working
Embedded PDF viewer with download
Entity Explorer
✅ Working
Search and filter entities by type
Faces Detection
⏳ Pending
Requires dlib installation
Graph View
✅ Working
Neo4j optional, graceful fallback
Timeline View
✅ Working
Browse documents by year/month
Bookmarks
✅ Working
Save and manage document bookmarks
Settings
✅ Working
Service status and statistics
Data Summary
Metric
Value
Total Documents
15,875
Total Size
5.7 GB
Unique Entities
139,018
Entity Mentions
731,021
Search Results for "Clinton"
253
Search Results for "Maxwell"
1,000+
Features
Full-Text Search: Search the actual content of 15,875+ PDFs, not just titles
# Download PDFs from DOJ
python -m app.cli download
# Run OCR on PDFs
python -m app.cli ocr
# Extract entities
python -m app.cli entities
# Detect faces
python -m app.cli faces
# Index for search
python -m app.cli index
# Or run all at once
python -m app.cli process-all
# Download PDFs from DOJ website
python -m app.cli download [--limit N]
# Run OCR on downloaded PDFs
python -m app.cli ocr [--limit N] [--reprocess]
# Extract named entities
python -m app.cli entities [--limit N] [--reprocess]
# Detect and cluster faces
python -m app.cli faces [--limit N] [--reprocess]
# Index documents for search
python -m app.cli index [--limit N] [--reindex]
# Run full pipeline
python -m app.cli process-all [--limit N]
# Check processing status
python -m app.cli status
# Start API server
python -m app.cli serve [--host HOST] [--port PORT]
API Endpoints
Search
GET /api/search?q={query} - Full-text search
GET /api/search/suggestions?q={prefix} - Autocomplete
Documents
GET /api/documents - List documents
GET /api/documents/{id} - Get document details
GET /api/documents/{id}/text - Get extracted text
GET /api/documents/{id}/pdf - Download PDF
Entities
GET /api/entities - List entities
GET /api/entities/{id} - Get entity details
GET /api/entities/{id}/documents - Get documents mentioning entity
Faces
GET /api/faces - List detected faces
GET /api/faces/clusters - List face clusters
GET /api/faces/{id}/similar - Find similar faces
POST /api/faces/search - Search by uploaded face image
Graph
GET /api/graph/document/{id}/connections - Document connections
GET /api/graph/entity/{id}/connections - Entity connections
Timeline
GET /api/timeline - Get timeline events
GET /api/timeline/by-year - Document counts by year
Annotations
GET /api/annotations - List annotations
POST /api/annotations - Create annotation
PUT /api/annotations/{id} - Update annotation
Export
GET /api/export/search?q={query}&format=csv - Export search results
GET /api/export/entities?format=csv - Export entities
GET /api/export/faces/clusters?format=json - Export face clusters
Development
Backend
cd backend
pip install -e ".[dev]"# Run tests
pytest
# Run linter
ruff check .# Start dev server
uvicorn app.main:app --reload
Frontend
cd frontend
npm install
# Run dev server
npm run dev
# Build for production
npm run build
License
This project processes publicly released government documents. The code is provided as-is for research and journalistic purposes.
Acknowledgments
DOJ for releasing the Epstein files
Open source libraries: spaCy, face_recognition, Meilisearch, Neo4j
About
A searchable, structured archive of publicly released Epstein-related documents, built to enable fast queries, cross-referencing, and investigative analysis.