A smart, context-aware document query engine powered by LLMs, built for real-world use cases in insurance, legal, HR, and compliance.
This project builds an LLM-driven intelligent retrieval system that can:
- Parse large unstructured documents (PDFs, DOCX, emails)
- Understand natural language queries
- Retrieve and match relevant clauses or policies
- Provide explainable answers and output structured JSON responses
Designed to work in domains such as:
- 🛡 Insurance
- ⚖️ Legal
- 🧑💼 HR
- ✅ Compliance
- 📄 Multi-format document ingestion (PDF, DOCX, Email)
- 🔍 Semantic clause search via vector embeddings (Pinecone)
- 🤖 Query parsing + clause matching with LLM assistance
- 📊 Explainable rationale with clause references
- 🧾 JSON structured outputs
- 🚀 FastAPI backend for API-based interaction
- Documents:
.pdf,.docx, and email files (.eml) - Natural language queries like:
“What’s the notice period for termination?”
| Component | Tech |
|---|---|
| Backend API | FastAPI |
| Embeddings | FAISS-CPU |
| File Parsers | pdfplumber, python-docx, email.parser |
| LLM Provider | Gemini - 2.5 |
git clone https://github.com/AnirbansarkarS/Document-extractor.git
cd Document-extractorpip install -r requirements.txtGEMINI_KEYS = <your gemini key>
ACCESS_TOKEN = <A TOKEN TO ACCESS YOUR API>
uvicorn main:app --host 0.0.0.0 --port <$PORT>Document-extractor
├── app
│ ├── auth.py # Handles authentication and authorization logic
│ ├── routes.py # Registers routes and maps endpoints to controller functions
│ └── schemas.py # Defines Pydantic models for request and response validation
├── core
│ ├── embbeding.py # Generates and manages embeddings for semantic search
│ ├── llm_handeler.py # Handles interaction with the LLM for query answering
│ ├── logic_evaluator.py # Evaluates logical expressions or conditions in queries
│ └── parser.py # Parses documents and extracts structured data
├── utils
│ ├── chunker.py # Splits documents into chunks for processing and indexing
│ └── output_answers.py # Formats and post-processes answers from the LLM
├── requirements.txt # Lists Python dependencies for the project
├── useage.txt # Contains test cases for core functionalities
├── main.py # Entry point to start the FastAPI server
├── .gitignore # Specifies files and directories to ignore in Git
├── README.md # Project description, usage instructions, and documentation
└── LICENCE # Licensing information for the project
Query: “What is the termination clause in this contract?”
Response:
{
"answer": "The agreement may be terminated with 30 days prior notice by either party.",
}This project is licensed under the MIT License.
⚠️ OPEN FOR CONTRIBUTION
