A professional-grade Retrieval-Augmented Generation (RAG) system designed for intelligent question-answering over NCERT textbooks of Class 11 and 12. This project integrates LangChain, Qdrant vector store, Dockerized LLM backend using Ollama, and PDF data pipelines to deliver accurate and contextual answers to academic questions.
To build a localized and private QA system using open-source LLMs and RAG methodology, suitable for educational and contextual querying of NCERT content.
NCERT_RAG/
├── NCERT/ # Folder containing subject-wise NCERT PDFs (Class 11 & 12)
│ ├── Class11_Chemistry/
│ ├── Class12_Physics/
│ └── ...
├── data_ingestion.py # Main driver to ingest PDFs into vector DB
├── data_loader.py # Loads and parses PDFs
├── ncert_loader.py # Handles recursive loading from subject folders
├── ingestion_to_qdrant.py # Handles document embedding and ingestion to Qdrant
├── retriever.py # Query interface, fetches and answers from vector store
├── requirements.txt # Python dependencies
└── README.md # Project documentation
| Tool/Library | Purpose |
|---|---|
| LangChain | Framework for building LLM-powered applications |
| Qdrant | Vector database for storing embedded document vectors |
| Ollama | Runs open-source LLMs like LLaMA3 locally via Docker |
| Docker | Containerized deployment of LLMs (backend requirement for Ollama) |
| PyMuPDF / fitz | Parses and reads text content from PDF files |
graph TD
A[Start] --> B[Start Docker + Ollama]
B --> C[Run data_ingestion.py]
C --> D[Parse PDFs using ncert_loader.py and data_loader.py]
D --> E[Generate embeddings]
E --> F[Store in Qdrant via ingestion_to_qdrant.py]
F --> G[Run retriever.py]
G --> H[Accept user query]
H --> I[Retrieve top documents from Qdrant]
I --> J[Answer using Ollama LLM (LangChain)]
J --> K[Display final answer]
Ollama must be installed and running inside Docker to serve LLaMA3 model.
docker run -d -p 11434:11434 ollama/ollama:latest
ollama run llama3Create a virtual environment and install dependencies:
conda create -n ncert_rag python=3.10 -y
conda activate ncert_rag
pip install -r requirements.txtrequirements.txt:
langchain>=0.1.0
qdrant-client>=1.6.0
openai>=1.0.0
PyMuPDF
ollama
docker start <container_id>
ollama run llama3python data_ingestion.pyThis step recursively loads all PDFs from the NCERT/ directory, generates embeddings, and stores them in Qdrant.
python retriever.pyEnter your query when prompted and get accurate responses using the local LLM.
To avoid raw metadata and show only the final answer:
In retriever.py, replace:
print("\n######### Answer #########:\n", answer)with:
print("\n######### Answer #########:\n", answer['output_text'])Question: Was India isolated from the world 2000 years ago?
Answer: According to the context, It is mentioned that "Globalization and Social Change" affected independent India. This implies that India was not isolated from the world, as global connections existed in the past.
- Ollama must be running in the background before you run ingestion or query steps.
- All documents are loaded from the subject-wise
NCERT/folder. Make sure PDFs are properly placed. - Only Class 11 and 12 PDFs are supported by default. Modify
ncert_loader.pyto extend.
This project is for learning and educational use. Customize it for your use case! Licensed under the MIT License. See the LICENSE file for details.
For any feedback or contributions, please open an issue or submit a pull request!