Skip to content

thendralmagudapathi/RAG-for-NCERT

Repository files navigation

NCERT RAG-based QA System

A professional-grade Retrieval-Augmented Generation (RAG) system designed for intelligent question-answering over NCERT textbooks of Class 11 and 12. This project integrates LangChain, Qdrant vector store, Dockerized LLM backend using Ollama, and PDF data pipelines to deliver accurate and contextual answers to academic questions.


Objective

To build a localized and private QA system using open-source LLMs and RAG methodology, suitable for educational and contextual querying of NCERT content.


Project Structure

NCERT_RAG/
├── NCERT/                            # Folder containing subject-wise NCERT PDFs (Class 11 & 12)
│   ├── Class11_Chemistry/
│   ├── Class12_Physics/
│   └── ...
├── data_ingestion.py                # Main driver to ingest PDFs into vector DB
├── data_loader.py                   # Loads and parses PDFs
├── ncert_loader.py                  # Handles recursive loading from subject folders
├── ingestion_to_qdrant.py          # Handles document embedding and ingestion to Qdrant
├── retriever.py                     # Query interface, fetches and answers from vector store
├── requirements.txt                 # Python dependencies
└── README.md                        # Project documentation

🛠️ Tools & Libraries Used

Tool/Library Purpose
LangChain Framework for building LLM-powered applications
Qdrant Vector database for storing embedded document vectors
Ollama Runs open-source LLMs like LLaMA3 locally via Docker
Docker Containerized deployment of LLMs (backend requirement for Ollama)
PyMuPDF / fitz Parses and reads text content from PDF files

Flowchart

graph TD
    A[Start] --> B[Start Docker + Ollama]
    B --> C[Run data_ingestion.py]
    C --> D[Parse PDFs using ncert_loader.py and data_loader.py]
    D --> E[Generate embeddings]
    E --> F[Store in Qdrant via ingestion_to_qdrant.py]
    F --> G[Run retriever.py]
    G --> H[Accept user query]
    H --> I[Retrieve top documents from Qdrant]
    I --> J[Answer using Ollama LLM (LangChain)]
    J --> K[Display final answer]
Loading

Pre-Requisites

1. Install Docker & Start Ollama

Ollama must be installed and running inside Docker to serve LLaMA3 model.

docker run -d -p 11434:11434 ollama/ollama:latest
ollama run llama3

2. Install Python Libraries

Create a virtual environment and install dependencies:

conda create -n ncert_rag python=3.10 -y
conda activate ncert_rag
pip install -r requirements.txt

requirements.txt:

langchain>=0.1.0
qdrant-client>=1.6.0
openai>=1.0.0
PyMuPDF
ollama

Running the Application

Step 1: Start Ollama (Must be running before everything else)

docker start <container_id>
ollama run llama3

Step 2: Ingest Data

python data_ingestion.py

This step recursively loads all PDFs from the NCERT/ directory, generates embeddings, and stores them in Qdrant.

Step 3: Query the Data

python retriever.py

Enter your query when prompted and get accurate responses using the local LLM.


Clean Output Tips

To avoid raw metadata and show only the final answer: In retriever.py, replace:

print("\n######### Answer #########:\n", answer)

with:

print("\n######### Answer #########:\n", answer['output_text'])

Sample Query

Question: Was India isolated from the world 2000 years ago?
Answer: According to the context, It is mentioned that "Globalization and Social Change" affected independent India. This implies that India was not isolated from the world, as global connections existed in the past.

📌 Notes

  • Ollama must be running in the background before you run ingestion or query steps.
  • All documents are loaded from the subject-wise NCERT/ folder. Make sure PDFs are properly placed.
  • Only Class 11 and 12 PDFs are supported by default. Modify ncert_loader.py to extend.

License

This project is for learning and educational use. Customize it for your use case! Licensed under the MIT License. See the LICENSE file for details.


For any feedback or contributions, please open an issue or submit a pull request!

About

A professional-grade Retrieval-Augmented Generation (RAG) system designed for intelligent question-answering over NCERT textbooks of Class 11 and 12. This project integrates LangChain, Qdrant vector store, Dockerized LLM backend using Ollama, and PDF data pipelines to deliver accurate and contextual answers to academic questions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages