A production-ready, pluggable Retrieval-Augmented Generation system for IIT Bhilai information retrieval. The system ingests documents, provides intelligent answers using Gemini LLM, and features a two-layer caching system for optimal performance.
- Overview
- Architecture
- Features
- Tech Stack
- Installation
- Configuration
- Running the System
- API Documentation
- Performance Metrics
- Project Structure
- Testing
- Rate Limiting
- Troubleshooting
- License
The IIT Bhilai RAG Agent is a comprehensive question-answering system designed to provide accurate, context-aware responses about IIT Bhilai's courses, programs, and campus information. The system uses semantic search to retrieve relevant information from indexed documents and generates natural language responses using Google's Gemini language model.
The system implements a hub-and-spoke architecture with the following components:
- Document Ingestion Pipeline: Automated processing of PDF documents with semantic chunking and embedding generation
- Two-Layer Cache System: Exact match (sub-millisecond) and semantic similarity (90% threshold) caching
- Vector Store: Chroma DB with 470 chunks indexed using Gemini embeddings (3072 dimensions)
- LLM Integration: Google Gemini 2.5 Flash for answer generation
- FastAPI Backend: RESTful API with CORS support for frontend integration
- Next.js Frontend: Modern chat interface with ChatGPT-like user experience
- Documents are ingested, chunked, and embedded into vector representations
- User queries are processed through the two-layer cache system
- Semantic search retrieves relevant document chunks
- LLM generates contextual answers based on retrieved content
- Responses are cached for future identical or similar queries
- Automated document detection and ingestion with chunking registry
- Semantic chunking with configurable parameters (500 chars, 50 overlap)
- Two-layer caching (exact and semantic) for reduced latency
- Pluggable tool registry for multi-website and multi-source support
- Production-ready FastAPI server with comprehensive endpoints
- Next.js frontend with responsive ChatGPT-like interface
- Rate limit handling and batch processing for API optimization
- Persistent vector storage with Chroma DB
| Component | Technology |
|---|---|
| Backend Framework | FastAPI, Python 3.13 |
| LLM Provider | Google Gemini 2.5 Flash |
| Embeddings | Google Gemini Embedding 2 (3072 dimensions) |
| Vector Database | Chroma DB |
| Orchestration | LangChain |
| Frontend | Next.js, TypeScript, Tailwind CSS |
| Caching | In-memory exact cache, vector-based semantic cache |
- Python 3.13 or higher
- Node.js 18 or higher
- Google Gemini API key (get from Google AI Studio)
- 4GB RAM minimum (8GB recommended)
- 500MB free disk space for vector store
# Clone the repository
git clone <repository-url>
cd IITBhilai_RAG/backend
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your GOOGLE_API_KEYcd ../frontend
npm installCreate a .env file in the backend directory with the following variables:
# Required
GOOGLE_API_KEY=your_api_key_here
# LLM Configuration
LLM_PROVIDER=gemini
GEMINI_MODEL=gemini-2.5-flash
# Embedding Configuration
EMBEDDING_PROVIDER=gemini
GEMINI_EMBEDDING_MODEL=gemini-embedding-2
# Database Configuration
CHROMA_PERSIST_DIRECTORY=./data/chroma_langchain_db
# Server Configuration
API_HOST=0.0.0.0
API_PORT=8000
CORS_ORIGINS=http://localhost:3000,http://localhost:3001The system uses optimized chunking parameters to balance context preservation and API rate limits:
| Parameter | Value | Description |
|---|---|---|
| Chunk Size | 500 characters | Length of each text segment |
| Chunk Overlap | 50 characters | Overlap between consecutive chunks |
| Batch Size | 20 chunks | Number of chunks per API call |
| Similarity Threshold | 0.90 | Minimum score for semantic cache hits |
cd backend
python run.pyThe API will be available at http://localhost:8000
Expected output:
INFO: Uvicorn running on http://0.0.0.0:8000
INFO: Application startup complete.
Agent ready!
cd frontend
npm run devThe frontend will be available at http://localhost:3000
curl http://localhost:8000/healthResponse:
{
"status": "healthy",
"vector_store": true
}| Method | Endpoint | Description | Authentication |
|---|---|---|---|
| GET | /chat |
Ask a question to the agent | None |
| GET | /health |
Health check endpoint | None |
| GET | /stats |
System statistics | None |
| GET | /cache/stats |
Cache performance metrics | None |
| DELETE | /cache/{question} |
Clear cached response for specific question | None |
| DELETE | /cache/all |
Clear entire cache | None |
Request
curl "http://localhost:8000/chat?question=What%20BTech%20programs%20are%20offered"Response (Cache Miss)
{
"question": "What BTech programs are offered?",
"answer": "IIT Bhilai offers BTech programs in Computer Science and Engineering, Data Science and Artificial Intelligence, Electrical Engineering, Mechanical Engineering, and Mechatronics Engineering.",
"processing_time": 3.58,
"from_cache": false,
"cache_layer": "miss",
"sources": ["vector_store"]
}Response (Cache Hit)
{
"question": "What BTech programs are offered?",
"answer": "IIT Bhilai offers BTech programs in Computer Science and Engineering, Data Science and Artificial Intelligence, Electrical Engineering, Mechanical Engineering, and Mechatronics Engineering.",
"processing_time": 0.000017,
"from_cache": true,
"cache_layer": "exact",
"sources": ["vector_store"]
}Clear cache for specific question:
curl -X DELETE "http://localhost:8000/cache/What%20BTech%20programs%20are%20offered"Clear entire cache:
curl -X DELETE "http://localhost:8000/cache/all"View cache statistics:
curl "http://localhost:8000/cache/stats"| Operation | First Request | Cached Request |
|---|---|---|
| Exact Match Query | 3-5 seconds | < 0.001 seconds |
| Semantic Variation | 3-5 seconds | 1-2 seconds |
| Document Indexing (470 chunks) | 37 API calls | N/A |
| Metric | Value |
|---|---|
| Exact Cache Hit Rate | 20-40% |
| Semantic Cache Hit Rate | 30-50% |
| Memory Usage | < 100MB |
| Storage (Vector DB) | 9.1 MB |
IITBhilai_RAG/
├── backend/
│ ├── src/
│ │ ├── api/ # FastAPI endpoints
│ │ │ └── app.py
│ │ ├── core/ # Core orchestration logic
│ │ │ ├── orchestrator_with_cache.py
│ │ │ ├── config_loader.py
│ │ │ └── llm_factory.py
│ │ ├── caching/ # Two-layer cache system
│ │ │ └── enhanced_cache.py
│ │ ├── tools/ # Pluggable tool registry
│ │ │ ├── tool_registry.py
│ │ │ └── base_tool.py
│ │ ├── ingestion/ # Document processing pipeline
│ │ │ ├── vector_store_wrapper.py
│ │ │ ├── document_ingestion.py
│ │ │ └── chunking_registry.py
│ │ └── config/ # YAML configuration files
│ │ ├── agent.yaml
│ │ ├── cache.yaml
│ │ └── websites.yaml
│ ├── data/ # Chroma DB storage
│ ├── scripts/ # Utility scripts
│ ├── tests/ # Test files
│ ├── run.py # Main entry point
│ └── requirements.txt
├── frontend/
│ ├── src/
│ │ └── app/
│ │ ├── page.tsx # Main chat interface
│ │ ├── layout.tsx
│ │ ├── globals.css
│ │ └── api/chat/ # API route
│ ├── components/
│ ├── package.json
│ └── tsconfig.json
└── README.md
Enable debug logging for detailed diagnostics:
logging.basicConfig(level=logging.DEBUG)This project is proprietary and confidential. Unauthorized copying, distribution, or use of this software is strictly prohibited.
For issues, questions, or contributions:
- Documentation: Refer to this README
- Issues: Create a ticket in the project management system
- Email: Contact the development team
