A Retrieval-Augmented Generation agent that answers questions using your own data β not the LLM's training data. Answers are grounded, accurate, and verifiable.
Built by Nawang Dorjay β for GSSoC 2026.
Most AI chatbots rely on the LLM's training data β they hallucinate, give outdated info, or don't know your domain.
RAG fixes this:
- Chunk your documents into passages
- Embed each chunk into a vector (numerical representation of meaning)
- Store vectors in a searchable index (FAISS)
- Retrieve the most relevant chunks when a user asks a question
- Generate an answer grounded in retrieved data
User: "When to sow wheat?"
β
Vector search β finds: "Wheat: Sown Oct-Nov. Each week delay after Nov 15 loses 50kg/ha."
β
LLM + context β "Wheat should be sown in October-November. Delaying after November 15 reduces yield by ~50 kg per hectare per week."
| Component | Technology | Why |
|---|---|---|
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) | Fast, good quality, 384-dim vectors |
| Vector DB | FAISS (Facebook AI Similarity Search) | Fast nearest-neighbor search, no server |
| LLM | Groq Llama 3.3 / OpenAI GPT-4o-mini | Grounded answer generation |
| UI | Streamlit | Interactive chat interface |
git clone https://github.com/nawangdorjay/rag-knowledge-agent.git
cd rag-knowledge-agent
pip install -r requirements.txt
cp .env.example .env
# Add GROQ_API_KEY=gsk_xxxxx
streamlit run app.pyrag-knowledge-agent/
βββ app.py # Streamlit chat UI
βββ agent/
β βββ __init__.py
β βββ knowledge_base.py # Vector store, chunking, search
β βββ rag_agent.py # RAG pipeline (retrieve β generate)
βββ data/
β βββ crops_knowledge.json # Crop knowledge (rice, wheat, cotton, etc.)
β βββ schemes_knowledge.json # Government schemes (PM-KISAN, Ayushman, etc.)
βββ tests/
β βββ test_rag.py # 6 tests
βββ requirements.txt
βββ .github/workflows/ci.yml
This project is designed as a learning base. Here's what to explore:
- Read
knowledge_base.pyβ understand chunking, embedding, indexing - Run
test_rag.pyβ see how data flows through the pipeline - Try different questions in the UI
- TODO: Add overlapping chunks (sliding window) for better context
- TODO: Try different embedding models (multilingual for Hindi)
- TODO: Add MMR (Maximal Marginal Relevance) for diverse results
- TODO: Implement a re-ranker (cross-encoder) for better ranking
- TODO: Parse PDF files (use PyPDF2 or pdfplumber)
- TODO: Scrape web pages (requests + BeautifulSoup)
- TODO: Add markdown/text file support
- TODO: Build from your existing farmer/health agent data
- TODO: Add source citation in answers
- TODO: Implement conversation memory
- TODO: Add evaluation metrics (retrieval precision, answer faithfulness)
- TODO: Cache embeddings for faster startup
python tests/test_rag.py6 tests covering: JSON chunking, knowledge base operations, data loading, and edge cases.
MIT
Nawang Dorjay β B.Tech CSE (Data Science), MAIT Delhi | GitHub
This project was built with AI assistance. See BUILDING.md for full transparency.
