Status: 🚧 Work in Progress — Research project in active development.
A Retrieval-Augmented Generation (RAG) pipeline designed to handle code-mixed queries, transliterated inputs, and low-resource Indian languages — enabling faithful multilingual response generation across diverse real-world user inputs.
Existing RAG systems perform well primarily in English but fail when users query in Indian languages, mix multiple languages, or write phonetically (transliteration).
Example 1 — Kannada query, English document:
ಈ ಬೆಳೆಗೆ ಯಾವ ರಸಗೊಬ್ಬರ ಉಪಯೋಗಿಸಬೇಕು? (Which fertilizer should I use for this crop?)
The answer exists in an English government document — but current systems fail to retrieve it and respond faithfully in Kannada.
Example 2 — Code-mixed query:
"Meri crop ke liye konsa fertilizer use karoon?"
Current multilingual RAG systems struggle with:
- Cross-lingual retrieval — query and document are in different languages/scripts
- Code-mixed query understanding — Hindi + English mixed in the same sentence
- Transliteration handling — Devanagari written in Roman script
- Faithful multilingual response generation — answering in the user's language
User Query (any language / script / mix)
│
▼
Query Normalization Layer
(transliteration detection + script normalization)
│
▼
Code-Mix Aware Encoder
(multilingual embeddings + Indic fine-tuning)
│
▼
Cross-Lingual Vector Retrieval
(multilingual vector DB)
│
▼
Reranker
(language-aware relevance scoring)
│
▼
Multilingual Generator (LLM)
│
▼
Response in User's Language
| Limitation | This Work's Response |
|---|---|
| Weak transliteration robustness | Dedicated transliteration normalization layer |
| Poor code-mixed query understanding | Code-mix aware encoder with Indic fine-tuning |
| Benchmark-focused, not real-world | Optimized for noisy, real-world user queries |
| Cross-lingual retrieval failures | Cross-lingual embedding + reranking pipeline |
- Query normalization: transliteration detection and script normalization
- Code-mix aware multilingual encoder (IndicBERT / MuRIL / LaBSE fine-tuned)
- Cross-lingual vector retrieval (FAISS / Qdrant)
- Language-aware reranker
- Multilingual response generator
- Evaluation on IndicRAGSuite and custom code-mixed benchmarks
- Python 3.10+
- LangChain / LlamaIndex — RAG pipeline
- IndicNLP / Aksharamukha — transliteration + script tools
- MuRIL / LaBSE — multilingual embeddings
- FAISS / Qdrant — vector database
- Hugging Face Transformers — fine-tuning and inference
indic-rag/
├── README.md
├── docs/
│ ├── research-overview.md # Detailed problem + approach
│ ├── related-work.md # Literature review notes
│ └── learning-notes.md # Study notes (RAG, embeddings, Indic NLP, etc.)
├── src/ # Source code (coming soon)
├── notebooks/ # Experiments and exploration (coming soon)
├── data/ # Sample queries and documents (coming soon)
└── results/ # Evaluation results (coming soon)
- IndicRAGSuite: Large-Scale Datasets for Indian Language RAG
- XRAG: Cross-lingual Retrieval-Augmented Generation
- ACL 2025 Multilingual RAG
- EMNLP 2025 Findings
- arxiv.org/abs/2505.10089
- arxiv.org/abs/2504.03616
- arxiv.org/abs/2410.01171
| Date | Milestone |
|---|---|
| TBD | Initial repo setup |
MIT License — see LICENSE for details.