Skip to content

MrAshwin2142/indic-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Code-Mix and Transliteration Robust Retrieval for Indian Language RAG Systems

Status: 🚧 Work in Progress — Research project in active development.

A Retrieval-Augmented Generation (RAG) pipeline designed to handle code-mixed queries, transliterated inputs, and low-resource Indian languages — enabling faithful multilingual response generation across diverse real-world user inputs.


Problem

Existing RAG systems perform well primarily in English but fail when users query in Indian languages, mix multiple languages, or write phonetically (transliteration).

Example 1 — Kannada query, English document:

ಈ ಬೆಳೆಗೆ ಯಾವ ರಸಗೊಬ್ಬರ ಉಪಯೋಗಿಸಬೇಕು? (Which fertilizer should I use for this crop?)

The answer exists in an English government document — but current systems fail to retrieve it and respond faithfully in Kannada.

Example 2 — Code-mixed query:

"Meri crop ke liye konsa fertilizer use karoon?"

Current multilingual RAG systems struggle with:

  • Cross-lingual retrieval — query and document are in different languages/scripts
  • Code-mixed query understanding — Hindi + English mixed in the same sentence
  • Transliteration handling — Devanagari written in Roman script
  • Faithful multilingual response generation — answering in the user's language

Proposed Approach

User Query (any language / script / mix)
          │
          ▼
  Query Normalization Layer
  (transliteration detection + script normalization)
          │
          ▼
  Code-Mix Aware Encoder
  (multilingual embeddings + Indic fine-tuning)
          │
          ▼
  Cross-Lingual Vector Retrieval
  (multilingual vector DB)
          │
          ▼
  Reranker
  (language-aware relevance scoring)
          │
          ▼
  Multilingual Generator (LLM)
          │
          ▼
  Response in User's Language

Research Gap

Limitation This Work's Response
Weak transliteration robustness Dedicated transliteration normalization layer
Poor code-mixed query understanding Code-mix aware encoder with Indic fine-tuning
Benchmark-focused, not real-world Optimized for noisy, real-world user queries
Cross-lingual retrieval failures Cross-lingual embedding + reranking pipeline

Planned Components

  • Query normalization: transliteration detection and script normalization
  • Code-mix aware multilingual encoder (IndicBERT / MuRIL / LaBSE fine-tuned)
  • Cross-lingual vector retrieval (FAISS / Qdrant)
  • Language-aware reranker
  • Multilingual response generator
  • Evaluation on IndicRAGSuite and custom code-mixed benchmarks

Tech Stack (planned)


Repository Structure (evolving)

indic-rag/
├── README.md
├── docs/
│   ├── research-overview.md     # Detailed problem + approach
│   ├── related-work.md          # Literature review notes
│   └── learning-notes.md        # Study notes (RAG, embeddings, Indic NLP, etc.)
├── src/                         # Source code (coming soon)
├── notebooks/                   # Experiments and exploration (coming soon)
├── data/                        # Sample queries and documents (coming soon)
└── results/                     # Evaluation results (coming soon)

References


Progress Log

Date Milestone
TBD Initial repo setup

License

MIT License — see LICENSE for details.

About

Code-mix and transliteration robust RAG pipeline for low-resource Indian language retrieval and multilingual response generation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors