Code-Mix and Transliteration Robust Retrieval for Indian Language RAG Systems

Status: 🚧 Work in Progress — Research project in active development.

A Retrieval-Augmented Generation (RAG) pipeline designed to handle code-mixed queries, transliterated inputs, and low-resource Indian languages — enabling faithful multilingual response generation across diverse real-world user inputs.

Problem

Existing RAG systems perform well primarily in English but fail when users query in Indian languages, mix multiple languages, or write phonetically (transliteration).

Example 1 — Kannada query, English document:

ಈ ಬೆಳೆಗೆ ಯಾವ ರಸಗೊಬ್ಬರ ಉಪಯೋಗಿಸಬೇಕು? (Which fertilizer should I use for this crop?)

The answer exists in an English government document — but current systems fail to retrieve it and respond faithfully in Kannada.

Example 2 — Code-mixed query:

"Meri crop ke liye konsa fertilizer use karoon?"

Current multilingual RAG systems struggle with:

Cross-lingual retrieval — query and document are in different languages/scripts
Code-mixed query understanding — Hindi + English mixed in the same sentence
Transliteration handling — Devanagari written in Roman script
Faithful multilingual response generation — answering in the user's language

Proposed Approach

User Query (any language / script / mix)
          │
          ▼
  Query Normalization Layer
  (transliteration detection + script normalization)
          │
          ▼
  Code-Mix Aware Encoder
  (multilingual embeddings + Indic fine-tuning)
          │
          ▼
  Cross-Lingual Vector Retrieval
  (multilingual vector DB)
          │
          ▼
  Reranker
  (language-aware relevance scoring)
          │
          ▼
  Multilingual Generator (LLM)
          │
          ▼
  Response in User's Language

Research Gap

Limitation	This Work's Response
Weak transliteration robustness	Dedicated transliteration normalization layer
Poor code-mixed query understanding	Code-mix aware encoder with Indic fine-tuning
Benchmark-focused, not real-world	Optimized for noisy, real-world user queries
Cross-lingual retrieval failures	Cross-lingual embedding + reranking pipeline

Planned Components

Query normalization: transliteration detection and script normalization
Code-mix aware multilingual encoder (IndicBERT / MuRIL / LaBSE fine-tuned)
Cross-lingual vector retrieval (FAISS / Qdrant)
Language-aware reranker
Multilingual response generator
Evaluation on IndicRAGSuite and custom code-mixed benchmarks

Tech Stack (planned)

Python 3.10+
LangChain / LlamaIndex — RAG pipeline
IndicNLP / Aksharamukha — transliteration + script tools
MuRIL / LaBSE — multilingual embeddings
FAISS / Qdrant — vector database
Hugging Face Transformers — fine-tuning and inference

Repository Structure (evolving)

indic-rag/
├── README.md
├── docs/
│   ├── research-overview.md     # Detailed problem + approach
│   ├── related-work.md          # Literature review notes
│   └── learning-notes.md        # Study notes (RAG, embeddings, Indic NLP, etc.)
├── src/                         # Source code (coming soon)
├── notebooks/                   # Experiments and exploration (coming soon)
├── data/                        # Sample queries and documents (coming soon)
└── results/                     # Evaluation results (coming soon)

References

Progress Log

Date	Milestone
TBD	Initial repo setup

License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code-Mix and Transliteration Robust Retrieval for Indian Language RAG Systems

Problem

Proposed Approach

Research Gap

Planned Components

Tech Stack (planned)

Repository Structure (evolving)

References

Progress Log

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Code-Mix and Transliteration Robust Retrieval for Indian Language RAG Systems

Problem

Proposed Approach

Research Gap

Planned Components

Tech Stack (planned)

Repository Structure (evolving)

References

Progress Log

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages