Detoxify-Telugu: A Fine-Tuned BERT-Based Language Model for Hate Speech Detection in Telugu & Tenglish
Detoxify-Telugu is a thoughtful platform designed to detect toxic content across Telugu, Tenglish (Telugu-English code-mixed), and English social media inputs. It supports binary (Toxic vs. Non-Toxic) and multi-class (11 toxicity types + "none") classification using fine-tuned BERT-based models. The end-to-end pipeline, powered by a Streamlit UI, enables technical and non-technical users to scrape, annotate, train, evaluate, and predict toxicity with ease.
This platform empowers both technical and non-technical users with:
- Intuitive UI for data annotation, cleaning, model training, and evaluation.
- Real-time and batch toxicity detection via text or CSV inputs
- Keyword-based auto-annotation for efficient labeling
- Data scraping from YouTube
- ๐ End-to-end NLP pipeline via Streamlit
- โ๏ธ Fine-tuned BERT models for binary and multi-class toxicity detection.
- ๐ Comprehensive metrics: Accuracy, Precision, Recall, F1-Score, Confusion Matrix.
- ๐งช Real-time single and batch CSV predictions for moderation.
- ๐งผ Built-in data cleaning, balancing, and annotation.
- ๐ฌ Rule-based Tenglish-to-Telugu transliteration for preprocessing.
- ๐ฅ Selenium-based YouTube comment scraping (headless mode).
Detoxify_Telugu/
โโโ assets/
โโโ config/
โโโ data/
โ โโโ processed/
โ โ โโโ toxic_data_cleaned.csv
โ โโโ raw/
โ โ โโโ scraped/
โ โ โ โโโ toxicity_data.csv
โ โโโ testing_data/
โ โ โโโ testing.csv
โ โโโ training/
โ โ โโโ binary/
โ โ โ โโโ dataset_binary.csv
โ โ โโโ multi/
โ โ โโโ dataset_multiclass.csv
โโโ docs/
โโโ logs/
โโโ models/
โ โโโ Binary/
โ โ โโโ google_bert_uncased_L-4_H-256_A-4/
โ โ โโโ prajjwal1_bert-tiny/
โ โโโ Multi-Class/
โ โโโ google_bert_uncased_L-4_H-256_A-4/
โ โโโ prajjwal1_bert-mini/
โ โโโ prajjwal1_bert-tiny/
โโโ utils/
โ โโโ __pycache__/
โ โโโ data_annotation.py
โ โโโ data_balancing.py
โ โโโ data_cleaning.py
โ โโโ data_collection.py
โ โโโ data_summary.py
โ โโโ model_evaluation.py
โ โโโ model_training.py
โ โโโ predict.py
โ โโโ YT_Scraper.py
โ โโโ ...........................
โโโ venv/
โโโ .gitignore
โโโ app.py
โโโ DetoxifyTelugu.html
โโโ project.aurdino
โโโ README.md
โโโ requirements.txt
- Python 3.8+
- pip
- Google Chrome + ChromeDriver (or) Firefox + GeckoDriver
# Step 1: Clone the repo
https://github.com/YOUR_USERNAME/detoxify-telugu.git
cd detoxify-telugu
# Step 2: Set up virtual environment
python -m venv venv
source venv/bin/activate # (Windows: venv\Scripts\activate)
# Step 3: Install dependencies
pip install -r requirements.txt
# Step 4: Launch app
streamlit run app.py| Module | Description |
|---|---|
| Data Collection | Scrapes YouTube comments using YT_Scraper.py |
| Data Cleaning | Normalizes text, handles Tenglish, removes noise |
| Data Annotation | Supports rule-based and manual labeling |
| Data Balancing | Equalizes class distribution for training |
| Model Training | Fine-tunes BERT models with configurable settings |
| Model Evaluation | Computes Accuracy, Precision, Recall, F1, matrices |
| Toxicity Prediction | Real-time or CSV-based toxicity classification |
| Tenglish Generator | Converts Tenglish to Telugu script for consistency |
| Category | Model Name | Parameters |
|---|---|---|
| Tiny | prajjwal1/bert-tiny | ~4.3M |
| Small | prajjwal1/bert-mini | ~29M |
| Small | google/bert_uncased_L-4_H-256_A-4 | ~4.3M |
Models are selected based on resource availability. Training parameters (e.g., epochs, learning rate, batch size) are tunable via the Streamlit UI. Larger models like distilbert-base-uncased (~66M parameters) are planned for future enhancements.
The models were evaluated on a Telugu/Tenglish dataset (~8,000 records) for both binary and multi-class toxicity detection tasks. Below are the detailed results:
- Model Used:
google/bert_uncased_L-4_H-256_A-4 - Accuracy: 85.62%
- Precision:
- Toxic: 85.7%
- Non-Toxic: 85.65%
- Recall:
- Toxic: 93.4%
- Non-Toxic: 85.62%
- F1-Score:
- Toxic: 84.5%
- Overall: 85.62%
โ
Insights:
This model exhibits strong and reliable performance, making it well-suited for live moderation of Telugu, Tenglish, and English text inputs.
๐ prajjwal1/bert-tiny also performed closely with 84.72% accuracy, and showed even higher Non-Toxic precision at 93.8%, but slightly lower Toxic class performance.
- Best Model:
prajjwal1/bert-mini - Accuracy: 47.64%
- Precision (macro): 0.571
- Recall (macro): 0.476
- F1-Score (macro): 0.491
| Model | Accuracy | F1 Score |
|---|---|---|
google/bert_uncased_L-4_H-256_A-4 |
45.30% | 0.404 |
prajjwal1/bert-tiny |
34.18% | 0.237 |
-
Class Imbalance:
- Only 1 class has more than 1,000 samples
- Another class has 700+
- Remaining classes range between 400โ600
- Skews predictions toward "none" or dominant labels
-
Semantic Overlap:
- Confusion seen between:
mixed_toxicitygender_targetedfilms_fan_warpolitical_toxicity
- Confusion seen between:
-
Linguistic Diversity:
- Dataset includes:
- Telugu (native script)
- Tenglish (Telugu in Latin script)
- Standard English
- Adds linguistic noise, especially hard for smaller models like
bert-tiny - Example:
bert-tinyscored zero precision for the classthreatening
- Dataset includes:
- Expand annotated dataset (aim for 20K+ diverse examples)
- Address class imbalance via:
- Oversampling
- Synthetic augmentation
- Multilingual & cross-lingual fine-tuning (e.g.,
indicBERT,distilbert-base-uncased) - Add language detection and script normalization pre-processing
- Evaluate on real-world moderation scenarios
[Scrape] โ [Annotate] โ [Clean] โ [Balance] โ [Train] โ [Evaluate] โ [Predict]
Each step is managed via the Streamlit UI, with modular scripts located in the utils/ directory.
Detailed documentation is available in the /docs folder, covering:
-
๐ Research Literature
Insights into BERT models and their application to toxicity detection in multilingual contexts. -
๐ Evaluation Reports
Includes confusion matrices, performance metrics, and key training observations. -
๐ Tenglish Transliteration
Notes on challenges and methods used to normalize Telugu-English mixed inputs.
- Open
index.htmlfor a step-by-step walkthrough of the system. - It contains annotated screenshots explaining each module and how to use them.
โถ๏ธ A YouTube video walkthrough (linked inside the/docsfolder) demonstrates the full Streamlit UI and pipeline flow.
Important:
This repository only includes a sanitized and minimized dataset for demonstration purposes.
Due to the presence of explicit and potentially offensive content in the original dataset, it has not been publicly released.
๐ If you require access to the full dataset for academic or research purposes, please contact the author directly to discuss terms of use.
Pavan Yellathakota
๐ Clarkson University
๐ง pavanyellathakota@gmail.com
๐ https://pye.pages.dev
- ๐ก Inspired by Detoxify
- ๐ค Built with HuggingFace Transformers
- ๐บ UI powered by Streamlit
- ๐ Scraping module powered by Selenium