Skip to content

yellatp/detoxify-telugu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

19 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Detoxify-Telugu: A Fine-Tuned BERT-Based Language Model for Hate Speech Detection in Telugu & Tenglish

A Streamlit-Based NLP Platform for Regional Hate Speech Classification


๐Ÿงญ Overview

Detoxify-Telugu is a thoughtful platform designed to detect toxic content across Telugu, Tenglish (Telugu-English code-mixed), and English social media inputs. It supports binary (Toxic vs. Non-Toxic) and multi-class (11 toxicity types + "none") classification using fine-tuned BERT-based models. The end-to-end pipeline, powered by a Streamlit UI, enables technical and non-technical users to scrape, annotate, train, evaluate, and predict toxicity with ease.

This platform empowers both technical and non-technical users with:

  • Intuitive UI for data annotation, cleaning, model training, and evaluation.
  • Real-time and batch toxicity detection via text or CSV inputs
  • Keyword-based auto-annotation for efficient labeling
  • Data scraping from YouTube

โœจ Features

  • ๐Ÿš€ End-to-end NLP pipeline via Streamlit
  • โš™๏ธ Fine-tuned BERT models for binary and multi-class toxicity detection.
  • ๐Ÿ“Š Comprehensive metrics: Accuracy, Precision, Recall, F1-Score, Confusion Matrix.
  • ๐Ÿงช Real-time single and batch CSV predictions for moderation.
  • ๐Ÿงผ Built-in data cleaning, balancing, and annotation.
  • ๐Ÿ’ฌ Rule-based Tenglish-to-Telugu transliteration for preprocessing.
  • ๐Ÿ“ฅ Selenium-based YouTube comment scraping (headless mode).

๐Ÿ—‚๏ธ Project Structure

Detoxify_Telugu/
โ”œโ”€โ”€ assets/
โ”œโ”€โ”€ config/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ processed/
โ”‚   โ”‚   โ””โ”€โ”€ toxic_data_cleaned.csv
โ”‚   โ”œโ”€โ”€ raw/
โ”‚   โ”‚   โ”œโ”€โ”€ scraped/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ toxicity_data.csv
โ”‚   โ”œโ”€โ”€ testing_data/
โ”‚   โ”‚   โ””โ”€โ”€ testing.csv
โ”‚   โ”œโ”€โ”€ training/
โ”‚   โ”‚   โ”œโ”€โ”€ binary/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ dataset_binary.csv
โ”‚   โ”‚   โ””โ”€โ”€ multi/
โ”‚   โ”‚       โ””โ”€โ”€ dataset_multiclass.csv
โ”œโ”€โ”€ docs/
โ”œโ”€โ”€ logs/
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ Binary/
โ”‚   โ”‚   โ””โ”€โ”€ google_bert_uncased_L-4_H-256_A-4/
โ”‚   โ”‚   โ””โ”€โ”€ prajjwal1_bert-tiny/
โ”‚   โ””โ”€โ”€ Multi-Class/
โ”‚       โ”œโ”€โ”€ google_bert_uncased_L-4_H-256_A-4/
โ”‚       โ”œโ”€โ”€ prajjwal1_bert-mini/
โ”‚       โ””โ”€โ”€ prajjwal1_bert-tiny/
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ __pycache__/
โ”‚   โ”œโ”€โ”€ data_annotation.py
โ”‚   โ”œโ”€โ”€ data_balancing.py
โ”‚   โ”œโ”€โ”€ data_cleaning.py
โ”‚   โ”œโ”€โ”€ data_collection.py
โ”‚   โ”œโ”€โ”€ data_summary.py
โ”‚   โ”œโ”€โ”€ model_evaluation.py
โ”‚   โ”œโ”€โ”€ model_training.py
โ”‚   โ”œโ”€โ”€ predict.py
โ”‚   โ”œโ”€โ”€ YT_Scraper.py
โ”‚   โ””โ”€โ”€ ...........................
โ”œโ”€โ”€ venv/
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ app.py
โ”œโ”€โ”€ DetoxifyTelugu.html
โ”œโ”€โ”€ project.aurdino
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ requirements.txt


๐Ÿ›  Installation

๐Ÿ“ฆ Requirements

  • Python 3.8+
  • pip
  • Google Chrome + ChromeDriver (or) Firefox + GeckoDriver

๐Ÿ”ง Setup Instructions

# Step 1: Clone the repo
https://github.com/YOUR_USERNAME/detoxify-telugu.git
cd detoxify-telugu

# Step 2: Set up virtual environment
python -m venv venv
source venv/bin/activate        # (Windows: venv\Scripts\activate)

# Step 3: Install dependencies
pip install -r requirements.txt

# Step 4: Launch app
streamlit run app.py

๐Ÿš€ Functional Modules

Module Description
Data Collection Scrapes YouTube comments using YT_Scraper.py
Data Cleaning Normalizes text, handles Tenglish, removes noise
Data Annotation Supports rule-based and manual labeling
Data Balancing Equalizes class distribution for training
Model Training Fine-tunes BERT models with configurable settings
Model Evaluation Computes Accuracy, Precision, Recall, F1, matrices
Toxicity Prediction Real-time or CSV-based toxicity classification
Tenglish Generator Converts Tenglish to Telugu script for consistency

โœ… Supported Models

Category Model Name Parameters
Tiny prajjwal1/bert-tiny ~4.3M
Small prajjwal1/bert-mini ~29M
Small google/bert_uncased_L-4_H-256_A-4 ~4.3M

Models are selected based on resource availability. Training parameters (e.g., epochs, learning rate, batch size) are tunable via the Streamlit UI. Larger models like distilbert-base-uncased (~66M parameters) are planned for future enhancements.


๐Ÿ“ˆ Model Performance Report

The models were evaluated on a Telugu/Tenglish dataset (~8,000 records) for both binary and multi-class toxicity detection tasks. Below are the detailed results:


๐Ÿ”น Binary Classification

  • Model Used: google/bert_uncased_L-4_H-256_A-4
  • Accuracy: 85.62%
  • Precision:
    • Toxic: 85.7%
    • Non-Toxic: 85.65%
  • Recall:
    • Toxic: 93.4%
    • Non-Toxic: 85.62%
  • F1-Score:
    • Toxic: 84.5%
    • Overall: 85.62%

โœ… Insights:
This model exhibits strong and reliable performance, making it well-suited for live moderation of Telugu, Tenglish, and English text inputs.

๐Ÿ†š prajjwal1/bert-tiny also performed closely with 84.72% accuracy, and showed even higher Non-Toxic precision at 93.8%, but slightly lower Toxic class performance.


๐Ÿ”น Multi-Class Classification (11 Toxicity Types + "none")

  • Best Model: prajjwal1/bert-mini
  • Accuracy: 47.64%
  • Precision (macro): 0.571
  • Recall (macro): 0.476
  • F1-Score (macro): 0.491

๐Ÿ“Š Other Model Performances:

Model Accuracy F1 Score
google/bert_uncased_L-4_H-256_A-4 45.30% 0.404
prajjwal1/bert-tiny 34.18% 0.237

โš ๏ธ Challenges

  • Class Imbalance:

    • Only 1 class has more than 1,000 samples
    • Another class has 700+
    • Remaining classes range between 400โ€“600
    • Skews predictions toward "none" or dominant labels
  • Semantic Overlap:

    • Confusion seen between:
      • mixed_toxicity
      • gender_targeted
      • films_fan_war
      • political_toxicity
  • Linguistic Diversity:

    • Dataset includes:
      • Telugu (native script)
      • Tenglish (Telugu in Latin script)
      • Standard English
    • Adds linguistic noise, especially hard for smaller models like bert-tiny
    • Example: bert-tiny scored zero precision for the class threatening

๐Ÿ”ฎ Future Plans

  • Expand annotated dataset (aim for 20K+ diverse examples)
  • Address class imbalance via:
    • Oversampling
    • Synthetic augmentation
  • Multilingual & cross-lingual fine-tuning (e.g., indicBERT, distilbert-base-uncased)
  • Add language detection and script normalization pre-processing
  • Evaluate on real-world moderation scenarios

๐Ÿง  How It Works (Visual Overview)

[Scrape] โ†’ [Annotate] โ†’ [Clean] โ†’ [Balance] โ†’ [Train] โ†’ [Evaluate] โ†’ [Predict]

Each step is managed via the Streamlit UI, with modular scripts located in the utils/ directory.


๐Ÿ“˜ Documentation

Detailed documentation is available in the /docs folder, covering:

  • ๐Ÿ“š Research Literature
    Insights into BERT models and their application to toxicity detection in multilingual contexts.

  • ๐Ÿ“Š Evaluation Reports
    Includes confusion matrices, performance metrics, and key training observations.

  • ๐ŸŒ Tenglish Transliteration
    Notes on challenges and methods used to normalize Telugu-English mixed inputs.


๐Ÿ“„ Visual Setup Guide

  • Open index.html for a step-by-step walkthrough of the system.
  • It contains annotated screenshots explaining each module and how to use them.
  • โ–ถ๏ธ A YouTube video walkthrough (linked inside the /docs folder) demonstrates the full Streamlit UI and pipeline flow.

โš ๏ธ Dataset Disclaimer

Important:
This repository only includes a sanitized and minimized dataset for demonstration purposes.

Due to the presence of explicit and potentially offensive content in the original dataset, it has not been publicly released.

๐Ÿ”’ If you require access to the full dataset for academic or research purposes, please contact the author directly to discuss terms of use.


๐Ÿ“ซ Author & Contact

Pavan Yellathakota
๐ŸŽ“ Clarkson University
๐Ÿ“ง pavanyellathakota@gmail.com
๐Ÿ”— https://pye.pages.dev


๐Ÿงพ License & Credits


About

A Fine-Tuned BERT-Based Language Model for Hate Speech Detection in Telugu & Tenglish

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors