A web-based machine learning application that detects toxic and abusive comments using both classical NLP techniques (TF-IDF + Logistic Regression) and Transformer-based models (BERT).
This project demonstrates an end-to-end ML pipeline, from data preprocessing and model training to deployment with a web interface.
Online platforms often face challenges in moderating toxic content such as hate speech, insults, and threats. This system helps automatically identify such content by classifying user input into multiple toxicity categories.
The application supports multi-label classification, meaning a single comment can belong to more than one toxic category.
-
🔹 Multi-label toxicity detection
-
🔹 Two model options:
- TF-IDF + Logistic Regression (fast & lightweight)
- BERT (Transformer-based model)
-
🔹 Web-based UI for live prediction
-
🔹 Probability scores for each toxicity category
-
🔹 Clean and responsive frontend
-
🔹 FastAPI-based backend
- Toxic
- Severe Toxic
- Obscene
- Threat
- Insult
- Identity Hate
Each category is predicted independently with a confidence score.
Smart-Toxic-Comment-Detection-System/
│
├── frontend/
│ ├── index.html
│ └── style.css
│
├── utils/
│ └── preprocess.py
│
├── models/ # Generated after training
│ ├── tfidf_vectorizer.pkl
│ └── tfidf_logreg.pkl
│
├── app.py # FastAPI application
├── train_model.ipynb # Model training notebook
├── download_bert.py # Script to download BERT model
├── train.csv # Dataset (local, ignored by git)
├── requirements.txt
├── README.md
└── .gitignore
- The system is based on the Toxic Comment Classification dataset (originally by Jigsaw / Kaggle).
- For demonstration and local testing, a small curated dataset is used.
- Large datasets are not pushed to GitHub to keep the repository clean and lightweight.
📌 In real-world deployment, large-scale datasets such as Kaggle’s Toxic Comment dataset are used for training.
git clone https://github.com/PavanRa-crypto/Smart-Toxic-Comment-Detection-System.git
cd Smart-Toxic-Comment-Detection-Systempython -m venv venvActivate:
- Windows:
venv\Scripts\activatepip install -r requirements.txtIf running manually:
pip install fastapi uvicorn pandas scikit-learn joblib torch transformers python-multipart notebookOpen the training notebook:
jupyter notebookRun all cells in:
train_model.ipynb
This will generate:
models/
├── tfidf_vectorizer.pkl
└── tfidf_logreg.pkl
Start the FastAPI server:
uvicorn app:app --reloadOpen in browser:
- 🌐 Web UI: http://127.0.0.1:8000
- 📘 API Docs: http://127.0.0.1:8000/docs
-
User enters a comment
-
Model predicts toxicity probabilities
-
System displays:
- Overall status (Safe / Toxic)
- Category-wise confidence scores
Screenshots included in this project demonstrate:
- Home page UI
- Toxic comment prediction
- Safe comment prediction
- Model selection (TF-IDF / BERT)
- Running server (Uvicorn)
(Screenshots can be found in the project report and documentation.)
This project is suitable for:
- MCA final year project
- Machine Learning / NLP coursework
- Demonstration of ML deployment concepts
It covers:
- Text preprocessing
- Feature extraction (TF-IDF)
- Supervised learning
- Transformer models
- Web deployment using FastAPI
Developed by: M. Pavan Master of Computer Applications (MCA)
Project Title: Smart Toxic Comment Detection System
This project is developed for academic and educational purposes.
This README is submission-ready. Do not change model versions or retrain unless required.