An end-to-end Machine Learning pipeline to classify SMS/email messages as spam or ham (not spam), with multiple model comparisons, TF-IDF vectorization, evaluation metrics, and a FastAPI deployment endpoint.
- ✅ Full NLP preprocessing (lowercasing, stemming, stopword removal, URL/phone normalization)
- ✅ TF-IDF vectorization with unigrams + bigrams
- ✅ 4 models compared: Naive Bayes, Logistic Regression, Random Forest, Linear SVM
- ✅ Evaluation: Accuracy, F1, ROC-AUC, Confusion Matrix, ROC Curves
- ✅ Feature importance visualization
- ✅ 5-fold cross-validation
- ✅ FastAPI REST endpoint for real-time prediction
spam-detector/
│
├── spam_detection.ipynb # Main Jupyter / Colab notebook
├── src/
│ ├── preprocess.py # Text cleaning & preprocessing
│ ├── train.py # Model training & saving
│ ├── predict.py # Prediction function
│ └── api.py # FastAPI REST API
│
├── models/ # Saved model files (auto-created on train)
│ ├── spam_model.pkl
│ └── tfidf_vectorizer.pkl
│
├── data/
│ └── README.md # Dataset download instructions
│
├── requirements.txt
├── .gitignore
└── README.md
git clone https://github.com/MithraKL/spam-detector.git
cd spam-detectorpython -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activatepip install -r requirements.txtDownload SMS Spam Collection from Kaggle and place spam.csv in the data/ folder.
python src/train.pyuvicorn src.api:app --reloadThen test it:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Congratulations! You won a FREE iPhone. Click now!"}'| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Health check |
| POST | /predict |
Predict spam or ham |
Request body:
{ "text": "Your message here" }Response:
{ "label": "spam", "confidence": 0.9871 }| Model | Accuracy | AUC-ROC |
|---|---|---|
| Naive Bayes | ~97.4% | ~0.985 |
| Logistic Regression | ~98.1% | ~0.992 |
| Random Forest | ~97.8% | ~0.991 |
| Linear SVM | ~98.3% | ~0.994 |
- Language: Python 3.10+
- ML/NLP: scikit-learn, NLTK
- Visualization: Matplotlib, Seaborn, WordCloud
- API: FastAPI, Uvicorn
- Serialization: Joblib
This project is licensed under the MIT License.