A machine learning-powered application that classifies SMS messages as Spam or Ham (Not Spam) using Natural Language Processing (NLP). This project includes:
- ✅ Command-line interface (CLI)
- ✅ Interactive Streamlit web app
- ✅ Trained model using
Multinomial Naive Bayes - ✅ TF-IDF-based text preprocessing
- ✅ K-Fold evaluation metrics
- Vectorizer:
TfidfVectorizer(stop_words='english') - Classifier:
MultinomialNB() - Training Strategy:
10-Fold Cross Validation - Evaluation Metrics:
- Accuracy: ~97.15%
- Precision: ~99.66%
- Recall: ~78.98%
- F1 Score: ~88.10%
- Geometric Mean: ~88.84%
Multinomial Naive Bayes is efficient and performs well for text classification tasks like spam detection, where input features are word frequencies or TF-IDF scores.
git clone https://github.com/mayankraj052/SpamdetectionApp.git cd SpamdetectionApp for window
python -m venv venv
venv\Scripts\activate macOS/Linux
source venv/bin/activate pip install -r requirements.txtstreamlit run app.pypython spam_check_cli.py"Hey John, I thought you might like this opportunity — earn $500/day working from home!" → Spam
"Are we still meeting at 6 PM today?" → Ham
- SMS Spam Collection Dataset (UCI ML Repository)
- Data Set Link
- Format: Label (spam or ham) + Message content
- TF-IDF Vectorizer (preprocessing)
- Multinomial Naive Bayes Classifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
spam_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('model', MultinomialNB())
])from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix- Used to validate model performance across 10 different splits. Metrics like accuracy, precision, recall, F1, specificity, and geometric mean were averaged.
- Python 3.11+
- Scikit-learn
- Pandas & NumPy
- Streamlit
- Joblib
- Git for version control
This project is open-source and available under the MIT License. 🙋♂️ Author
Made with ❤️ by Mayank Raj
