An end-to-end Machine Learning and Natural Language Processing (NLP) project that automatically classifies SMS messages as Spam or Ham (Legitimate Message) using Logistic Regression and TF-IDF Vectorization.
The application is deployed using Streamlit, allowing users to test SMS messages in real time.
https://machine-learning-based-spam-ham-detection-yuewtcszp3runl7wssgn.streamlit.app/
Mobile users frequently receive unwanted SMS messages, commonly known as spam. These messages may contain advertisements, scams, phishing attempts, or fraudulent content.
The objective of this project is to develop a machine learning model capable of automatically classifying SMS messages as:
- Spam
- Ham (Legitimate Message)
based on the text content of the message.
- Detect spam SMS messages automatically
- Reduce unwanted message exposure
- Apply NLP techniques to text data
- Build and evaluate a machine learning classification model
- Deploy the model for real-time predictions
SMS Spam Collection Dataset
Dataset contains:
- Text → SMS Content
- Target → Spam / Ham Label
Performed exploratory analysis to understand:
- Message distribution
- Character count
- Word count
- Sentence count
- Spam vs Ham patterns
Created additional features:
- Number of Characters
- Number of Words
- Number of Sentences
- Removed special characters
- Removed numbers
- Removed extra spaces
- Converted sentences into individual words
- Removed common words with little semantic value
- Converted words into their root form
Example:
Running → Run
Playing → Play
Applied:
- Box Plot Analysis
- Interquartile Range (IQR) Method
to identify and handle abnormal message lengths.
Implemented:
Converted text data into numerical feature vectors suitable for machine learning algorithms.
Applied:
to balance Spam and Ham classes before model training.
Algorithm Used:
Steps:
- Train-Test Split
- Model Training
- Prediction
- Hyperparameter Tuning using GridSearchCV
Evaluation Metrics:
- Accuracy
- Precision
- Recall
- F1 Score
- Confusion Matrix
These metrics were used to assess the effectiveness of spam detection.
- Python
- Pandas
- NumPy
- NLTK
- Scikit-Learn
- TF-IDF Vectorizer
- Logistic Regression
- SMOTE
- Joblib
- Streamlit
Machine-Learning-Based-Spam-Ham-Detection/
│
├── app.py
├── logistic_regression_sms_spam_model.pkl
├── tfidf_vectorizer.pkl
├── label_encoder.pkl
├── requirements.txt
├── spam.csv
└── README.md
- Natural Language Processing workflow
- Text preprocessing techniques
- TF-IDF feature extraction
- Handling imbalanced datasets using SMOTE
- Logistic Regression for text classification
- Hyperparameter tuning with GridSearchCV
- Streamlit deployment
- End-to-end machine learning project lifecycle
Daniel J
LinkedIn: https://www.linkedin.com/in/daniel-j77
GitHub: https://github.com/daniel-j77
- Deep Learning based spam detection
- LSTM / RNN implementation
- Transformer-based models (BERT)
- Multilingual spam detection
- Real-time SMS filtering API