📩 Machine Learning Based Spam & Ham Detection

An end-to-end Machine Learning and Natural Language Processing (NLP) project that automatically classifies SMS messages as Spam or Ham (Legitimate Message) using Logistic Regression and TF-IDF Vectorization.

The application is deployed using Streamlit, allowing users to test SMS messages in real time.

🚀 Live Demo

https://machine-learning-based-spam-ham-detection-yuewtcszp3runl7wssgn.streamlit.app/

📌 Problem Statement

Mobile users frequently receive unwanted SMS messages, commonly known as spam. These messages may contain advertisements, scams, phishing attempts, or fraudulent content.

The objective of this project is to develop a machine learning model capable of automatically classifying SMS messages as:

Spam
Ham (Legitimate Message)

based on the text content of the message.

🎯 Project Objectives

Detect spam SMS messages automatically
Reduce unwanted message exposure
Apply NLP techniques to text data
Build and evaluate a machine learning classification model
Deploy the model for real-time predictions

📊 Dataset

SMS Spam Collection Dataset

Dataset contains:

Text → SMS Content
Target → Spam / Ham Label

🔍 Exploratory Data Analysis (EDA)

Performed exploratory analysis to understand:

Message distribution
Character count
Word count
Sentence count
Spam vs Ham patterns

Feature Engineering

Created additional features:

Number of Characters
Number of Words
Number of Sentences

⚙️ Data Preprocessing

Text Cleaning

Removed special characters
Removed numbers
Removed extra spaces

Tokenization

Converted sentences into individual words

Stopword Removal

Removed common words with little semantic value

Lemmatization

Converted words into their root form

Example:

Running → Run

Playing → Play

📈 Outlier Detection

Applied:

Box Plot Analysis
Interquartile Range (IQR) Method

to identify and handle abnormal message lengths.

🔤 Text Vectorization

Implemented:

TF-IDF Vectorization

Converted text data into numerical feature vectors suitable for machine learning algorithms.

⚖️ Handling Class Imbalance

Applied:

SMOTE (Synthetic Minority Oversampling Technique)

to balance Spam and Ham classes before model training.

🤖 Model Building

Algorithm Used:

Logistic Regression

Steps:

Train-Test Split
Model Training
Prediction
Hyperparameter Tuning using GridSearchCV

📉 Model Evaluation

Evaluation Metrics:

Accuracy
Precision
Recall
F1 Score
Confusion Matrix

These metrics were used to assess the effectiveness of spam detection.

🛠️ Technologies Used

Python
Pandas
NumPy
NLTK
Scikit-Learn
TF-IDF Vectorizer
Logistic Regression
SMOTE
Joblib
Streamlit

📁 Project Structure

Machine-Learning-Based-Spam-Ham-Detection/
│
├── app.py
├── logistic_regression_sms_spam_model.pkl
├── tfidf_vectorizer.pkl
├── label_encoder.pkl
├── requirements.txt
├── spam.csv
└── README.md

💡 Key Learnings

Natural Language Processing workflow
Text preprocessing techniques
TF-IDF feature extraction
Handling imbalanced datasets using SMOTE
Logistic Regression for text classification
Hyperparameter tuning with GridSearchCV
Streamlit deployment
End-to-end machine learning project lifecycle

👨‍💻 Author

Daniel J

LinkedIn: https://www.linkedin.com/in/daniel-j77

GitHub: https://github.com/daniel-j77

⭐ Future Improvements

Deep Learning based spam detection
LSTM / RNN implementation
Transformer-based models (BERT)
Multilingual spam detection
Real-time SMS filtering API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📩 Machine Learning Based Spam & Ham Detection

🚀 Live Demo

📌 Problem Statement

🎯 Project Objectives

📊 Dataset

🔍 Exploratory Data Analysis (EDA)

Feature Engineering

⚙️ Data Preprocessing

Text Cleaning

Tokenization

Stopword Removal

Lemmatization

📈 Outlier Detection

🔤 Text Vectorization

TF-IDF Vectorization

⚖️ Handling Class Imbalance

SMOTE (Synthetic Minority Oversampling Technique)

🤖 Model Building

Logistic Regression

📉 Model Evaluation

🛠️ Technologies Used

📁 Project Structure

💡 Key Learnings

👨‍💻 Author

⭐ Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
app.py		app.py
label_encoder.pkl		label_encoder.pkl
logistic_regression_sms_spam_model.pkl		logistic_regression_sms_spam_model.pkl
requirements.txt		requirements.txt
spam.csv		spam.csv
tfidf_vectorizer.pkl		tfidf_vectorizer.pkl

Folders and files

Latest commit

History

Repository files navigation

📩 Machine Learning Based Spam & Ham Detection

🚀 Live Demo

📌 Problem Statement

🎯 Project Objectives

📊 Dataset

🔍 Exploratory Data Analysis (EDA)

Feature Engineering

⚙️ Data Preprocessing

Text Cleaning

Tokenization

Stopword Removal

Lemmatization

📈 Outlier Detection

🔤 Text Vectorization

TF-IDF Vectorization

⚖️ Handling Class Imbalance

SMOTE (Synthetic Minority Oversampling Technique)

🤖 Model Building

Logistic Regression

📉 Model Evaluation

🛠️ Technologies Used

📁 Project Structure

💡 Key Learnings

👨‍💻 Author

⭐ Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages