This project implements a complete Machine Learning pipeline for binary sentiment classification of tweets. It classifies tweets as either Negative (0) or Positive (1) using Natural Language Processing (NLP) techniques and Logistic Regression.
The workflow involves cleaning raw text data, performing stemming, converting text to numerical vectors, training a predictive model, and evaluating its performance.
-
Python
-
Pandas & NumPy
-
NLTK (Natural Language Toolkit)
- Stopwords removal
- Porter Stemmer
-
Scikit-learn
TfidfVectorizerLogisticRegression
-
Pickle
This project uses the Sentiment140 dataset (1.6M tweets) containing pre-labeled sentiments.
Download the dataset here: 👉 https://www.kaggle.com/datasets/kazanova/sentiment140
After downloading:
- Extract the CSV file
- Place it in the project root directory
- Ensure the filename matches what
main.ipynbreferences
Key Columns:
-
target: Sentiment label
0: Negative4: Positive (Converted to1in preprocessing)
-
id: Tweet ID
-
date: Time of tweet
-
flag: Query status
-
user: Username
-
text: Tweet content
Text preprocessing (stemming) is computationally expensive. Using NLTK’s
PorterStemmeron the full dataset will take 50–65 minutes on standard machines. This is expected behavior and not a bug — please be patient.
-
Data Loading — Load the CSV with
ISO-8859-1encoding -
Data Cleaning — Null checks and basic cleaning
-
Target Conversion — Convert
4→1for binary classification -
Text Preprocessing
- Regex cleaning
- Lowercasing
- Stopwords removal
- Porter Stemming
-
TF-IDF Vectorization
-
Train/Test Split
-
Logistic Regression Training
-
Evaluation
-
Save Model (
.sav) -
Inference on new tweets
- Training Accuracy: ~79.8%
- Test Accuracy: ~77.7%
A close accuracy gap suggests reasonable generalization.
git clone https://github.com/aarogyaojha/x-sentiment-analyzer.git
cd x-sentiment-analyzer
pip install pandas numpy nltk scikit-learn
Download NLTK stopwords:
import nltk
nltk.download('stopwords')Visit the Kaggle link below to download Sentiment140:
https://www.kaggle.com/datasets/kazanova/sentiment140
Place the extracted CSV in the project root.
- Open
main.ipynb - Run cells in order
- Note: The stemming step (~50–65 mins) will take time
- After training, the model is saved as
sentiment_analyzer.sav
The file sentiment_analyzer.sav contains the trained Logistic Regression model. It can be reused for prediction without retraining if the same preprocessing steps are kept.
The model outputs:
- 0 → Negative tweet
- 1 → Positive tweet