X Sentiment Analysis using NLP & Logistic Regression

Project Overview

This project implements a complete Machine Learning pipeline for binary sentiment classification of tweets. It classifies tweets as either Negative (0) or Positive (1) using Natural Language Processing (NLP) techniques and Logistic Regression.

The workflow involves cleaning raw text data, performing stemming, converting text to numerical vectors, training a predictive model, and evaluating its performance.

Technologies Used

Python
Pandas & NumPy
NLTK (Natural Language Toolkit)
- Stopwords removal
- Porter Stemmer
Scikit-learn
- TfidfVectorizer
- LogisticRegression
Pickle

Dataset Description

This project uses the Sentiment140 dataset (1.6M tweets) containing pre-labeled sentiments.

Download the dataset here: 👉 https://www.kaggle.com/datasets/kazanova/sentiment140

After downloading:

Extract the CSV file
Place it in the project root directory
Ensure the filename matches what main.ipynb references

Key Columns:

target: Sentiment label
- 0: Negative
- 4: Positive (Converted to 1 in preprocessing)
id: Tweet ID
date: Time of tweet
flag: Query status
user: Username
text: Tweet content

> [!IMPORTANT]

> Processing Time Warning

Text preprocessing (stemming) is computationally expensive. Using NLTK’s PorterStemmer on the full dataset will take 50–65 minutes on standard machines. This is expected behavior and not a bug — please be patient.

Workflow Explanation

Data Loading — Load the CSV with ISO-8859-1 encoding
Data Cleaning — Null checks and basic cleaning
Target Conversion — Convert 4 → 1 for binary classification
Text Preprocessing
- Regex cleaning
- Lowercasing
- Stopwords removal
- Porter Stemming
TF-IDF Vectorization
Train/Test Split
Logistic Regression Training
Evaluation
Save Model (.sav)
Inference on new tweets

Model Performance

Training Accuracy: ~79.8%
Test Accuracy: ~77.7%

A close accuracy gap suggests reasonable generalization.

How to Run the Project

1. Clone the Repository

git clone https://github.com/aarogyaojha/x-sentiment-analyzer.git
cd x-sentiment-analyzer

2. Prerequisites

pip install pandas numpy nltk scikit-learn

Download NLTK stopwords:

import nltk
nltk.download('stopwords')

3. Download Dataset

Visit the Kaggle link below to download Sentiment140:

https://www.kaggle.com/datasets/kazanova/sentiment140

Place the extracted CSV in the project root.

4. Run the Notebook

Open main.ipynb
Run cells in order
Note: The stemming step (~50–65 mins) will take time
After training, the model is saved as sentiment_analyzer.sav

Saved Model

The file sentiment_analyzer.sav contains the trained Logistic Regression model. It can be reused for prediction without retraining if the same preprocessing steps are kept.

Output Interpretation

The model outputs:

0 → Negative tweet
1 → Positive tweet

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
main.ipynb		main.ipynb
sentiment_analyzer.sav		sentiment_analyzer.sav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

X Sentiment Analysis using NLP & Logistic Regression

Project Overview

Technologies Used

Dataset Description

> [!IMPORTANT]

> Processing Time Warning

Workflow Explanation

Model Performance

How to Run the Project

1. Clone the Repository

2. Prerequisites

3. Download Dataset

4. Run the Notebook

Saved Model

Output Interpretation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

X Sentiment Analysis using NLP & Logistic Regression

Project Overview

Technologies Used

Dataset Description

> [!IMPORTANT]

> Processing Time Warning

Workflow Explanation

Model Performance

How to Run the Project

1. Clone the Repository

2. Prerequisites

3. Download Dataset

4. Run the Notebook

Saved Model

Output Interpretation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages