Skip to content

aarogyaojha/x-sentiment-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

X Sentiment Analysis using NLP & Logistic Regression

Project Overview

This project implements a complete Machine Learning pipeline for binary sentiment classification of tweets. It classifies tweets as either Negative (0) or Positive (1) using Natural Language Processing (NLP) techniques and Logistic Regression.

The workflow involves cleaning raw text data, performing stemming, converting text to numerical vectors, training a predictive model, and evaluating its performance.

Technologies Used

  • Python

  • Pandas & NumPy

  • NLTK (Natural Language Toolkit)

    • Stopwords removal
    • Porter Stemmer
  • Scikit-learn

    • TfidfVectorizer
    • LogisticRegression
  • Pickle

Dataset Description

This project uses the Sentiment140 dataset (1.6M tweets) containing pre-labeled sentiments.

Download the dataset here: 👉 https://www.kaggle.com/datasets/kazanova/sentiment140

After downloading:

  • Extract the CSV file
  • Place it in the project root directory
  • Ensure the filename matches what main.ipynb references

Key Columns:

  • target: Sentiment label

    • 0: Negative
    • 4: Positive (Converted to 1 in preprocessing)
  • id: Tweet ID

  • date: Time of tweet

  • flag: Query status

  • user: Username

  • text: Tweet content

> [!IMPORTANT]

> Processing Time Warning

Text preprocessing (stemming) is computationally expensive. Using NLTK’s PorterStemmer on the full dataset will take 50–65 minutes on standard machines. This is expected behavior and not a bug — please be patient.

Workflow Explanation

  1. Data Loading — Load the CSV with ISO-8859-1 encoding

  2. Data Cleaning — Null checks and basic cleaning

  3. Target Conversion — Convert 41 for binary classification

  4. Text Preprocessing

    • Regex cleaning
    • Lowercasing
    • Stopwords removal
    • Porter Stemming
  5. TF-IDF Vectorization

  6. Train/Test Split

  7. Logistic Regression Training

  8. Evaluation

  9. Save Model (.sav)

  10. Inference on new tweets

Model Performance

  • Training Accuracy: ~79.8%
  • Test Accuracy: ~77.7%

A close accuracy gap suggests reasonable generalization.

How to Run the Project

1. Clone the Repository

git clone https://github.com/aarogyaojha/x-sentiment-analyzer.git
cd x-sentiment-analyzer

2. Prerequisites

pip install pandas numpy nltk scikit-learn

Download NLTK stopwords:

import nltk
nltk.download('stopwords')

3. Download Dataset

Visit the Kaggle link below to download Sentiment140:

https://www.kaggle.com/datasets/kazanova/sentiment140

Place the extracted CSV in the project root.

4. Run the Notebook

  1. Open main.ipynb
  2. Run cells in order
  3. Note: The stemming step (~50–65 mins) will take time
  4. After training, the model is saved as sentiment_analyzer.sav

Saved Model

The file sentiment_analyzer.sav contains the trained Logistic Regression model. It can be reused for prediction without retraining if the same preprocessing steps are kept.

Output Interpretation

The model outputs:

  • 0 → Negative tweet
  • 1 → Positive tweet

About

End-to-end Twitter sentiment analysis using NLP and Logistic Regression on the Sentiment140 dataset, including text preprocessing, TF-IDF vectorization, model training, evaluation, and persistence.

Topics

Resources

Stars

Watchers

Forks

Contributors