Skip to content

AidanColvin/sentiment-classification-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Classification Pipeline

A multi-model text classification pipeline for binary sentiment analysis. Trains four classifiers on bag-of-words features, evaluates on a held-out validation set, and generates test-set submission files.


Table of Contents


Overview

This pipeline takes raw text reviews, vectorizes them with a bigram bag-of-words representation, and trains four classifiers in sequence. Each model is evaluated on a validation split before generating predictions on the held-out test set. Submission files are produced for all four models.

The full run — from raw .tsv files to submission CSVs — is a single command.


Project Structure

.
├── data/
│   ├── train.tsv                           # Training data (required)
│   ├── test.tsv                            # Test data (required)
│   ├── train_X.npz                         # Vectorized training features (generated)
│   ├── valid_X.npz                         # Vectorized validation features (generated)
│   ├── y_train.csv                         # Training labels (generated)
│   ├── y_valid.csv                         # Validation labels (generated)
│   ├── logistic_regression_results.csv     # (generated)
│   ├── random_forest_results.csv           # (generated)
│   ├── svm_results.csv                     # (generated)
│   ├── gradient_boosting_results.csv       # (generated)
│   ├── submission_logistic_regression.csv  # (generated)
│   ├── submission_random_forest.csv        # (generated)
│   ├── submission_svm.csv                  # (generated)
│   └── submission_gradient_boosting.csv    # (generated)
├── LICENSE
├── Makefile
├── Makefile.save
├── README.md
├── memo.md
├── pipeline.py
├── requirements.txt
└── sample-submission.csv

Requirements

Python 3.8+

pandas
scikit-learn
scipy

Install dependencies:

pip install pandas scikit-learn scipy

Quickstart

  1. Place your data files in the data/ directory (see Data Format).
  2. Run the pipeline:
python pipeline.py

That's it. All intermediate artifacts and submission files will be written to data/.


Data Format

data/train.tsv — tab-separated, must contain these two columns:

Column Type Description
review string Raw text of the review
label int/str Binary class label

data/test.tsv — tab-separated, must contain:

Column Type Description
id int/str Unique row identifier (optional but recommended)
review string Raw text of the review

Rows with missing values in required columns are dropped automatically.


Pipeline Walkthrough

The main() function executes the following steps in order:

1. Data preparation — Loads train.tsv, applies an 80/20 stratified random split, and vectorizes both splits using a CountVectorizer with bigrams. Sparse matrices and label files are saved to disk.

2. Individual model evaluation — Each of the four classifiers is trained on the training split and scored on the validation split. Accuracy is written to a per-model results CSV.

3. Submission generation — All four models are retrained on the full training set. Cross-validation scores are printed to stdout. Test-set predictions are written to submission CSVs.


Models

Model Library Class CV Folds Notes
Logistic Regression LogisticRegression 5 max_iter=2000
Linear SVM LinearSVC + CalibratedClassifierCV 3 Calibrated for probability output
Random Forest RandomForestClassifier 3 200 trees, all cores
Gradient Boosting GradientBoostingClassifier 3 100 estimators, depth 3

Slow models (SVM, Random Forest, Gradient Boosting) use 3-fold CV during submission generation to keep runtime reasonable.


Output Files

After a full run, the following files will be present in data/:

Validation results (one row each, columns: Model, Accuracy):

  • logistic_regression_results.csv
  • random_forest_results.csv
  • svm_results.csv
  • gradient_boosting_results.csv

Submission files (columns: id, label):

  • submission_logistic_regression.csv
  • submission_random_forest.csv
  • submission_svm.csv
  • submission_gradient_boosting.csv

For models with predict_proba support, the label column contains soft probability scores. For models without it, hard class predictions are used.


Configuration

All tunable constants live in the Settings dataclass at the top of pipeline.py:

@dataclass(frozen=True)
class Settings:
    data_dir: Path = Path("data")
    train_file: Path = Path("data/train.tsv")
    test_file: Path = Path("data/test.tsv")

    train_ratio: float = 0.8      # Fraction of training data used for training split
    random_seed: int = 100        # Controls the train/validation split

    cv_default: int = 5           # CV folds for fast models
    cv_slow: int = 3              # CV folds for slow models

The vectorizer is configured inline inside vectorize_and_save():

CountVectorizer(max_features=50000, ngram_range=(1, 2), min_df=2)

Unigrams and bigrams, capped at 50k features, minimum document frequency of 2.


Reproducing Results

To get identical results across runs, make sure random_seed in Settings is unchanged (default: 100) and that the input files have not been shuffled. The train/validation split, all model random_state parameters, and the vectorizer are all deterministic given the same seed and input data.

# Clean run from scratch
rm -rf data/train_X.npz data/valid_X.npz data/y_train.csv data/y_valid.csv
python pipeline.py

About

Classifying Amazon product reviews by positive or negative sentiment for the SP26 INLS 642 Kaggle competition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors