Sentiment Classification Pipeline

A multi-model text classification pipeline for binary sentiment analysis. Trains four classifiers on bag-of-words features, evaluates on a held-out validation set, and generates test-set submission files.

Overview

This pipeline takes raw text reviews, vectorizes them with a bigram bag-of-words representation, and trains four classifiers in sequence. Each model is evaluated on a validation split before generating predictions on the held-out test set. Submission files are produced for all four models.

The full run — from raw .tsv files to submission CSVs — is a single command.

Project Structure

.
├── data/
│   ├── train.tsv                           # Training data (required)
│   ├── test.tsv                            # Test data (required)
│   ├── train_X.npz                         # Vectorized training features (generated)
│   ├── valid_X.npz                         # Vectorized validation features (generated)
│   ├── y_train.csv                         # Training labels (generated)
│   ├── y_valid.csv                         # Validation labels (generated)
│   ├── logistic_regression_results.csv     # (generated)
│   ├── random_forest_results.csv           # (generated)
│   ├── svm_results.csv                     # (generated)
│   ├── gradient_boosting_results.csv       # (generated)
│   ├── submission_logistic_regression.csv  # (generated)
│   ├── submission_random_forest.csv        # (generated)
│   ├── submission_svm.csv                  # (generated)
│   └── submission_gradient_boosting.csv    # (generated)
├── LICENSE
├── Makefile
├── Makefile.save
├── README.md
├── memo.md
├── pipeline.py
├── requirements.txt
└── sample-submission.csv

Requirements

Python 3.8+

pandas
scikit-learn
scipy

Install dependencies:

pip install pandas scikit-learn scipy

Quickstart

Place your data files in the data/ directory (see Data Format).
Run the pipeline:

python pipeline.py

That's it. All intermediate artifacts and submission files will be written to data/.

Data Format

data/train.tsv — tab-separated, must contain these two columns:

Column	Type	Description
`review`	string	Raw text of the review
`label`	int/str	Binary class label

data/test.tsv — tab-separated, must contain:

Column	Type	Description
`id`	int/str	Unique row identifier (optional but recommended)
`review`	string	Raw text of the review

Rows with missing values in required columns are dropped automatically.

Pipeline Walkthrough

The main() function executes the following steps in order:

1. Data preparation — Loads train.tsv, applies an 80/20 stratified random split, and vectorizes both splits using a CountVectorizer with bigrams. Sparse matrices and label files are saved to disk.

2. Individual model evaluation — Each of the four classifiers is trained on the training split and scored on the validation split. Accuracy is written to a per-model results CSV.

3. Submission generation — All four models are retrained on the full training set. Cross-validation scores are printed to stdout. Test-set predictions are written to submission CSVs.

Models

Model	Library Class	CV Folds	Notes
Logistic Regression	`LogisticRegression`	5	`max_iter=2000`
Linear SVM	`LinearSVC` + `CalibratedClassifierCV`	3	Calibrated for probability output
Random Forest	`RandomForestClassifier`	3	200 trees, all cores
Gradient Boosting	`GradientBoostingClassifier`	3	100 estimators, depth 3

Slow models (SVM, Random Forest, Gradient Boosting) use 3-fold CV during submission generation to keep runtime reasonable.

Output Files

After a full run, the following files will be present in data/:

Validation results (one row each, columns: Model, Accuracy):

logistic_regression_results.csv
random_forest_results.csv
svm_results.csv
gradient_boosting_results.csv

Submission files (columns: id, label):

submission_logistic_regression.csv
submission_random_forest.csv
submission_svm.csv
submission_gradient_boosting.csv

For models with predict_proba support, the label column contains soft probability scores. For models without it, hard class predictions are used.

Configuration

All tunable constants live in the Settings dataclass at the top of pipeline.py:

@dataclass(frozen=True)
class Settings:
    data_dir: Path = Path("data")
    train_file: Path = Path("data/train.tsv")
    test_file: Path = Path("data/test.tsv")

    train_ratio: float = 0.8      # Fraction of training data used for training split
    random_seed: int = 100        # Controls the train/validation split

    cv_default: int = 5           # CV folds for fast models
    cv_slow: int = 3              # CV folds for slow models

The vectorizer is configured inline inside vectorize_and_save():

CountVectorizer(max_features=50000, ngram_range=(1, 2), min_df=2)

Unigrams and bigrams, capped at 50k features, minimum document frequency of 2.

Reproducing Results

To get identical results across runs, make sure random_seed in Settings is unchanged (default: 100) and that the input files have not been shuffled. The train/validation split, all model random_state parameters, and the vectorizer are all deterministic given the same seed and input data.

# Clean run from scratch
rm -rf data/train_X.npz data/valid_X.npz data/y_train.csv data/y_valid.csv
python pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Classification Pipeline

Table of Contents

Overview

Project Structure

Requirements

Quickstart

Data Format

Pipeline Walkthrough

Models

Output Files

Configuration

Reproducing Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
LICENSE		LICENSE
Makefile		Makefile
Makefile.save		Makefile.save
README.md		README.md
memo.md		memo.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt
sample-submission.csv		sample-submission.csv

Folders and files

Latest commit

History

Repository files navigation

Sentiment Classification Pipeline

Table of Contents

Overview

Project Structure

Requirements

Quickstart

Data Format

Pipeline Walkthrough

Models

Output Files

Configuration

Reproducing Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages