A multi-model text classification pipeline for binary sentiment analysis. Trains four classifiers on bag-of-words features, evaluates on a held-out validation set, and generates test-set submission files.
- Overview
- Project Structure
- Requirements
- Quickstart
- Data Format
- Pipeline Walkthrough
- Models
- Output Files
- Configuration
- Reproducing Results
This pipeline takes raw text reviews, vectorizes them with a bigram bag-of-words representation, and trains four classifiers in sequence. Each model is evaluated on a validation split before generating predictions on the held-out test set. Submission files are produced for all four models.
The full run — from raw .tsv files to submission CSVs — is a single command.
.
├── data/
│ ├── train.tsv # Training data (required)
│ ├── test.tsv # Test data (required)
│ ├── train_X.npz # Vectorized training features (generated)
│ ├── valid_X.npz # Vectorized validation features (generated)
│ ├── y_train.csv # Training labels (generated)
│ ├── y_valid.csv # Validation labels (generated)
│ ├── logistic_regression_results.csv # (generated)
│ ├── random_forest_results.csv # (generated)
│ ├── svm_results.csv # (generated)
│ ├── gradient_boosting_results.csv # (generated)
│ ├── submission_logistic_regression.csv # (generated)
│ ├── submission_random_forest.csv # (generated)
│ ├── submission_svm.csv # (generated)
│ └── submission_gradient_boosting.csv # (generated)
├── LICENSE
├── Makefile
├── Makefile.save
├── README.md
├── memo.md
├── pipeline.py
├── requirements.txt
└── sample-submission.csv
Python 3.8+
pandas
scikit-learn
scipy
Install dependencies:
pip install pandas scikit-learn scipy- Place your data files in the
data/directory (see Data Format). - Run the pipeline:
python pipeline.pyThat's it. All intermediate artifacts and submission files will be written to data/.
data/train.tsv — tab-separated, must contain these two columns:
| Column | Type | Description |
|---|---|---|
review |
string | Raw text of the review |
label |
int/str | Binary class label |
data/test.tsv — tab-separated, must contain:
| Column | Type | Description |
|---|---|---|
id |
int/str | Unique row identifier (optional but recommended) |
review |
string | Raw text of the review |
Rows with missing values in required columns are dropped automatically.
The main() function executes the following steps in order:
1. Data preparation — Loads train.tsv, applies an 80/20 stratified random split, and vectorizes both splits using a CountVectorizer with bigrams. Sparse matrices and label files are saved to disk.
2. Individual model evaluation — Each of the four classifiers is trained on the training split and scored on the validation split. Accuracy is written to a per-model results CSV.
3. Submission generation — All four models are retrained on the full training set. Cross-validation scores are printed to stdout. Test-set predictions are written to submission CSVs.
| Model | Library Class | CV Folds | Notes |
|---|---|---|---|
| Logistic Regression | LogisticRegression |
5 | max_iter=2000 |
| Linear SVM | LinearSVC + CalibratedClassifierCV |
3 | Calibrated for probability output |
| Random Forest | RandomForestClassifier |
3 | 200 trees, all cores |
| Gradient Boosting | GradientBoostingClassifier |
3 | 100 estimators, depth 3 |
Slow models (SVM, Random Forest, Gradient Boosting) use 3-fold CV during submission generation to keep runtime reasonable.
After a full run, the following files will be present in data/:
Validation results (one row each, columns: Model, Accuracy):
logistic_regression_results.csvrandom_forest_results.csvsvm_results.csvgradient_boosting_results.csv
Submission files (columns: id, label):
submission_logistic_regression.csvsubmission_random_forest.csvsubmission_svm.csvsubmission_gradient_boosting.csv
For models with predict_proba support, the label column contains soft probability scores. For models without it, hard class predictions are used.
All tunable constants live in the Settings dataclass at the top of pipeline.py:
@dataclass(frozen=True)
class Settings:
data_dir: Path = Path("data")
train_file: Path = Path("data/train.tsv")
test_file: Path = Path("data/test.tsv")
train_ratio: float = 0.8 # Fraction of training data used for training split
random_seed: int = 100 # Controls the train/validation split
cv_default: int = 5 # CV folds for fast models
cv_slow: int = 3 # CV folds for slow modelsThe vectorizer is configured inline inside vectorize_and_save():
CountVectorizer(max_features=50000, ngram_range=(1, 2), min_df=2)Unigrams and bigrams, capped at 50k features, minimum document frequency of 2.
To get identical results across runs, make sure random_seed in Settings is unchanged (default: 100) and that the input files have not been shuffled. The train/validation split, all model random_state parameters, and the vectorizer are all deterministic given the same seed and input data.
# Clean run from scratch
rm -rf data/train_X.npz data/valid_X.npz data/y_train.csv data/y_valid.csv
python pipeline.py