Customer Review Sentiment Classification

A complete machine learning pipeline for sentiment classification of customer reviews, from synthetic data generation to model training and evaluation.

Overview

This project demonstrates an end-to-end workflow for building a sentiment classification model using synthetic data and few-shot learning. The pipeline consists of three main stages:

Synthetic Data Generation - Generate realistic customer review datasets using LLMs
Data Preprocessing & EDA - Analyze and prepare data for training
Model Training & Evaluation - Train and evaluate a sentiment classifier using SetFit

Notebooks

1. Synthetic Data Collection (`synthetic_data_collection.ipynb`)

Generates synthetic customer review data using NVIDIA Data Designer and LLMs.

What it does:

Creates structured data columns (product categories, subcategories, customer demographics, star ratings)
Uses LLM (GPT-OSS-120B) to generate product names and customer reviews
Generates realistic customer personas with demographics
Exports dataset to CSV format

Key outputs:

synthetic_reviews_dataset_large.csv - Generated dataset with reviews, ratings, and metadata

Configuration:

Product categories: Electronics, Clothing, Home & Kitchen, Books, Home Office
Star ratings: 1-5 stars
Review styles: rambling, brief, detailed, structured with bullet points
Customer age ranges: 18-25, 25-35, 35-50, 50-65, 65+

2. Data Preprocessing and EDA (`data_preprocessing_and_eda.ipynb`)

Processes the synthetic dataset and prepares it for model training.

What it does:

Loads the synthetic reviews dataset
Performs exploratory data analysis on categorical distributions
Converts star ratings to binary sentiment labels (negative < 3 stars, positive >= 3 stars)
Splits data into train/validation sets (80/20 split)
Converts to HuggingFace Dataset format
Saves processed datasets to disk

Key outputs:

data/synthetic_reviews_hf_ds_splits/ - Processed dataset splits ready for training

Visualizations:

Distribution plots for product categories, age ranges, and star ratings

3. Model Training and Evaluation (`model_training_and_evaluation.ipynb`)

Trains a sentiment classification model using SetFit for few-shot learning.

What it does:

Loads preprocessed dataset splits
Initializes SetFit model with sentence-transformers/all-MiniLM-L6-v2 backbone
Trains on minimal data (16 examples) using few-shot learning
Evaluates model performance with accuracy metrics and confusion matrix
Demonstrates real-time inference on new text

Training configuration:

Base model: all-MiniLM-L6-v2
Training samples: 16 examples
Batch size: 32
Epochs: 4
Evaluation: Per epoch with best model checkpointing

Results:

Validation accuracy: ~75%
Supports real-time sentiment classification

Quick Start

Prerequisites

Python 3.12+
NVIDIA GPU (recommended)
NVIDIA NGC account (for Data Designer)

Setup

Install dependencies:
```
uv sync
```
Set up NVIDIA NGC CLI (for data generation):
- Create account at https://catalog.ngc.nvidia.com/
- Install CLI: https://org.ngc.nvidia.com/setup/installers/cli
- Configure credentials
Configure environment:
```
# Create .env file with your API keys
```

Usage

Run the notebooks in sequence:

Generate data:

jupyter notebook DataDesigner/src/notebooks/synthetic_data_collection.ipynb

Preprocess data:

jupyter notebook notebooks/data_preprocessing_and_eda.ipynb

Train model:

jupyter notebook notebooks/model_training_and_evaluation.ipynb

Project Structure

shared/
├── notebooks/
│   ├── synthetic_data_collection.ipynb      # Data generation
│   ├── data_preprocessing_and_eda.ipynb      # Data preprocessing
│   └── model_training_and_evaluation.ipynb  # Model training
└── README.md

Key Technologies

NVIDIA Data Designer - Synthetic data generation
SetFit - Few-shot learning framework
HuggingFace Datasets - Data management
Sentence Transformers - Text embeddings
PyTorch - Deep learning backend

Workflow

Generate Synthetic Data
    ↓
Preprocess & Analyze
    ↓
Train Model (Few-Shot)
    ↓
Evaluate & Deploy

Notes

The pipeline demonstrates few-shot learning, achieving ~75% accuracy with only 16 training examples
Synthetic data generation requires NVIDIA NGC access and API credentials
Model checkpoints are saved automatically during training
The binary classification converts star ratings: < 3 stars = negative, >= 3 stars = positive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Review Sentiment Classification

Overview

Notebooks

1. Synthetic Data Collection (`synthetic_data_collection.ipynb`)

2. Data Preprocessing and EDA (`data_preprocessing_and_eda.ipynb`)

3. Model Training and Evaluation (`model_training_and_evaluation.ipynb`)

Quick Start

Prerequisites

Setup

Usage

Project Structure

Key Technologies

Workflow

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Customer Review Sentiment Classification

Overview

Notebooks

1. Synthetic Data Collection (synthetic_data_collection.ipynb)

2. Data Preprocessing and EDA (data_preprocessing_and_eda.ipynb)

3. Model Training and Evaluation (model_training_and_evaluation.ipynb)

Quick Start

Prerequisites

Setup

Usage

Project Structure

Key Technologies

Workflow

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Synthetic Data Collection (`synthetic_data_collection.ipynb`)

2. Data Preprocessing and EDA (`data_preprocessing_and_eda.ipynb`)

3. Model Training and Evaluation (`model_training_and_evaluation.ipynb`)

Packages