A complete machine learning pipeline for sentiment classification of customer reviews, from synthetic data generation to model training and evaluation.
This project demonstrates an end-to-end workflow for building a sentiment classification model using synthetic data and few-shot learning. The pipeline consists of three main stages:
- Synthetic Data Generation - Generate realistic customer review datasets using LLMs
- Data Preprocessing & EDA - Analyze and prepare data for training
- Model Training & Evaluation - Train and evaluate a sentiment classifier using SetFit
Generates synthetic customer review data using NVIDIA Data Designer and LLMs.
What it does:
- Creates structured data columns (product categories, subcategories, customer demographics, star ratings)
- Uses LLM (GPT-OSS-120B) to generate product names and customer reviews
- Generates realistic customer personas with demographics
- Exports dataset to CSV format
Key outputs:
synthetic_reviews_dataset_large.csv- Generated dataset with reviews, ratings, and metadata
Configuration:
- Product categories: Electronics, Clothing, Home & Kitchen, Books, Home Office
- Star ratings: 1-5 stars
- Review styles: rambling, brief, detailed, structured with bullet points
- Customer age ranges: 18-25, 25-35, 35-50, 50-65, 65+
Processes the synthetic dataset and prepares it for model training.
What it does:
- Loads the synthetic reviews dataset
- Performs exploratory data analysis on categorical distributions
- Converts star ratings to binary sentiment labels (negative < 3 stars, positive >= 3 stars)
- Splits data into train/validation sets (80/20 split)
- Converts to HuggingFace Dataset format
- Saves processed datasets to disk
Key outputs:
data/synthetic_reviews_hf_ds_splits/- Processed dataset splits ready for training
Visualizations:
- Distribution plots for product categories, age ranges, and star ratings
Trains a sentiment classification model using SetFit for few-shot learning.
What it does:
- Loads preprocessed dataset splits
- Initializes SetFit model with
sentence-transformers/all-MiniLM-L6-v2backbone - Trains on minimal data (16 examples) using few-shot learning
- Evaluates model performance with accuracy metrics and confusion matrix
- Demonstrates real-time inference on new text
Training configuration:
- Base model:
all-MiniLM-L6-v2 - Training samples: 16 examples
- Batch size: 32
- Epochs: 4
- Evaluation: Per epoch with best model checkpointing
Results:
- Validation accuracy: ~75%
- Supports real-time sentiment classification
- Python 3.12+
- NVIDIA GPU (recommended)
- NVIDIA NGC account (for Data Designer)
-
Install dependencies:
uv sync
-
Set up NVIDIA NGC CLI (for data generation):
- Create account at https://catalog.ngc.nvidia.com/
- Install CLI: https://org.ngc.nvidia.com/setup/installers/cli
- Configure credentials
-
Configure environment:
# Create .env file with your API keys
Run the notebooks in sequence:
-
Generate data:
jupyter notebook DataDesigner/src/notebooks/synthetic_data_collection.ipynb
-
Preprocess data:
jupyter notebook notebooks/data_preprocessing_and_eda.ipynb
-
Train model:
jupyter notebook notebooks/model_training_and_evaluation.ipynb
shared/
├── notebooks/
│ ├── synthetic_data_collection.ipynb # Data generation
│ ├── data_preprocessing_and_eda.ipynb # Data preprocessing
│ └── model_training_and_evaluation.ipynb # Model training
└── README.md
- NVIDIA Data Designer - Synthetic data generation
- SetFit - Few-shot learning framework
- HuggingFace Datasets - Data management
- Sentence Transformers - Text embeddings
- PyTorch - Deep learning backend
Generate Synthetic Data
↓
Preprocess & Analyze
↓
Train Model (Few-Shot)
↓
Evaluate & Deploy
- The pipeline demonstrates few-shot learning, achieving ~75% accuracy with only 16 training examples
- Synthetic data generation requires NVIDIA NGC access and API credentials
- Model checkpoints are saved automatically during training
- The binary classification converts star ratings: < 3 stars = negative, >= 3 stars = positive