Skip to content

qriomatrix/handson-synthetic-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Customer Review Sentiment Classification

A complete machine learning pipeline for sentiment classification of customer reviews, from synthetic data generation to model training and evaluation.

Overview

This project demonstrates an end-to-end workflow for building a sentiment classification model using synthetic data and few-shot learning. The pipeline consists of three main stages:

  1. Synthetic Data Generation - Generate realistic customer review datasets using LLMs
  2. Data Preprocessing & EDA - Analyze and prepare data for training
  3. Model Training & Evaluation - Train and evaluate a sentiment classifier using SetFit

Notebooks

1. Synthetic Data Collection (synthetic_data_collection.ipynb)

Generates synthetic customer review data using NVIDIA Data Designer and LLMs.

What it does:

  • Creates structured data columns (product categories, subcategories, customer demographics, star ratings)
  • Uses LLM (GPT-OSS-120B) to generate product names and customer reviews
  • Generates realistic customer personas with demographics
  • Exports dataset to CSV format

Key outputs:

  • synthetic_reviews_dataset_large.csv - Generated dataset with reviews, ratings, and metadata

Configuration:

  • Product categories: Electronics, Clothing, Home & Kitchen, Books, Home Office
  • Star ratings: 1-5 stars
  • Review styles: rambling, brief, detailed, structured with bullet points
  • Customer age ranges: 18-25, 25-35, 35-50, 50-65, 65+

2. Data Preprocessing and EDA (data_preprocessing_and_eda.ipynb)

Processes the synthetic dataset and prepares it for model training.

What it does:

  • Loads the synthetic reviews dataset
  • Performs exploratory data analysis on categorical distributions
  • Converts star ratings to binary sentiment labels (negative < 3 stars, positive >= 3 stars)
  • Splits data into train/validation sets (80/20 split)
  • Converts to HuggingFace Dataset format
  • Saves processed datasets to disk

Key outputs:

  • data/synthetic_reviews_hf_ds_splits/ - Processed dataset splits ready for training

Visualizations:

  • Distribution plots for product categories, age ranges, and star ratings

3. Model Training and Evaluation (model_training_and_evaluation.ipynb)

Trains a sentiment classification model using SetFit for few-shot learning.

What it does:

  • Loads preprocessed dataset splits
  • Initializes SetFit model with sentence-transformers/all-MiniLM-L6-v2 backbone
  • Trains on minimal data (16 examples) using few-shot learning
  • Evaluates model performance with accuracy metrics and confusion matrix
  • Demonstrates real-time inference on new text

Training configuration:

  • Base model: all-MiniLM-L6-v2
  • Training samples: 16 examples
  • Batch size: 32
  • Epochs: 4
  • Evaluation: Per epoch with best model checkpointing

Results:

  • Validation accuracy: ~75%
  • Supports real-time sentiment classification

Quick Start

Prerequisites

  • Python 3.12+
  • NVIDIA GPU (recommended)
  • NVIDIA NGC account (for Data Designer)

Setup

  1. Install dependencies:

    uv sync
  2. Set up NVIDIA NGC CLI (for data generation):

  3. Configure environment:

    # Create .env file with your API keys

Usage

Run the notebooks in sequence:

  1. Generate data:

    jupyter notebook DataDesigner/src/notebooks/synthetic_data_collection.ipynb
  2. Preprocess data:

    jupyter notebook notebooks/data_preprocessing_and_eda.ipynb
  3. Train model:

    jupyter notebook notebooks/model_training_and_evaluation.ipynb

Project Structure

shared/
├── notebooks/
│   ├── synthetic_data_collection.ipynb      # Data generation
│   ├── data_preprocessing_and_eda.ipynb      # Data preprocessing
│   └── model_training_and_evaluation.ipynb  # Model training
└── README.md

Key Technologies

  • NVIDIA Data Designer - Synthetic data generation
  • SetFit - Few-shot learning framework
  • HuggingFace Datasets - Data management
  • Sentence Transformers - Text embeddings
  • PyTorch - Deep learning backend

Workflow

Generate Synthetic Data
    ↓
Preprocess & Analyze
    ↓
Train Model (Few-Shot)
    ↓
Evaluate & Deploy

Notes

  • The pipeline demonstrates few-shot learning, achieving ~75% accuracy with only 16 training examples
  • Synthetic data generation requires NVIDIA NGC access and API credentials
  • Model checkpoints are saved automatically during training
  • The binary classification converts star ratings: < 3 stars = negative, >= 3 stars = positive

About

Workshop assets: Achieving Final Year Research Success in AI/ML

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors