Skip to content

aryanKaga/Topic-Modeling-on-Amazon-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Amazon Reviews Topic Modeling: LDA vs BERTopic

Overview

This project performs topic modeling on the Amazon Reviews Dataset to extract meaningful themes, customer sentiments, and product-related insights from large-scale textual review data. It compares two powerful topic modeling approaches:

  • Latent Dirichlet Allocation (LDA) – A traditional probabilistic topic modeling technique
  • BERTopic – A transformer-based modern topic modeling framework

The goal is to evaluate how classical and contextual NLP methods differ in topic coherence, interpretability, and semantic understanding when applied to real-world e-commerce reviews.


Dataset

Amazon Reviews Dataset: https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews

This dataset contains large volumes of customer reviews across products, making it ideal for:

  • Customer sentiment exploration
  • Product trend identification
  • Hidden theme extraction
  • NLP benchmarking

Models Used

1. Latent Dirichlet Allocation (LDA)

LDA is a generative probabilistic model that identifies hidden topics based on word distributions across documents.

Workflow:

  • Text cleaning and preprocessing
  • Tokenization
  • Stopword removal
  • Vectorization using Bag-of-Words / TF-IDF
  • Topic extraction using probabilistic distributions

Advantages:

  • Interpretable topic-word distributions
  • Strong baseline for topic modeling
  • Efficient for large corpora

Limitations:

  • Limited semantic understanding
  • Sensitive to preprocessing
  • Less effective on short or context-rich text

2. BERTopic

BERTopic combines:

  • Transformer embeddings (BERT)
  • UMAP dimensionality reduction
  • HDBSCAN clustering
  • Class-based TF-IDF

This approach captures contextual and semantic relationships beyond simple word frequency.

Advantages:

  • Context-aware topic extraction
  • Higher topic coherence
  • Better performance on diverse review texts
  • More meaningful semantic clusters

Limitations:

  • Higher computational cost
  • Larger model size
  • More resource intensive

Project Objectives

  • Compare LDA and BERTopic performance
  • Analyze topic coherence and semantic quality
  • Visualize extracted topics
  • Understand customer concerns and product themes
  • Benchmark traditional vs transformer-based NLP methods

Tech Stack

  • Python
  • BERTopic
  • Scikit-learn
  • Gensim
  • Pandas
  • NumPy
  • Matplotlib / Seaborn
  • HuggingFace Transformers

Results

LDA:

  • Produced interpretable keyword-based topics
  • Useful for broad trend analysis
  • Lower contextual accuracy

BERTopic:

  • Generated semantically richer topics
  • Better clustered nuanced customer concerns
  • Superior topic coherence and modern NLP performance

References

Dataset:

BERTopic Reference:

Official BERTopic Documentation:

LDA Reference:


Conclusion

This project demonstrates the evolution of topic modeling from traditional statistical methods like LDA to modern embedding-based approaches like BERTopic. While LDA offers simplicity and interpretability, BERTopic provides significantly improved semantic understanding and topic quality for large-scale customer review analysis.


Future Improvements

  • Fine-tune transformer models on review-specific corpora
  • Add sentiment-topic correlation analysis
  • Implement dynamic topic tracking over time
  • Deploy interactive dashboard for topic exploration

Author

Developed as an NLP and machine learning project to explore advanced topic modeling techniques on real-world Amazon review datasets.

About

Automated topic labeling pipeline for BERTopic and LDA models using local flan-t5 (no API required). Generates concise, deduplicated labels with multi-run inference, fallback handling, and structured text output. Supports gensim, sklearn, and pickle-saved LDA models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors