Customer Segmentation & Purchase Pattern Analysis
This project analyzes retail transaction data to:
- Segment customers based on their purchasing behavior using RFM Analysis and K-Means Clustering
- Visualize customer clusters using PCA (Principal Component Analysis)
- Discover product purchase patterns using Market Basket Analysis (Apriori Algorithm)
- Provide an interactive analytics dashboard using Streamlit
The goal is to support data-driven marketing and business strategy decisions.
- Identify different customer segments (e.g., loyal, inactive, high spenders)
- Support marketing strategies such as:
- Loyalty programs
- Personalized promotions
- Product bundling
- Understand which products are frequently purchased together
Online Retail Dataset
Transaction data with the following columns:
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
Source: https://www.kaggle.com/datasets/ulrikthygepedersen/online-retail-dataset
The following preprocessing steps were applied:
- Removed duplicate records
- Removed rows with missing
CustomerIDandDescription - Converted
InvoiceDateto datetime format - Removed cancelled transactions (InvoiceNo starting with "C")
- Removed invalid values (
Quantity ≤ 0,UnitPrice ≤ 0) - Removed outliers using the IQR method
- Standardized text fields (
Description,Country)
Feature engineering:
- Created
TotalPrice = Quantity × UnitPrice
RFM variables:
- Recency → Number of days since the last transaction
- Frequency → Number of unique invoices
- Monetary → Total spending
These variables represent customer purchasing behavior and are used as input for clustering.
- Algorithm: K-Means Clustering
- Input features: Scaled RFM variables
- Visualization: PCA (2D scatter plot)
- Output: Cluster label for each customer
Cluster profiling is performed using the average RFM values for each cluster.
- Algorithm: Apriori
- Metrics used:
- Support
- Confidence
- Lift
- Output:
- Association rules such as:
{Product A} → {Product B}
- Association rules such as:
These rules can be used to support product recommendation and bundling strategies.
- Interactive selection of number of clusters (k)
- PCA scatter plot for cluster visualization
- Bar chart for cluster profiling
- Association rules table
- Python
- Pandas, NumPy
- Scikit-learn
- Mlxtend
- Matplotlib
- Streamlit
- Clone the repository:
git clone https://github.com/yourusername/retail-customer-analytics.git
cd retail-customer-analytics- Install dependencies:
pip install -r requirements.txt- Project structure:
retail-customer-analytics/
│
├── app.py
├── requirements.txt
├── README.md
└── data/
└── online_retail.csv
- Run the Streamlit app:
streamlit run app.py