Customer Ordering Behavior Clustering

This project segments a customer base into distinct risk profile clusters based on their purchasing patterns using K-Means clustering. By analyzing historical order data (Jan 2020 - July 2024), we identify customer archetypes to inform sales strategies and client management.

Project Overview

The goal of this analysis is to classify customers not just by volume, but by the nature of their ordering habits. We calculate three key metrics for every customer:

Consistency: The standard deviation of their order quantity (lower = more consistent).
Loyalty: The percentage of active months relative to their total lifespan as a customer.
Slope: The trend direction of their order volume over time (linear regression slope).

Methodology

The analysis pipeline follows these steps, as detailed in the src/ directory:

1. Data Preparation

Merging: Combines recent closing data with historical combined data (01_join_data.py).
Feature Engineering: Calculates the core metrics (Consistency, Loyalty, Slope) and temporal attributes like "July Average" and "P3MA" (Orders per day in April/May/June) (02_feature_engineering.py).

2. Preprocessing

To prepare the data for K-Means clustering, we apply rigorous preprocessing (03_winsorization.py):

Normalization: A PowerTransformer is applied to give the data a Gaussian distribution.
Winsorization: Extreme outliers are capped (floors and ceilings) rather than removed, preserving the dataset size while reducing the impact of anomalies.

3. Clustering

Determining K: We evaluate the optimal number of clusters for different customer sizes (Small, Medium, Large, Cow) using Silhouette, Calinski-Harabasz, and Davies-Bouldin scores (04_determine_k.py).
K-Means Execution: The algorithm assigns customers to clusters. For example, "Small" customers are divided into 3 clusters (05_clustering.py).

4. Evaluation & Visualization

PCA Analysis: We use Principal Component Analysis (PCA) to calculate variability eigenvectors, determining how much each field (Consistency, Loyalty, Slope) contributes to the cluster definitions (06_pca_analysis.py).
Distributions: Visualizing the statistical distribution of fields across clusters (07_plot_distributions.py).

Cluster Archetypes (Results)

Based on the analysis (refer to presentation/Clustering Clients.pptx), the following archetypes were identified for "Small" clients:

Cluster 0 (Steady & Loyal): Lowest consistency score (meaning highly consistent), 98% Loyalty, slightly negative slope.
Cluster 1 (Volatile & Declining): High consistency score (highly variable), 91% Loyalty, but a highly negative slope.
Cluster 2 (Churn Risk/Sporadic): Low consistency, low loyalty (52%), and negative slope.

Presentation

A detailed presentation containing 3D PCA visualizations and deep-dive statistics is available in the presentation/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
presentation		presentation
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Ordering Behavior Clustering

Project Overview

Methodology

1. Data Preparation

2. Preprocessing

3. Clustering

4. Evaluation & Visualization

Cluster Archetypes (Results)

Presentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Customer Ordering Behavior Clustering

Project Overview

Methodology

1. Data Preparation

2. Preprocessing

3. Clustering

4. Evaluation & Visualization

Cluster Archetypes (Results)

Presentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages