This project segments a customer base into distinct risk profile clusters based on their purchasing patterns using K-Means clustering. By analyzing historical order data (Jan 2020 - July 2024), we identify customer archetypes to inform sales strategies and client management.
The goal of this analysis is to classify customers not just by volume, but by the nature of their ordering habits. We calculate three key metrics for every customer:
- Consistency: The standard deviation of their order quantity (lower = more consistent).
- Loyalty: The percentage of active months relative to their total lifespan as a customer.
- Slope: The trend direction of their order volume over time (linear regression slope).
The analysis pipeline follows these steps, as detailed in the src/ directory:
- Merging: Combines recent closing data with historical combined data (
01_join_data.py). - Feature Engineering: Calculates the core metrics (Consistency, Loyalty, Slope) and temporal attributes like "July Average" and "P3MA" (Orders per day in April/May/June) (
02_feature_engineering.py).
To prepare the data for K-Means clustering, we apply rigorous preprocessing (03_winsorization.py):
- Normalization: A PowerTransformer is applied to give the data a Gaussian distribution.
- Winsorization: Extreme outliers are capped (floors and ceilings) rather than removed, preserving the dataset size while reducing the impact of anomalies.
- Determining K: We evaluate the optimal number of clusters for different customer sizes (Small, Medium, Large, Cow) using Silhouette, Calinski-Harabasz, and Davies-Bouldin scores (
04_determine_k.py). - K-Means Execution: The algorithm assigns customers to clusters. For example, "Small" customers are divided into 3 clusters (
05_clustering.py).
- PCA Analysis: We use Principal Component Analysis (PCA) to calculate variability eigenvectors, determining how much each field (Consistency, Loyalty, Slope) contributes to the cluster definitions (
06_pca_analysis.py). - Distributions: Visualizing the statistical distribution of fields across clusters (
07_plot_distributions.py).
Based on the analysis (refer to presentation/Clustering Clients.pptx), the following archetypes were identified for "Small" clients:
- Cluster 0 (Steady & Loyal): Lowest consistency score (meaning highly consistent), 98% Loyalty, slightly negative slope.
- Cluster 1 (Volatile & Declining): High consistency score (highly variable), 91% Loyalty, but a highly negative slope.
- Cluster 2 (Churn Risk/Sporadic): Low consistency, low loyalty (52%), and negative slope.
A detailed presentation containing 3D PCA visualizations and deep-dive statistics is available in the presentation/ directory.