Skip to content

Jouzou-M/Clustering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Customer Ordering Behavior Clustering

This project segments a customer base into distinct risk profile clusters based on their purchasing patterns using K-Means clustering. By analyzing historical order data (Jan 2020 - July 2024), we identify customer archetypes to inform sales strategies and client management.

Project Overview

The goal of this analysis is to classify customers not just by volume, but by the nature of their ordering habits. We calculate three key metrics for every customer:

  1. Consistency: The standard deviation of their order quantity (lower = more consistent).
  2. Loyalty: The percentage of active months relative to their total lifespan as a customer.
  3. Slope: The trend direction of their order volume over time (linear regression slope).

Methodology

The analysis pipeline follows these steps, as detailed in the src/ directory:

1. Data Preparation

  • Merging: Combines recent closing data with historical combined data (01_join_data.py).
  • Feature Engineering: Calculates the core metrics (Consistency, Loyalty, Slope) and temporal attributes like "July Average" and "P3MA" (Orders per day in April/May/June) (02_feature_engineering.py).

2. Preprocessing

To prepare the data for K-Means clustering, we apply rigorous preprocessing (03_winsorization.py):

  • Normalization: A PowerTransformer is applied to give the data a Gaussian distribution.
  • Winsorization: Extreme outliers are capped (floors and ceilings) rather than removed, preserving the dataset size while reducing the impact of anomalies.

3. Clustering

  • Determining K: We evaluate the optimal number of clusters for different customer sizes (Small, Medium, Large, Cow) using Silhouette, Calinski-Harabasz, and Davies-Bouldin scores (04_determine_k.py).
  • K-Means Execution: The algorithm assigns customers to clusters. For example, "Small" customers are divided into 3 clusters (05_clustering.py).

4. Evaluation & Visualization

  • PCA Analysis: We use Principal Component Analysis (PCA) to calculate variability eigenvectors, determining how much each field (Consistency, Loyalty, Slope) contributes to the cluster definitions (06_pca_analysis.py).
  • Distributions: Visualizing the statistical distribution of fields across clusters (07_plot_distributions.py).

Cluster Archetypes (Results)

Based on the analysis (refer to presentation/Clustering Clients.pptx), the following archetypes were identified for "Small" clients:

  • Cluster 0 (Steady & Loyal): Lowest consistency score (meaning highly consistent), 98% Loyalty, slightly negative slope.
  • Cluster 1 (Volatile & Declining): High consistency score (highly variable), 91% Loyalty, but a highly negative slope.
  • Cluster 2 (Churn Risk/Sporadic): Low consistency, low loyalty (52%), and negative slope.

Presentation

A detailed presentation containing 3D PCA visualizations and deep-dive statistics is available in the presentation/ directory.

About

Customer Ordering Behavior Clustering for Wakilni (fulfilment company)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages