This project performs customer segmentation analysis using Kaggle's Bank Marketing dataset to identify distinct customer groups for targeted marketing strategies. Through unsupervised machine learning and comprehensive feature engineering, we identified 5 distinct customer segments with deposit conversion rates ranging from 34.1% to 88.9%.
Dataset: Kaggle Bank Marketing Dataset
Records: 11,162 customers | Features: 17 original + 15 engineered features
# Clone the repository
git clone https://github.com/yourusername/Customer-Segmentation-Project.git
# Install dependencies
pip install -r requirements.txt
# Run the analysis notebook
jupyter notebook notebook/data_exploration.ipynb
# Run inference on new data
python inference.pyRequirements: Python 3.8+, pandas, numpy, scikit-learn, matplotlib, seaborn, scipy, statsmodels, joblib
- Source: Kaggle Bank Marketing dataset (Portuguese banking institution marketing campaigns)
- Size: 11,162 customers with 17 features
- Features: Demographics (age, job, marital, education), financial (balance, housing, loan), behavioral (campaign, pdays, previous, poutcome), temporal (month, day_of_week), and target (deposit)
- Data Quality: No missing values, balanced target distribution (52% no deposit, 48% yes deposit)
Key Exploratory Findings:
- Age distribution: Mean 40.9 years (range: 18-95)
- Balance distribution: Highly right-skewed (mean: $1,362, median: $448)
- Campaign contacts: Most customers contacted 1-3 times (mean: 2.76)
- Previous contact outcomes: 81.7% never contacted, 14.5% failure, 3.8% success
Created 15 engineered features to capture complex customer behaviors:
Financial Features:
balance_per_age: Wealth accumulation rate indicatorbalance_log: Log-transformed balance for skewness correctionbalance_positive: Binary indicator for positive account balancebalance_category: Quartile-based balance bins (low/medium/high/very_high)
Campaign Intensity Metrics:
campaign_intensity: Campaign contacts normalized by previous contactstotal_contacts: Cumulative campaign + previous contactsis_first_campaign: Binary flag for first-time contact
Previous Contact Features:
contacted_before: Whether customer was contacted in previous campaignspdays_normalized: Days since last contact (handling -1 missing values)recent_contact: Recent contact indicator (< 30 days)
Temporal Encoding:
month_sin,month_cos: Cyclical month encodingquarter: Quarterly seasonality indicator (Q1-Q4)
Interaction Features:
has_debt: Binary indicator for any loan/housing debtdebt_count: Total number of active debts (0-2)prev_success: Previous campaign success indicatorduration_per_contact: Average call duration per campaign contact
Applied three-stage feature selection to reduce dimensionality and multicollinearity:
Stage 1 - Variance Threshold:
- Removed features with variance < 0.01
- Eliminated quasi-constant features post one-hot encoding
- Reduced from 50+ encoded features to 40+ features
Stage 2 - Correlation Analysis:
- Identified highly correlated feature pairs (|r| > 0.8)
- Removed redundant features retaining business-relevant ones
- Example: Kept
balance_logover rawbalancedue to skewness handling
Stage 3 - VIF Analysis:
- Calculated Variance Inflation Factors for multicollinearity detection
- Iteratively removed features with VIF > 10
- Final feature set: 33 features with low multicollinearity
Result: Reduced from 50+ features to 33 high-quality features while preserving 95%+ information content
- Numerical Features: StandardScaler normalization (mean=0, std=1)
- Categorical Features: One-hot encoding for job, marital, education, contact, month, day_of_week, poutcome
- Binary Features: Direct encoding for default, housing, loan, deposit
- Pipeline: Sklearn ColumnTransformer for consistent train/inference preprocessing
Methodological Note: The deposit target variable is included in clustering for descriptive segmentation (understanding existing customer behaviors) rather than predictive modeling. This approach creates actionable marketing personas based on actual conversion patterns.
Evaluated 3 clustering algorithms using multiple validation metrics:
| Algorithm | Silhouette Score β | Davies-Bouldin Index β | Calinski-Harabasz Score β |
|---|---|---|---|
| K-Means | 0.1153 | 2.1270 | 475.38 |
| Hierarchical (Ward) | 0.1136 | 2.1459 | 468.42 |
| Gaussian Mixture | 0.1089 | 2.2031 | 451.29 |
Winner: K-Means (highest silhouette, lowest Davies-Bouldin)
Optimal k Selection:
- Elbow method: Diminishing returns after k=5
- Silhouette analysis: Peak at k=5 (score: 0.115)
- Davies-Bouldin: Lowest at k=5 (score: 2.13)
- Calinski-Harabasz: Stable plateau at k=5
- Business consideration: 5 clusters provide actionable granularity without oversegmentation
- ANOVA Tests: Confirmed statistically significant differences between clusters for all key features (p < 0.05)
- Cluster Stability: Consistent cluster assignments across multiple random initializations
- PCA Visualization: 2-component PCA explains 28.3% variance, shows reasonable cluster separation
| Cluster | Name | Size | % | Age (ΞΌ) | Balance (ΞΌ) | Duration (ΞΌ) | Campaign (ΞΌ) | Deposit Rate |
|---|---|---|---|---|---|---|---|---|
| 0 | Premium Engaged | 1,104 | 9.9% | 39.9 | $1,025 | 1,032 sec | 1.4 | 88.9% β¨ |
| 1 | High-Volume Prospects | 4,182 | 37.5% | 38.0 | $630 | 270 sec | 2.5 | 34.1% |
| 2 | Mature Balanced | 1,711 | 15.3% | 43.0 | $1,388 | 312 sec | 1.8 | 54.5% |
| 3 | Affluent Seniors | 1,279 | 11.5% | 48.6 | $7,086 | 354 sec | 2.5 | 59.6% β |
| 4 | Campaign-Intensive | 2,886 | 25.9% | 42.0 | $644 | 310 sec | 3.4 | 41.2% |
Demographics: Average age 39.9 years, balanced gender distribution
Financial Profile: Moderate balance ($1,025 median $539)
Behavioral Pattern:
- Exceptional call engagement: 17.2-minute average call duration (highest)
- Low campaign intensity: 1.4 contacts on average
- Strong previous contact history: 57% contacted before
- Highest conversion rate: 88.9% deposit subscription
Key Insight: This segment exhibits quality over quantity - long, engaged conversations lead to exceptional conversion. These are relationship-oriented customers who value personalized attention.
Demographics: Youngest active segment (38.0 years)
Financial Profile: Lower balance ($630 mean, $364 median)
Behavioral Pattern:
- Brief engagement: 4.5-minute average call duration (lowest)
- Moderate campaign intensity: 2.5 contacts
- Mostly first-time contacts: 78% no previous contact
- Moderate conversion: 34.1% deposit rate
Key Insight: Largest segment (37.5%) with untapped volume potential. Short call durations suggest transactional mindset or time constraints. Requires efficient, value-focused messaging.
Demographics: Mid-career professionals (43.0 years)
Financial Profile: Moderate-high balance ($1,388 mean, $771 median)
Behavioral Pattern:
- Moderate engagement: 5.2-minute average call duration
- Low campaign intensity: 1.8 contacts
- Significant previous contact history: 126% contacted before
- Strong conversion: 54.5% deposit rate
Key Insight: Reliable, established customers with proven responsiveness. Previous campaign success indicates loyalty and financial stability. Ideal for relationship expansion.
Demographics: Oldest segment (48.6 years), nearing peak earning years
Financial Profile: Wealthiest segment ($7,086 mean, $5,154 median)
Behavioral Pattern:
- Moderate engagement: 5.9-minute average call duration
- Moderate campaign intensity: 2.5 contacts
- Previous contact mix: 109% contacted before
- Strong conversion: 59.6% deposit rate
Key Insight: High-value targets with substantial assets and mature financial decision-making. Premium product offerings and wealth management focus appropriate.
Demographics: Mid-career professionals (42.0 years)
Financial Profile: Low balance ($644 mean, $333 median)
Behavioral Pattern:
- Moderate engagement: 5.2-minute average call duration
- Highest campaign intensity: 3.4 contacts (oversaturation signal)
- Low previous contact: 65% no prior contact
- Below-average conversion: 41.2% deposit rate
Key Insight: Diminishing returns from high contact frequency. Evidence of campaign fatigue - 3.4 contacts yet lowest conversion in contacted segments. Requires contact optimization strategy.
Target: Premium Engaged (88.9% conversion) + Affluent Seniors (59.6% conversion)
Combined Size: 21.4% of customer base | Projected Revenue Impact: Highest per-customer value
Tactical Actions:
-
Cluster 0 - Relationship-Based Strategy:
- Assign dedicated relationship managers for personalized service
- Increase call duration allowance (target: 15-20 min per call)
- Offer premium products: Wealth management, private banking, investment advisory
- Channel: Phone-first approach with follow-up email summaries
- Timing: Weekday afternoons (2-5 PM) when engagement is highest
- Expected Outcome: Maintain 85%+ conversion rate, expand product adoption
-
Cluster 3 - Affluent Product Focus:
- Curate high-balance products: Fixed deposits (>$5K), retirement planning, insurance
- Emphasize capital preservation and growth strategies
- Provide market insights and quarterly financial reviews
- Channel: Multi-channel (phone + email with investment reports)
- Timing: End-of-quarter financial planning cycles
- Expected Outcome: 60%+ conversion, increased deposit amounts
Budget Allocation: 35% of marketing spend | ROI: 8-12x (estimated based on conversion rates)
Target: High-Volume Prospects (37.5% of base, 34.1% conversion)
Opportunity: Largest untapped volume with room for conversion improvement
Tactical Actions:
-
Optimize for Efficiency:
- Develop 3-5 minute scripted value propositions (time-constrained audience)
- Create self-service digital onboarding portal for quick signup
- A/B test incentive structures: Sign-up bonuses, referral rewards, gamification
- Channel: Hybrid (brief phone + SMS/email follow-ups + mobile app)
- Timing: Evening/weekend outreach (accommodate working schedules)
-
Conversion Rate Improvement Target: 34% β 42% (+8 percentage points)
- Implement urgency-based campaigns (limited-time offers, seasonal promotions)
- Reduce friction: Simplified KYC, instant approval processes
- Expected Outcome: +330 additional conversions from existing segment
Budget Allocation: 30% of marketing spend | ROI: 4-6x (volume play)
Target: Campaign-Intensive (25.9% of base, 41.2% conversion)
Problem: Oversaturation (3.4 contacts) yielding diminishing returns
Tactical Actions:
-
Reduce Contact Frequency:
- Implement 90-day cooling-off period after 2 unsuccessful contacts
- Shift from push (outbound calls) to pull (inbound lead generation) strategies
- Test Hypothesis: Reduce contacts from 3.4 β 2.0, measure conversion change
-
Re-engagement Strategy:
- Personalized "We miss you" campaigns with exclusive offers
- Segment by last contact outcome: Differentiate "not interested" vs "bad timing"
- Channel: Low-pressure digital (email, SMS) before re-attempting calls
Budget Allocation: 15% of marketing spend (redirect savings to Priority 1-2)
Expected Outcome: Conversion improvement from 41% β 48% through smarter targeting
Target: Mature Balanced (15.3% of base, 54.5% conversion)
Strategy: Relationship expansion and loyalty reinforcement
Tactical Actions:
-
Cross-Selling & Upselling:
- Leverage previous campaign success (126% contacted before with positive history)
- Introduce tiered product ladder: Savings β Fixed Deposits β Investment Products
- Channel: Email-first with phone follow-ups for high-value products
-
Loyalty Program:
- Reward existing deposit holders with rate bonuses, exclusive offerings
- Create referral incentives (leverage established trust)
Budget Allocation: 20% of marketing spend | ROI: 5-7x (relationship extension)
| Cluster | Campaign Start | Frequency | Primary Channel | Target Conversion | Success Metric |
|---|---|---|---|---|---|
| 0 (Premium) | Q1 2026 | Quarterly | Phone + Email | 85-90% | Avg. deposit value >$3K |
| 1 (Volume) | Ongoing | Weekly batches | SMS/App + Phone | 40-45% | Total conversions >1,700 |
| 2 (Mature) | Q2 2026 | Bi-monthly | Email + Phone | 55-60% | Cross-sell rate 30% |
| 3 (Affluent) | Q1 2026 | Quarterly | Phone + Reports | 60-65% | Avg. deposit value >$8K |
| 4 (Fatigue) | Q2 2026 | Reduced cadence | Email β Phone | 45-50% | Contact reduction to 2.0 |
Baseline Performance: 48% overall deposit rate (5,289 conversions from 11,162 customers)
Optimized Performance (12-month projection):
- Cluster 0: 981 β 1,020 conversions (+39) | Maintain 88-90% rate
- Cluster 1: 1,425 β 1,755 conversions (+330) | Improve 34% β 42%
- Cluster 2: 933 β 1,010 conversions (+77) | Improve 55% β 59%
- Cluster 3: 762 β 820 conversions (+58) | Improve 60% β 64%
- Cluster 4: 1,188 β 1,385 conversions (+197) | Improve 41% β 48%
Total Projected Impact: +701 additional conversions (+13.3% overall conversion improvement)
Revenue Estimate (assuming $3,000 average deposit value): $2.1M additional deposit volume
The inference.py script enables real-time customer segmentation for new prospects:
# Usage Example
from inference import load_models, predict_clusters, engineer_features
# Load trained artifacts
preprocessor, variance_selector, model, feature_info = load_models()
# Score new customers
new_customers = pd.read_csv('new_customer_data.csv')
clusters = predict_clusters(new_customers, preprocessor, variance_selector, model, feature_info)
# Output: Cluster assignments for targeted marketingProduction Integration:
- CRM Integration: Export clustered data to Salesforce/HubSpot for campaign execution
- Real-Time Scoring: API-ify inference pipeline for instant lead scoring
- Monitoring: Track cluster drift over time, retrain quarterly
- A/B Testing: Validate recommended strategies with controlled experiments
Model Artifacts (saved in output/):
preprocessor.pkl: Feature transformation pipelinevariance_selector.pkl: Feature selection modelclustering_model.pkl: K-Means model (k=5)feature_info.pkl: Metadata (optimal k, algorithm, feature list)
β
Feature Engineering: Expanded from 17 β 32 features through domain-driven engineering
β
Feature Selection: Reduced dimensionality from 50+ β 33 features using unsupervised methods
β
Algorithm Selection: K-Means outperformed Hierarchical and GMM on multiple metrics
β
Optimal Clustering: 5 clusters identified through holistic metric analysis
β
Statistical Validation: ANOVA tests confirmed significant inter-cluster differences (p < 0.05)
β
Reproducibility: All models and artifacts saved for production deployment
π― Conversion Rate Disparity: 88.9% (Cluster 0) vs 34.1% (Cluster 1) - 2.6x difference
π Wealth Concentration: Cluster 3 accounts for 11.5% of customers but highest asset base
π Volume Opportunity: 37.5% of base (Cluster 1) with 8% conversion upside potential
π€ Relationship Value: Longer call durations (17 min) correlate with 88.9% conversion rates
- Invest in high-touch service for Clusters 0 & 3 (21% of base, 75% conversion average)
- Optimize digital efficiency for Cluster 1 (38% of base, volume conversion opportunity)
- Reduce contact frequency for Cluster 4 (26% of base, combat campaign fatigue)
- Expand product offerings for Cluster 2 (15% of base, cross-sell potential)
- Projected 13.3% overall conversion improvement through segmented strategies
- Present findings to marketing and product teams
- Design A/B test frameworks for each cluster strategy
- Integrate inference pipeline with CRM system
- Create cluster-specific campaign templates
- Launch pilot campaigns for Priority 1 clusters (0 & 3)
- Implement campaign frequency optimization for Cluster 4
- Develop digital self-service portal for Cluster 1
- Establish cluster monitoring dashboard (track drift)
- Validate projected 13.3% conversion improvement
- Expand feature set with transaction data, web behavior
- Explore predictive modeling for churn, lifetime value
- Retrain clustering model quarterly with new data
- Dataset: [Moro et al., 2014] Bank Marketing Dataset, UCI Machine Learning Repository
- Clustering Validation: Rousseeuw (1987) Silhouette Score, Davies & Bouldin (1979) DB Index
- Feature Selection: Kuhn & Johnson (2013) Applied Predictive Modeling
- Business Strategy: Kumar & Rajan (2009) Customer Profitability Management