This project builds a credit default prediction system that estimates the probability of default for customers using historical repayment and financial behavior.
The focus is not just classification, but:
- Ranking users by risk
- Maximizing recall of defaulters
- Producing calibrated probability estimates for decision-making
Given customer credit history, predict the likelihood of default in the next month.
This is a risk-sensitive problem where:
- Missing a defaulter → financial loss
- False positives → operational cost
Therefore, the system prioritizes:
- High recall for defaulters
- Efficient identification of high-risk users
The dataset contains:
PAY_X: Repayment status (delay behavior)BILL_AMT_X: Monthly bill amounts (debt)PAY_AMT_X: Payments made
Target:
default(1 = default, 0 = non-default)
The dataset is imbalanced, making accuracy an unreliable metric.
Constructed behavior-driven features:
recent_delay→ recent repayment severitydelay_trend→ worsening payment behaviorutilization→ credit usage ratiopay_ratio→ payment discipline
- Logistic Regression (baseline, interpretable)
- XGBoost (final model, non-linear patterns)
Instead of default 0.5 threshold:
- Lower threshold → higher recall
- Selected threshold based on business trade-off
Focused on:
- Recall (defaulters)
- Precision
- Confusion matrix
Accuracy was not used as a primary metric.
- Top 10% users capture ~68% of total defaulters
- Recall improved to ~80%
- XGBoost achieved higher precision at similar recall compared to logistic regression
- Model effectively ranks users by risk
Initial model showed overconfidence in probability estimates.
Applied isotonic calibration:
- Improved probability reliability
- Did not affect ranking performance
- Repayment delay is the strongest predictor of default
- Behavioral signals outperform static financial metrics
- Model is fragile without key delay features (validated via ablation)
- Ranking users is more valuable than binary classification
The model enables:
- Prioritization of high-risk customers
- Efficient allocation of risk management resources
- Estimation of expected default rates
Example:
- Top 10% high-risk users → 68% of defaulters
→ Significant improvement over random targeting
- Python
- Pandas, NumPy
- Scikit-learn
- XGBoost
- Matplotlib / Seaborn
├── notebook.ipynb # Full analysis and modeling
├── credit_model.pkl # Saved calibrated model
├── README.md
├── requirements.txt
git clone <repo_url>
cd <repo>
pip install -r requirements.txt
jupyter notebook- Add more temporal features
- Improve robustness beyond single dominant feature
- Deploy as real-time scoring API
- Integrate with live financial data
This project demonstrates a real-world credit risk modeling pipeline, including:
- Feature engineering from behavioral data
- Model comparison and threshold tuning
- Risk-based ranking
- Probability calibration
The system functions as an early warning mechanism for identifying high-risk customers.