Credit Risk Modeling

Overview

This project builds a credit default prediction system that estimates the probability of default for customers using historical repayment and financial behavior.

The focus is not just classification, but:

Ranking users by risk
Maximizing recall of defaulters
Producing calibrated probability estimates for decision-making

Problem Statement

Given customer credit history, predict the likelihood of default in the next month.

This is a risk-sensitive problem where:

Missing a defaulter → financial loss
False positives → operational cost

Therefore, the system prioritizes:

High recall for defaulters
Efficient identification of high-risk users

Dataset

The dataset contains:

PAY_X: Repayment status (delay behavior)
BILL_AMT_X: Monthly bill amounts (debt)
PAY_AMT_X: Payments made

Target:

default (1 = default, 0 = non-default)

The dataset is imbalanced, making accuracy an unreliable metric.

Approach

1. Feature Engineering

Constructed behavior-driven features:

recent_delay → recent repayment severity
delay_trend → worsening payment behavior
utilization → credit usage ratio
pay_ratio → payment discipline

2. Models Used

Logistic Regression (baseline, interpretable)
XGBoost (final model, non-linear patterns)

3. Threshold Optimization

Instead of default 0.5 threshold:

Lower threshold → higher recall
Selected threshold based on business trade-off

4. Model Evaluation

Focused on:

Recall (defaulters)
Precision
Confusion matrix

Accuracy was not used as a primary metric.

Key Results

Top 10% users capture ~68% of total defaulters
Recall improved to ~80%
XGBoost achieved higher precision at similar recall compared to logistic regression
Model effectively ranks users by risk

Calibration

Initial model showed overconfidence in probability estimates.

Applied isotonic calibration:

Improved probability reliability
Did not affect ranking performance

Key Insights

Repayment delay is the strongest predictor of default
Behavioral signals outperform static financial metrics
Model is fragile without key delay features (validated via ablation)
Ranking users is more valuable than binary classification

Business Impact

The model enables:

Prioritization of high-risk customers
Efficient allocation of risk management resources
Estimation of expected default rates

Example:

Top 10% high-risk users → 68% of defaulters
→ Significant improvement over random targeting

Tech Stack

Python
Pandas, NumPy
Scikit-learn
XGBoost
Matplotlib / Seaborn

Project Structure

├── notebook.ipynb       # Full analysis and modeling
├── credit_model.pkl     # Saved calibrated model
├── README.md
├── requirements.txt

How to Run

git clone <repo_url>
cd <repo>
pip install -r requirements.txt
jupyter notebook

Future Improvements

Add more temporal features
Improve robustness beyond single dominant feature
Deploy as real-time scoring API
Integrate with live financial data

Conclusion

This project demonstrates a real-world credit risk modeling pipeline, including:

Feature engineering from behavioral data
Model comparison and threshold tuning
Risk-based ranking
Probability calibration

The system functions as an early warning mechanism for identifying high-risk customers.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
.gitattributes		.gitattributes
README.md		README.md
app.py		app.py
credit_model.pkl		credit_model.pkl
main.ipynb		main.ipynb
requirements.txt		requirements.txt
workflow.md		workflow.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Risk Modeling

Overview

Problem Statement

Dataset

Approach

1. Feature Engineering

2. Models Used

3. Threshold Optimization

4. Model Evaluation

Key Results

Calibration

Key Insights

Business Impact

Tech Stack

Project Structure

How to Run

Future Improvements

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Credit Risk Modeling

Overview

Problem Statement

Dataset

Approach

1. Feature Engineering

2. Models Used

3. Threshold Optimization

4. Model Evaluation

Key Results

Calibration

Key Insights

Business Impact

Tech Stack

Project Structure

How to Run

Future Improvements

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages