Gestational Diabetes Mellitus (GDM) Prediction

📌 Project Overview

This repository implements a complete machine learning workflow for the prediction of Gestational Diabetes Mellitus (GDM) using clinical and demographic data.
The dataset mainly consists of categorical, ordinal, and binary variables, which requires specialized preprocessing and model selection strategies.

The project focuses on:

Proper handling of categorical data
Class imbalance mitigation
Robust model evaluation
Comparative analysis of multiple machine learning models

📊 Dataset Description

Target variable: GDM (Binary: 0 / 1)
Features: Mostly categorical and ordinal clinical variables
Dataset size: ~3,000 records
Class distribution: Imbalanced
Missing values: Minimal or none

⚙️ Data Preprocessing

The following preprocessing steps were performed:

Exploratory data analysis (EDA)
Unique value and frequency analysis per column
Handling mixed-type categorical values
Data type validation
Stratified data splitting:
- 80% Training
- 20% Testing

Class distribution was preserved during splitting using stratified sampling.

⚖️ Class Imbalance Handling

To address the imbalanced nature of the GDM target variable, several strategies were applied only on the training set:

No sampling (baseline)
Random OverSampling
SMOTE
Random UnderSampling

The test set remained unchanged to avoid data leakage and ensure fair evaluation.

🤖 Machine Learning Models

The following models were implemented and evaluated:

Logistic Regression
Support Vector Machine (SVM)
Decision Tree
Random Forest (RF)
Gradient Boosting Decision Trees (GBDT)
XGBoost
CatBoost
Generalized Additive Model (GAM)

Special emphasis was placed on CatBoost, due to its native support for categorical features without explicit encoding.

🔁 Model Evaluation Strategy

1️⃣ Train/Test Evaluation

Stratified 80/20 split
Evaluation metrics:
- Accuracy
- Precision
- Recall
- F1-score
- ROC-AUC

2️⃣ Cross-Validation

Stratified K-Fold Cross-Validation (k = 10)
Mean and standard deviation reported for all metrics
Used to assess model stability and robustness

All evaluation results were stored in structured dictionaries and exported as CSV files.

📈 Visualization and Analysis

The following visual analyses were performed:

ROC-AUC comparison plots for all models
Separate plots for:
- Test set AUC
- K-Fold mean AUC
Feature importance analysis (Top 15 features) using:
- Random Forest
- GBDT
Categorical correlation analysis using Cramér’s V
Heatmaps for inter-feature dependency analysis

🔍 Feature Importance

Feature importance was extracted from tree-based models to identify the most influential predictors of GDM.
Consistency of important features across different models was also examined.

📁 Outputs

The pipeline generates multiple output files:

Model performance summaries (CSV)
Cross-validation statistics (CSV)
Feature importance rankings (CSV)
ROC-AUC comparison plots
Correlation heatmaps

These outputs are suitable for reports, theses, and scientific publications.

🧪 Reproducibility

Fixed random seeds used throughout the project
Stratified sampling applied consistently
Modular and reusable evaluation framework

📌 Conclusion

This repository provides a reproducible and extensible machine learning framework for GDM prediction using categorical clinical data.
The workflow emphasizes proper preprocessing, robust evaluation, and interpretability, making it suitable for academic and applied research.

🚀 Future Improvements

Hyperparameter tuning
Threshold optimization for clinical sensitivity
SHAP-based explainability
External dataset validation

🔗 Publication

This work is associated with the following peer-reviewed publication:

The Power of Machine Learning Models in Predicting Gestational Diabetes Mellitus
Published in BMC Pregnancy and Childbirth (Springer Nature)
DOI: https://doi.org/10.1186/s12884-026-08856-1

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
GDM Database.xlsx		GDM Database.xlsx
GDM.ipynb		GDM.ipynb
OverSampeling.ipynb		OverSampeling.ipynb
README.md		README.md
UnderSampeling.ipynb		UnderSampeling.ipynb
auc_kfold_all_models.png		auc_kfold_all_models.png
auc_kfold_all_models_OverSampeling.png		auc_kfold_all_models_OverSampeling.png
auc_kfold_all_models_UnderSampeling.png		auc_kfold_all_models_UnderSampeling.png
auc_kfold_all_models_smote.png		auc_kfold_all_models_smote.png
auc_test_all_models.png		auc_test_all_models.png
auc_test_all_models_OverSampeling.png		auc_test_all_models_OverSampeling.png
auc_test_all_models_UnderSampeling.png		auc_test_all_models_UnderSampeling.png
auc_test_all_models_smote.png		auc_test_all_models_smote.png
categorical_correlation_cramers_v.csv		categorical_correlation_cramers_v.csv
categorical_correlation_cramers_v_heatmap.png		categorical_correlation_cramers_v_heatmap.png
column_statistical_report.txt		column_statistical_report.txt
model_results_kfold_folds.csv		model_results_kfold_folds.csv
model_results_kfold_folds_OverSampeling.csv		model_results_kfold_folds_OverSampeling.csv
model_results_kfold_folds_UnderSampeling.csv		model_results_kfold_folds_UnderSampeling.csv
model_results_kfold_folds_smote.csv		model_results_kfold_folds_smote.csv
model_results_kfold_summary.csv		model_results_kfold_summary.csv
model_results_kfold_summary_OverSampeling.csv		model_results_kfold_summary_OverSampeling.csv
model_results_kfold_summary_UnderSampeling.csv		model_results_kfold_summary_UnderSampeling.csv
model_results_kfold_summary_smote.csv		model_results_kfold_summary_smote.csv
model_results_test_80_20.csv		model_results_test_80_20.csv
model_results_test_80_20_OverSampeling.csv		model_results_test_80_20_OverSampeling.csv
model_results_test_80_20_UnderSampeling.csv		model_results_test_80_20_UnderSampeling.csv
model_results_test_80_20_smote.csv		model_results_test_80_20_smote.csv
smoth GDM.ipynb		smoth GDM.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gestational Diabetes Mellitus (GDM) Prediction

📌 Project Overview

📊 Dataset Description

⚙️ Data Preprocessing

⚖️ Class Imbalance Handling

🤖 Machine Learning Models

🔁 Model Evaluation Strategy

1️⃣ Train/Test Evaluation

2️⃣ Cross-Validation

📈 Visualization and Analysis

🔍 Feature Importance

📁 Outputs

🧪 Reproducibility

📌 Conclusion

🚀 Future Improvements

🔗 Publication

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gestational Diabetes Mellitus (GDM) Prediction

📌 Project Overview

📊 Dataset Description

⚙️ Data Preprocessing

⚖️ Class Imbalance Handling

🤖 Machine Learning Models

🔁 Model Evaluation Strategy

1️⃣ Train/Test Evaluation

2️⃣ Cross-Validation

📈 Visualization and Analysis

🔍 Feature Importance

📁 Outputs

🧪 Reproducibility

📌 Conclusion

🚀 Future Improvements

🔗 Publication

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages