This repository implements a complete machine learning workflow for the prediction of Gestational Diabetes Mellitus (GDM) using clinical and demographic data.
The dataset mainly consists of categorical, ordinal, and binary variables, which requires specialized preprocessing and model selection strategies.
The project focuses on:
- Proper handling of categorical data
- Class imbalance mitigation
- Robust model evaluation
- Comparative analysis of multiple machine learning models
- Target variable: GDM (Binary: 0 / 1)
- Features: Mostly categorical and ordinal clinical variables
- Dataset size: ~3,000 records
- Class distribution: Imbalanced
- Missing values: Minimal or none
The following preprocessing steps were performed:
- Exploratory data analysis (EDA)
- Unique value and frequency analysis per column
- Handling mixed-type categorical values
- Data type validation
- Stratified data splitting:
- 80% Training
- 20% Testing
Class distribution was preserved during splitting using stratified sampling.
To address the imbalanced nature of the GDM target variable, several strategies were applied only on the training set:
- No sampling (baseline)
- Random OverSampling
- SMOTE
- Random UnderSampling
The test set remained unchanged to avoid data leakage and ensure fair evaluation.
The following models were implemented and evaluated:
- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
- Random Forest (RF)
- Gradient Boosting Decision Trees (GBDT)
- XGBoost
- CatBoost
- Generalized Additive Model (GAM)
Special emphasis was placed on CatBoost, due to its native support for categorical features without explicit encoding.
- Stratified 80/20 split
- Evaluation metrics:
- Accuracy
- Precision
- Recall
- F1-score
- ROC-AUC
- Stratified K-Fold Cross-Validation (k = 10)
- Mean and standard deviation reported for all metrics
- Used to assess model stability and robustness
All evaluation results were stored in structured dictionaries and exported as CSV files.
The following visual analyses were performed:
- ROC-AUC comparison plots for all models
- Separate plots for:
- Test set AUC
- K-Fold mean AUC
- Feature importance analysis (Top 15 features) using:
- Random Forest
- GBDT
- Categorical correlation analysis using Cramér’s V
- Heatmaps for inter-feature dependency analysis
Feature importance was extracted from tree-based models to identify the most influential predictors of GDM.
Consistency of important features across different models was also examined.
The pipeline generates multiple output files:
- Model performance summaries (CSV)
- Cross-validation statistics (CSV)
- Feature importance rankings (CSV)
- ROC-AUC comparison plots
- Correlation heatmaps
These outputs are suitable for reports, theses, and scientific publications.
- Fixed random seeds used throughout the project
- Stratified sampling applied consistently
- Modular and reusable evaluation framework
This repository provides a reproducible and extensible machine learning framework for GDM prediction using categorical clinical data.
The workflow emphasizes proper preprocessing, robust evaluation, and interpretability, making it suitable for academic and applied research.
- Hyperparameter tuning
- Threshold optimization for clinical sensitivity
- SHAP-based explainability
- External dataset validation
This work is associated with the following peer-reviewed publication:
The Power of Machine Learning Models in Predicting Gestational Diabetes Mellitus
Published in BMC Pregnancy and Childbirth (Springer Nature)
DOI: https://doi.org/10.1186/s12884-026-08856-1