Skip to content

alihaghighat/GDM-Machine-Learning-main

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gestational Diabetes Mellitus (GDM) Prediction

📌 Project Overview

This repository implements a complete machine learning workflow for the prediction of Gestational Diabetes Mellitus (GDM) using clinical and demographic data.
The dataset mainly consists of categorical, ordinal, and binary variables, which requires specialized preprocessing and model selection strategies.

The project focuses on:

  • Proper handling of categorical data
  • Class imbalance mitigation
  • Robust model evaluation
  • Comparative analysis of multiple machine learning models

📊 Dataset Description

  • Target variable: GDM (Binary: 0 / 1)
  • Features: Mostly categorical and ordinal clinical variables
  • Dataset size: ~3,000 records
  • Class distribution: Imbalanced
  • Missing values: Minimal or none

⚙️ Data Preprocessing

The following preprocessing steps were performed:

  • Exploratory data analysis (EDA)
  • Unique value and frequency analysis per column
  • Handling mixed-type categorical values
  • Data type validation
  • Stratified data splitting:
    • 80% Training
    • 20% Testing

Class distribution was preserved during splitting using stratified sampling.


⚖️ Class Imbalance Handling

To address the imbalanced nature of the GDM target variable, several strategies were applied only on the training set:

  • No sampling (baseline)
  • Random OverSampling
  • SMOTE
  • Random UnderSampling

The test set remained unchanged to avoid data leakage and ensure fair evaluation.


🤖 Machine Learning Models

The following models were implemented and evaluated:

  • Logistic Regression
  • Support Vector Machine (SVM)
  • Decision Tree
  • Random Forest (RF)
  • Gradient Boosting Decision Trees (GBDT)
  • XGBoost
  • CatBoost
  • Generalized Additive Model (GAM)

Special emphasis was placed on CatBoost, due to its native support for categorical features without explicit encoding.


🔁 Model Evaluation Strategy

1️⃣ Train/Test Evaluation

  • Stratified 80/20 split
  • Evaluation metrics:
    • Accuracy
    • Precision
    • Recall
    • F1-score
    • ROC-AUC

2️⃣ Cross-Validation

  • Stratified K-Fold Cross-Validation (k = 10)
  • Mean and standard deviation reported for all metrics
  • Used to assess model stability and robustness

All evaluation results were stored in structured dictionaries and exported as CSV files.


📈 Visualization and Analysis

The following visual analyses were performed:

  • ROC-AUC comparison plots for all models
  • Separate plots for:
    • Test set AUC
    • K-Fold mean AUC
  • Feature importance analysis (Top 15 features) using:
    • Random Forest
    • GBDT
  • Categorical correlation analysis using Cramér’s V
  • Heatmaps for inter-feature dependency analysis

🔍 Feature Importance

Feature importance was extracted from tree-based models to identify the most influential predictors of GDM.
Consistency of important features across different models was also examined.


📁 Outputs

The pipeline generates multiple output files:

  • Model performance summaries (CSV)
  • Cross-validation statistics (CSV)
  • Feature importance rankings (CSV)
  • ROC-AUC comparison plots
  • Correlation heatmaps

These outputs are suitable for reports, theses, and scientific publications.


🧪 Reproducibility

  • Fixed random seeds used throughout the project
  • Stratified sampling applied consistently
  • Modular and reusable evaluation framework

📌 Conclusion

This repository provides a reproducible and extensible machine learning framework for GDM prediction using categorical clinical data.
The workflow emphasizes proper preprocessing, robust evaluation, and interpretability, making it suitable for academic and applied research.


🚀 Future Improvements

  • Hyperparameter tuning
  • Threshold optimization for clinical sensitivity
  • SHAP-based explainability
  • External dataset validation

🔗 Publication

This work is associated with the following peer-reviewed publication:

The Power of Machine Learning Models in Predicting Gestational Diabetes Mellitus
Published in BMC Pregnancy and Childbirth (Springer Nature)
DOI: https://doi.org/10.1186/s12884-026-08856-1

About

The Power of Machine Learning Models in Predicting Gestational Diabetes Mellitus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors