Skip to content

hetuvpatel/ML-Diabetes-Risk-Progression-Stage

Repository files navigation

๐Ÿง  Clustering Analysis of Diabetes Risk and Progression in Pima Indian Females

Python Scikit-learn Clustering MinMaxScaler Visualization Pima Indian Diabetes Dataset Healthcare Analytics Completed Project


๐Ÿ“– Project Overview

This project conducts a clustering analysis on the Pima Indian Diabetes dataset, aiming to identify subgroups at different diabetes risk stages.
Using K-Means and Hierarchical Clustering, the study uncovers patterns across medical indicators like Glucose, BMI, and Insulin, offering insights into early detection and personalized interventions.

โœ… Dataset: 770 diagnostic records of Pima Indian females (21+ years) from the National Institute of Diabetes and Digestive and Kidney Diseases.


๐Ÿ› ๏ธ Techniques Used

  • Data Cleaning and Imputation: Replace invalid zero values; median imputation.
  • Normalization: MinMaxScaler applied to bring features into [0, 1] range.
  • Exploratory Data Analysis:
    • Correlation heatmaps
    • Pair plots (e.g., BMI vs Glucose)
  • Clustering:
    • K-Means (using Elbow Method for k=3)
    • Hierarchical Clustering (Ward linkage, dendrogram analysis)

โœจ Key Visualizations

Visualization Description
Heatmap Correlation heatmap showing relationships between health features.
Pair Plot Pair-wise feature comparison highlighting clustering tendencies.
K-Means Clusters Scatter plot (By stages) colored by K-Means risk groups.
K-Means Clusters Scatter plot (By risks) colored by K-Means risk groups.
Radar Chart Comparison of feature averages across Low, Medium, and High risk clusters.
Agglomerative Clusters Agglomerative Clustering to showcase Risk Groups

(๐Ÿ“ธ Replace Images/xxx.png with your actual screenshots if you have different paths!)


๐Ÿš€ Methodology

๐Ÿ“‚ Data Loading and Preprocessing

  • Loaded dataset with pandas.
  • Invalid zero entries replaced with NaN and imputed.
  • Features scaled using MinMaxScaler.

๐Ÿ“Š Exploratory Data Analysis

  • Generated pair plots and correlation heatmaps to discover relationships.
  • Identified clusters through visual trends.

๐Ÿงฉ Clustering

K-Means Clustering:

  • Optimal k=3 determined by Elbow Method.
  • Grouped data into:
    • Cluster 0: Low Risk
    • Cluster 1: Medium Risk
    • Cluster 2: High Risk

Hierarchical Clustering:

  • Applied Ward's method.
  • Dendrogram cut at three clusters for comparison.

๐Ÿ“ˆ Results

Metric Cluster 0 (Low Risk) Cluster 1 (Medium Risk) Cluster 2 (High Risk)
Glucose Low Medium High
BMI Low Elevated High
Insulin Low Moderate High
  • ๐Ÿ”ฅ K-Means efficiently segmented groups based on Glucose, BMI, and Insulin.
  • ๐Ÿง  Hierarchical Clustering provided detailed data structure insights through dendrograms.
  • ๐Ÿ† Both methods validated the existence of Low, Medium, and High risk groups.

๐Ÿง  Discussion

  • K-Means was computationally efficient but required pre-specifying k.
  • Hierarchical Clustering offered deeper insights without needing k, but was computationally heavier.
  • Combining both methods provided robust and validated subgroup classifications.

โœ… Conclusion

Clustering techniques provided actionable insights into the risk profiles of Pima Indian females for diabetes progression.
This project demonstrates how machine learning can support early diagnosis, risk stratification, and personalized care strategies in healthcare.


๐Ÿ”ฎ Future Improvements

  • Apply Gaussian Mixture Models for soft clustering.
  • Incorporate feature selection techniques to optimize model performance.
  • Expand dataset to multi-ethnic groups for broader generalization.

๐Ÿ“š References


๐Ÿ“ฌ Contact


"Empowering healthcare through machine learning, one cluster at a time." ๐Ÿง ๐Ÿš€

About

Machine learning project analyzing diabetes risk progression using K-Means and Hierarchical clustering techniques on the Pima Indian Diabetes dataset. ๐Ÿง ๐Ÿ“Š

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages