This project conducts a clustering analysis on the Pima Indian Diabetes dataset, aiming to identify subgroups at different diabetes risk stages.
Using K-Means and Hierarchical Clustering, the study uncovers patterns across medical indicators like Glucose, BMI, and Insulin, offering insights into early detection and personalized interventions.
โ Dataset: 770 diagnostic records of Pima Indian females (21+ years) from the National Institute of Diabetes and Digestive and Kidney Diseases.
- Data Cleaning and Imputation: Replace invalid zero values; median imputation.
- Normalization: MinMaxScaler applied to bring features into [0, 1] range.
- Exploratory Data Analysis:
- Correlation heatmaps
- Pair plots (e.g., BMI vs Glucose)
- Clustering:
- K-Means (using Elbow Method for k=3)
- Hierarchical Clustering (Ward linkage, dendrogram analysis)
(๐ธ Replace Images/xxx.png with your actual screenshots if you have different paths!)
- Loaded dataset with
pandas. - Invalid zero entries replaced with
NaNand imputed. - Features scaled using MinMaxScaler.
- Generated pair plots and correlation heatmaps to discover relationships.
- Identified clusters through visual trends.
K-Means Clustering:
- Optimal
k=3determined by Elbow Method. - Grouped data into:
- Cluster 0: Low Risk
- Cluster 1: Medium Risk
- Cluster 2: High Risk
Hierarchical Clustering:
- Applied Ward's method.
- Dendrogram cut at three clusters for comparison.
| Metric | Cluster 0 (Low Risk) | Cluster 1 (Medium Risk) | Cluster 2 (High Risk) |
|---|---|---|---|
| Glucose | Low | Medium | High |
| BMI | Low | Elevated | High |
| Insulin | Low | Moderate | High |
- ๐ฅ K-Means efficiently segmented groups based on Glucose, BMI, and Insulin.
- ๐ง Hierarchical Clustering provided detailed data structure insights through dendrograms.
- ๐ Both methods validated the existence of Low, Medium, and High risk groups.
- K-Means was computationally efficient but required pre-specifying
k. - Hierarchical Clustering offered deeper insights without needing
k, but was computationally heavier. - Combining both methods provided robust and validated subgroup classifications.
Clustering techniques provided actionable insights into the risk profiles of Pima Indian females for diabetes progression.
This project demonstrates how machine learning can support early diagnosis, risk stratification, and personalized care strategies in healthcare.
- Apply Gaussian Mixture Models for soft clustering.
- Incorporate feature selection techniques to optimize model performance.
- Expand dataset to multi-ethnic groups for broader generalization.
- Pima Indian Diabetes Dataset - Kaggle
- Hopkins Medicine - Diabetes Resources
- Google Colab Notebook - Project Code
- Hetu Patel
- ๐ซ hetu.patel@torontomu.ca
- ๐ Portfolio Website
- ๐ป GitHub Profile
"Empowering healthcare through machine learning, one cluster at a time." ๐ง ๐





