🧠 Clustering Analysis of Diabetes Risk and Progression in Pima Indian Females

📖 Project Overview

This project conducts a clustering analysis on the Pima Indian Diabetes dataset, aiming to identify subgroups at different diabetes risk stages.
Using K-Means and Hierarchical Clustering, the study uncovers patterns across medical indicators like Glucose, BMI, and Insulin, offering insights into early detection and personalized interventions.

✅ Dataset: 770 diagnostic records of Pima Indian females (21+ years) from the National Institute of Diabetes and Digestive and Kidney Diseases.

🛠️ Techniques Used

Data Cleaning and Imputation: Replace invalid zero values; median imputation.
Normalization: MinMaxScaler applied to bring features into [0, 1] range.
Exploratory Data Analysis:
- Correlation heatmaps
- Pair plots (e.g., BMI vs Glucose)
Clustering:
- K-Means (using Elbow Method for k=3)
- Hierarchical Clustering (Ward linkage, dendrogram analysis)

✨ Key Visualizations

Visualization	Description
	Correlation heatmap showing relationships between health features.
	Pair-wise feature comparison highlighting clustering tendencies.
	Scatter plot (By stages) colored by K-Means risk groups.
	Scatter plot (By risks) colored by K-Means risk groups.
	Comparison of feature averages across Low, Medium, and High risk clusters.
	Agglomerative Clustering to showcase Risk Groups

(📸 Replace Images/xxx.png with your actual screenshots if you have different paths!)

🚀 Methodology

📂 Data Loading and Preprocessing

Loaded dataset with pandas.
Invalid zero entries replaced with NaN and imputed.
Features scaled using MinMaxScaler.

📊 Exploratory Data Analysis

Generated pair plots and correlation heatmaps to discover relationships.
Identified clusters through visual trends.

🧩 Clustering

K-Means Clustering:

Optimal k=3 determined by Elbow Method.
Grouped data into:
- Cluster 0: Low Risk
- Cluster 1: Medium Risk
- Cluster 2: High Risk

Hierarchical Clustering:

Applied Ward's method.
Dendrogram cut at three clusters for comparison.

📈 Results

Metric	Cluster 0 (Low Risk)	Cluster 1 (Medium Risk)	Cluster 2 (High Risk)
Glucose	Low	Medium	High
BMI	Low	Elevated	High
Insulin	Low	Moderate	High

🔥 K-Means efficiently segmented groups based on Glucose, BMI, and Insulin.
🧠 Hierarchical Clustering provided detailed data structure insights through dendrograms.
🏆 Both methods validated the existence of Low, Medium, and High risk groups.

🧠 Discussion

K-Means was computationally efficient but required pre-specifying k.
Hierarchical Clustering offered deeper insights without needing k, but was computationally heavier.
Combining both methods provided robust and validated subgroup classifications.

✅ Conclusion

Clustering techniques provided actionable insights into the risk profiles of Pima Indian females for diabetes progression.
This project demonstrates how machine learning can support early diagnosis, risk stratification, and personalized care strategies in healthcare.

🔮 Future Improvements

Apply Gaussian Mixture Models for soft clustering.
Incorporate feature selection techniques to optimize model performance.
Expand dataset to multi-ethnic groups for broader generalization.

📚 References

📬 Contact

"Empowering healthcare through machine learning, one cluster at a time." 🧠🚀

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
agglomerativ-clustering.png		agglomerativ-clustering.png
diabetes-dataset.csv		diabetes-dataset.csv
heatmap.png		heatmap.png
heatmap1.png		heatmap1.png
kmeans1.png		kmeans1.png
kmeans2.png		kmeans2.png
pairplot.png		pairplot.png
radar.png		radar.png
report_HetuPatel.pdf		report_HetuPatel.pdf
script_HetuPatel.py		script_HetuPatel.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Clustering Analysis of Diabetes Risk and Progression in Pima Indian Females

📖 Project Overview

🛠️ Techniques Used

✨ Key Visualizations

🚀 Methodology

📂 Data Loading and Preprocessing

📊 Exploratory Data Analysis

🧩 Clustering

📈 Results

🧠 Discussion

✅ Conclusion

🔮 Future Improvements

📚 References

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Clustering Analysis of Diabetes Risk and Progression in Pima Indian Females

📖 Project Overview

🛠️ Techniques Used

✨ Key Visualizations

🚀 Methodology

📂 Data Loading and Preprocessing

📊 Exploratory Data Analysis

🧩 Clustering

📈 Results

🧠 Discussion

✅ Conclusion

🔮 Future Improvements

📚 References

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages