A multivariate data analysis project on Chronic Kidney Disease (CKD) using the UCI CKD dataset. The work covers dataset description, data preprocessing, exploratory data analysis (EDA), principal component analysis (PCA), and confirmatory analysis (hypothesis testing + confidence intervals).
---Chronic Kidney Disease (CKD) is a progressive condition that can remain undetected until advanced stages. Early identification and intervention are important for preventing progression and improving outcomes.
This project uses multivariate data analysis methods to explore differences between CKD and Not-CKD groups and to confirm which variables show evidence of mean differences.
- Source: UCI Chronic Kidney Disease dataset
- Size: 398 instances, 25 attributes
- Attributes:
- 11 numerical (e.g., age, bp, bgr, bu, sc, sod, pot, hemo, pcv, wbcc, rbcc)
- 14 nominal (e.g., sg, al, su, rbc, pc, pcc, ba, htn, dm, cad, appet, pe, ane)
- Target variable:
class(CKD vs Not-CKD)
Main dataset issues handled:
- Missing (null) values
- Multi-level factor variables (converted/forced to two-levels for analysis where needed)
- Class imbalance
- Missing value imputation methods:
- CART (classification and regression trees)
- Logistic / polynomial regression approaches
- Multi-level factors forced to two-levels (examples mentioned in slides):
- Specific gravity (sg)
- Sugar (su)
- Albumin (al)
- Data split:
- Stratified sampling based on
class - EDA (training): 1/3
- Confirmatory (testing): 2/3
- Stratified sampling based on
- Numerical variables showed:
- Skewness / non-normality (age, bgr, bu, sc, wbcc)
- Multi-modality (pcv, hemo)
- Mean differences and clustering by CKD classification
- Qualitative variables showed clear separation for some levels (examples highlighted):
- classification, anemia, albumin, appetite
- Observed relationships:
- Lower hemoglobin levels relate to anemia in CKD
- rbcc associated with pcv
- sc correlated with bu
- Multicollinearity noted among:
- hemoglobin, rbcc, pcv
- Scree plot summary from slides:
- First six components explain about 85% of variation
- 3D PCA plots suggested clustering by:
- classification and some clinical indicators (e.g., hypertension)
- Data imbalance noted as a limitation
- Multivariate normality check:
- Mardia’s test indicated the data does not follow multivariate normality
- Hypothesis testing approach:
- Used non-parametric bootstrap / permutation tests for comparing group means (CKD vs Not-CKD), especially because assumptions for classical MVN methods were not satisfied.
- Confidence intervals:
- Bootstrap CIs were initially wider
- Extended to include:
- Bootstrap CIs
- Permutation CIs
- Bonferroni CIs (noted as more precise in slides)
- EDA shows strong separation for several variables, especially among qualitative factors like anemia, albumin, appetite, and classification.
- Confirmatory testing rejects equal-mean hypotheses for most variables, while at least one variable (bacteria) was not rejected in the presented table.
- Using simultaneous bootstrap CIs, the slides report no significant mean differences for a small subset of variables (examples mentioned: age, potassium, white blood cell count), while others show clearer differences.