Chronic Kidney Disease (CKD) — Multivariate Data Analysis

A multivariate data analysis project on Chronic Kidney Disease (CKD) using the UCI CKD dataset. The work covers dataset description, data preprocessing, exploratory data analysis (EDA), principal component analysis (PCA), and confirmatory analysis (hypothesis testing + confidence intervals).

---

Project Overview

Chronic Kidney Disease (CKD) is a progressive condition that can remain undetected until advanced stages. Early identification and intervention are important for preventing progression and improving outcomes.
This project uses multivariate data analysis methods to explore differences between CKD and Not-CKD groups and to confirm which variables show evidence of mean differences.

Dataset

Source: UCI Chronic Kidney Disease dataset
Size: 398 instances, 25 attributes
Attributes:
- 11 numerical (e.g., age, bp, bgr, bu, sc, sod, pot, hemo, pcv, wbcc, rbcc)
- 14 nominal (e.g., sg, al, su, rbc, pc, pcc, ba, htn, dm, cad, appet, pe, ane)
Target variable: class (CKD vs Not-CKD)

Main dataset issues handled:

Missing (null) values
Multi-level factor variables (converted/forced to two-levels for analysis where needed)
Class imbalance

Methods

1) Data Preprocessing

Missing value imputation methods:
- CART (classification and regression trees)
- Logistic / polynomial regression approaches
Multi-level factors forced to two-levels (examples mentioned in slides):
- Specific gravity (sg)
- Sugar (su)
- Albumin (al)
Data split:
- Stratified sampling based on class
- EDA (training): 1/3
- Confirmatory (testing): 2/3

2) Exploratory Data Analysis (EDA)

Numerical variables showed:
- Skewness / non-normality (age, bgr, bu, sc, wbcc)
- Multi-modality (pcv, hemo)
- Mean differences and clustering by CKD classification
Qualitative variables showed clear separation for some levels (examples highlighted):
- classification, anemia, albumin, appetite

3) Correlation Analysis

Observed relationships:
- Lower hemoglobin levels relate to anemia in CKD
- rbcc associated with pcv
- sc correlated with bu
Multicollinearity noted among:
- hemoglobin, rbcc, pcv

4) Principal Component Analysis (PCA)

Scree plot summary from slides:
- First six components explain about 85% of variation
3D PCA plots suggested clustering by:
- classification and some clinical indicators (e.g., hypertension)
Data imbalance noted as a limitation

5) Confirmatory Analysis

Multivariate normality check:
- Mardia’s test indicated the data does not follow multivariate normality
Hypothesis testing approach:
- Used non-parametric bootstrap / permutation tests for comparing group means (CKD vs Not-CKD), especially because assumptions for classical MVN methods were not satisfied.
Confidence intervals:
- Bootstrap CIs were initially wider
- Extended to include:
  - Bootstrap CIs
  - Permutation CIs
  - Bonferroni CIs (noted as more precise in slides)

Key Findings (from slides)

EDA shows strong separation for several variables, especially among qualitative factors like anemia, albumin, appetite, and classification.
Confirmatory testing rejects equal-mean hypotheses for most variables, while at least one variable (bacteria) was not rejected in the presented table.
Using simultaneous bootstrap CIs, the slides report no significant mean differences for a small subset of variables (examples mentioned: age, potassium, white blood cell count), while others show clearer differences.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
Code		Code
Data		Data
Outputs		Outputs
.gitignore		.gitignore
MultivariateAnalysis.Rproj		MultivariateAnalysis.Rproj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chronic Kidney Disease (CKD) — Multivariate Data Analysis

Project Overview

Dataset

Methods

1) Data Preprocessing

2) Exploratory Data Analysis (EDA)

3) Correlation Analysis

4) Principal Component Analysis (PCA)

5) Confirmatory Analysis

Key Findings (from slides)

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chronic Kidney Disease (CKD) — Multivariate Data Analysis

Project Overview

Dataset

Methods

1) Data Preprocessing

2) Exploratory Data Analysis (EDA)

3) Correlation Analysis

4) Principal Component Analysis (PCA)

5) Confirmatory Analysis

Key Findings (from slides)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages