Skip to content

shakkyaNV/MultivariateAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

118 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chronic Kidney Disease (CKD) — Multivariate Data Analysis

A multivariate data analysis project on Chronic Kidney Disease (CKD) using the UCI CKD dataset. The work covers dataset description, data preprocessing, exploratory data analysis (EDA), principal component analysis (PCA), and confirmatory analysis (hypothesis testing + confidence intervals).

---

Project Overview

Chronic Kidney Disease (CKD) is a progressive condition that can remain undetected until advanced stages. Early identification and intervention are important for preventing progression and improving outcomes.
This project uses multivariate data analysis methods to explore differences between CKD and Not-CKD groups and to confirm which variables show evidence of mean differences.


Dataset

  • Source: UCI Chronic Kidney Disease dataset
  • Size: 398 instances, 25 attributes
  • Attributes:
    • 11 numerical (e.g., age, bp, bgr, bu, sc, sod, pot, hemo, pcv, wbcc, rbcc)
    • 14 nominal (e.g., sg, al, su, rbc, pc, pcc, ba, htn, dm, cad, appet, pe, ane)
  • Target variable: class (CKD vs Not-CKD)

Main dataset issues handled:

  • Missing (null) values
  • Multi-level factor variables (converted/forced to two-levels for analysis where needed)
  • Class imbalance

Methods

1) Data Preprocessing

  • Missing value imputation methods:
    • CART (classification and regression trees)
    • Logistic / polynomial regression approaches
  • Multi-level factors forced to two-levels (examples mentioned in slides):
    • Specific gravity (sg)
    • Sugar (su)
    • Albumin (al)
  • Data split:
    • Stratified sampling based on class
    • EDA (training): 1/3
    • Confirmatory (testing): 2/3

2) Exploratory Data Analysis (EDA)

  • Numerical variables showed:
    • Skewness / non-normality (age, bgr, bu, sc, wbcc)
    • Multi-modality (pcv, hemo)
    • Mean differences and clustering by CKD classification
  • Qualitative variables showed clear separation for some levels (examples highlighted):
    • classification, anemia, albumin, appetite

3) Correlation Analysis

  • Observed relationships:
    • Lower hemoglobin levels relate to anemia in CKD
    • rbcc associated with pcv
    • sc correlated with bu
  • Multicollinearity noted among:
    • hemoglobin, rbcc, pcv

4) Principal Component Analysis (PCA)

  • Scree plot summary from slides:
    • First six components explain about 85% of variation
  • 3D PCA plots suggested clustering by:
    • classification and some clinical indicators (e.g., hypertension)
  • Data imbalance noted as a limitation

5) Confirmatory Analysis

  • Multivariate normality check:
    • Mardia’s test indicated the data does not follow multivariate normality
  • Hypothesis testing approach:
    • Used non-parametric bootstrap / permutation tests for comparing group means (CKD vs Not-CKD), especially because assumptions for classical MVN methods were not satisfied.
  • Confidence intervals:
    • Bootstrap CIs were initially wider
    • Extended to include:
      • Bootstrap CIs
      • Permutation CIs
      • Bonferroni CIs (noted as more precise in slides)

Key Findings (from slides)

  • EDA shows strong separation for several variables, especially among qualitative factors like anemia, albumin, appetite, and classification.
  • Confirmatory testing rejects equal-mean hypotheses for most variables, while at least one variable (bacteria) was not rejected in the presented table.
  • Using simultaneous bootstrap CIs, the slides report no significant mean differences for a small subset of variables (examples mentioned: age, potassium, white blood cell count), while others show clearer differences.

About

The project utilizes multivariate statistical methods, PCA, non-parametric Boostrap and permutation inferences to the UCI Chronic Kidney Diseases Dataset, to indentify any clinically important variables, under minimal distributional assumptions.

Topics

Resources

Stars

Watchers

Forks

Contributors