Skip to content

jessicakushidaa/data-prep-and-sampling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Analysis and Preprocessing for Heart Disease Prediction

🔹 Project Overview

This project demonstrates essential data preprocessing and sampling techniques on a heart disease dataset. It covers the critical steps from initial Exploratory Data Analysis (EDA) to advanced methods for handling class imbalance, preparing the data for robust machine learning models. It is an applied study from Advanced Database Practice subject from the PUC-Campinas (University).


💡 Key Skills Demonstrated

  • Exploratory Data Analysis (EDA)
  • Data Visualization (matplotlib, seaborn)
  • Feature Engineering & Preprocessing
    • Categorical Variable Encoding (One-Hot vs. Label)
    • Continuous Variable Discretization (Equal-Width vs. Quantile)
  • Handling Class Imbalance (imblearn)
    • Random Oversampling
    • Random Undersampling
    • SMOTE (Synthetic Minority Over-sampling Technique)

📊 Methodology and Analyses Performed

  1. Initial Data Exploration: Performed EDA using .info(), .describe(), and histograms to assess data structure and quality, identifying potential masked missing values (e.g., 0 for Cholesterol).

  2. Categorical Data Encoding: Compared One-Hot vs. Label Encoding on the RestingECG variable, concluding that One-Hot is superior for distance-based algorithms as it avoids creating an artificial ordinal relationship.

  3. Continuous Data Discretization: Applied Equal-Width and Quantile-based Discretization to the Age variable, demonstrating how quantile binning creates more balanced groups ideal for comparative analysis.

  4. Handling Class Imbalance: Addressed class imbalance in the HeartDisease target variable by implementing and visualizing the results of Random Oversampling, Random Undersampling, and SMOTE.


🛠️ Technologies Used

  • Language: Python
  • Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Imbalanced-learn
  • Environment: Jupyter Notebook

🚀 How to Run the Project

  1. Clone the repository:

  2. Install the dependencies:

    pip install pandas numpy matplotlib seaborn scikit-learn imbalanced-learn jupyter
  3. Start the Jupyter Notebook:

    jupyter notebook excs.ipynb

📬 Contact

Authors: Jéssica Kushida, Natália Naomi Sumida, and Isabella Tressino

About

Preprocessing the heart disease dataset: A practical guide to EDA, feature encoding, discretization, and handling class imbalance with SMOTE.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors