This project demonstrates essential data preprocessing and sampling techniques on a heart disease dataset. It covers the critical steps from initial Exploratory Data Analysis (EDA) to advanced methods for handling class imbalance, preparing the data for robust machine learning models. It is an applied study from Advanced Database Practice subject from the PUC-Campinas (University).
- Exploratory Data Analysis (EDA)
- Data Visualization (
matplotlib,seaborn) - Feature Engineering & Preprocessing
- Categorical Variable Encoding (One-Hot vs. Label)
- Continuous Variable Discretization (Equal-Width vs. Quantile)
- Handling Class Imbalance (
imblearn)- Random Oversampling
- Random Undersampling
- SMOTE (Synthetic Minority Over-sampling Technique)
-
Initial Data Exploration: Performed EDA using
.info(),.describe(), and histograms to assess data structure and quality, identifying potential masked missing values (e.g.,0forCholesterol). -
Categorical Data Encoding: Compared One-Hot vs. Label Encoding on the
RestingECGvariable, concluding that One-Hot is superior for distance-based algorithms as it avoids creating an artificial ordinal relationship. -
Continuous Data Discretization: Applied Equal-Width and Quantile-based Discretization to the
Agevariable, demonstrating how quantile binning creates more balanced groups ideal for comparative analysis. -
Handling Class Imbalance: Addressed class imbalance in the
HeartDiseasetarget variable by implementing and visualizing the results of Random Oversampling, Random Undersampling, and SMOTE.
- Language: Python
- Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Imbalanced-learn
- Environment: Jupyter Notebook
-
Clone the repository:
-
Install the dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn imbalanced-learn jupyter
-
Start the Jupyter Notebook:
jupyter notebook excs.ipynb
Authors: Jéssica Kushida, Natália Naomi Sumida, and Isabella Tressino
- LinkedIn: (https://www.linkedin.com/in/jessicakushida/)
- Email: jessicakushida2@gmail.com