Data Analysis and Preprocessing for Heart Disease Prediction

🔹 Project Overview

This project demonstrates essential data preprocessing and sampling techniques on a heart disease dataset. It covers the critical steps from initial Exploratory Data Analysis (EDA) to advanced methods for handling class imbalance, preparing the data for robust machine learning models. It is an applied study from Advanced Database Practice subject from the PUC-Campinas (University).

💡 Key Skills Demonstrated

Exploratory Data Analysis (EDA)
Data Visualization (matplotlib, seaborn)
Feature Engineering & Preprocessing
- Categorical Variable Encoding (One-Hot vs. Label)
- Continuous Variable Discretization (Equal-Width vs. Quantile)
Handling Class Imbalance (imblearn)
- Random Oversampling
- Random Undersampling
- SMOTE (Synthetic Minority Over-sampling Technique)

📊 Methodology and Analyses Performed

Initial Data Exploration: Performed EDA using .info(), .describe(), and histograms to assess data structure and quality, identifying potential masked missing values (e.g., 0 for Cholesterol).
Categorical Data Encoding: Compared One-Hot vs. Label Encoding on the RestingECG variable, concluding that One-Hot is superior for distance-based algorithms as it avoids creating an artificial ordinal relationship.
Continuous Data Discretization: Applied Equal-Width and Quantile-based Discretization to the Age variable, demonstrating how quantile binning creates more balanced groups ideal for comparative analysis.
Handling Class Imbalance: Addressed class imbalance in the HeartDisease target variable by implementing and visualizing the results of Random Oversampling, Random Undersampling, and SMOTE.

🛠️ Technologies Used

Language: Python
Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Imbalanced-learn
Environment: Jupyter Notebook

🚀 How to Run the Project

Clone the repository:

Install the dependencies:

pip install pandas numpy matplotlib seaborn scikit-learn imbalanced-learn jupyter

Start the Jupyter Notebook:
```
jupyter notebook excs.ipynb
```

📬 Contact

Authors: Jéssica Kushida, Natália Naomi Sumida, and Isabella Tressino

LinkedIn: (https://www.linkedin.com/in/jessicakushida/)
Email: jessicakushida2@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
excs.html		excs.html
excs.ipynb		excs.ipynb
heart.csv		heart.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analysis and Preprocessing for Heart Disease Prediction

🔹 Project Overview

💡 Key Skills Demonstrated

📊 Methodology and Analyses Performed

🛠️ Technologies Used

🚀 How to Run the Project

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Analysis and Preprocessing for Heart Disease Prediction

🔹 Project Overview

💡 Key Skills Demonstrated

📊 Methodology and Analyses Performed

🛠️ Technologies Used

🚀 How to Run the Project

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages