A hands-on collection of Jupyter notebooks for learning and practicing Exploratory Data Analysis (EDA).
This repository is organized as a step-by-step workflow:
- Understand your dataset
- Exploratory Data Analysis (EDA)
- Explore univariate patterns
- Explore multivariate patterns
- Analyze bivariate and multivariate relationships
- Detect outliers
- Handle missing values
- Train baseline ML models
- Classification
- Regression
- Clustering
It includes multiple CSV datasets (stored in datasets/) so each notebook can be executed directly.
-
1. Dataset Overview.ipynb
Initial data inspection: shape, data types, summary statistics, and first quality checks. -
2.1 EDA - Univariate.ipynb
Distribution analysis for single variables (numerical and categorical). -
2.2 EDA - Bivariate and Multivariate.ipynb
Relationship analysis using pairwise comparisons, grouped summaries, and multivariate visualizations. -
3. Outliers.ipynb
Outlier detection methods and interpretation. -
4. Missing Values.ipynb
Missing data inspection and practical handling/imputation techniques. -
5.1 Model - Classification.ipynb
End-to-end supervised learning workflow for classification tasks: preprocessing, model training, and evaluation. -
5.2 Model - Regression.ipynb
End-to-end supervised learning workflow for regression tasks with appropriate metrics and model diagnostics. -
5.3 Model - Clustering.ipynb
Unsupervised learning workflow for clustering, including cluster quality analysis and interpretation.
datasets/auto-mpg.csvdatasets/california-housing.csvdatasets/flights_seaborn.csvdatasets/healthcare-dataset-stroke-data.csvdatasets/iris_seaborn.csvdatasets/marketing-data.csvdatasets/outlier_detection_dataset.csvdatasets/students.csvdatasets/synthetic_stroke_data.csvdatasets/tips_seaborn.csvdatasets/titanic_seaborn.csv
git clone https://github.com/DataSciencePolimi/Exploratory-Data-Analysis
cd Exploratory-Data-Analysispython -m venv .venv
source .venv/bin/activatepip install jupyter pandas numpy matplotlib seaborn scikit-learn scipyFor a structured learning flow, run notebooks in this order:
1. Dataset Overview.ipynb2.1 EDA - Univariate.ipynb2.2 EDA - Bivariate and Multivariate.ipynb3. Outliers.ipynb4. Missing Values.ipynb5.1 Model - Classification.ipynb5.2 Model - Regression.ipynb5.3 Model - Clustering.ipynb
- Build intuition for reading and profiling real datasets
- Practice selecting the right visual/statistical technique for each question
- Learn robust preprocessing patterns before modeling
- Improve reproducibility in data analysis workflows
- Dataset files are stored in
datasets/; if you move them again, update notebook file paths accordingly. - The
cache/folder is required for slow computations (for example clustering metric sweeps) and stores precomputed arrays used by some notebooks. - If plots do not display, verify your Jupyter kernel and package installation.
- Some notebooks may require rerunning cells from top to bottom after kernel restarts.
Riccardo Campi, PhD student in Information Technology,
Politecnico di Milano, Data Science Lab.
This project is licensed under the MIT License. See LICENSE for details.