A comprehensive machine learning project demonstrating production-ready data preprocessing pipelines with scikit-learn
Key Features β’ Quick Start β’ Methodology β’ Results β’ Contributing
This project demonstrates a complete end-to-end machine learning pipeline for predicting passenger survival on the Titanic using the famous dataset. The focus is on building robust, reusable preprocessing pipelines that handle missing values and feature transformations professionally.
- β Missing Value Detection β Identify and analyze patterns in incomplete data
- β Imputation Strategies β Apply median, mode, and custom imputation techniques
- β Feature Engineering β Transform raw features into ML-ready representations
- β Pipeline Architecture β Build modular, production-ready sklearn pipelines
- β Data Leakage Prevention β Implement proper train/test splitting strategies
- β Model Evaluation β Assess classifier performance with comprehensive metrics
The Titanic Dataset contains information about 891 passengers from the RMS Titanic. The goal is to predict whether a passenger survived the sinking based on various features.
| Feature | Type | Description | Missing Values |
|---|---|---|---|
pclass |
Categorical | Passenger class (1st, 2nd, 3rd) | 0 |
sex |
Categorical | Gender (male/female) | 0 |
age |
Numerical | Age in years | 177 |
sibsp |
Numerical | # of siblings/spouses aboard | 0 |
parch |
Numerical | # of parents/children aboard | 0 |
fare |
Numerical | Ticket price (British Pounds) | 0 |
embarked |
Categorical | Port of embarkation (C/Q/S) | 2 |
survived |
Binary | Target variable (0 = No, 1 = Yes) | 0 |
| Age Distribution | Embarked Distribution |
![]() |
![]() |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PREPROCESSING PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ βββββββββββββββββββ β
β β NUMERICAL β β CATEGORICAL β β
β β FEATURES β β FEATURES β β
β β β β β β
β β β’ age β β β’ pclass β β
β β β’ sibsp β β β’ sex β β
β β β’ parch β β β’ embarked β β
β β β’ fare β β β β
β β β β β β
β β βββββββββββββ β β βββββββββββββ β β
β β β Imputer β β β β Imputer β β β
β β β (median) β β β β (mode) β β β
β β βββββββ¬ββββββ β β βββββββ¬ββββββ β β
β β β β β β β β
β β βββββββΌββββββ β β βββββββΌββββββ β β
β β β Standard β β β β OneHot β β β
β β β Scaler β β β β Encoder β β β
β β βββββββββββββ β β βββββββββββββ β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β
β βββββββββββββ¬ββββββββββββββββ β
β β β
β ββββββββββΌβββββββββ β
β β ColumnTransformerβ β
β ββββββββββ¬βββββββββ β
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ€
β ββββββββββΌβββββββββ β
β β Logistic β β
β β Regression β β
β βββββββββββββββββββ β
β MODEL β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
numerical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')), # Robust to outliers
('scaler', StandardScaler()) # Zero mean, unit variance
])categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')), # Mode imputation
('encoder', OneHotEncoder(handle_unknown='ignore')) # Sparse encoding
])preprocessor = ColumnTransformer([
('num', numerical_pipeline, numerical_features),
('cat', categorical_pipeline, categorical_features)
])model = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=100000))
])| Metric | Score |
|---|---|
| Accuracy | 79.4% |
| Precision (Survived) | 77% |
| Recall (Survived) | 70% |
| F1-Score (Survived) | 73% |
precision recall f1-score support
0 0.81 0.86 0.83 133
1 0.77 0.70 0.73 90
accuracy 0.79 223
macro avg 0.79 0.78 0.78 223
weighted avg 0.79 0.79 0.79 223
- Python 3.8+
- pip package manager
# Clone the repository
git clone https://github.com/yourusername/featureEngineering.git
cd featureEngineering
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install pandas scikit-learn seaborn matplotlib numpy jupyterjupyter notebook missing_values.ipynbfeatureEngineering/
βββ π missing_values.ipynb # Main notebook with complete analysis
βββ π README.md # Project documentation
βββ π age_plot.png # Age distribution visualization
βββ π embarked_plot.png # Embarked distribution visualization
βββ π .venv/ # Virtual environment (not tracked)
βββ π .vscode/ # VS Code settings
| Technology | Purpose |
|---|---|
| πΌ Pandas | Data manipulation and analysis |
| π Seaborn | Statistical data visualization |
| π€ Scikit-learn | ML algorithms and preprocessing |
| π Matplotlib | Plotting and visualization |
| π’ NumPy | Numerical computations |
By studying this project, you will learn:
| Topic | Description |
|---|---|
| Missing Data Handling | Strategies for imputing incomplete datasets |
| Pipeline Architecture | Building modular, reusable preprocessing workflows |
| Data Leakage Prevention | Proper train/test splitting to avoid bias |
| Feature Engineering | Transforming raw data for ML consumption |
| Model Evaluation | Comprehensive metrics for classification tasks |
| Code Organization | Clean, documented, production-ready code |
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Ahmed Essam
Machine Learning & Data Science Enthusiast
This project is licensed under the MIT License - see the LICENSE file for details.

