Skip to content

ahmedessammdev-hub/featureEngnering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🚒 Titanic Survival Prediction: Missing Values & Feature Engineering

Python Scikit-learn Pandas Jupyter License

A comprehensive machine learning project demonstrating production-ready data preprocessing pipelines with scikit-learn

Key Features β€’ Quick Start β€’ Methodology β€’ Results β€’ Contributing


πŸ“– Overview

This project demonstrates a complete end-to-end machine learning pipeline for predicting passenger survival on the Titanic using the famous dataset. The focus is on building robust, reusable preprocessing pipelines that handle missing values and feature transformations professionally.

🎯 What You'll Learn

  • βœ… Missing Value Detection β€” Identify and analyze patterns in incomplete data
  • βœ… Imputation Strategies β€” Apply median, mode, and custom imputation techniques
  • βœ… Feature Engineering β€” Transform raw features into ML-ready representations
  • βœ… Pipeline Architecture β€” Build modular, production-ready sklearn pipelines
  • βœ… Data Leakage Prevention β€” Implement proper train/test splitting strategies
  • βœ… Model Evaluation β€” Assess classifier performance with comprehensive metrics

πŸ“Š Dataset

The Titanic Dataset contains information about 891 passengers from the RMS Titanic. The goal is to predict whether a passenger survived the sinking based on various features.

Feature Descriptions

Feature Type Description Missing Values
pclass Categorical Passenger class (1st, 2nd, 3rd) 0
sex Categorical Gender (male/female) 0
age Numerical Age in years 177
sibsp Numerical # of siblings/spouses aboard 0
parch Numerical # of parents/children aboard 0
fare Numerical Ticket price (British Pounds) 0
embarked Categorical Port of embarkation (C/Q/S) 2
survived Binary Target variable (0 = No, 1 = Yes) 0

πŸ“ˆ Data Visualizations

Age Distribution Embarked Distribution
Age Distribution Embarked Distribution

πŸ”§ Methodology

Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    PREPROCESSING PIPELINE                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚  NUMERICAL      β”‚         β”‚  CATEGORICAL    β”‚           β”‚
β”‚  β”‚  FEATURES       β”‚         β”‚  FEATURES       β”‚           β”‚
β”‚  β”‚                 β”‚         β”‚                 β”‚           β”‚
β”‚  β”‚  β€’ age          β”‚         β”‚  β€’ pclass       β”‚           β”‚
β”‚  β”‚  β€’ sibsp        β”‚         β”‚  β€’ sex          β”‚           β”‚
β”‚  β”‚  β€’ parch        β”‚         β”‚  β€’ embarked     β”‚           β”‚
β”‚  β”‚  β€’ fare         β”‚         β”‚                 β”‚           β”‚
β”‚  β”‚                 β”‚         β”‚                 β”‚           β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚           β”‚
β”‚  β”‚  β”‚ Imputer   β”‚  β”‚         β”‚  β”‚ Imputer   β”‚  β”‚           β”‚
β”‚  β”‚  β”‚ (median)  β”‚  β”‚         β”‚  β”‚ (mode)    β”‚  β”‚           β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚         β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚           β”‚
β”‚  β”‚        β”‚        β”‚         β”‚        β”‚        β”‚           β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”‚         β”‚  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”‚           β”‚
β”‚  β”‚  β”‚ Standard  β”‚  β”‚         β”‚  β”‚ OneHot    β”‚  β”‚           β”‚
β”‚  β”‚  β”‚ Scaler    β”‚  β”‚         β”‚  β”‚ Encoder   β”‚  β”‚           β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚           β”‚                           β”‚                     β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                       β”‚                                     β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚              β”‚ ColumnTransformerβ”‚                           β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚                       β”‚                                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚              β”‚   Logistic      β”‚                           β”‚
β”‚              β”‚   Regression    β”‚                           β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚                    MODEL                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Implementation Details

1. Numerical Feature Pipeline

numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Robust to outliers
    ('scaler', StandardScaler())                    # Zero mean, unit variance
])

2. Categorical Feature Pipeline

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Mode imputation
    ('encoder', OneHotEncoder(handle_unknown='ignore'))     # Sparse encoding
])

3. Combined Preprocessor

preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

4. Full ML Pipeline

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=100000))
])

πŸ“ˆ Results

Model Performance

Metric Score
Accuracy 79.4%
Precision (Survived) 77%
Recall (Survived) 70%
F1-Score (Survived) 73%

Classification Report

              precision    recall  f1-score   support

           0       0.81      0.86      0.83       133
           1       0.77      0.70      0.73        90

    accuracy                           0.79       223
   macro avg       0.79      0.78      0.78       223
weighted avg       0.79      0.79      0.79       223

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • pip package manager

Installation

# Clone the repository
git clone https://github.com/yourusername/featureEngineering.git
cd featureEngineering

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install pandas scikit-learn seaborn matplotlib numpy jupyter

Run the Notebook

jupyter notebook missing_values.ipynb

πŸ“ Project Structure

featureEngineering/
β”œβ”€β”€ πŸ““ missing_values.ipynb    # Main notebook with complete analysis
β”œβ”€β”€ πŸ“– README.md               # Project documentation
β”œβ”€β”€ πŸ“Š age_plot.png            # Age distribution visualization
β”œβ”€β”€ πŸ“Š embarked_plot.png       # Embarked distribution visualization
β”œβ”€β”€ πŸ“ .venv/                  # Virtual environment (not tracked)
└── πŸ“ .vscode/                # VS Code settings

πŸ› οΈ Technologies Used

Technology Purpose
🐼 Pandas Data manipulation and analysis
πŸ“Š Seaborn Statistical data visualization
πŸ€– Scikit-learn ML algorithms and preprocessing
πŸ“ˆ Matplotlib Plotting and visualization
πŸ”’ NumPy Numerical computations

πŸŽ“ Learning Outcomes

By studying this project, you will learn:

Topic Description
Missing Data Handling Strategies for imputing incomplete datasets
Pipeline Architecture Building modular, reusable preprocessing workflows
Data Leakage Prevention Proper train/test splitting to avoid bias
Feature Engineering Transforming raw data for ML consumption
Model Evaluation Comprehensive metrics for classification tasks
Code Organization Clean, documented, production-ready code

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ‘€ Author

Ahmed Essam
Machine Learning & Data Science Enthusiast

GitHub LinkedIn


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


⭐ If you found this project helpful, please give it a star!

Made with ❀️ by Ahmed Essam

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors