This project aims to predict individual income using a complete Machine Learning workflow.
The dataset contains demographic, education, employment, and household-related attributes, and the model uses a Decision Tree Regressor with full hyperparameter tuning to estimate income values.
- Perform data cleaning and preprocessing.
- Apply Ordinal and One-Hot Encoding to categorical features.
- Use log transformation to reduce skewness in the target variable.
- Split data into training and testing sets.
- Use GridSearchCV to optimize tree hyperparameters.
- Evaluate model performance (R², RMSE).
- Visualize predicted vs actual income values.
- Python
- Pandas
- NumPy
- Scikit-Learn
- Plotly
- Jupyter Notebook
- Google Colab
- Algorithm: Decision Tree Regression
- Tuning:
max_depthmin_samples_leafmin_samples_split
- Scoring: Negative Mean Squared Error (MSE)
- Target: Income (log-transformed during training)
Due to the synthetic nature of the dataset, the model shows:
- Training R² : low
- Testing R² : low
This indicates underfitting, meaning the dataset lacks strong relationships between features and income.
Despite this, the project demonstrates a clean, end-to-end ML pipeline suitable for learning and experimentation.
- data.csv → Dataset used for training and tetsing the model
- Income Prediction Project No LogTransformation.ipynb → Main notebook containing full ML workflow without log transformation applied ( higher in the accuracy )
- Income Prediction with Log Trasnfromation.ipynb → Another version of the notebook containing full ML workflow with log transformation applied ( lower in the accuracy )
- README.md → Project documentation
Due to the synthetic nature of the dataset and the selected model ( Decision Tree Regression ) , the model shows underfitting, with low R² scores on the both notebooks .
- Income Prediction Project No LogTransformation.ipyn → Rsquared = 1.68%
- Income Prediction Project with Log Trasnfromation.ipyn → Rsquared = -8.54%
This demonstrates a realistic challenge when datasets lack strong feature–target relationships.
- Try ensemble models: RandomForest, GradientBoosting, XGBoost
- Use a more realistic dataset
- Apply advanced feature engineering to extract meaningful patterns
Developed by Samir Mohamed as part of a regression machine learning practice project.