An end-to-end Machine Learning project that predicts medical insurance charges using statistical analysis, feature engineering, and Linear Regression.
The application is deployed using Streamlit and allows users to estimate insurance charges in real time based on healthcare and financial attributes.
https://predictive-insurance-pricing-system-qnmpk5cxusutnndxyfsknk.streamlit.app/
Insurance providers must accurately estimate medical insurance charges based on various customer attributes such as claim amount, hospital expenditure, annual salary, smoking habits, and family details.
The objective of this project is to build a predictive system capable of estimating insurance charges using historical healthcare-related data and machine learning techniques.
- Analyze healthcare and insurance-related data
- Identify significant factors influencing insurance charges
- Build a predictive regression model
- Evaluate model performance and reliability
- Deploy the model for real-time predictions
Performed comprehensive exploratory data analysis to understand:
- Data distributions
- Missing values
- Feature relationships
- Outlier behavior
- Correlation patterns
Applied:
- Mean Imputation
- Median Imputation
- Mode Imputation
Analyzed skewness and used Median Imputation for skewed features to maintain data stability.
Implemented:
Used IQR analysis to identify and handle extreme values that could negatively impact model performance.
Performed:
to understand linear relationships among variables.
Applied:
to detect and reduce multicollinearity.
Conducted:
- T-Test
- ANOVA
These statistical tests helped identify significant features contributing to insurance charge prediction.
Implemented:
- Feature Selection
- Feature Encoding
- Data Transformation
- Feature Scaling (Standardization)
Created an optimized preprocessing pipeline for model training.
Algorithm Used:
Workflow:
- Data Preprocessing
- Train-Test Split
- Model Training
- Model Testing
- Model Evaluation
- Bias-Variance Assessment
Evaluated model performance using:
- Regression Metrics
- Prediction Analysis
- Bias-Variance Assessment
These evaluations helped validate the model's predictive capability.
The trained model was:
- Saved using Joblib
- Integrated with Streamlit
- Deployed using GitHub and Streamlit Cloud
Users can provide input values and receive real-time insurance charge predictions in:
- USD
- INR
- Python
- Pandas
- NumPy
- Scikit-Learn
- Statistical Analysis
- T-Test
- ANOVA
- VIF
- Linear Regression
- Joblib
- Streamlit
- GitHub
Predictive-Insurance-Pricing-System/
│
├── app.py
├── linear_regression_model.joblib
├── scaler.pkl
├── new_insurance_data.csv
├── Processed_Insurance_data.csv
├── requirements.txt
└── README.md
Raw insurance dataset used for data exploration, preprocessing, and feature analysis.
Cleaned and preprocessed dataset after handling missing values, outliers, feature engineering, and data transformations used for model development.
- Practical application of statistical analysis in machine learning
- Missing value treatment strategies
- Outlier detection using IQR
- Multicollinearity analysis using VIF
- Feature engineering and scaling
- Linear Regression model development
- Model deployment using Streamlit
- End-to-end machine learning workflow
Daniel J
LinkedIn: https://www.linkedin.com/in/daniel-j77
GitHub: https://github.com/daniel-j77
- Ensemble Regression Models
- XGBoost Regression
- Random Forest Regression
- Advanced Feature Selection Techniques
- Cloud-Based Deployment
- Model Monitoring and Performance Tracking