An End-to-End MLOps project that demonstrates how a real-world machine learning system is built, deployed, and automated using modern industry tools.
This project implements a complete ML lifecycle pipeline, including:
- Data ingestion from MongoDB Atlas
- Data validation and transformation
- Model training and evaluation
- Model versioning in AWS S3
- Containerization with Docker
- Automated CI/CD using GitHub Actions
- Deployment on AWS EC2
The project simulates a production-grade ML system architecture.
Data Source (MongoDB Atlas)
β
βΌ
Data Ingestion
β
βΌ
Data Validation
β
βΌ
Data Transformation
β
βΌ
Model Trainer
β
βΌ
Model Evaluation
β
βΌ
Model Pusher (AWS S3)
β
βΌ
Prediction Pipeline
β
βΌ
Web Application (Streamlit)
β
βΌ
Docker Container
β
βΌ
CI/CD (GitHub Actions)
β
βΌ
Deployment (AWS EC2)
- Python 3.10
- Scikit-Learn
- Pandas
- NumPy
- Docker
- GitHub Actions
- CI/CD Automation
- AWS S3 (Model Registry)
- AWS EC2 (Deployment)
- AWS ECR (Docker Registry)
- MongoDB Atlas
- Streamlit
vehicle-data-mlops
β
βββ notebook/
β βββ mongoDB_demo.ipynb
β βββ EDA.ipynb
β
βββ src/
β βββ components/
β β βββ data_ingestion.py
β β βββ data_validation.py
β β βββ data_transformation.py
β β βββ model_trainer.py
β β βββ model_evaluation.py
β β βββ model_pusher.py
β β
β βββ configuration/
β β βββ mongo_db_connection.py
β β βββ aws_connection.py
β β
β βββ entity/
β β βββ config_entity.py
β β βββ artifact_entity.py
β β βββ estimator.py
β β βββ s3_estimator.py
β β
β βββ utils/
β β βββ main_utils.py
β
βββ pipeline/
β βββ training_pipeline.py
β βββ prediction_pipeline.py
β
βββ app.py
βββ requirements.txt
βββ setup.py
βββ pyproject.toml
βββ Dockerfile
βββ .dockerignore
βββ README.md
The ingestion pipeline:
- Connects to MongoDB Atlas
- Fetches raw dataset
- Converts key-value records into Pandas DataFrame
- Stores dataset inside artifact directory
Main modules involved:
data_access/
configuration/
entity/
components/data_ingestion.py
Validation checks include:
- Schema validation
- Column type validation
- Missing values
- Data consistency
Configuration defined in:
config/schema.yaml
Feature engineering and preprocessing steps:
- Handling missing values
- Feature scaling
- Encoding categorical variables
- Preparing training dataset
The model trainer:
- Splits dataset
- Trains ML models
- Selects best performing model
- Saves trained model artifact
Compares new model with previous production model.
Threshold defined in constants:
MODEL_EVALUATION_CHANGED_THRESHOLD_SCORE = 0.02
If performance improves β model is pushed to S3 Model Registry
CI/CD automatically performs:
1οΈβ£ Build Docker Image 2οΈβ£ Push image to AWS ECR 3οΈβ£ Deploy to EC2 instance
Secrets required in GitHub:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
ECR_REPO
Launch EC2 instance:
Ubuntu Server 24.04
Instance: t2.medium
Storage: 30GB
Install Docker:
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker ubuntu
newgrp dockerSteps:
GitHub β Settings
Actions β Runner
New Self Hosted Runner
Run commands on EC2:
./config.sh
./run.sh
Allow port 5000 in EC2 security group.
Access application:
http://<EC2-PUBLIC-IP>:5000
Training endpoint:
/training
β Train ML model from UI β Predict using trained model β Fully automated CI/CD deployment β Cloud model registry
This project demonstrates real production ML system design including:
- End-to-End ML pipelines
- Cloud infrastructure
- Model versioning
- CI/CD automation
- Docker containerization
- Scalable deployment architecture
Lipu Daman
Machine Learning | MLOps | Data Scientist
β If you like this project, please star the repository!