This project is an end-to-end bioinformatics and machine learning system that analyzes biological tumor features to predict whether a tumor is benign or malignant. Using the widely recognized Breast Cancer Wisconsin Dataset, this project demonstrates how machine learning can be used in the medical field to aid diagnoses based on digitized imaging and biological data.
- Data Analysis: Perform comprehensive Exploratory Data Analysis (EDA) to understand the distribution and correlation of 30 continuous biological variables.
- Dimensionality Reduction: Implemented PCA for visualizing high-dimensional biological data into interpretable 2D spaces.
- Machine Learning: Train, evaluate, and compare Logistic Regression, Random Forest, and Support Vector Machines for medical diagnostics.
- Model Deployment: Build a simple interactive dashboard using Streamlit, allowing clinicians/users to input patient features and receive a prediction.
- Python (Data Analysis & Modeling)
- Pandas & NumPy (Data Preprocessing)
- Matplotlib & Seaborn (Data Visualization)
- Scikit-Learn (Machine Learning & PCA)
- Flask (Web Backend Framework)
- HTML/CSS/JS (Glassmorphic Web Frontend)
OncoPredict/
│
├── data_analysis.ipynb # Jupyter Notebook for EDA & biological visualizations
├── model_training.py # ML Pipeline script to train & evaluate models
├── prediction_system.py # Prediction logic to accept features & output class/prob
├── app.py # Flask backend providing endpoints and UI
├── templates/
│ └── index.html # HTML Web Application Interface
├── static/
│ ├── style.css # Glassmorphic CSS Styling
│ └── script.js # Interactive UI logic
├── generate_notebook.py # Python script automating notebook creation
├── requirements.txt # Project dependencies
└── README.md # Project documentation
-
Data Preprocessing: The dataset was normalized using
StandardScalerto ensure all biological measurements are on the same magnitude scale. -
Exploratory Visualizations: Used pairwise scatterplots and heatmaps to discover that
mean radius,mean perimeter, andmean areashow a high correlation and differentiate well between benign and malignant cases. -
PCA: The first two principal components capture over 63% of the total variance, showing distinct clustering between the classes in 2D space.
-
Model Performance:
- Evaluated using Accuracy, Precision, Recall, F1-score, and ROC curve (AUC).
- Logistic Regression, Random Forest, and SVM models were trained using an 80/20 train-test split layout.
- Model metrics (F1-score) drove our best model selection. Random Forest proved to be highly interpretable thanks to the extracted feature importances.
-
Feature Importance: Random Forest analysis revealed that features like
worst radius,worst perimeter, andworst areaare critical indicators for predictions.
pip install -r requirements.txtIf the jupyter command is not recognized, you can run the notebook through Python directly:
python -m jupyter notebook data_analysis.ipynb(Alternatively, just open the data_analysis.ipynb file in your VS Code which allows you to run cells natively!)
Run model_training.py to train models, generate evaluation plots (roc_curve_comparison.png, feature_importance.png), and save artifacts (best_model.pkl, scaler.pkl).
python model_training.pyBoot up the Flask web backend.
python app.pyOpen the URL shown in the terminal (usually http://127.0.0.1:5000/) in your browser to view the application!