Skip to content

AagmanS/OncoPredict

Repository files navigation

OncoPredict - AI-Based Breast Cancer Detection System

📌 Project Overview

This project is an end-to-end bioinformatics and machine learning system that analyzes biological tumor features to predict whether a tumor is benign or malignant. Using the widely recognized Breast Cancer Wisconsin Dataset, this project demonstrates how machine learning can be used in the medical field to aid diagnoses based on digitized imaging and biological data.

🎯 Objectives

  • Data Analysis: Perform comprehensive Exploratory Data Analysis (EDA) to understand the distribution and correlation of 30 continuous biological variables.
  • Dimensionality Reduction: Implemented PCA for visualizing high-dimensional biological data into interpretable 2D spaces.
  • Machine Learning: Train, evaluate, and compare Logistic Regression, Random Forest, and Support Vector Machines for medical diagnostics.
  • Model Deployment: Build a simple interactive dashboard using Streamlit, allowing clinicians/users to input patient features and receive a prediction.

🛠️ Tech Stack

  • Python (Data Analysis & Modeling)
  • Pandas & NumPy (Data Preprocessing)
  • Matplotlib & Seaborn (Data Visualization)
  • Scikit-Learn (Machine Learning & PCA)
  • Flask (Web Backend Framework)
  • HTML/CSS/JS (Glassmorphic Web Frontend)

🗂️ Project Structure

OncoPredict/
│
├── data_analysis.ipynb     # Jupyter Notebook for EDA & biological visualizations
├── model_training.py       # ML Pipeline script to train & evaluate models
├── prediction_system.py    # Prediction logic to accept features & output class/prob
├── app.py                  # Flask backend providing endpoints and UI
├── templates/
│   └── index.html          # HTML Web Application Interface
├── static/
│   ├── style.css           # Glassmorphic CSS Styling
│   └── script.js           # Interactive UI logic
├── generate_notebook.py    # Python script automating notebook creation
├── requirements.txt        # Project dependencies
└── README.md               # Project documentation

📊 Methodology & Results

  1. Data Preprocessing: The dataset was normalized using StandardScaler to ensure all biological measurements are on the same magnitude scale.

  2. Exploratory Visualizations: Used pairwise scatterplots and heatmaps to discover that mean radius, mean perimeter, and mean area show a high correlation and differentiate well between benign and malignant cases.

  3. PCA: The first two principal components capture over 63% of the total variance, showing distinct clustering between the classes in 2D space.

  4. Model Performance:

    • Evaluated using Accuracy, Precision, Recall, F1-score, and ROC curve (AUC).
    • Logistic Regression, Random Forest, and SVM models were trained using an 80/20 train-test split layout.
    • Model metrics (F1-score) drove our best model selection. Random Forest proved to be highly interpretable thanks to the extracted feature importances.
  5. Feature Importance: Random Forest analysis revealed that features like worst radius, worst perimeter, and worst area are critical indicators for predictions.

    Screenshot 2026-03-08 025845

🚀 How to Run Locally

1. Install Dependencies

pip install -r requirements.txt

2. Run Exploratory Data Analysis

If the jupyter command is not recognized, you can run the notebook through Python directly:

python -m jupyter notebook data_analysis.ipynb

(Alternatively, just open the data_analysis.ipynb file in your VS Code which allows you to run cells natively!)

3. Model Training

Run model_training.py to train models, generate evaluation plots (roc_curve_comparison.png, feature_importance.png), and save artifacts (best_model.pkl, scaler.pkl).

python model_training.py

4. Start the Application

Boot up the Flask web backend.

python app.py

Open the URL shown in the terminal (usually http://127.0.0.1:5000/) in your browser to view the application!

The Sample Report link for input purpose is attachd below:

https://docs.google.com/document/d/1sJXyiBUusD7_LzGzv139xyZWyBF2y0Y-/edit?usp=sharing&ouid=112675565658180695174&rtpof=true&sd=true

About

OncoPredict is a machine learning-based biomedical analysis platform that predicts whether a tumor is benign or malignant using 30 biological features from the Breast Cancer Wisconsin dataset. It processes input data, applies ML models, and provides prediction probabilities through an interactive diagnostic interface.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors