Genetic Disease Risk Prediction System

A full-stack web application that leverages machine learning and genomic data to predict an individual's risk of developing specific genetic diseases. The system integrates a FastAPI backend with a React frontend, utilizing a hybrid database architecture combining PostgreSQL for structured relational data and MongoDB for flexible document storage.

Overview

The Genetic Disease Risk Prediction System is designed to assess an individual's genetic predisposition to various diseases based on their Single Nucleotide Polymorphism (SNP) data. By analyzing genetic variants and their associated risk factors, the system provides probabilistic risk assessments for conditions including:

Type 2 Diabetes
Coronary Artery Disease
Hypertension
Breast Cancer

The application employs a Logistic Regression model trained on Genome-Wide Association Study (GWAS) data, with disease-specific models that account for the unique genetic markers associated with each condition.

System Architecture

The application follows a three-tier architecture:

+-------------------+     +-------------------+     +-------------------+
|                   |     |                   |     |                   |
|   React Frontend  |<--->|  FastAPI Backend  |<--->|    Databases      |
|                   |     |                   |     |                   |
+-------------------+     +-------------------+     +-------------------+
                                   |
                                   v
                          +-------------------+
                          |                   |
                          |   ML Pipeline     |
                          |   (scikit-learn)  |
                          |                   |
                          +-------------------+

Components

Frontend: React-based single-page application providing user input forms and risk visualization
Backend: FastAPI server handling RESTful API endpoints, data processing, and ML inference
PostgreSQL (Neon): Stores structured data including patients, genes, SNPs, diseases, and predictions
MongoDB (Atlas): Stores model metadata, serialized trained models, and raw genetic input logs
ML Pipeline: scikit-learn-based Logistic Regression models with StandardScaler preprocessing

Technology Stack

Backend

Component	Technology	Version
Web Framework	FastAPI	0.115.0
ASGI Server	Uvicorn	0.30.1
ORM	SQLAlchemy	2.0.32
PostgreSQL Driver	psycopg2-binary	2.9.9
MongoDB Driver	PyMongo	4.8.0
Data Processing	Pandas	2.2.2
Numerical Computing	NumPy	2.1.2
Machine Learning	scikit-learn	1.5.1
Environment Management	python-dotenv	1.0.1

Frontend

Component	Technology	Version
JavaScript Library	React	18.2.0
Build Tool	react-scripts	5.0.1

Infrastructure

Component	Service
SQL Database	Neon (PostgreSQL)
NoSQL Database	MongoDB Atlas

Database Schema

PostgreSQL Tables

Patient

Column	Type	Description
patient_id	Integer (PK)	Unique patient identifier
name	String(100)	Patient name
age	Integer	Patient age
gender	String(10)	Patient gender

Gene

Column	Type	Description
gene_id	Integer (PK)	Unique gene identifier
gene_name	String (Unique)	Gene symbol
description	String	Gene description

SNP

Column	Type	Description
snp_id	Integer (PK)	Unique SNP identifier
rsid	String(20) (Unique)	Reference SNP ID (e.g., rs123456)
gene_id	Integer (FK)	Associated gene
chromosome	Integer	Chromosome number
position	Integer	Genomic position
risk_allele	String(1)	Risk-associated allele (A, T, G, or C)
odds_ratio	Float	Association strength
risk_allele_freq	Float	Population frequency of risk allele
p_value	Float	Statistical significance
is_significant	Boolean	Whether SNP meets significance threshold

Disease

Column	Type	Description
disease_id	Integer (PK)	Unique disease identifier
name	String(100)	Disease name
description	String	Disease description

DiseaseSNP

Column	Type	Description
id	Integer (PK)	Unique mapping identifier
disease_id	Integer (FK)	Associated disease
snp_id	Integer (FK)	Associated SNP

Prediction

Column	Type	Description
pred_id	Integer (PK)	Unique prediction identifier
patient_id	Integer (FK)	Associated patient
disease_id	Integer (FK)	Predicted disease
probability	Float	Risk probability (0.0 - 1.0)
risk_level	String(20)	Risk classification (High/Low)
timestamp	DateTime	Prediction timestamp

MongoDB Collections

model_metadata

Stores serialized ML models (pickled)
Feature names and training metadata
Disease-specific model identifiers
Creation timestamps

genetic_inputs

Raw SNP input logs from user submissions
Derived feature vectors
Patient and model associations
Audit trail for predictions

Project Structure

Genetic_database/
|
|-- backend/
|   |-- main.py                  # FastAPI application with endpoints
|   |-- database.py              # Database connection configuration
|   |-- models.py                # SQLAlchemy ORM models
|   |-- crud.py                  # Database CRUD operations
|   |-- etl.py                   # Data extraction, transformation, loading
|   |-- ml.py                    # Core ML training and prediction logic
|   |-- train_pipeline.py        # One-time training pipeline script
|   |-- train_disease_models.py  # Disease-specific model training
|   |-- populate_disease_snp.py  # Disease-SNP mapping population
|   |-- diagnose_model.py        # Model diagnostics and debugging
|   |-- retrain.py               # Model retraining utilities
|   |-- requirements.txt         # Python dependencies
|   |-- .env                     # Environment variables (not committed)
|
|-- frontend/
|   |-- public/                  # Static assets
|   |-- src/
|   |   |-- App.js               # Main application component
|   |   |-- App.css              # Global styles
|   |   |-- index.js             # Application entry point
|   |   |-- components/
|   |   |   |-- Home.js          # Landing page component
|   |   |   |-- inputForm.js     # Patient and SNP input form
|   |   |   |-- results.js       # Prediction results display
|   |   |   |-- RiskVisualization.js  # Risk visualization charts
|   |   |   |-- ComputationFlow.js    # ML computation explanation
|   |   |   |-- Glossary.js      # Genetic terms glossary
|   |   |   |-- TermTooltip.js   # Interactive term tooltips
|   |   |   |-- SQLViewer.js     # SQL query visualization
|   |   |-- pages/
|   |   |   |-- RiskPredictor.js # Main prediction page
|   |   |   |-- GlossaryPage.js  # Glossary page
|   |   |   |-- BlogPage.js      # Educational blog content
|   |   |-- styles/              # Component-specific CSS files
|   |-- package.json             # Node.js dependencies
|
|-- cleaned_gwas.csv             # GWAS training dataset
|-- cleaned_gwas_test_split.csv  # Test dataset split
|-- .gitignore                   # Git ignore rules
|-- README.md                    # Project documentation

Installation and Setup

Prerequisites

Python 3.12 or higher
Node.js 18 or higher
npm (included with Node.js)
PostgreSQL database (Neon recommended)
MongoDB database (Atlas recommended)

Step 1: Clone the Repository

git clone https://github.com/your-repository/genetic-risk-prediction.git
cd genetic-risk-prediction

Step 2: Set Up Cloud Databases

PostgreSQL (Neon)

Create an account at https://console.neon.tech
Create a new project and branch (default settings are acceptable)

Copy the connection string in the format:

postgresql://USER:PASSWORD@HOST/DBNAME?sslmode=require

MongoDB Atlas

Create an account at https://www.mongodb.com/atlas/database
Create a free M0 cluster
Create a database user with read/write permissions
Configure network access (allow your current IP address)

Copy the SRV connection URI:

mongodb+srv://USER:PASSWORD@CLUSTER.mongodb.net/?retryWrites=true&w=majority

Step 3: Backend Setup

cd backend

# Create and activate virtual environment
python -m venv venv

# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Step 4: Frontend Setup

cd frontend

# Install dependencies
npm install

Configuration

Create a .env file in the backend/ directory with the following variables:

SQLDATABASE_URL=postgresql+psycopg2://USER:PASSWORD@HOST/DBNAME?sslmode=require
NoSQLDATABASE_URL=mongodb+srv://USER:PASSWORD@CLUSTER.mongodb.net/?retryWrites=true&w=majority

Replace the placeholder values with your actual database credentials.

Important Notes:

The SQL URL must include +psycopg2 for the SQLAlchemy PostgreSQL driver
Ensure sslmode=require is appended for secure Neon connections

Training the Machine Learning Model

The training pipeline performs the following operations:

Creates database tables using SQLAlchemy ORM
Loads and cleans the GWAS dataset
Splits data into training (80%) and test (20%) sets
Populates PostgreSQL with gene, SNP, and disease data
Trains disease-specific Logistic Regression models
Stores model metadata and serialized models in MongoDB

Run Initial Training

cd backend

# Activate virtual environment
venv\Scripts\activate  # Windows
source venv/bin/activate  # macOS/Linux

# Run the training pipeline
python train_pipeline.py --dataset ../cleaned_gwas.csv

Train Disease-Specific Models

After the initial pipeline, train models for each disease:

python train_disease_models.py

This creates separate models for each disease in the database, allowing for disease-specific risk predictions.

Running the Application

Start the Backend Server

cd backend

# Activate virtual environment
venv\Scripts\activate  # Windows
source venv/bin/activate  # macOS/Linux

# Start the FastAPI server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

The backend API will be accessible at http://localhost:8000.

Start the Frontend Development Server

cd frontend

# Start React development server
npm start

The frontend application will be accessible at http://localhost:3000.

API Reference

POST /init/

Verifies that the ML model is loaded and ready for predictions.

Response:

{
    "message": "Disease-specific model metadata loaded",
    "model_id": "60a7b8c9d1e2f3a4b5c6d7e8",
    "trained_at": "2024-01-15T10:30:00Z",
    "disease_id": 1
}

POST /predict/

Generates a disease risk prediction for a patient.

Request Body:

{
    "patient": {
        "name": "John Doe",
        "age": 45,
        "gender": "Male"
    },
    "disease_name": "Type 2 Diabetes",
    "snps": [
        {"rsid": "rs1234567", "allele": "A"},
        {"rsid": "rs7654321", "allele": "G"}
    ]
}

Response:

{
    "patient_id": 1,
    "disease": "Type 2 Diabetes",
    "disease_id": 1,
    "risk_probability": 0.67,
    "model_id": "60a7b8c9d1e2f3a4b5c6d7e8"
}

Usage Guide

Access the Application: Navigate to http://localhost:3000 in your web browser.
Navigate to Risk Predictor: From the home page, click on the prediction tool.
Enter Patient Information:
- Patient name
- Age (1-120)
- Gender (Male/Female/Other)
Select Target Disease: Choose from the available diseases (Type 2 Diabetes, Coronary Artery Disease, Hypertension, or Breast Cancer).
Input Genetic Variants:
- Enter SNP rsID values (e.g., rs123456)
- Specify the patient's allele for each SNP (A, T, G, or C)
- Add multiple SNPs using the "Add Another SNP" button
Submit for Prediction: Click "Predict Risk" to generate the risk assessment.
Review Results: The application displays:
- Risk probability (0-100%)
- Risk classification (High/Low)
- Visualization of the computation process

Data Flow

User Input: Patient information and SNP data submitted via React frontend
API Request: Frontend sends POST request to /predict/ endpoint
Patient Creation: Backend creates new patient record in PostgreSQL
Feature Extraction:
- SNPs are matched against the database
- Features extracted: odds_ratio, risk_allele_freq, chromosome, position
- Weighted averaging applied based on odds ratios
Input Logging: Raw input and derived features stored in MongoDB
Model Inference:
- Disease-specific model loaded from MongoDB
- Features passed through StandardScaler and Logistic Regression
- Risk probability calculated
Prediction Storage: Result saved to PostgreSQL predictions table
Response: Risk probability and metadata returned to frontend for visualization

Verification and Testing

Verify Backend Connectivity

curl http://localhost:8000/init/

Expected response: JSON object with model metadata confirming successful initialization.

End-to-End Test

Submit a prediction through the frontend interface
Verify data in PostgreSQL:
- Check patient table for new patient record
- Check prediction table for prediction result
Verify data in MongoDB:
- Check genetic_inputs collection for input log

Check MongoDB Collections

Using MongoDB Atlas Data Explorer:

Database: genetic_risk
Collections: model_metadata, genetic_inputs

Troubleshooting

Common Issues

Connection Refused on Backend

Ensure the virtual environment is activated
Verify the .env file contains correct database credentials
Check that PostgreSQL and MongoDB services are accessible

Model Not Found Error

Run train_pipeline.py to initialize the database and train models
Run train_disease_models.py to create disease-specific models

CORS Errors in Frontend

The backend is configured to allow all origins (*)
Ensure the frontend is pointing to http://localhost:8000

Database Connection Timeout

Check network connectivity to Neon and MongoDB Atlas
Verify IP address is whitelisted in MongoDB Atlas

Missing Dependencies

Backend: pip install -r requirements.txt
Frontend: npm install

License

This project is developed for educational and research purposes. Please ensure compliance with applicable data protection regulations when handling genetic information.

Acknowledgments

GWAS Catalog for providing the foundational genetic association data
Neon for serverless PostgreSQL hosting
MongoDB Atlas for document database services
scikit-learn for machine learning infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
cleaned_gwas.csv		cleaned_gwas.csv
cleaned_gwas_test_split.csv		cleaned_gwas_test_split.csv

Folders and files

Latest commit

History

Repository files navigation

Genetic Disease Risk Prediction System

Table of Contents

Overview

System Architecture

Components

Technology Stack

Backend

Frontend

Infrastructure

Database Schema

PostgreSQL Tables

MongoDB Collections

Project Structure

Installation and Setup

Prerequisites

Step 1: Clone the Repository

Step 2: Set Up Cloud Databases

Step 3: Backend Setup

Step 4: Frontend Setup

Configuration

Training the Machine Learning Model

Run Initial Training

Train Disease-Specific Models

Running the Application

Start the Backend Server

Start the Frontend Development Server

API Reference

POST /init/

POST /predict/

Usage Guide

Data Flow

Verification and Testing

Verify Backend Connectivity

End-to-End Test

Check MongoDB Collections

Troubleshooting

Common Issues

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages