Skip to content

Adhithi02/Genetic_database

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genetic Disease Risk Prediction System

A full-stack web application that leverages machine learning and genomic data to predict an individual's risk of developing specific genetic diseases. The system integrates a FastAPI backend with a React frontend, utilizing a hybrid database architecture combining PostgreSQL for structured relational data and MongoDB for flexible document storage.


Table of Contents

  1. Overview
  2. System Architecture
  3. Technology Stack
  4. Database Schema
  5. Project Structure
  6. Installation and Setup
  7. Configuration
  8. Training the Machine Learning Model
  9. Running the Application
  10. API Reference
  11. Usage Guide
  12. Data Flow
  13. Verification and Testing
  14. Troubleshooting
  15. License

Overview

The Genetic Disease Risk Prediction System is designed to assess an individual's genetic predisposition to various diseases based on their Single Nucleotide Polymorphism (SNP) data. By analyzing genetic variants and their associated risk factors, the system provides probabilistic risk assessments for conditions including:

  • Type 2 Diabetes
  • Coronary Artery Disease
  • Hypertension
  • Breast Cancer

The application employs a Logistic Regression model trained on Genome-Wide Association Study (GWAS) data, with disease-specific models that account for the unique genetic markers associated with each condition.


System Architecture

The application follows a three-tier architecture:

+-------------------+     +-------------------+     +-------------------+
|                   |     |                   |     |                   |
|   React Frontend  |<--->|  FastAPI Backend  |<--->|    Databases      |
|                   |     |                   |     |                   |
+-------------------+     +-------------------+     +-------------------+
                                   |
                                   v
                          +-------------------+
                          |                   |
                          |   ML Pipeline     |
                          |   (scikit-learn)  |
                          |                   |
                          +-------------------+

Components

  • Frontend: React-based single-page application providing user input forms and risk visualization
  • Backend: FastAPI server handling RESTful API endpoints, data processing, and ML inference
  • PostgreSQL (Neon): Stores structured data including patients, genes, SNPs, diseases, and predictions
  • MongoDB (Atlas): Stores model metadata, serialized trained models, and raw genetic input logs
  • ML Pipeline: scikit-learn-based Logistic Regression models with StandardScaler preprocessing

Technology Stack

Backend

Component Technology Version
Web Framework FastAPI 0.115.0
ASGI Server Uvicorn 0.30.1
ORM SQLAlchemy 2.0.32
PostgreSQL Driver psycopg2-binary 2.9.9
MongoDB Driver PyMongo 4.8.0
Data Processing Pandas 2.2.2
Numerical Computing NumPy 2.1.2
Machine Learning scikit-learn 1.5.1
Environment Management python-dotenv 1.0.1

Frontend

Component Technology Version
JavaScript Library React 18.2.0
Build Tool react-scripts 5.0.1

Infrastructure

Component Service
SQL Database Neon (PostgreSQL)
NoSQL Database MongoDB Atlas

Database Schema

PostgreSQL Tables

Patient

Column Type Description
patient_id Integer (PK) Unique patient identifier
name String(100) Patient name
age Integer Patient age
gender String(10) Patient gender

Gene

Column Type Description
gene_id Integer (PK) Unique gene identifier
gene_name String (Unique) Gene symbol
description String Gene description

SNP

Column Type Description
snp_id Integer (PK) Unique SNP identifier
rsid String(20) (Unique) Reference SNP ID (e.g., rs123456)
gene_id Integer (FK) Associated gene
chromosome Integer Chromosome number
position Integer Genomic position
risk_allele String(1) Risk-associated allele (A, T, G, or C)
odds_ratio Float Association strength
risk_allele_freq Float Population frequency of risk allele
p_value Float Statistical significance
is_significant Boolean Whether SNP meets significance threshold

Disease

Column Type Description
disease_id Integer (PK) Unique disease identifier
name String(100) Disease name
description String Disease description

DiseaseSNP

Column Type Description
id Integer (PK) Unique mapping identifier
disease_id Integer (FK) Associated disease
snp_id Integer (FK) Associated SNP

Prediction

Column Type Description
pred_id Integer (PK) Unique prediction identifier
patient_id Integer (FK) Associated patient
disease_id Integer (FK) Predicted disease
probability Float Risk probability (0.0 - 1.0)
risk_level String(20) Risk classification (High/Low)
timestamp DateTime Prediction timestamp

MongoDB Collections

model_metadata

  • Stores serialized ML models (pickled)
  • Feature names and training metadata
  • Disease-specific model identifiers
  • Creation timestamps

genetic_inputs

  • Raw SNP input logs from user submissions
  • Derived feature vectors
  • Patient and model associations
  • Audit trail for predictions

Project Structure

Genetic_database/
|
|-- backend/
|   |-- main.py                  # FastAPI application with endpoints
|   |-- database.py              # Database connection configuration
|   |-- models.py                # SQLAlchemy ORM models
|   |-- crud.py                  # Database CRUD operations
|   |-- etl.py                   # Data extraction, transformation, loading
|   |-- ml.py                    # Core ML training and prediction logic
|   |-- train_pipeline.py        # One-time training pipeline script
|   |-- train_disease_models.py  # Disease-specific model training
|   |-- populate_disease_snp.py  # Disease-SNP mapping population
|   |-- diagnose_model.py        # Model diagnostics and debugging
|   |-- retrain.py               # Model retraining utilities
|   |-- requirements.txt         # Python dependencies
|   |-- .env                     # Environment variables (not committed)
|
|-- frontend/
|   |-- public/                  # Static assets
|   |-- src/
|   |   |-- App.js               # Main application component
|   |   |-- App.css              # Global styles
|   |   |-- index.js             # Application entry point
|   |   |-- components/
|   |   |   |-- Home.js          # Landing page component
|   |   |   |-- inputForm.js     # Patient and SNP input form
|   |   |   |-- results.js       # Prediction results display
|   |   |   |-- RiskVisualization.js  # Risk visualization charts
|   |   |   |-- ComputationFlow.js    # ML computation explanation
|   |   |   |-- Glossary.js      # Genetic terms glossary
|   |   |   |-- TermTooltip.js   # Interactive term tooltips
|   |   |   |-- SQLViewer.js     # SQL query visualization
|   |   |-- pages/
|   |   |   |-- RiskPredictor.js # Main prediction page
|   |   |   |-- GlossaryPage.js  # Glossary page
|   |   |   |-- BlogPage.js      # Educational blog content
|   |   |-- styles/              # Component-specific CSS files
|   |-- package.json             # Node.js dependencies
|
|-- cleaned_gwas.csv             # GWAS training dataset
|-- cleaned_gwas_test_split.csv  # Test dataset split
|-- .gitignore                   # Git ignore rules
|-- README.md                    # Project documentation

Installation and Setup

Prerequisites

  • Python 3.12 or higher
  • Node.js 18 or higher
  • npm (included with Node.js)
  • PostgreSQL database (Neon recommended)
  • MongoDB database (Atlas recommended)

Step 1: Clone the Repository

git clone https://github.com/your-repository/genetic-risk-prediction.git
cd genetic-risk-prediction

Step 2: Set Up Cloud Databases

PostgreSQL (Neon)

  1. Create an account at https://console.neon.tech
  2. Create a new project and branch (default settings are acceptable)
  3. Copy the connection string in the format:
    postgresql://USER:PASSWORD@HOST/DBNAME?sslmode=require
    

MongoDB Atlas

  1. Create an account at https://www.mongodb.com/atlas/database
  2. Create a free M0 cluster
  3. Create a database user with read/write permissions
  4. Configure network access (allow your current IP address)
  5. Copy the SRV connection URI:
    mongodb+srv://USER:PASSWORD@CLUSTER.mongodb.net/?retryWrites=true&w=majority
    

Step 3: Backend Setup

cd backend

# Create and activate virtual environment
python -m venv venv

# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Step 4: Frontend Setup

cd frontend

# Install dependencies
npm install

Configuration

Create a .env file in the backend/ directory with the following variables:

SQLDATABASE_URL=postgresql+psycopg2://USER:PASSWORD@HOST/DBNAME?sslmode=require
NoSQLDATABASE_URL=mongodb+srv://USER:PASSWORD@CLUSTER.mongodb.net/?retryWrites=true&w=majority

Replace the placeholder values with your actual database credentials.

Important Notes:

  • The SQL URL must include +psycopg2 for the SQLAlchemy PostgreSQL driver
  • Ensure sslmode=require is appended for secure Neon connections

Training the Machine Learning Model

The training pipeline performs the following operations:

  1. Creates database tables using SQLAlchemy ORM
  2. Loads and cleans the GWAS dataset
  3. Splits data into training (80%) and test (20%) sets
  4. Populates PostgreSQL with gene, SNP, and disease data
  5. Trains disease-specific Logistic Regression models
  6. Stores model metadata and serialized models in MongoDB

Run Initial Training

cd backend

# Activate virtual environment
venv\Scripts\activate  # Windows
source venv/bin/activate  # macOS/Linux

# Run the training pipeline
python train_pipeline.py --dataset ../cleaned_gwas.csv

Train Disease-Specific Models

After the initial pipeline, train models for each disease:

python train_disease_models.py

This creates separate models for each disease in the database, allowing for disease-specific risk predictions.


Running the Application

Start the Backend Server

cd backend

# Activate virtual environment
venv\Scripts\activate  # Windows
source venv/bin/activate  # macOS/Linux

# Start the FastAPI server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

The backend API will be accessible at http://localhost:8000.

Start the Frontend Development Server

cd frontend

# Start React development server
npm start

The frontend application will be accessible at http://localhost:3000.


API Reference

POST /init/

Verifies that the ML model is loaded and ready for predictions.

Response:

{
    "message": "Disease-specific model metadata loaded",
    "model_id": "60a7b8c9d1e2f3a4b5c6d7e8",
    "trained_at": "2024-01-15T10:30:00Z",
    "disease_id": 1
}

POST /predict/

Generates a disease risk prediction for a patient.

Request Body:

{
    "patient": {
        "name": "John Doe",
        "age": 45,
        "gender": "Male"
    },
    "disease_name": "Type 2 Diabetes",
    "snps": [
        {"rsid": "rs1234567", "allele": "A"},
        {"rsid": "rs7654321", "allele": "G"}
    ]
}

Response:

{
    "patient_id": 1,
    "disease": "Type 2 Diabetes",
    "disease_id": 1,
    "risk_probability": 0.67,
    "model_id": "60a7b8c9d1e2f3a4b5c6d7e8"
}

Usage Guide

  1. Access the Application: Navigate to http://localhost:3000 in your web browser.

  2. Navigate to Risk Predictor: From the home page, click on the prediction tool.

  3. Enter Patient Information:

    • Patient name
    • Age (1-120)
    • Gender (Male/Female/Other)
  4. Select Target Disease: Choose from the available diseases (Type 2 Diabetes, Coronary Artery Disease, Hypertension, or Breast Cancer).

  5. Input Genetic Variants:

    • Enter SNP rsID values (e.g., rs123456)
    • Specify the patient's allele for each SNP (A, T, G, or C)
    • Add multiple SNPs using the "Add Another SNP" button
  6. Submit for Prediction: Click "Predict Risk" to generate the risk assessment.

  7. Review Results: The application displays:

    • Risk probability (0-100%)
    • Risk classification (High/Low)
    • Visualization of the computation process

Data Flow

  1. User Input: Patient information and SNP data submitted via React frontend

  2. API Request: Frontend sends POST request to /predict/ endpoint

  3. Patient Creation: Backend creates new patient record in PostgreSQL

  4. Feature Extraction:

    • SNPs are matched against the database
    • Features extracted: odds_ratio, risk_allele_freq, chromosome, position
    • Weighted averaging applied based on odds ratios
  5. Input Logging: Raw input and derived features stored in MongoDB

  6. Model Inference:

    • Disease-specific model loaded from MongoDB
    • Features passed through StandardScaler and Logistic Regression
    • Risk probability calculated
  7. Prediction Storage: Result saved to PostgreSQL predictions table

  8. Response: Risk probability and metadata returned to frontend for visualization


Verification and Testing

Verify Backend Connectivity

curl http://localhost:8000/init/

Expected response: JSON object with model metadata confirming successful initialization.

End-to-End Test

  1. Submit a prediction through the frontend interface
  2. Verify data in PostgreSQL:
    • Check patient table for new patient record
    • Check prediction table for prediction result
  3. Verify data in MongoDB:
    • Check genetic_inputs collection for input log

Check MongoDB Collections

Using MongoDB Atlas Data Explorer:

  • Database: genetic_risk
  • Collections: model_metadata, genetic_inputs

Troubleshooting

Common Issues

Connection Refused on Backend

  • Ensure the virtual environment is activated
  • Verify the .env file contains correct database credentials
  • Check that PostgreSQL and MongoDB services are accessible

Model Not Found Error

  • Run train_pipeline.py to initialize the database and train models
  • Run train_disease_models.py to create disease-specific models

CORS Errors in Frontend

  • The backend is configured to allow all origins (*)
  • Ensure the frontend is pointing to http://localhost:8000

Database Connection Timeout

  • Check network connectivity to Neon and MongoDB Atlas
  • Verify IP address is whitelisted in MongoDB Atlas

Missing Dependencies

  • Backend: pip install -r requirements.txt
  • Frontend: npm install

License

This project is developed for educational and research purposes. Please ensure compliance with applicable data protection regulations when handling genetic information.


Acknowledgments

  • GWAS Catalog for providing the foundational genetic association data
  • Neon for serverless PostgreSQL hosting
  • MongoDB Atlas for document database services
  • scikit-learn for machine learning infrastructure

About

DBMS LAB EL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors