A full-stack web application that leverages machine learning and genomic data to predict an individual's risk of developing specific genetic diseases. The system integrates a FastAPI backend with a React frontend, utilizing a hybrid database architecture combining PostgreSQL for structured relational data and MongoDB for flexible document storage.
- Overview
- System Architecture
- Technology Stack
- Database Schema
- Project Structure
- Installation and Setup
- Configuration
- Training the Machine Learning Model
- Running the Application
- API Reference
- Usage Guide
- Data Flow
- Verification and Testing
- Troubleshooting
- License
The Genetic Disease Risk Prediction System is designed to assess an individual's genetic predisposition to various diseases based on their Single Nucleotide Polymorphism (SNP) data. By analyzing genetic variants and their associated risk factors, the system provides probabilistic risk assessments for conditions including:
- Type 2 Diabetes
- Coronary Artery Disease
- Hypertension
- Breast Cancer
The application employs a Logistic Regression model trained on Genome-Wide Association Study (GWAS) data, with disease-specific models that account for the unique genetic markers associated with each condition.
The application follows a three-tier architecture:
+-------------------+ +-------------------+ +-------------------+
| | | | | |
| React Frontend |<--->| FastAPI Backend |<--->| Databases |
| | | | | |
+-------------------+ +-------------------+ +-------------------+
|
v
+-------------------+
| |
| ML Pipeline |
| (scikit-learn) |
| |
+-------------------+
- Frontend: React-based single-page application providing user input forms and risk visualization
- Backend: FastAPI server handling RESTful API endpoints, data processing, and ML inference
- PostgreSQL (Neon): Stores structured data including patients, genes, SNPs, diseases, and predictions
- MongoDB (Atlas): Stores model metadata, serialized trained models, and raw genetic input logs
- ML Pipeline: scikit-learn-based Logistic Regression models with StandardScaler preprocessing
| Component | Technology | Version |
|---|---|---|
| Web Framework | FastAPI | 0.115.0 |
| ASGI Server | Uvicorn | 0.30.1 |
| ORM | SQLAlchemy | 2.0.32 |
| PostgreSQL Driver | psycopg2-binary | 2.9.9 |
| MongoDB Driver | PyMongo | 4.8.0 |
| Data Processing | Pandas | 2.2.2 |
| Numerical Computing | NumPy | 2.1.2 |
| Machine Learning | scikit-learn | 1.5.1 |
| Environment Management | python-dotenv | 1.0.1 |
| Component | Technology | Version |
|---|---|---|
| JavaScript Library | React | 18.2.0 |
| Build Tool | react-scripts | 5.0.1 |
| Component | Service |
|---|---|
| SQL Database | Neon (PostgreSQL) |
| NoSQL Database | MongoDB Atlas |
Patient
| Column | Type | Description |
|---|---|---|
| patient_id | Integer (PK) | Unique patient identifier |
| name | String(100) | Patient name |
| age | Integer | Patient age |
| gender | String(10) | Patient gender |
Gene
| Column | Type | Description |
|---|---|---|
| gene_id | Integer (PK) | Unique gene identifier |
| gene_name | String (Unique) | Gene symbol |
| description | String | Gene description |
SNP
| Column | Type | Description |
|---|---|---|
| snp_id | Integer (PK) | Unique SNP identifier |
| rsid | String(20) (Unique) | Reference SNP ID (e.g., rs123456) |
| gene_id | Integer (FK) | Associated gene |
| chromosome | Integer | Chromosome number |
| position | Integer | Genomic position |
| risk_allele | String(1) | Risk-associated allele (A, T, G, or C) |
| odds_ratio | Float | Association strength |
| risk_allele_freq | Float | Population frequency of risk allele |
| p_value | Float | Statistical significance |
| is_significant | Boolean | Whether SNP meets significance threshold |
Disease
| Column | Type | Description |
|---|---|---|
| disease_id | Integer (PK) | Unique disease identifier |
| name | String(100) | Disease name |
| description | String | Disease description |
DiseaseSNP
| Column | Type | Description |
|---|---|---|
| id | Integer (PK) | Unique mapping identifier |
| disease_id | Integer (FK) | Associated disease |
| snp_id | Integer (FK) | Associated SNP |
Prediction
| Column | Type | Description |
|---|---|---|
| pred_id | Integer (PK) | Unique prediction identifier |
| patient_id | Integer (FK) | Associated patient |
| disease_id | Integer (FK) | Predicted disease |
| probability | Float | Risk probability (0.0 - 1.0) |
| risk_level | String(20) | Risk classification (High/Low) |
| timestamp | DateTime | Prediction timestamp |
model_metadata
- Stores serialized ML models (pickled)
- Feature names and training metadata
- Disease-specific model identifiers
- Creation timestamps
genetic_inputs
- Raw SNP input logs from user submissions
- Derived feature vectors
- Patient and model associations
- Audit trail for predictions
Genetic_database/
|
|-- backend/
| |-- main.py # FastAPI application with endpoints
| |-- database.py # Database connection configuration
| |-- models.py # SQLAlchemy ORM models
| |-- crud.py # Database CRUD operations
| |-- etl.py # Data extraction, transformation, loading
| |-- ml.py # Core ML training and prediction logic
| |-- train_pipeline.py # One-time training pipeline script
| |-- train_disease_models.py # Disease-specific model training
| |-- populate_disease_snp.py # Disease-SNP mapping population
| |-- diagnose_model.py # Model diagnostics and debugging
| |-- retrain.py # Model retraining utilities
| |-- requirements.txt # Python dependencies
| |-- .env # Environment variables (not committed)
|
|-- frontend/
| |-- public/ # Static assets
| |-- src/
| | |-- App.js # Main application component
| | |-- App.css # Global styles
| | |-- index.js # Application entry point
| | |-- components/
| | | |-- Home.js # Landing page component
| | | |-- inputForm.js # Patient and SNP input form
| | | |-- results.js # Prediction results display
| | | |-- RiskVisualization.js # Risk visualization charts
| | | |-- ComputationFlow.js # ML computation explanation
| | | |-- Glossary.js # Genetic terms glossary
| | | |-- TermTooltip.js # Interactive term tooltips
| | | |-- SQLViewer.js # SQL query visualization
| | |-- pages/
| | | |-- RiskPredictor.js # Main prediction page
| | | |-- GlossaryPage.js # Glossary page
| | | |-- BlogPage.js # Educational blog content
| | |-- styles/ # Component-specific CSS files
| |-- package.json # Node.js dependencies
|
|-- cleaned_gwas.csv # GWAS training dataset
|-- cleaned_gwas_test_split.csv # Test dataset split
|-- .gitignore # Git ignore rules
|-- README.md # Project documentation
- Python 3.12 or higher
- Node.js 18 or higher
- npm (included with Node.js)
- PostgreSQL database (Neon recommended)
- MongoDB database (Atlas recommended)
git clone https://github.com/your-repository/genetic-risk-prediction.git
cd genetic-risk-predictionPostgreSQL (Neon)
- Create an account at https://console.neon.tech
- Create a new project and branch (default settings are acceptable)
- Copy the connection string in the format:
postgresql://USER:PASSWORD@HOST/DBNAME?sslmode=require
MongoDB Atlas
- Create an account at https://www.mongodb.com/atlas/database
- Create a free M0 cluster
- Create a database user with read/write permissions
- Configure network access (allow your current IP address)
- Copy the SRV connection URI:
mongodb+srv://USER:PASSWORD@CLUSTER.mongodb.net/?retryWrites=true&w=majority
cd backend
# Create and activate virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtcd frontend
# Install dependencies
npm installCreate a .env file in the backend/ directory with the following variables:
SQLDATABASE_URL=postgresql+psycopg2://USER:PASSWORD@HOST/DBNAME?sslmode=require
NoSQLDATABASE_URL=mongodb+srv://USER:PASSWORD@CLUSTER.mongodb.net/?retryWrites=true&w=majorityReplace the placeholder values with your actual database credentials.
Important Notes:
- The SQL URL must include
+psycopg2for the SQLAlchemy PostgreSQL driver - Ensure
sslmode=requireis appended for secure Neon connections
The training pipeline performs the following operations:
- Creates database tables using SQLAlchemy ORM
- Loads and cleans the GWAS dataset
- Splits data into training (80%) and test (20%) sets
- Populates PostgreSQL with gene, SNP, and disease data
- Trains disease-specific Logistic Regression models
- Stores model metadata and serialized models in MongoDB
cd backend
# Activate virtual environment
venv\Scripts\activate # Windows
source venv/bin/activate # macOS/Linux
# Run the training pipeline
python train_pipeline.py --dataset ../cleaned_gwas.csvAfter the initial pipeline, train models for each disease:
python train_disease_models.pyThis creates separate models for each disease in the database, allowing for disease-specific risk predictions.
cd backend
# Activate virtual environment
venv\Scripts\activate # Windows
source venv/bin/activate # macOS/Linux
# Start the FastAPI server
uvicorn main:app --host 0.0.0.0 --port 8000 --reloadThe backend API will be accessible at http://localhost:8000.
cd frontend
# Start React development server
npm startThe frontend application will be accessible at http://localhost:3000.
Verifies that the ML model is loaded and ready for predictions.
Response:
{
"message": "Disease-specific model metadata loaded",
"model_id": "60a7b8c9d1e2f3a4b5c6d7e8",
"trained_at": "2024-01-15T10:30:00Z",
"disease_id": 1
}Generates a disease risk prediction for a patient.
Request Body:
{
"patient": {
"name": "John Doe",
"age": 45,
"gender": "Male"
},
"disease_name": "Type 2 Diabetes",
"snps": [
{"rsid": "rs1234567", "allele": "A"},
{"rsid": "rs7654321", "allele": "G"}
]
}Response:
{
"patient_id": 1,
"disease": "Type 2 Diabetes",
"disease_id": 1,
"risk_probability": 0.67,
"model_id": "60a7b8c9d1e2f3a4b5c6d7e8"
}-
Access the Application: Navigate to
http://localhost:3000in your web browser. -
Navigate to Risk Predictor: From the home page, click on the prediction tool.
-
Enter Patient Information:
- Patient name
- Age (1-120)
- Gender (Male/Female/Other)
-
Select Target Disease: Choose from the available diseases (Type 2 Diabetes, Coronary Artery Disease, Hypertension, or Breast Cancer).
-
Input Genetic Variants:
- Enter SNP rsID values (e.g., rs123456)
- Specify the patient's allele for each SNP (A, T, G, or C)
- Add multiple SNPs using the "Add Another SNP" button
-
Submit for Prediction: Click "Predict Risk" to generate the risk assessment.
-
Review Results: The application displays:
- Risk probability (0-100%)
- Risk classification (High/Low)
- Visualization of the computation process
-
User Input: Patient information and SNP data submitted via React frontend
-
API Request: Frontend sends POST request to
/predict/endpoint -
Patient Creation: Backend creates new patient record in PostgreSQL
-
Feature Extraction:
- SNPs are matched against the database
- Features extracted: odds_ratio, risk_allele_freq, chromosome, position
- Weighted averaging applied based on odds ratios
-
Input Logging: Raw input and derived features stored in MongoDB
-
Model Inference:
- Disease-specific model loaded from MongoDB
- Features passed through StandardScaler and Logistic Regression
- Risk probability calculated
-
Prediction Storage: Result saved to PostgreSQL predictions table
-
Response: Risk probability and metadata returned to frontend for visualization
curl http://localhost:8000/init/Expected response: JSON object with model metadata confirming successful initialization.
- Submit a prediction through the frontend interface
- Verify data in PostgreSQL:
- Check
patienttable for new patient record - Check
predictiontable for prediction result
- Check
- Verify data in MongoDB:
- Check
genetic_inputscollection for input log
- Check
Using MongoDB Atlas Data Explorer:
- Database:
genetic_risk - Collections:
model_metadata,genetic_inputs
Connection Refused on Backend
- Ensure the virtual environment is activated
- Verify the
.envfile contains correct database credentials - Check that PostgreSQL and MongoDB services are accessible
Model Not Found Error
- Run
train_pipeline.pyto initialize the database and train models - Run
train_disease_models.pyto create disease-specific models
CORS Errors in Frontend
- The backend is configured to allow all origins (
*) - Ensure the frontend is pointing to
http://localhost:8000
Database Connection Timeout
- Check network connectivity to Neon and MongoDB Atlas
- Verify IP address is whitelisted in MongoDB Atlas
Missing Dependencies
- Backend:
pip install -r requirements.txt - Frontend:
npm install
This project is developed for educational and research purposes. Please ensure compliance with applicable data protection regulations when handling genetic information.
- GWAS Catalog for providing the foundational genetic association data
- Neon for serverless PostgreSQL hosting
- MongoDB Atlas for document database services
- scikit-learn for machine learning infrastructure