Skip to content

harishbabu2007/Drug-Target-Binding-Affinity-Prediction

Repository files navigation

Drug-Target Binding Affinity (DTBA) Prediction

Overview

This project implements a deep learning framework for predicting the binding affinity between drug molecules (ligands) and target proteins. The model leverages a hybrid architecture combining Graph Neural Networks (GNN) for molecular representation learning and Evolutionary Scale Modeling (ESM) for protein sequence embeddings. This approach allows for a comprehensive understanding of both the structural properties of small molecules and the biological context of protein targets.

Key Features

  • Graph Neural Network (GNN) for Ligands: Utilizes TransformerConv layers from PyTorch Geometric to process molecular graphs, capturing complex atomic interactions and structural features.
  • ESM Protein Embeddings: Integrates pre-trained ESM-2 (esm2_t6_8M_UR50D) models to generate high-quality embeddings for protein sequences, ensuring robust representation of biological targets.
  • Hybrid Architecture: Concatenates ligand and protein representations to predict binding affinity values through a multi-layer perceptron (MLP) regressor.
  • Automated Preprocessing Pipeline: Includes scripts for validating SMILES strings, generating protein embeddings, and preparing datasets for training.
  • Comprehensive Evaluation: standardized evaluation metrics including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²) scores.

Project Structure

DTBA/
├── data/
│   ├── processed/          # Processed PyTorch Geometric data files
│   └── raw/                # Raw input data (e.g., Ki_bind.tsv)
├── preprocessing/
│   ├── dataset.py          # PyTorch Geometric dataset implementation
│   └── preprocess_drugs.py # Data cleaning and embedding generation script
├── results/                # Training visualizations and loss analysis
├── saved_models/           # Model checkpoints and best performing models
├── evaluate.py             # Script for model evaluation
├── model.py                # Neural network architecture definition
├── train.py                # Main training loop and optimization
└── requirements.txt        # Project dependencies

Prerequisites

  • Python 3.8+
  • CUDA-enabled GPU (recommended for training)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd DTBA
  2. Create a virtual environment (optional but recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  3. Install dependencies:

    pip install -r requirements.txt

    Note: You may need to install PyTorch and PyTorch Geometric specifically for your CUDA version. Please refer to the PyTorch website for specific instructions.

Usage

1. Data Preprocessing

Before training, the raw data must be processed to generate protein embeddings and validate molecular structures.

python preprocessing/preprocess_drugs.py

This script reads from data/raw/Ki_bind.tsv, filters invalid entries, generates ESM embeddings for proteins, and saves the train/test splits to data/raw/.

2. Training the Model

To train the model, run the training script. This will initialize the model, load the dataset, and begin the training process.

python train.py

The script will:

  • Load the processed dataset.
  • Train the model for the specified number of epochs.
  • Save the best model to saved_models/best_model.pth.
  • Generate loss analysis plots in the results/ directory.

3. Evaluation

To evaluate the trained model on the test set:

python evaluate.py

This will output performance metrics such as MSE, RMSE, and R² score, providing insights into the model's predictive accuracy.

Model Architecture

The MainNetwork class in model.py defines the architecture:

  1. Ligand Branch:

    • Input: Molecular graph (node features, edge indices, edge attributes).
    • Layers: Two TransformerConv layers with multi-head attention, followed by global mean and max pooling.
    • Output: A fixed-size vector representation of the ligand.
  2. Protein Branch:

    • Input: Pre-computed ESM embeddings (320 dimensions).
    • Processing: Directly passed to the concatenation stage (extensible for further processing).
  3. Interaction Module:

    • The ligand and protein vectors are concatenated.
    • Passed through a sequence of Linear layers with BatchNorm, ReLU activation, and Dropout for regularization.
    • Final Output: Predicted binding affinity (scalar value).

Results

Training progress and loss curves are automatically saved in the results/ folder. Key metrics tracked include:

  • MSE (Mean Squared Error): Measures the average squared difference between estimated values and the actual value.
  • R² Score: Represents the proportion of variance for the dependent variable that's explained by the independent variables.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

predicted Protein–Ligand binding affinity using Graph Neural Networks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages