FABind: Fast and Accurate Protein-Ligand Binding 🔥

News

🔥Apr 01 2024: Release our new version FABind+ with enhanced performance and sampling ability. Check the FABind+ paper on arxiv. The corresponding codes will be released soon.

🔥Mar 02 2024: Fix the bug of inference from custom complex caused by an incorrect loaded parameter and rdkit version. We also normalize the order of the atom for the writed mol file in post optimization. See more details in this commit.

🔥Jan 01 2024: Upload trained checkpoint into Google Drive.

🔥Nov 09 2023: Move trained checkpoint from Github to HuggingFace.

🔥Oct 10 2023: The trained FABind model and processed dataset are released!

🔥Oct 11 2023: Initial commits. More codes, pre-trained model, and data are coming soon.

Overview

This repository contains the source code for NeurIPS 2023 paper "FABind: Fast and Accurate Protein-Ligand Binding". FABind achieves accurate docking performance with high speed compared to recent baselines. If you have questions, don't hesitate to open an issue or ask me via qizhipei@ruc.edu.cn, Kaiyuan Gao via im_kai@hust.edu.cn, or Lijun Wu via lijuwu@microsoft.com. We are happy to hear from you!

Setup Environment

This is an example of how to set up a working conda environment to run the code. In this example, we have cuda version==11.3, torch==1.12.0, and rdkit==2021.03.4. To make sure the pyg packages are installed correctly, we directly install them from whl.

As the trained model checkpoint is included in the HuggingFace repository with git-lfs, you need to install git-lfs to pull the data correctly.

sudo apt-get install git-lfs # run this if you have not installed git-lfs
git lfs install
git clone https://github.com/QizhiPei/FABind.git --recursive
conda create --name fabind python=3.8
conda activate fabind
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_cluster-1.6.0%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_scatter-2.1.0%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_sparse-0.6.15%2Bpt112cu113-cp38-cp38-linux_x86_64.whl 
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/torch_spline_conv-1.2.1%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install https://data.pyg.org/whl/torch-1.12.0%2Bcu113/pyg_lib-0.2.0%2Bpt112cu113-cp38-cp38-linux_x86_64.whl
pip install torch-geometric==2.4.0
pip install torchdrug==0.1.2 torchmetrics==0.10.2 tqdm mlcrate pyarrow accelerate Bio lmdb fair-esm tensorboard
pip install fair-esm
pip install rdkit-pypi==2021.03.4
conda install -c conda-forge openbabel # install openbabel to save .mol2 file and .sdf file at the same time

Data

The PDBbind 2020 dataset can be download from http://www.pdbbind.org.cn. We then follow the same data processing as TankBind.

We also provided processed dataset on zenodo. If you want to train FABind from scratch, or reproduce the FABind results, you can:

download dataset from zenodo
unzip the zip file and place it into data_path such that data_path=pdbbind2020

Generate the ESM2 embeddings for the proteins

Before training or evaluation, you need to first generate the ESM2 embeddings for the proteins based on the preprocessed data above.

data_path=pdbbind2020

python fabind/tools/generate_esm2_t33.py ${data_path}

Then the ESM2 embedings will be saved at ${data_path}/dataset/processed/esm2_t33_650M_UR50D.lmdb.

Model

The pre-trained model is placed at ckpt/best_model.bin, which will be automatically downloaded when cloning this reporsitory with --recursive.

You can also manually download the pre-trained model from Hugging Face or Google Drive.

Evaluation

data_path=pdbbind2020
ckpt_path=ckpt/best_model.bin

python fabind/test_fabind.py \
    --batch_size 4 \
    --data-path $data_path \
    --resultFolder ./results \
    --exp-name test_exp \
    --ckpt $ckpt_path \
    --local-eval

Inference on Custom Complexes

Here are the scripts available for inference with smiles and according pdb files.

The following script iteratively runs:

Given smiles in index_csv, preprocess molecules with num_threads multiprocessing and save each processed molecule to {save_pt_dir}/mol.
Given protein pdb files in pdb_file_dir, preprocess protein information and save it to {save_pt_dir}/processed_protein.pt.
Load model checkpoint in ckpt_path, save the predicted molecule conformation in output_dir. Another csv file in output_dir indicates the smiles and according filename.

index_csv=../inference_examples/example.csv
pdb_file_dir=../inference_examples/pdb_files
num_threads=1
save_pt_dir=../inference_examples/temp_files
save_mols_dir=${save_pt_dir}/mol
ckpt_path=../ckpt/best_model.bin
output_dir=../inference_examples/inference_output

cd fabind

echo "======  preprocess molecules  ======"
python inference_preprocess_mol_confs.py --index_csv ${index_csv} --save_mols_dir ${save_mols_dir} --num_threads ${num_threads}

echo "======  preprocess proteins  ======"
python inference_preprocess_protein.py --pdb_file_dir ${pdb_file_dir} --save_pt_dir ${save_pt_dir}

echo "======  inference begins  ======"
python fabind_inference.py \
    --ckpt ${ckpt_path} \
    --batch_size 4 \
    --seed 128 \
    --test-gumbel-soft \
    --redocking \
    --post-optim \
    --write-mol-to-file \
    --sdf-output-path-post-optim ${output_dir} \
    --index-csv ${index_csv} \
    --preprocess-dir ${save_pt_dir} \
    --sdf-to-mol2

Re-training

data_path=pdbbind_2020
# write the default accelerate settings
python -c "from accelerate.utils import write_basic_config; write_basic_config(mixed_precision='no')"
# "accelerate launch" will run the experiments in multi-gpu if applicable 
accelerate launch fabind/main_fabind.py \
    --batch_size 3 \
    -d 0 \
    -m 5 \
    --data-path $data_path \
    --label baseline \
    --addNoise 5 \
    --resultFolder ./results \
    --use-compound-com-cls \
    --total-epochs 500 \
    --exp-name train_tmp \
    --coord-loss-weight 1.0 \
    --pair-distance-loss-weight 1.0 \
    --pair-distance-distill-loss-weight 1.0 \
    --pocket-cls-loss-weight 1.0 \
    --pocket-distance-loss-weight 0.05 \
    --lr 5e-05 --lr-scheduler poly_decay \
    --distmap-pred mlp \
    --hidden-size 512 --pocket-pred-hidden-size 128 \
    --n-iter 8 --mean-layers 4 \
    --refine refine_coord \
    --coordinate-scale 5 \
    --geometry-reg-step-size 0.001 \
    --rm-layernorm --add-attn-pair-bias --explicit-pair-embed --add-cross-attn-layer \
    --noise-for-predicted-pocket 0 \
    --clip-grad \
    --random-n-iter \
    --pocket-idx-no-noise \
    --pocket-cls-loss-func bce \
    --use-esm2-feat

About

Citations

@article{pei2023fabind,
  title={FABind: Fast and Accurate Protein-Ligand Binding},
  author={Pei, Qizhi and Gao, Kaiyuan and Wu, Lijun and Zhu, Jinhua and Xia, Yingce and Xie, Shufang and Qin, Tao and He, Kun and Liu, Tie-Yan and Yan, Rui},
  journal={arXiv preprint arXiv:2310.06763},
  year={2023}
}

@inproceedings{pei2023fabind,
  title={{FAB}ind: Fast and Accurate Protein-Ligand Binding},
  author={Qizhi Pei and Kaiyuan Gao and Lijun Wu and Jinhua Zhu and Yingce Xia and Shufang Xie and Tao Qin and Kun He and Tie-Yan Liu and Rui Yan},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=PnWakgg1RL}
}
@misc{gao2024fabind,
      title={FABind+: Enhancing Molecular Docking through Improved Pocket Prediction and Pose Generation}, 
      author={Kaiyuan Gao and Qizhi Pei and Jinhua Zhu and Tao Qin and Kun He and Lijun Wu},
      journal={arXiv preprint arXiv:2403.20261},
      year={2024}
}

Acknowledegments

We appreciate EquiBind, TankBind, E3Bind, DiffDock and other related works for their open-sourced contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
ckpt @ 88f4039		ckpt @ 88f4039
fabind		fabind
imgs		imgs
inference_examples		inference_examples
split_pdb_id		split_pdb_id
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FABind: Fast and Accurate Protein-Ligand Binding 🔥

News

Overview

Setup Environment

Data

Generate the ESM2 embeddings for the proteins

Model

Evaluation

Inference on Custom Complexes

Re-training

About

Citations

Related

Acknowledegments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FABind: Fast and Accurate Protein-Ligand Binding 🔥

News

Overview

Setup Environment

Data

Generate the ESM2 embeddings for the proteins

Model

Evaluation

Inference on Custom Complexes

Re-training

About

Citations

Related

Acknowledegments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages