Implementation of DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models in PyTorch (ICLR 2023 - MLDD Workshop) by Mohamed Amine Ketata*, Cedrik Laue*, Ruslan Mammadov*, Hannes Stärk, Menghua Wu, Gabriele Corso, Céline Marquet, Regina Barzilay, Tommi S. Jakkola.
DiffDock-PP is a new approach to rigid-body protein-protein docking that is based on a diffusion generative model that learns to translate and rotate unbound protein structures into their bound conformations, and a confidence model that learns to rank different poses generated by the score model and select the best one.
If you encounter any problem with the code, feel free to open an issue or contact mohamedamine.ketata@tum.de.
First, clone this repository
git clone https://github.com/ketatam/DiffDock-PP.git
Then, create a virtual environment to install the dependencies. We use python=3.10.8
, but other new Python versions should work as well.
conda create -n diffdock_pp python=3.10.8
conda activate diffdock_pp
Now, you can install the required packages
conda install pytorch=1.13.0 pytorch-cuda=11.6 -c pytorch -c nvidia
# install compatible pytorch geometric in this order WITH versions
pip install --no-cache-dir torch-scatter==2.0.9 torch-sparse==0.6.15 torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.13.0+cu116.html
pip install numpy dill tqdm pyyaml pandas biopandas scikit-learn biopython e3nn wandb tensorboard tensorboardX matplotlib
Note that the code was tested on Ubuntu 22.04 using NVIDIA RTX A6000 GPUs.
To get the DIPS dataset, you can either follow the steps stated in the EquiDock repo to download the raw data and process it to prepare the protein pairs for docking, or you can directly download the processed files (2.6GB):
curl -L -o datasets/DIPS/dips.zip https://www.dropbox.com/s/sqknqofy58nlosh/DIPS.zip?dl=0
unzip datasets/DIPS/dips.zip -d datasets/DIPS/pairs_pruned
This should result in creating the folder datasets/DIPS/pairs_pruned
which contains around 962 folders corresponding to the different protein pairs present in DIPS.
The data split that we used can be found in datasets/DIPS/data_file.csv
We also support using the DB5.5 dataset, which can be downloaded from:
https://zlab.umassmed.edu/benchmark/
However, in this work we only focused on DIPS and therefore only release models that were trained on DIPS.
Our code supports multi-GPU training to accelerate the development of new models. To train a new score model from scratch, simply run
sh src/train.sh
In this bash file, you can specify the experimental setup such as the number and IDs of GPUs, batch size, etc. You also need to specify a config file that defines the parameters related to the data, model, training and inference.
The parameters that we used to train our score model can be found in config/dips_esm.yaml
. This file specifies, among other things, the path to the data folder, the model hyperparameters, the training and inference configuration.
Note that the very first run will take longer time as it processes the data and caches it for future runs. Also note that you need to setup a WandB account to log the experiments results or change the logger to "tensorboard".
As described in the paper, the confidence model is used to rank multiple poses generated by the score model based on predicted confidence scores. As such, it is a classification network that is trained to predict the quality of the poses generated by the score model.
To train the confidence model, you first need to create its training dataset by generating multiple samples using the score model:
sh src/generate_samples.sh
Then, you can start the actual training by running
sh src/train_confidence.sh
We provide the weights of our score model trained for 170 epochs on DIPS in checkpoints/large_model_dips
We also provide the weights of our trained confidence model in checkpoints/confidence_model_dips
To run inference on the validation or test set, run
sh src/inference.sh
Similarly to training, you can specify all necessary configurations in the bash file and in the config files. The default configuration with the provided trained score and confidence models allows you to reproduce the numbers in the paper.
Note that if you want to test our models on your custom dataset, the easiest way would be to use the DB5Loader
class defined in src/data/data_train_utils.py
and name your PDB files {PDB_ID}_l_b.pdb
and {PDB_ID}_r_b.pdb
for the ligand and the receptor, respectively. For illustration, check out the script src/db5_inference.sh
and the corresponding config file config/single_pair_inference.yaml
, which run the inference on a single pair located in datasets/single_pair_dataset
.
To visualize the predictions of the model, in the inference.sh
script add the flag --visualization_path path/to/visualization/folder
and --visualize_n_val_graphs NUMBER_OF_COMPLEXES_TO_VISUALIZE
and it will save the protein complex structure at each time step of the reverse diffusion process as .pdb
files that you can visualize using, e.g., PyMOL.
@article{ketata2023diffdock,
title={DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models},
author={Ketata, Mohamed Amine and Laue, Cedrik and Mammadov, Ruslan and St{\"a}rk, Hannes and Wu, Menghua and Corso, Gabriele and Marquet, C{\'e}line and Barzilay, Regina and Jaakkola, Tommi S},
journal={arXiv preprint arXiv:2304.03889},
year={2023}
}
MIT