Pocket2Drug is an encoder-decoder deep neural network that predicts binding drugs given protein binding sites (pockets). The pocket graphs are generated using Graphsite. The encoder is a graph neural network, and the decoder is a recurrent neural network. The SELFIES molecule representation is used as the tokenization scheme instead of SMILES. The pipeline of Pocket2Drug is illustrated below:
If you find Pocket2Drug helpful, please cite our paper in your work :)
Pocket2Drug: An encoder-decoder deep neural network for the target-based drug design
Wentao Shi, Manali Singha, Gopal Srivastava, Limeng Pu, J. Ramanujam, and Michal Brylinsky
Frontiers in Pharmacology: 587
- Install Pytorch:
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
- Install Pytorch-geometric:
conda install pyg -c pyg
- Install BioPandas:
conda install biopandas -c conda-forge
- Install selfies:
pip install selfies
- Install Rdkit:
conda install rdkit -c conda-forge
All the related data can be downloaded here. After extraction, there will be two folders:
- pocket-data: files that contain information of the pockets. We will use the
.mol2
files. - protein-data: files that contain information of the proteins. We wiil use the
.pops
and.profile
files.
The configurations for training can be updated in train.yaml
. Set the pocket_dir
to the path of pocket-data
, then set pop_dir
and profile_dir
to the path of protein-data
. Set the out_dir
the folder where you want to save the output results. The other configurations are for hyper-parameter tuning and they are self-explanatory according to their names. The script train.py
trains the model on a 90%-10% split of the dataset, and you can specify which fold is used for validation:
python train.py -val_fold 0
In addition, you can use a pretrained RNN to initialize the decoder, the pretrained model can be found here. The pretrained RNN is trained on the chembl dataset and can improve the performance of the model. I have wrote an exmaple for pretraining RNN here).
After training, the trained model will be saved at out_dir
, and we can use it to sample molecules for the pockets in the validation fold:
python sample.py -batch_size 1024 -num_batches 2 -pocket_dir path_to_dataset_folder -popsa_dir path_to_pops_folder -profile_dir path_to_profile_folder -result_dir path_to_training_output_folder -fold 0
Of course, the model can be used to sample molecules for the unseen pockets defined by user. Simply omit the -fold
option, the code will run on the specified input directories:
python sample.py -batch_size 1024 -num_batches 2 -pocket_dir path_to_dataset_folder -popsa_dir path_to_pops_folder -profile_dir path_to_profile_folder -result_dir path_to_training_output_folder