DeepPocket is a 3D convolutional Neural Network framework for ligand binding site detection and segmentation from protein structures. This is the official open source repository for the following paper:
Aggarwal, Rishal; Gupta, Akash; Chelur, Vineeth; Jawahar, C. V.; Priyakumar, U. Deva (2021): DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.14611146.v1
Fpocket, Pytorch, libmolgrid, Biopython and other frequently used python packages
PDB files are first paresed to remove hetero atoms, then converted to "gninatypes" files and finally collected into a "molcache" file for quicker input and model training with libmolgrid. More about "gninatypes" and "molcache" here.
cavity6.mol2 files that are provided by scPDB and generated by volsite for other datasets are used as is, the "data_dir" argument in training scripts have to be pointed to the parent directory they are present.
".types" files contain training data points prepared, the first column is the class label, the next three columns are pocket center cordinates (x,y,z) and the final columns contain molecule files required for that datapoint. All the molecule files specified in the types files must be present in either the molcache or in the "data_dir".
Prepared types, molcache and saved model checkpoints can be downloaded here.
I know this may be very cryptic, therefore I have written down simple steps in the last section of the README that one can use to prepare a new dataset for training.
"predict.py" is a simple script that can be used for predicting binding sites from a .pdb file. It follows 6 steps namely:
- Hetero atom removal (clean_pdb)
- fpocket run
- Parsing fpocket output for candidate centers (get_centers)
- Creating gninatypes and types file for CNN input (types_and_gninatyper)
- Rerank types input according to CNN score (rank_pockets)
- Segment shape of top ranked pockets (segment_pockets)
Example usage of predict.py:
python predict.py -p protein.pdb -c first_model_fold1_best_test_auc_85001.pth.tar -s seg0_best_test_IOU_91.pth.tar -r 3
Description of each argument given in script.
If the name of the input file is protein.pdb, then fpocket creates a protein_out/pockets directory. The CNN ranked pockets will be given in the bary_centers_ranked.types file in that directory.
If you asked for segmented pockets ("-r") the script will output ".dx" files that can be visualised in pymol.
We use wandb to track training performance. It's free and easy to use. If you want to avoid using wandb, simply comment out all lines that contain "wandb" in the training script.
Example usage of train.py:
python train.py -m model.py --train_types scPDB_train0.types --test_types scPDB_test0.types -i 200000 --train_recmolcache scPDB_new.molcache2 --test_recmolcache scPDB_new.molcache2 -r val0 -o /model_saves/val9 --base_lr 0.001 --solver Adam
Description of each argument given in script.
Example usage of train_segmentation.py:
python train_segmentation.py --train_types seg_scPDB_train9.types --test_types seg_scPDB_test9.types -d data/ --train_recmolcache scPDB_new.molcache2 --test_recmolcache scPDB_new.molcache2 -b 8 -o model_saves/seg9 -e 200 -r seg9
Description of each argument in script
I have written down steps below that pertain to preparing training data for a dataset like PDBbind, but could easily be adopted for other datasets by making approprate changes of file paths and file names in the scripts.
Steps for preparing training data:
- remove hetero atoms (clean_pdb.py)
- run fpocket through structures (fpocket -f *_protein.pdb)
- get candidate pocket centers for all structures (get_centers.py)
- create .gninatypes files for all structure (gninatype() in types_and_gninatyper.py)
- make train and test types (make_types.py)
- create molcache file for training (create_molcache2.py)
Example usage of create_molcache2:
python create_molcache2.py -c 4 --recmolcache scPDB_new.molcache2 -d data/scPDB/ scPDB_train0.types scPDB_test0.types
If you find this useful please cite the paper mentioned above.