Accepted on Journal of Computer-Aided Molecular Design. https://link.springer.com/article/10.1007/s10822-022-00448-3
conda environment should have rdkit/obfit(open babel)/sklearn/pandas/scipy/pytorch
1 PDBbind data (re-docking)
- http://www.pdbbind.org.cn/
- Download PDBbind data(v2019) both "general set" and "refined set" and merge all the files inside folder pdbbind_files.
- run "utilities/pdbbind_redo.py" to create re-docking dataset from PDBbind data
- The pdbbind_redo.py file contains python path for the conda environment. Please modify the PATH_TO_PYTHON accordingly.
- Assuming that it takes up to an hour to process 100 instances, the job will finish in less than 40 hours.
2 chembl_bace data (cross-docking, "bace_chembl_cd")
- https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL4822/
- One can download IC50 values of the target protein BACE (chembl id:CHEMBL4822) or use the "BACE_IC50.csv" file to start
- run "make_data_smilecomp.py", "smile_comp2.py", "conformer_gen_BACE.py", and "alignedSubdir.py" in this order.
- Some of the files contains python path for the conda environment. Please modify the PATH_TO_PYTHON accordingly.
- copy all the contents inside the "truth" folder into "alignedsubset"
- copy "BACE_IC50.csv" into "alignedsubset"
- copy "similar_pdbid_info_bace.tsv" into "alignedsubset"
- move the "alignedsubset" directory to the "src2" and change the name to "chembl_bace"
3 BACE data (cross-docking, "bace_gc4_cd")
- https://drugdesigndata.org/about/grand-challenge-4
- Start with the "BACE_score_compounds_D3R_GC4_answers.csv" file
- run "make_data_smilecomp.py", "smile_comp_bace.py", "conformer_generation_bace.py", and "alignedSubdir_bace.py" in this order.
- Some of the files contains python path for the conda environment. Please modify the PATH_TO_PYTHON accordingly.
- copy all the contents inside the "truth" folder into "alignedBACEsubset"
- copy "BACE_score_compounds_D3R_GC4_answers.csv" into "alignedBACEsubset"
- copy "similar_pdbid_info2.tsv" into "alignedBACEsubset"
- move the "alignedBACEsubset" directory to the "src2" and change the name to "BACE"
- pdbbind_files ==> Preparatory work is required (too big to upload all files)
- pdbbind_index ==> no preparation required
- src2/chembl_bace ==> Preparatory work is required (too big to upload all files)
- src2/BACE ==> Preparatory work is required (too big to upload all files)
- src2/CATS ==> unzip the compressed file
- src2/gc3_CATS ==> unzip the compressed file
- The name "src2" means that the code is run on the third (0,1,2) gpu card
- run the "src2/train.py"
- In the first run, modify the hyperparameter usePickledData to False.
- I used ensemble for the final result. the code can be found in the utilities folder
please ask me anything if you feel confused.