We provide a large-scale benchmark that enables a comprehensive evaluation of the drug-target predictive models to facilitate a better selection of computational strategies for pre-screening. This benchmark functions are:
-
- an extensive multiple-partite network (e.g., 0.95 million biomedical concepts including 59 thousand drugs and 75 thousand targets, and 2.5 million associations including 817 thousand drug-target associations) as well as drug-drug and protein-protein similarities based on drug chemical structures and gene sequences
-
- a way of comprehensively evaluating strategies that reflect diverse scenarios (a total 1300 tasks across two types of training/testing sampling strategies based on drug-target space as well as five types of validation strategies).
We have provided three sources for you, 1) benchmark-raw (raw data directly downloaded from each source), 2) benchmark-data (networks, similarity, chemical structure & gene sequence, and mappings), 3) benchmark-tasks (a total of 1300 tasks). Please follow the path to download the archived benchmark and data. Please note, the archives are encrypted due to the copy-right policy regulated by the licenses (e.g., drugbank). If you want to use data for your study, please contact the author (Zong.nansu@mayo.edu) for decryption.
- benchmark raw (These are the raw data)
- benchmark data (These are the benchmark data, i.e., networks, similarity, chemical structure & gene sequence, and mapping)
- benchmark tasks (These are the evaluation tasks in the benchmark)
We also provided a data loader that automatically download the data from remote. The program will unzip the files in your local ("benchmark/data" for network data and "benchmark/tasks" for tasks), and load all the files into the memory.
- Java src/main/DataLoader.java "password"
We also provide the codes to enable the generation of your own benchmark. You need to download the data from each database.
-
Data sources:
-
Data download (You can reuse the data in the benchmark if you do not need to replace them with the preferred version):
- https://bio2rdf.org (RDF data of DrugBank, GOA, Irefindex, KEGG, Linkedspl, OMIM, Pharmgkb, Pharmgkb-offside, Sider)
- http://networkrepository.com/bio-diseasome.php (RDF data of Disease network from diseasome)
- https://go.drugbank.com (drugbank.xml data contains Drug-target associations from DrugBank)
-
Sample data: (Note: The sample datasets are used for demonstration of the data format only. Running sample dataset will result in failure of the program)
- data_sample/input/done (sample raw data)
- data_sample/output (sample network data and mappings)
- data_sample/output/datasets/experiment (sample evaluation tasks)
- data_sample/output/datasets/orignial (sample supplementary data to generate the network and evaluation tasks)
- Data space for the running program
- data_space/input (put raw data here, input path of Render_main.java)
- data_space/output (generate network data and mappings, output path of Render_main.java)
- data_space/output/datasets/orignial (put supplementary data, input path of Benchmark_main.java)
- data_space/output/datasets/orignial/drugbank.xml (copy drugbank.xml here)
- data_space/output/datasets/orignial/sequence.txt (copy gene sequence data here)
- data_space/output/datasets/orignial/smile.xml (copy chemical structure data here)
- Environment:
- Java 1.8
- Weka 3.8
- nxparser 2.2
- Maven install:
- Network generation:
- Java src/main/Render_main.java (running all the data needs approximately 1 hour)
- Benchmark generation:
- Java src/main/Benchmark_main.java (running all the data needs approximately 3 hours)
We provide the codes to enable the mapping of drugs, targets, and diseases to the corresponding entities in BETA.
-
mappings for drugs based on the common IDs (UMLS, DBpedia, Wikipedia, KEGG, PubChem, Pharmgkb, Drugbank) refer to the examples in 'src/data/render/node/drugs/'
-
mappings for targets based on the common IDs (UniProtKB, HGNC, GenAtlas, OMIM) refer to the examples in 'src/data/render/node/targets/'
-
mappings for disease based on the common IDs (DBpedia, UMLS, SNOMED CT, OMIM) refer to the examples in 'src/data/render/node/diseases/'
We regularly generate a new version of the benchmark based on the newly released raw data. The current version is 1.0.
- Version current = 1.0
- release date: October 10, 2021.
- specification: bio2redf release 4; Drugbank 5.1.7; ClinicalTrials by Mar 2021
For help or questions of using the application, please contact Zong.nansu@mayo.edu