-
Notifications
You must be signed in to change notification settings - Fork 52
Using QLever for PubChem
The following is a performance comparison between QLever and Virtuoso, carried out on an AMD Ryzen 9 7950X 16-Core processor with 128 GB RAM and 2 x 4 TB NVMe SSD (RAID 0). The performance is measured on the 16 use cases from https://pubchem.ncbi.nlm.nih.gov/docs/rdf-use-cases, with minor modifications [1]. Exactly the same queries were run for QLever and for Virtuoso. For both engines, the complete result was downloaded as TSV. The number of result rows was the same for each query. Before running each benchmark, disk caches were cleared using sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches". Regarding the engine-internal caching, see below.
QLever was used in the version from 04.10.2024 (commit hash 77ea2c6), with PR #1537 merged to make it simpler to assign one graph per input file. Notable settings in the Qleverfile were MEMORY_FOR_QUERIES = 40G (use at most that much memory for query processing) and QUERY_PLANNING_BUDGET = 1500 (if there are more than this many connected subgraphs, switch to greedy query planning). QLever's internal query cache was cleared before each query.
qlever example-queries --query-ids 1-16 --download-or-count download --sparql-endpoint localhost:7023
Case 1: Protein targets inhibited by ... 1.84 s 17
Case 2: Pharmacological roles of SID4... 0.04 s 13
Case 3: NSAID compounds with molecula... 0.09 s 11
Case 4: NSAID substances according to... 0.24 s 13,867
Case 5: Protein targets inhibited by ... 5.00 s 71
Case 6: Substances inhibiting targets... 3.87 s 149
Case 7: Protein targets inhibited by ... 1.84 s 22
Case 8: Substances inhibiting protein... 1.98 s 9,981
Case 9: Pharmacological roles for sub... 3.83 s 149
Case 10: For each protein, the number... 5.95 s 10,183
Case 11: Top five diseases commonly m... 0.03 s 5
Case 12: Three most recent references... 0.33 s 3
Case 13: Top 20 genes co-mentioned wi... 0.03 s 20
Case 14: Top ten diseases co-occurrin... 0.03 s 10
Case 15: Chemicals commonly mentioned... 0.03 s 1,000
Case 16: Chemicals co-mentioned with ... 0.06 s 274
TOTAL for 16 queries 25.20 s 35,775
AVERAGE for 16 queries 1.57 s 2,236
Virtuoso was used in version 7.2.14-rc1.3240-pthreads as of Oct 5 2024 (a52d25b1d). Notable settings in virtuoso.ini were ThreadsPerQuery = 4, NumberOfBuffers = 5450000 and MaxDirtyBuffers = 4000000. Virtuoso's internal query cache was not cleared before each query because we are not aware of any reliable way to do that.
qlever example-queries --query-ids 1-16 --download-or-count download --sparql-endpoint localhost:8890/sparql
Case 1: Protein targets inhibited by ... 7.71 s 17
Case 2: Pharmacological roles of SID4... 0.04 s 13
Case 3: NSAID compounds with molecula... 71.93 s 11
Case 4: NSAID substances according to... 1.22 s 13,867
Case 5: Protein targets inhibited by ... 23.60 s 71
Case 6: Substances inhibiting targets... 0.67 s 149
Case 7: Protein targets inhibited by ... 1.44 s 22
Case 8: Substances inhibiting protein... 0.34 s 9,981
Case 9: Pharmacological roles for sub... 0.16 s 149
Case 10: For each protein, the number... 208.06 s 10,183
Case 11: Top five diseases commonly m... 1.35 s 5
Case 12: Three most recent references... 4.96 s 3
Case 13: Top 20 genes co-mentioned wi... 0.15 s 20
Case 14: Top ten diseases co-occurrin... 0.05 s 10
Case 15: Chemicals commonly mentioned... 0.31 s 1,000
Case 16: Chemicals co-mentioned with ... 21.56 s 274
TOTAL for 16 queries 343.53 s 35,775
AVERAGE for 16 queries 21.47 s 2,236
[1] For Cases 5, 6, 7, 8, 9, 16, the part before the final FILTER was put in { ... } due to a minor bug in QLever, which will be fixed soon. This change does not affect the query processing times for Virtuoso. You can inspect and try out the queries on https://qlever.cs.uni-freiburg.de/pubchem, or download them via https://qlever.cs.uni-freiburg.de/api/examples/pubchem.
Install the qlever
script following the instructions https://github.com/ad-freiburg/qlever-control (this is a matter of a few minutes, no need to compile anything). Make sure that the PATH
to the qlever
script is set and that you are in a fresh directory with no other content. Then do:
qlever setup-config pubchem
qlever get-data
qlever index
qlever start
qlever ui
The get-data
command downloads the data and fixes it (in several of the IRIs, forbidden characters are not properly percent-encoded). This takes around 5 hours on an AMD Ryzen 9 with 16 cores and requires about 250 GB of space. The index
command builds the index data structures needed by QLever. This also takes around 5 hours and requires around 1.5 TB of disk space. The start
command starts the server, which is then up in a matter of seconds. The ui
command starts the UI, which looks just like the UI of the public QLever SPARQL endpoint for PubChem on https://qlever.cs.uni-freiburg.de/pubchem. See the Qleverfile
(created by qlever setup-config pubchem
) for a more detailed description of some of the peculiarities of the PubChem dataset.
PubChem makes heavy use of alpha-numeric identifiers like sio:CHEMINF_000339
(molecular entity name) or obo:CHEBI_15365
(acetylsalicylic acid) for its predicates and entities. The labels for these identifiers are not part of the PubChem datasets. We recommend adding them to the data by downloading the respective ontologies. Here is a command to do that:
cut -d, -f3,4 <<EOT | while IFS=, read URL NAME; do echo "Downloading $URL -> $NAME ..."; curl --location --silent --remote-time --output rdf.ontologies/$NAME $URL; done
BAO - BioAssay Ontology,bao,http://www.bioassayontology.org/bao/bao_complete.owl,bao.rdf
BFO - Basic Formal Ontology,bfo,http://purl.obolibrary.org/obo/bfo.owl,bfo.rdf
BioPAX - biological pathway data,bp,http://www.biopax.org/release/biopax-level3.owl,bio-pax.rdf
CHEMINF - Chemical Information Ontology,cheminf,http://purl.obolibrary.org/obo/cheminf.owl,cheminf.rdf
ChEBI - Chemical Entities of Biological Interest,chebi,http://purl.obolibrary.org/obo/chebi.owl,chebi.rdf
CiTO,cito,http://purl.org/spar/cito.nt,cito.nt
DCMI Terms,dcterms,https://www.dublincore.org/specifications/dublin-core/dcmi-terms/dublin_core_terms.nt,dcterms.nt
FaBiO,fabio,http://purl.org/spar/fabio.nt,fabio.nt
GO - Gene Ontology,go,http://purl.obolibrary.org/obo/go.owl,go.rdf
IAO - Information Artifact Ontology,iao,http://purl.obolibrary.org/obo/iao.owl,iao.rdf
NCIt,ncit,http://purl.obolibrary.org/obo/ncit.owl,ncit.rdf
NDF-RT,ndfrt,https://data.bioontology.org/ontologies/NDF-RT/submissions/1/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb,ndfrt.rdf
OBI - Ontology for Biomedical Investigations,obi,http://purl.obolibrary.org/obo/obi.owl,obi.rdf
OWL,owl,http://www.w3.org/2002/07/owl,owl.ttl
PDBo,pdbo,http://rdf.wwpdb.org/schema/pdbx-v40.owl,pdbo.rdf
PR - PRotein Ontology (PRO),pr,http://purl.obolibrary.org/obo/pr.owl,pr.rdf
RDF Schema,rdfs,https://www.w3.org/2000/01/rdf-schema,rdf-schema.ttl,rdfs.ttl
RDF,rdf,http://www.w3.org/1999/02/22-rdf-syntax-ns,22-rdf-syntax-ns.ttl,rdf.ttl
RO - Relation Ontology,ro,http://purl.obolibrary.org/obo/ro.owl,ro.rdf
SIO - Semanticscience Integrated Ontology,sio,http://semanticscience.org/ontology/sio.owl,sio.rdf
SKOS,skos,http://www.w3.org/TR/skos-reference/skos.rdf,skos.rdf
SO - Sequence types and features ontology,so,http://purl.obolibrary.org/obo/so.owl,so.rdf
UO - Units of measurement ontology,uo,http://purl.obolibrary.org/obo/uo.owl,uo.rdf
EOT
The PubChem data is about three central kind of entities:
- A compound is an abstract chemical structure, for example: compound:CID2244 (acetyl-salicylic acid)
- A substance is a concrete materialization of a compound, for example: substance:SID24890623 (a particular edition of Aspirin)
- A bioassay is an analytical method for measuring the effect of a substance on living matter
TLDR: There is no "canonical" name, neither for compounds nor for substances; each compound can have many substances; each substance can have many different kinds of names; each substance can even have multiple names of the same kind; some compounds are related to entities from other ontologies
Compounds are related to substances via the predicate sio:CHEMINF_000477
(has normalized counterpair), for example substance:SID24890623 sio:CHEMINF_000477 compound:CID2244
For each substance, there are different kinds of names, for example, sio_CHEMINF_000339
(molecular entity name) or sio_CHEMINF_000476
(chemical database identifier) or sio:CHEMINF_000561
(drug trade name). That way, even a single compound can have hundreds of names and synonyms, for example https://qlever.cs.uni-freiburg.de/pubchem/PAlJvI (all names/synonyms of Diclofenac) or https://qlever.cs.uni-freiburg.de/pubchem/7TwZLX (same, grouped by kind of name/synonym).
To get a particular kind of name of a particular substance do substance:SID24890623 sio:SIO_000008 [ rdf:type sio:CHEMINF_000339 ; sio:SIO_000300 ?name ]
, where the intermediate node is called a "synonym".
Some compounds are related to entities from other ontologies via rdf:type
or closeMatch
. For example, compound:CID2244 rdf:type obo:CHEBI_15365
(where obo:CHEBI_15365
is the identifier for acetylsalicylic acid in the ChEBI dictionary = Chemical Entities of Biological Interest) or compound:CID2244 skos:closeMatch wd:Q18216
(where wd:Q18216
is the identifier for Aspirin in Wikidata).
TLDR: Most properties in PubChem are not expressed via a single predicate, but via multiple predicates and entities
The various chemical properties of a compound are realized via the generic predicate sio:SIO_000008
(has attribute) and a mediator node. For example, molecular weight is realized as follows, using the specific sio:CHEMINF_000334
(molecular weight) and the generic sio:SIO_000300
(has value)
?compound sio:SIO_000008 [
rdf:type sio:CHEMINF_000334 ;
sio:SIO_000300 ?value ]