Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries with more than 20 triple patterns are never solved #1428

Open
tarcisiotmf opened this issue Jul 31, 2024 · 2 comments
Open

Queries with more than 20 triple patterns are never solved #1428

tarcisiotmf opened this issue Jul 31, 2024 · 2 comments

Comments

@tarcisiotmf
Copy link

When executing the queries below with qlever, they are never solved and the error below is shown after 5 minutes. The same queries were tested with graphdb and they are solved in less than 2 seconds. You can replicate the issue with the following links:

{
    "exception": "Query timed out. Last operation: Query planning",
    "query": "# Among Melochia umbellata LCMS features in PI mode,\n# get the ones that are annoatted as [M+H]+ by SIRIUS and for which\n# a LCMS feature in NI mode with the corresponding [M-H]- m/z is found.\n\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX wdt: <http://www.wikidata.org/prop/direct/>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\nPREFIX emi: <https://purl.org/emi#>\nPREFIX sosa: <http://www.w3.org/ns/sosa/>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX prov: <http://www.w3.org/ns/prov#>\nSELECT DISTINCT ?lcms_opp ?feature ?rt ?pm ?feature_opp ?rt_opp ?pm_opp\nWHERE\n    { \n    VALUES ?ppm {\n        \"5\"^^xsd:decimal # m/z tolerance\n        }\n    VALUES ?rt_tol {\n        \"0.05\"^^xsd:decimal # RT tolerance (minute)\n        }\n    ?sample rdf:type emi:ExtractSample.\n    ?sample sosa:isSampleOf* ?organe .\n    ?organe emi:inTaxon ?taxon . \n    ?taxon rdfs:label \"melochia umbellata\" .\n    ?sample sosa:isFeatureOfInterestOf ?lcms .\n    ?lcms sosa:hasResult ?feature_list .  \n    ?lcms rdf:type emi:LCMSAnalysisPos .\n    ?feature_list emi:hasLCMSFeature ?feature .                    \n    ?feature emi:hasParentMass ?pm .\n    ?feature emi:hasRetentionTime  ?rt .\n\t?feature emi:hasAnnotation ?sirius .\n\t?sirius rdf:type emi:StructuralAnnotation .\n    ?sirius prov:wasGeneratedBy ?activiy .\n    ?activiy prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .\n    ?sirius emi:hasAdduct ?adduct .\n \tFILTER(regex(str(?adduct), \"[M+H]+\"))       \n    ?sample sosa:isFeatureOfInterestOf ?lcms_opp .\n    ?lcms_opp rdf:type emi:LCMSAnalysisNeg .\n    ?lcms_opp sosa:hasResult ?feature_list_opp .\n    ?feature_list_opp emi:hasLCMSFeature ?feature_opp .\n\t?feature_opp emi:hasParentMass ?pm_opp .\n    ?feature_opp emi:hasRetentionTime ?rt_opp .\n    FILTER(((?rt - ?rt_tol) < ?rt_opp) && ((?rt + ?rt_tol) > ?rt_opp))\n    FILTER((?pm_opp > ((?pm - 2.014) - ((?ppm * 0.000001) * (?pm - 2.014)))) && (?pm_opp < ((?pm - 2.014) + ((?ppm * 0.000001) * (?pm - 2.014)))))\n    }\n",
    "resultsize": 0,
    "status": "ERROR",
    "time": {
        "computeResult": 300403,
        "total": 300403
    }
}

Executing query with qlever

Executing query with graphdb, select emi-dbgi repository

The dataset used in our test is available here.

For your information, I also tried to simplify the Query 2 by removing property paths, Values and filters but it still did not work (see query 3 below).

Query 1:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX emi: <https://purl.org/emi#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT DISTINCT ?lcms_opp ?feature ?rt ?pm ?feature_opp ?rt_opp ?pm_opp
WHERE
    { 
    VALUES ?ppm {
        "5"^^xsd:decimal # m/z tolerance
        }
    VALUES ?rt_tol {
        "0.05"^^xsd:decimal # RT tolerance (minute)
        }
    ?sample rdf:type emi:ExtractSample.
    ?sample sosa:isSampleOf* ?organe .
    ?organe emi:inTaxon ?taxon . 
    ?taxon rdfs:label "melochia umbellata" .
    ?sample sosa:isFeatureOfInterestOf ?lcms .
    ?lcms sosa:hasResult ?feature_list .  
    ?lcms rdf:type emi:LCMSAnalysisPos .
    ?feature_list emi:hasLCMSFeature ?feature .                    
    ?feature emi:hasParentMass ?pm .
    ?feature emi:hasRetentionTime  ?rt .
	?feature emi:hasAnnotation ?sirius .
	?sirius rdf:type emi:StructuralAnnotation .
    ?sirius prov:wasGeneratedBy ?activiy .
    ?activiy prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius emi:hasAdduct ?adduct .
 	FILTER(regex(str(?adduct), "[M+H]+"))       
    ?sample sosa:isFeatureOfInterestOf ?lcms_opp .
    ?lcms_opp rdf:type emi:LCMSAnalysisNeg .
    ?lcms_opp sosa:hasResult ?feature_list_opp .
    ?feature_list_opp emi:hasLCMSFeature ?feature_opp .
	?feature_opp emi:hasParentMass ?pm_opp .
    ?feature_opp emi:hasRetentionTime ?rt_opp .
    FILTER(((?rt - ?rt_tol) < ?rt_opp) && ((?rt + ?rt_tol) > ?rt_opp))
    FILTER((?pm_opp > ((?pm - 2.014) - ((?ppm * 0.000001) * (?pm - 2.014)))) && (?pm_opp < ((?pm - 2.014) + ((?ppm * 0.000001) * (?pm - 2.014)))))
    }

Query 2:

# Get the PI mode LCMS features with SIRIUS annotation for which
# a LCMS feature in NI mode of the same extract is annotated with
# the same IK2D and has the same RT.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>          
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX emi: <https://purl.org/emi#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT DISTINCT ?feature ?feature_opp ?ik2d ?rt ?rt_opp
WHERE
    {
    VALUES ?rt_tol {
        "0.05"^^xsd:decimal # RT tolerance (minute)
        }
    ?sample rdf:type emi:ExtractSample .
    ?sample sosa:isSampleOf* ?organe .
    ?organe emi:inTaxon ?taxon . 
    ?taxon rdfs:label "melochia umbellata" .
    ?sample sosa:isFeatureOfInterestOf ?lcms .
    ?lcms sosa:hasResult ?feature_list .  
	?lcms rdf:type emi:LCMSAnalysisPos .
    ?feature_list emi:hasLCMSFeature ?feature .
    ?feature emi:hasRetentionTime  ?rt .
	?feature emi:hasAnnotation ?sirius .
	?sirius rdf:type emi:StructuralAnnotation .
    ?sirius prov:wasGeneratedBy ?activiy .
    ?activiy prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius emi:hasChemicalStructure ?ik2d .
    ?sample sosa:isFeatureOfInterestOf ?lcms_opp .
    ?lcms_opp rdf:type emi:LCMSAnalysisNeg .
    ?lcms_opp sosa:hasResult ?feature_list_opp .
    ?feature_list_opp emi:hasLCMSFeature ?feature_opp .
    ?feature_opp emi:hasRetentionTime ?rt_opp .                
    ?feature_opp emi:hasAnnotation ?sirius_opp .
	?sirius_opp rdf:type emi:StructuralAnnotation .
    ?sirius_opp prov:wasGeneratedBy ?activiy_opp .
    ?activiy_opp prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius_opp emi:hasChemicalStructure ?ik2d

    FILTER(((?rt - ?rt_tol) < ?rt_opp) && ((?rt + ?rt_tol) > ?rt_opp))
    }

Query 3:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>          
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX emi: <https://purl.org/emi#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT DISTINCT ?feature ?feature_opp ?ik2d ?rt ?rt_opp
WHERE
    {
    ?sample rdf:type emi:ExtractSample .
    ?sample sosa:isSampleOf ?a .
    ?a       sosa:isSampleOf ?organe .
    ?organe emi:inTaxon ?taxon . 
    ?taxon rdfs:label "melochia umbellata" .
    ?sample sosa:isFeatureOfInterestOf ?lcms .
    ?lcms sosa:hasResult ?feature_list .  
	?lcms rdf:type emi:LCMSAnalysisPos .
    ?feature_list emi:hasLCMSFeature ?feature .
    ?feature emi:hasRetentionTime  ?rt .
	?feature emi:hasAnnotation ?sirius .
	?sirius rdf:type emi:StructuralAnnotation .
    ?sirius prov:wasGeneratedBy ?activiy .
    ?activiy prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius emi:hasChemicalStructure ?ik2d .
    ?sample sosa:isFeatureOfInterestOf ?lcms_opp .
    ?lcms_opp rdf:type emi:LCMSAnalysisNeg .
    ?lcms_opp sosa:hasResult ?feature_list_opp .
    ?feature_list_opp emi:hasLCMSFeature ?feature_opp .
    ?feature_opp emi:hasRetentionTime ?rt_opp .                
    ?feature_opp emi:hasAnnotation ?sirius_opp .
	?sirius_opp rdf:type emi:StructuralAnnotation .
    ?sirius_opp prov:wasGeneratedBy ?activiy_opp .
    ?activiy_opp prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius_opp emi:hasChemicalStructure ?ik2d

    #FILTER(((?rt - ?rt_tol) < ?rt_opp) && ((?rt + ?rt_tol) > ?rt_opp))
    }limit 100
@hannahbast
Copy link
Member

@tarcisiotmf The problem is that QLever's query planner currently generates all possible query plans (and then chooses the best). For large queries like yours, a heuristic is needed to limit the number of possible query plans that QLever evaluates.

It is on TODO list to do this automatically. Until then, you can easily do it manually by grouping parts of the query { ... }. Then for each such part, all query plans are generated and the best query plans for each part are combined. You should choose the groups, so that each group by itself makes sense by itself. The smaller the final result for each group, the better.

Please try it and let us know if it worked for you.

@tarcisiotmf
Copy link
Author

tarcisiotmf commented Aug 2, 2024

Thanks for the clarification! Currently, we are mostly evaluating available RDF stores for the different types of data and use cases we have at the SIB Swiss Institute of Bioinformatics including query examples in use (real-world applications).

Actually, the majority of our use cases involves queries that are indeed large. We also develop Question Answer (QA) systems over different independent SPARQL endpoints (with different RDF technologies). Then it would be significantly complex for us to handle case by case tailoring solutions to a specific RDF store by also considering that the queries are generated automatically by the QA system.

I am really impressed by the latest Qlever developments, once this issue is solved, it will be highly relevant for us.

Thanks again for your quick reply and support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants