-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check for invalid language and data type QIDs #371
check for invalid language and data type QIDs #371
Conversation
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist |
Some observations
It will be hard to extract the QIDs. For example see # tool: scribe-data
# All English (Q1860) nouns and their plural.
# Enter this query at https://query.wikidata.org/.
SELECT DISTINCT
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
?singular
?plural
WHERE {
VALUES ?nounTypes {wd:Q1084 wd:Q147276} # Nouns and proper nouns
?lexeme dct:language wd:Q1860 ;
wikibase:lexicalCategory ?nounTypes ;
wikibase:lemma ?singular .
# MARK: Plural
OPTIONAL {
?lexeme ontolex:lexicalForm ?pluralForm .
?pluralForm ontolex:representation ?plural ;
wikibase:grammaticalFeature wd:Q146786 ;
} .
} and see verbs # tool: scribe-data
# All Nigerian Pidgin (Q33655) verbs.
# Enter this query at https://query.wikidata.org/.
SELECT
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
?verb
WHERE {
?lexeme dct:language wd:Q33655 ;
wikibase:lexicalCategory wd:Q24905 ;
wikibase:lemma ?verb .
} the |
language_pattern = r"\?lexeme dct:language wd:Q\d+" | ||
data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pattern will not work for all data_types because we don't have a consistent SPARQL query structure.
To get the language QID is easy because it's the same format regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can certainly split the current preposition and postposition queries, @DeleMike :) Do you want to send along a PR for that?
As far as nouns and proper nouns, how are we feeling on this? Split them too? CC @catreedle :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. we need to split them
languages = language_metadata.get( | ||
"languages" | ||
) # might not work since language_metadata file is not fully updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For language verification, the stumbling block might be the language_metadata.json
. If it is updated properly then we won't have issues. It depends of #293
@andrewtavis what do you think of this? The language issue is kind of sorted. Once we have an updated language_metadata file then this code should work as expected. However, verifying data_types is difficult because we have different ways the SPARQL queries were constructed. PS: Please ignore some of the print statements, I was using them to debug. This is still a draft PR. @catreedle any ideas too on how we can tackle these issues? |
As stated above, if we want to make the switch to one data type per query then I'm fine with that :) |
There's potentially more value in making sure that we can maintain the query base than combining nouns and proper nouns, plus some people might only want one 🤔 |
How about we keep a dictionary of the patterns for data type? data_type_patterns = {
"nouns": r"\?nounTypes \{wd:(Q\d+)\}",
"default": r"wikibase:lexicalCategory wd:Q\d+"
} We can retrieve the pattern based on the directory name directory_name = query_file.parent.name
data_type_pattern = data_type_patterns.get(directory_name, data_type_patterns["default"]) # Use default if no match This allows us to manage different patterns for various data types. If a directory name matches a key in the dictionary, we get its specific pattern, otherwise we use a default pattern for other types. |
I think that postpositions and prepositions should be split regardless, but this could certainly work 😊 |
@catreedle your idea would work but the real issue might still be there. So as @andrewtavis says, we should convert the queries to have one query per type. It would make the validation process easier. This pattern will now stand for all queries if we split all the queries: data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+" Hence, in this style of code: SELECT
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
?preposition
?case
WHERE {
# Prepositions and postpositions.
VALUES ?prePostPositions { wd:Q4833830 wd:Q161873 }
?lexeme dct:language wd:Q9610 ;
wikibase:lexicalCategory ?prePostPositions ;
wikibase:lemma ?preposition .
# MARK: Corresponding Case
OPTIONAL {
?lexeme wdt:P5713 ?caseForm .
} .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "[AUTO_LANGUAGE]".
?caseForm rdfs:label ?case .
}
} we should eliminate all SELECT DISTINCT
?lexeme
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
?preposition
WHERE {
?lexeme dct:language wd:Q9610 ;
wikibase:lexicalCategory wd:Q4833830 ;
wikibase:lemma ?preposition .
FILTER(lang(?preposition) = "hi")
}
SELECT DISTINCT
?lexeme
(REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
?postposition
WHERE {
?lexeme dct:language wd:Q9610 ;
wikibase:lexicalCategory wd:Q161873 ;
wikibase:lemma ?postposition .
FILTER(lang(?postposition) = "hi")
} However, this might lead to a large PR because there would be many shake-ups. @andrewtavis is there a need for me to create an issue? And what do you think about this? |
I'm totally fine with this if this means that it's a bit easier for us to make sure that everything is working properly. Plus again there's a chance that someone just wants proper nouns. I can even think of an example right now where someone wants a list of names in a language, and then they'd be able to use Scribe-Data to easily get all names of a given language :) Should this be another issue? |
Thanks @andrewtavis! Yes, this should be tracked as a separate issue. It’s a significant refactor that involves modifying the structure of multiple query files and creating a dedicated issue would help us manage the process effectively. I’ll create an issue, and then get started on a PR asap so this current PR can make progress. |
Sounds good, Thanks @DeleMike! |
Do you want to clean this PR up as well, @DeleMike, and then I can do a final review? |
I will clean it up, but we can't say for sure if everything will be 100% good to go with the checks, even after resolving #380. Would you like me to adjust the current PR with the new plan in mind, where the pattern will be updated to: data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+" This will ensure that our refactor is aligned with the upcoming changes. |
I will focus on cleaning up You should be expecting a clean PR in less than 30 mins :) |
currently facing merge conflicts... |
adverbs for yoruba
adverb for chinese/mandarin
adverb for english
Adverb for Basque
… print statements
@andrewtavis I resolved the merge conflicts. Please help review to know if this is acceptable for now🤔 |
and prints out any files with incorrect QIDs for both languages and data types. | ||
""" | ||
language_pattern = r"\?lexeme dct:language wd:Q\d+" | ||
data_type_pattern = r"wikibase:lexicalCategory\s+wd:Q\d+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upgraded the regex expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and docstrings. Those were the major work in the commit.
are the docstrings in the right format? @andrewtavis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they're looking good now, @DeleMike :) I'd just indent parameter and return entries, but really good 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, @DeleMike! Looking forward to seeing it in action in a workflow 😊
Contributor checklist
Description
I am showing a draft of how we want to verify queries. That is, language QIDs and data type QIDs.
Related issue