Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for invalid language and data type QIDs #371

Merged

Conversation

DeleMike
Copy link
Contributor

Contributor checklist


Description

I am showing a draft of how we want to verify queries. That is, language QIDs and data type QIDs.

Related issue

Copy link

github-actions bot commented Oct 15, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@DeleMike
Copy link
Contributor Author

DeleMike commented Oct 15, 2024

Some observations

  • The issue with language QID verification is we don't have an updated source of truth.

  • The issue with data type QID verification is the queries are written in different manners.

It will be hard to extract the QIDs. For example see src/scribe_data/language_data_extraction/English/nouns/query_nouns.sparql:

# tool: scribe-data
# All English (Q1860) nouns and their plural.
# Enter this query at https://query.wikidata.org/.

SELECT DISTINCT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?singular
  ?plural

WHERE {
  VALUES ?nounTypes {wd:Q1084 wd:Q147276} # Nouns and proper nouns

  ?lexeme dct:language wd:Q1860 ;
    wikibase:lexicalCategory ?nounTypes ;
    wikibase:lemma ?singular .

  # MARK: Plural

  OPTIONAL {
    ?lexeme ontolex:lexicalForm ?pluralForm .
    ?pluralForm ontolex:representation ?plural ;
      wikibase:grammaticalFeature wd:Q146786 ;
  } .
}

and see verbs

# tool: scribe-data
# All Nigerian Pidgin (Q33655) verbs.
# Enter this query at https://query.wikidata.org/.

SELECT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?verb

WHERE {
  ?lexeme dct:language wd:Q33655 ;
    wikibase:lexicalCategory wd:Q24905 ;
    wikibase:lemma ?verb .
}

the lexicalCategory wd: is different for verbs and nouns.
one event has lexicalCategory ?nounTypes and the other lexicalCategory wd:Q24905...this also happens for things like prepositions and postpositions

Comment on lines 27 to 28
language_pattern = r"\?lexeme dct:language wd:Q\d+"
data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern will not work for all data_types because we don't have a consistent SPARQL query structure.
To get the language QID is easy because it's the same format regardless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can certainly split the current preposition and postposition queries, @DeleMike :) Do you want to send along a PR for that?

As far as nouns and proper nouns, how are we feeling on this? Split them too? CC @catreedle :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. we need to split them

Comment on lines +56 to +58
languages = language_metadata.get(
"languages"
) # might not work since language_metadata file is not fully updated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For language verification, the stumbling block might be the language_metadata.json. If it is updated properly then we won't have issues. It depends of #293

@DeleMike
Copy link
Contributor Author

DeleMike commented Oct 15, 2024

Some observations

  • The issue with language QID verification is we don't have an updated source of truth.
  • The issue with data type QID verification is the queries are written in different manners.

It will be hard to extract the QIDs. For example see src/scribe_data/language_data_extraction/English/nouns/query_nouns.sparql:

# tool: scribe-data
# All English (Q1860) nouns and their plural.
# Enter this query at https://query.wikidata.org/.

SELECT DISTINCT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?singular
  ?plural

WHERE {
  VALUES ?nounTypes {wd:Q1084 wd:Q147276} # Nouns and proper nouns

  ?lexeme dct:language wd:Q1860 ;
    wikibase:lexicalCategory ?nounTypes ;
    wikibase:lemma ?singular .

  # MARK: Plural

  OPTIONAL {
    ?lexeme ontolex:lexicalForm ?pluralForm .
    ?pluralForm ontolex:representation ?plural ;
      wikibase:grammaticalFeature wd:Q146786 ;
  } .
}

and see verbs

# tool: scribe-data
# All Nigerian Pidgin (Q33655) verbs.
# Enter this query at https://query.wikidata.org/.

SELECT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?verb

WHERE {
  ?lexeme dct:language wd:Q33655 ;
    wikibase:lexicalCategory wd:Q24905 ;
    wikibase:lemma ?verb .
}

the lexicalCategory wd:Q24905 is different for verbs and nouns.

@andrewtavis what do you think of this?

The language issue is kind of sorted. Once we have an updated language_metadata file then this code should work as expected.

However, verifying data_types is difficult because we have different ways the SPARQL queries were constructed.

PS: Please ignore some of the print statements, I was using them to debug. This is still a draft PR.

@catreedle any ideas too on how we can tackle these issues?

@andrewtavis andrewtavis self-requested a review October 15, 2024 14:36
@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 15, 2024
@andrewtavis
Copy link
Member

As stated above, if we want to make the switch to one data type per query then I'm fine with that :)

@andrewtavis
Copy link
Member

There's potentially more value in making sure that we can maintain the query base than combining nouns and proper nouns, plus some people might only want one 🤔

@catreedle
Copy link
Contributor

How about we keep a dictionary of the patterns for data type?

data_type_patterns = {
    "nouns": r"\?nounTypes \{wd:(Q\d+)\}", 
    "default": r"wikibase:lexicalCategory wd:Q\d+"
}

We can retrieve the pattern based on the directory name

directory_name = query_file.parent.name
data_type_pattern = data_type_patterns.get(directory_name, data_type_patterns["default"])  # Use default if no match

This allows us to manage different patterns for various data types. If a directory name matches a key in the dictionary, we get its specific pattern, otherwise we use a default pattern for other types.
What do you think?

@andrewtavis
Copy link
Member

I think that postpositions and prepositions should be split regardless, but this could certainly work 😊

@DeleMike
Copy link
Contributor Author

@catreedle your idea would work but the real issue might still be there. So as @andrewtavis says, we should convert the queries to have one query per type. It would make the validation process easier. This pattern will now stand for all queries if we split all the queries:

data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+"

Hence, in this style of code:

SELECT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?preposition
  ?case

WHERE {
  # Prepositions and postpositions.
  VALUES ?prePostPositions { wd:Q4833830 wd:Q161873 }

  ?lexeme dct:language wd:Q9610 ;
    wikibase:lexicalCategory ?prePostPositions ;
    wikibase:lemma ?preposition .

  # MARK: Corresponding Case

  OPTIONAL {
    ?lexeme wdt:P5713 ?caseForm .
  } .

  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE]".
    ?caseForm rdfs:label ?case .
  }
}

we should eliminate all VALUES ?... and then have:
query_preposition.sparql:

SELECT DISTINCT
  ?lexeme
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?preposition

WHERE {
  ?lexeme dct:language wd:Q9610 ;
    wikibase:lexicalCategory wd:Q4833830 ;
    wikibase:lemma ?preposition .
    FILTER(lang(?preposition) = "hi")
}

query_postposition.sparql:

SELECT DISTINCT
  ?lexeme
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?postposition

WHERE {
  ?lexeme dct:language wd:Q9610 ;
    wikibase:lexicalCategory wd:Q161873 ;
    wikibase:lemma ?postposition .
    FILTER(lang(?postposition) = "hi")
}

However, this might lead to a large PR because there would be many shake-ups.

@andrewtavis is there a need for me to create an issue? And what do you think about this?

@andrewtavis
Copy link
Member

I'm totally fine with this if this means that it's a bit easier for us to make sure that everything is working properly. Plus again there's a chance that someone just wants proper nouns. I can even think of an example right now where someone wants a list of names in a language, and then they'd be able to use Scribe-Data to easily get all names of a given language :)

Should this be another issue?

@DeleMike
Copy link
Contributor Author

Thanks @andrewtavis!

Yes, this should be tracked as a separate issue. It’s a significant refactor that involves modifying the structure of multiple query files and creating a dedicated issue would help us manage the process effectively.

I’ll create an issue, and then get started on a PR asap so this current PR can make progress.

@andrewtavis
Copy link
Member

Sounds good, Thanks @DeleMike!

@andrewtavis
Copy link
Member

Do you want to clean this PR up as well, @DeleMike, and then I can do a final review?

@DeleMike
Copy link
Contributor Author

Do you want to clean this PR up as well, @DeleMike, and then I can do a final review?

I will clean it up, but we can't say for sure if everything will be 100% good to go with the checks, even after resolving #380.

Would you like me to adjust the current PR with the new plan in mind, where the pattern will be updated to:

data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+"

This will ensure that our refactor is aligned with the upcoming changes.

@DeleMike
Copy link
Contributor Author

I will focus on cleaning up src/scribe_data/check/check_query_identifiers.py to have the right pattern and flow to achieve these checks.
Then, I will jump straight into configuring the queries.
Once that is set up, I will run the checks again to verify the queries.

You should be expecting a clean PR in less than 30 mins :)

@andrewtavis
Copy link
Member

Thanks much, @DeleMike, and yes I think that it makes sense to refactor based on the changes in #380 going through 😊 We can merge the work for the other one before this one :)

@DeleMike
Copy link
Contributor Author

currently facing merge conflicts...

@DeleMike DeleMike marked this pull request as ready for review October 16, 2024 12:54
@DeleMike
Copy link
Contributor Author

@andrewtavis I resolved the merge conflicts. Please help review to know if this is acceptable for now🤔

and prints out any files with incorrect QIDs for both languages and data types.
"""
language_pattern = r"\?lexeme dct:language wd:Q\d+"
data_type_pattern = r"wikibase:lexicalCategory\s+wd:Q\d+"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upgraded the regex expression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and docstrings. Those were the major work in the commit.

are the docstrings in the right format? @andrewtavis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they're looking good now, @DeleMike :) I'd just indent parameter and return entries, but really good 😊

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, @DeleMike! Looking forward to seeing it in action in a workflow 😊

@andrewtavis andrewtavis merged commit 6189a84 into scribe-org:main Oct 16, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Accepted as a part of Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants