check for invalid language and data type QIDs #371

DeleMike · 2024-10-15T13:38:35Z

Contributor checklist

This pull request is on a separate branch and not the main branch

Description

I am showing a draft of how we want to verify queries. That is, language QIDs and data type QIDs.

Related issue

Add workflow to check queries #339

github-actions · 2024-10-15T13:39:05Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

DeleMike · 2024-10-15T13:40:03Z

Some observations

The issue with language QID verification is we don't have an updated source of truth.
The issue with data type QID verification is the queries are written in different manners.

It will be hard to extract the QIDs. For example see src/scribe_data/language_data_extraction/English/nouns/query_nouns.sparql:

# tool: scribe-data
# All English (Q1860) nouns and their plural.
# Enter this query at https://query.wikidata.org/.

SELECT DISTINCT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?singular
  ?plural

WHERE {
  VALUES ?nounTypes {wd:Q1084 wd:Q147276} # Nouns and proper nouns

  ?lexeme dct:language wd:Q1860 ;
    wikibase:lexicalCategory ?nounTypes ;
    wikibase:lemma ?singular .

  # MARK: Plural

  OPTIONAL {
    ?lexeme ontolex:lexicalForm ?pluralForm .
    ?pluralForm ontolex:representation ?plural ;
      wikibase:grammaticalFeature wd:Q146786 ;
  } .
}

and see verbs

# tool: scribe-data
# All Nigerian Pidgin (Q33655) verbs.
# Enter this query at https://query.wikidata.org/.

SELECT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?verb

WHERE {
  ?lexeme dct:language wd:Q33655 ;
    wikibase:lexicalCategory wd:Q24905 ;
    wikibase:lemma ?verb .
}

the lexicalCategory wd: is different for verbs and nouns.
one event has lexicalCategory ?nounTypes and the other lexicalCategory wd:Q24905...this also happens for things like prepositions and postpositions

DeleMike · 2024-10-15T13:41:35Z

src/scribe_data/check/check_query_identifiers.py

+    language_pattern = r"\?lexeme dct:language wd:Q\d+"
+    data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+"


this pattern will not work for all data_types because we don't have a consistent SPARQL query structure.
To get the language QID is easy because it's the same format regardless.

We can certainly split the current preposition and postposition queries, @DeleMike :) Do you want to send along a PR for that?

As far as nouns and proper nouns, how are we feeling on this? Split them too? CC @catreedle :)

Yes, I agree. we need to split them

DeleMike · 2024-10-15T13:43:20Z

src/scribe_data/check/check_query_identifiers.py

+    languages = language_metadata.get(
+        "languages"
+    )  # might not work since language_metadata file is not fully updated


For language verification, the stumbling block might be the language_metadata.json. If it is updated properly then we won't have issues. It depends of #293

DeleMike · 2024-10-15T13:45:47Z

Some observations

The issue with language QID verification is we don't have an updated source of truth.
The issue with data type QID verification is the queries are written in different manners.

It will be hard to extract the QIDs. For example see src/scribe_data/language_data_extraction/English/nouns/query_nouns.sparql:

# tool: scribe-data
# All English (Q1860) nouns and their plural.
# Enter this query at https://query.wikidata.org/.

SELECT DISTINCT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?singular
  ?plural

WHERE {
  VALUES ?nounTypes {wd:Q1084 wd:Q147276} # Nouns and proper nouns

  ?lexeme dct:language wd:Q1860 ;
    wikibase:lexicalCategory ?nounTypes ;
    wikibase:lemma ?singular .

  # MARK: Plural

  OPTIONAL {
    ?lexeme ontolex:lexicalForm ?pluralForm .
    ?pluralForm ontolex:representation ?plural ;
      wikibase:grammaticalFeature wd:Q146786 ;
  } .
}

and see verbs

# tool: scribe-data
# All Nigerian Pidgin (Q33655) verbs.
# Enter this query at https://query.wikidata.org/.

SELECT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?verb

WHERE {
  ?lexeme dct:language wd:Q33655 ;
    wikibase:lexicalCategory wd:Q24905 ;
    wikibase:lemma ?verb .
}

the lexicalCategory wd:Q24905 is different for verbs and nouns.

@andrewtavis what do you think of this?

The language issue is kind of sorted. Once we have an updated language_metadata file then this code should work as expected.

However, verifying data_types is difficult because we have different ways the SPARQL queries were constructed.

PS: Please ignore some of the print statements, I was using them to debug. This is still a draft PR.

@catreedle any ideas too on how we can tackle these issues?

andrewtavis · 2024-10-15T20:19:49Z

As stated above, if we want to make the switch to one data type per query then I'm fine with that :)

andrewtavis · 2024-10-15T20:20:31Z

There's potentially more value in making sure that we can maintain the query base than combining nouns and proper nouns, plus some people might only want one 🤔

catreedle · 2024-10-16T01:49:14Z

How about we keep a dictionary of the patterns for data type?

data_type_patterns = {
    "nouns": r"\?nounTypes \{wd:(Q\d+)\}", 
    "default": r"wikibase:lexicalCategory wd:Q\d+"
}

We can retrieve the pattern based on the directory name

directory_name = query_file.parent.name
data_type_pattern = data_type_patterns.get(directory_name, data_type_patterns["default"])  # Use default if no match

This allows us to manage different patterns for various data types. If a directory name matches a key in the dictionary, we get its specific pattern, otherwise we use a default pattern for other types.
What do you think?

andrewtavis · 2024-10-16T05:39:21Z

I think that postpositions and prepositions should be split regardless, but this could certainly work 😊

DeleMike · 2024-10-16T07:45:46Z

@catreedle your idea would work but the real issue might still be there. So as @andrewtavis says, we should convert the queries to have one query per type. It would make the validation process easier. This pattern will now stand for all queries if we split all the queries:

data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+"

Hence, in this style of code:

SELECT
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?preposition
  ?case

WHERE {
  # Prepositions and postpositions.
  VALUES ?prePostPositions { wd:Q4833830 wd:Q161873 }

  ?lexeme dct:language wd:Q9610 ;
    wikibase:lexicalCategory ?prePostPositions ;
    wikibase:lemma ?preposition .

  # MARK: Corresponding Case

  OPTIONAL {
    ?lexeme wdt:P5713 ?caseForm .
  } .

  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE]".
    ?caseForm rdfs:label ?case .
  }
}

we should eliminate all VALUES ?... and then have:
query_preposition.sparql:

SELECT DISTINCT
  ?lexeme
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?preposition

WHERE {
  ?lexeme dct:language wd:Q9610 ;
    wikibase:lexicalCategory wd:Q4833830 ;
    wikibase:lemma ?preposition .
    FILTER(lang(?preposition) = "hi")
}

query_postposition.sparql:

SELECT DISTINCT
  ?lexeme
  (REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") AS ?lexemeID)
  ?postposition

WHERE {
  ?lexeme dct:language wd:Q9610 ;
    wikibase:lexicalCategory wd:Q161873 ;
    wikibase:lemma ?postposition .
    FILTER(lang(?postposition) = "hi")
}

However, this might lead to a large PR because there would be many shake-ups.

@andrewtavis is there a need for me to create an issue? And what do you think about this?

andrewtavis · 2024-10-16T10:46:11Z

I'm totally fine with this if this means that it's a bit easier for us to make sure that everything is working properly. Plus again there's a chance that someone just wants proper nouns. I can even think of an example right now where someone wants a list of names in a language, and then they'd be able to use Scribe-Data to easily get all names of a given language :)

Should this be another issue?

DeleMike · 2024-10-16T11:08:55Z

Thanks @andrewtavis!

Yes, this should be tracked as a separate issue. It’s a significant refactor that involves modifying the structure of multiple query files and creating a dedicated issue would help us manage the process effectively.

I’ll create an issue, and then get started on a PR asap so this current PR can make progress.

andrewtavis · 2024-10-16T11:11:57Z

Sounds good, Thanks @DeleMike!

andrewtavis · 2024-10-16T11:16:32Z

Do you want to clean this PR up as well, @DeleMike, and then I can do a final review?

DeleMike · 2024-10-16T11:27:11Z

Do you want to clean this PR up as well, @DeleMike, and then I can do a final review?

I will clean it up, but we can't say for sure if everything will be 100% good to go with the checks, even after resolving #380.

Would you like me to adjust the current PR with the new plan in mind, where the pattern will be updated to:

data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+"

This will ensure that our refactor is aligned with the upcoming changes.

DeleMike · 2024-10-16T11:36:41Z

I will focus on cleaning up src/scribe_data/check/check_query_identifiers.py to have the right pattern and flow to achieve these checks.
Then, I will jump straight into configuring the queries.
Once that is set up, I will run the checks again to verify the queries.

You should be expecting a clean PR in less than 30 mins :)

andrewtavis · 2024-10-16T11:38:17Z

Thanks much, @DeleMike, and yes I think that it makes sense to refactor based on the changes in #380 going through 😊 We can merge the work for the other one before this one :)

DeleMike · 2024-10-16T12:44:47Z

currently facing merge conflicts...

adverbs for yoruba

adverb for chinese/mandarin

adverb for english

Adverb for Basque

… print statements

DeleMike · 2024-10-16T13:10:51Z

@andrewtavis I resolved the merge conflicts. Please help review to know if this is acceptable for now🤔

DeleMike · 2024-10-16T13:12:09Z

src/scribe_data/check/check_query_identifiers.py

+    and prints out any files with incorrect QIDs for both languages and data types.
+    """
+    language_pattern = r"\?lexeme dct:language wd:Q\d+"
+    data_type_pattern = r"wikibase:lexicalCategory\s+wd:Q\d+"


upgraded the regex expression

and docstrings. Those were the major work in the commit.

are the docstrings in the right format? @andrewtavis

Yes they're looking good now, @DeleMike :) I'd just indent parameter and return entries, but really good 😊

andrewtavis

This is great, @DeleMike! Looking forward to seeing it in action in a workflow 😊

check for invalid language and data type QIDs

594a5ac

DeleMike commented Oct 15, 2024

View reviewed changes

andrewtavis self-requested a review October 15, 2024 14:36

andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 15, 2024

Otom-obhazi and others added 9 commits October 16, 2024 13:46

Create query_adverbs.sparql

defab4d

adverbs for yoruba

Remove select distinct from all queries

662a0f6

Create query_adverbs.sparql

b5fecce

adverb for chinese/mandarin

Add filter for language

ae15e77

Create query_adverbs.sparql

f5f7404

adverb for english

Remove adverb file and prepare tests

e250233

Re-add English adverbs

52dca19

Add Chinese Mndarin adverbs,prepositions,adjectives and emoji keywords

7dbf7b0

Update Mandarin prepositions query

5a383f2

andrewtavis and others added 22 commits October 16, 2024 13:46

Add missing init file

318cceb

Create query_adverbs.sparql

52b7426

Adverb for Basque

Rename adverb directory

e16dc24

Create query_adjectives_1.sparql

e0f0598

Create query_adjective_2.sparql

51d1f1d

Create query_adjectives_3.sparql

cc7b9e6

Rename query_adjective_2.sparql to query_adjectives_2.sparql

2fc8ed7

Create query_adverbs.sparql

0bd670e

Create generate_emoji_keywords.py

f276d16

Add forms to adjectives query

a577951

adding a sparql file in Tamil/adverbs for Tamil adverbs

adc061f

simple sparql query for fetching Tamil adverbs from wikidata

7d0195b

Add vocative

7c3b037

fix lists of arguments to be validated

ae2e662

Minor formatting and edits to outputs

3e6835c

add workflow check_query_identifiers and dummy script scribe-org#339

343ffdb

Update workflow to trigger on future commits

230fa58

Deactivate workflow so it can be brought into other PRs

408abc9

Remove yaml from workflow name

bf02ac8

Update unicode docs

08f6ed1

Update Sphynx RTD theme for docs

5fba72f

Cleanup query validation logic: update data_type_pattern and clean up…

d37872c

… print statements

DeleMike marked this pull request as ready for review October 16, 2024 12:54

Merge branch 'main' into feat/add-check-query-identifiers-script

620922e

DeleMike commented Oct 16, 2024

View reviewed changes

DeleMike mentioned this pull request Oct 16, 2024

Refactor SPARQL queries into atomic structures: #387

Merged

1 task

Minor edits to script formatting

5e86265

andrewtavis approved these changes Oct 16, 2024

View reviewed changes

andrewtavis merged commit 6189a84 into scribe-org:main Oct 16, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check for invalid language and data type QIDs #371

check for invalid language and data type QIDs #371

DeleMike commented Oct 15, 2024

github-actions bot commented Oct 15, 2024 •

edited by andrewtavis

Loading

DeleMike commented Oct 15, 2024 •

edited

Loading

DeleMike Oct 15, 2024

andrewtavis Oct 15, 2024

DeleMike Oct 16, 2024

DeleMike Oct 15, 2024

DeleMike commented Oct 15, 2024 •

edited

Loading

Some observations

andrewtavis commented Oct 15, 2024

andrewtavis commented Oct 15, 2024

catreedle commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

DeleMike commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

DeleMike commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

DeleMike commented Oct 16, 2024

DeleMike commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

DeleMike commented Oct 16, 2024

DeleMike commented Oct 16, 2024

DeleMike Oct 16, 2024

DeleMike Oct 16, 2024

andrewtavis Oct 16, 2024

andrewtavis left a comment

		language_pattern = r"\?lexeme dct:language wd:Q\d+"
		data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+"

check for invalid language and data type QIDs #371

check for invalid language and data type QIDs #371

Conversation

DeleMike commented Oct 15, 2024

Contributor checklist

Description

Related issue

github-actions bot commented Oct 15, 2024 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

DeleMike commented Oct 15, 2024 • edited Loading

Some observations

DeleMike Oct 15, 2024

Choose a reason for hiding this comment

andrewtavis Oct 15, 2024

Choose a reason for hiding this comment

DeleMike Oct 16, 2024

Choose a reason for hiding this comment

DeleMike Oct 15, 2024

Choose a reason for hiding this comment

DeleMike commented Oct 15, 2024 • edited Loading

Some observations

andrewtavis commented Oct 15, 2024

andrewtavis commented Oct 15, 2024

catreedle commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

DeleMike commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

DeleMike commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

DeleMike commented Oct 16, 2024

DeleMike commented Oct 16, 2024

andrewtavis commented Oct 16, 2024

DeleMike commented Oct 16, 2024

DeleMike commented Oct 16, 2024

DeleMike Oct 16, 2024

Choose a reason for hiding this comment

DeleMike Oct 16, 2024

Choose a reason for hiding this comment

andrewtavis Oct 16, 2024

Choose a reason for hiding this comment

andrewtavis left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 15, 2024 •

edited by andrewtavis

Loading

DeleMike commented Oct 15, 2024 •

edited

Loading

DeleMike commented Oct 15, 2024 •

edited

Loading