-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check for invalid language and data type QIDs #371
Merged
andrewtavis
merged 37 commits into
scribe-org:main
from
DeleMike:feat/add-check-query-identifiers-script
Oct 16, 2024
Merged
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
594a5ac
check for invalid language and data type QIDs
DeleMike defab4d
Create query_adverbs.sparql
Otom-obhazi 662a0f6
Remove select distinct from all queries
andrewtavis b5fecce
Create query_adverbs.sparql
Otom-obhazi ae15e77
Add filter for language
andrewtavis f5f7404
Create query_adverbs.sparql
Otom-obhazi e250233
Remove adverb file and prepare tests
andrewtavis 52dca19
Re-add English adverbs
andrewtavis 7dbf7b0
Add Chinese Mndarin adverbs,prepositions,adjectives and emoji keywords
VNW22 5a383f2
Update Mandarin prepositions query
VNW22 1942d09
Remove Mandarin Adverbs directory
VNW22 3d505a7
Create query_adverbs.sparql
Otom-obhazi a871de3
Create generate_emoji_keywords.py
Otom-obhazi 318cceb
Add missing init file
andrewtavis 52b7426
Create query_adverbs.sparql
Otom-obhazi e16dc24
Rename adverb directory
andrewtavis e0f0598
Create query_adjectives_1.sparql
KesharwaniArpita 51d1f1d
Create query_adjective_2.sparql
KesharwaniArpita cc7b9e6
Create query_adjectives_3.sparql
KesharwaniArpita 2fc8ed7
Rename query_adjective_2.sparql to query_adjectives_2.sparql
KesharwaniArpita 0bd670e
Create query_adverbs.sparql
KesharwaniArpita f276d16
Create generate_emoji_keywords.py
KesharwaniArpita a577951
Add forms to adjectives query
andrewtavis adc061f
adding a sparql file in Tamil/adverbs for Tamil adverbs
OmarAI2003 7d0195b
simple sparql query for fetching Tamil adverbs from wikidata
OmarAI2003 7c3b037
Add vocative
andrewtavis ae2e662
fix lists of arguments to be validated
axif0 3e6835c
Minor formatting and edits to outputs
andrewtavis 343ffdb
add workflow check_query_identifiers and dummy script #339
catreedle 230fa58
Update workflow to trigger on future commits
catreedle 408abc9
Deactivate workflow so it can be brought into other PRs
andrewtavis bf02ac8
Remove yaml from workflow name
andrewtavis 08f6ed1
Update unicode docs
andrewtavis 5fba72f
Update Sphynx RTD theme for docs
andrewtavis d37872c
Cleanup query validation logic: update data_type_pattern and clean up…
DeleMike 620922e
Merge branch 'main' into feat/add-check-query-identifiers-script
DeleMike 5e86265
Minor edits to script formatting
andrewtavis File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
import re | ||
from pathlib import Path | ||
|
||
from scribe_data.cli.cli_utils import ( | ||
LANGUAGE_DATA_EXTRACTION_DIR, | ||
language_metadata, | ||
data_type_metadata, | ||
) | ||
|
||
|
||
def extract_qid_from_sparql(file_path: Path, pattern: str) -> str: | ||
""" | ||
Extract the QID based on the pattern provided (either language or data type). | ||
""" | ||
try: | ||
with open(file_path, "r", encoding="utf-8") as file: | ||
content = file.read() | ||
match = re.search(pattern, content) | ||
if match: | ||
return match.group(0).replace("wd:", "") | ||
except Exception as e: | ||
print(f"Error reading {file_path}: {e}") | ||
return None | ||
|
||
|
||
def check_queries(): | ||
language_pattern = r"\?lexeme dct:language wd:Q\d+" | ||
data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+" | ||
incorrect_languages = [] | ||
incorrect_data_types = [] | ||
|
||
language_extraction_dir = LANGUAGE_DATA_EXTRACTION_DIR | ||
for query_file in language_extraction_dir.glob("**/*.sparql"): | ||
lang_qid = extract_qid_from_sparql(query_file, language_pattern) | ||
data_type_qid = extract_qid_from_sparql(query_file, data_type_pattern) | ||
|
||
# Validate language QID and data type QID | ||
if not is_valid_language(query_file, lang_qid): | ||
incorrect_languages.append(query_file) | ||
if not is_valid_data_type(query_file, data_type_qid): | ||
incorrect_data_types.append(query_file) | ||
|
||
if incorrect_languages: | ||
print("Queries with incorrect languages QIDs are:") | ||
for file in incorrect_languages: | ||
print(f"- {file}") | ||
|
||
if incorrect_data_types: | ||
print("Queries with incorrect data type QIDs are:") | ||
for file in incorrect_data_types: | ||
print(f"- {file}") | ||
|
||
|
||
def is_valid_language(query_file, lang_qid): | ||
lang_directory_name = query_file.parent.parent.name.lower() | ||
languages = language_metadata.get( | ||
"languages" | ||
) # might not work since language_metadata file is not fully updated | ||
Comment on lines
+96
to
+98
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For language verification, the stumbling block might be the |
||
language_entry = next( | ||
(lang for lang in languages if lang["language"] == lang_directory_name), None | ||
) | ||
|
||
if not language_entry: | ||
print( | ||
f"Warning: Language '{lang_directory_name}' not found in language_metadata.json." | ||
) | ||
return False | ||
|
||
expected_language_qid = language_entry["qid"] | ||
print("Expected language QID:", expected_language_qid) | ||
|
||
if lang_qid != expected_language_qid: | ||
print( | ||
f"Incorrect language QID in {lang_directory_name}. " | ||
f"Found: {lang_qid}, Expected: {expected_language_qid}" | ||
) | ||
return False | ||
return True | ||
|
||
|
||
def is_valid_data_type(query_file, data_type_qid): | ||
directory_name = query_file.parent.name # e.g., "nouns" or "verbs" | ||
expected_data_type_qid = data_type_metadata.get(directory_name) | ||
|
||
if data_type_qid != expected_data_type_qid: | ||
print( | ||
f"Warning: Incorrect data type QID in {query_file}. Found: {data_type_qid}, Expected: {expected_data_type_qid}" | ||
) | ||
return False | ||
return True | ||
|
||
|
||
# Examples: | ||
|
||
# file_path = Path("French/verbs/query_verbs.sparql") | ||
# print(is_valid_data_type(file_path, "QW24907")) # check for data type | ||
# print(is_valid_language(file_path, "Q150")) # check for if valid language | ||
|
||
check_queries() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pattern will not work for all data_types because we don't have a consistent SPARQL query structure.
To get the language QID is easy because it's the same format regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can certainly split the current preposition and postposition queries, @DeleMike :) Do you want to send along a PR for that?
As far as nouns and proper nouns, how are we feeling on this? Split them too? CC @catreedle :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. we need to split them