Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for invalid language and data type QIDs #371

Merged
Merged
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
594a5ac
check for invalid language and data type QIDs
DeleMike Oct 15, 2024
defab4d
Create query_adverbs.sparql
Otom-obhazi Oct 15, 2024
662a0f6
Remove select distinct from all queries
andrewtavis Oct 15, 2024
b5fecce
Create query_adverbs.sparql
Otom-obhazi Oct 15, 2024
ae15e77
Add filter for language
andrewtavis Oct 15, 2024
f5f7404
Create query_adverbs.sparql
Otom-obhazi Oct 15, 2024
e250233
Remove adverb file and prepare tests
andrewtavis Oct 15, 2024
52dca19
Re-add English adverbs
andrewtavis Oct 15, 2024
7dbf7b0
Add Chinese Mndarin adverbs,prepositions,adjectives and emoji keywords
VNW22 Oct 15, 2024
5a383f2
Update Mandarin prepositions query
VNW22 Oct 15, 2024
1942d09
Remove Mandarin Adverbs directory
VNW22 Oct 15, 2024
3d505a7
Create query_adverbs.sparql
Otom-obhazi Oct 15, 2024
a871de3
Create generate_emoji_keywords.py
Otom-obhazi Oct 15, 2024
318cceb
Add missing init file
andrewtavis Oct 15, 2024
52b7426
Create query_adverbs.sparql
Otom-obhazi Oct 15, 2024
e16dc24
Rename adverb directory
andrewtavis Oct 15, 2024
e0f0598
Create query_adjectives_1.sparql
KesharwaniArpita Oct 15, 2024
51d1f1d
Create query_adjective_2.sparql
KesharwaniArpita Oct 15, 2024
cc7b9e6
Create query_adjectives_3.sparql
KesharwaniArpita Oct 15, 2024
2fc8ed7
Rename query_adjective_2.sparql to query_adjectives_2.sparql
KesharwaniArpita Oct 15, 2024
0bd670e
Create query_adverbs.sparql
KesharwaniArpita Oct 15, 2024
f276d16
Create generate_emoji_keywords.py
KesharwaniArpita Oct 15, 2024
a577951
Add forms to adjectives query
andrewtavis Oct 15, 2024
adc061f
adding a sparql file in Tamil/adverbs for Tamil adverbs
OmarAI2003 Oct 15, 2024
7d0195b
simple sparql query for fetching Tamil adverbs from wikidata
OmarAI2003 Oct 15, 2024
7c3b037
Add vocative
andrewtavis Oct 15, 2024
ae2e662
fix lists of arguments to be validated
axif0 Oct 15, 2024
3e6835c
Minor formatting and edits to outputs
andrewtavis Oct 15, 2024
343ffdb
add workflow check_query_identifiers and dummy script #339
catreedle Oct 14, 2024
230fa58
Update workflow to trigger on future commits
catreedle Oct 15, 2024
408abc9
Deactivate workflow so it can be brought into other PRs
andrewtavis Oct 15, 2024
bf02ac8
Remove yaml from workflow name
andrewtavis Oct 15, 2024
08f6ed1
Update unicode docs
andrewtavis Oct 16, 2024
5fba72f
Update Sphynx RTD theme for docs
andrewtavis Oct 16, 2024
d37872c
Cleanup query validation logic: update data_type_pattern and clean up…
DeleMike Oct 16, 2024
620922e
Merge branch 'main' into feat/add-check-query-identifiers-script
DeleMike Oct 16, 2024
5e86265
Minor edits to script formatting
andrewtavis Oct 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions src/scribe_data/check/check_query_identifiers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
import re
from pathlib import Path

from scribe_data.cli.cli_utils import (
LANGUAGE_DATA_EXTRACTION_DIR,
language_metadata,
data_type_metadata,
)


def extract_qid_from_sparql(file_path: Path, pattern: str) -> str:
"""
Extract the QID based on the pattern provided (either language or data type).
"""
try:
with open(file_path, "r", encoding="utf-8") as file:
content = file.read()
match = re.search(pattern, content)
if match:
return match.group(0).replace("wd:", "")
except Exception as e:
print(f"Error reading {file_path}: {e}")
return None


def check_queries():
language_pattern = r"\?lexeme dct:language wd:Q\d+"
data_type_pattern = r"wikibase:lexicalCategory wd:Q\d+"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern will not work for all data_types because we don't have a consistent SPARQL query structure.
To get the language QID is easy because it's the same format regardless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can certainly split the current preposition and postposition queries, @DeleMike :) Do you want to send along a PR for that?

As far as nouns and proper nouns, how are we feeling on this? Split them too? CC @catreedle :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. we need to split them

incorrect_languages = []
incorrect_data_types = []

language_extraction_dir = LANGUAGE_DATA_EXTRACTION_DIR
for query_file in language_extraction_dir.glob("**/*.sparql"):
lang_qid = extract_qid_from_sparql(query_file, language_pattern)
data_type_qid = extract_qid_from_sparql(query_file, data_type_pattern)

# Validate language QID and data type QID
if not is_valid_language(query_file, lang_qid):
incorrect_languages.append(query_file)
if not is_valid_data_type(query_file, data_type_qid):
incorrect_data_types.append(query_file)

if incorrect_languages:
print("Queries with incorrect languages QIDs are:")
for file in incorrect_languages:
print(f"- {file}")

if incorrect_data_types:
print("Queries with incorrect data type QIDs are:")
for file in incorrect_data_types:
print(f"- {file}")


def is_valid_language(query_file, lang_qid):
lang_directory_name = query_file.parent.parent.name.lower()
languages = language_metadata.get(
"languages"
) # might not work since language_metadata file is not fully updated
Comment on lines +96 to +98
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For language verification, the stumbling block might be the language_metadata.json. If it is updated properly then we won't have issues. It depends of #293

language_entry = next(
(lang for lang in languages if lang["language"] == lang_directory_name), None
)

if not language_entry:
print(
f"Warning: Language '{lang_directory_name}' not found in language_metadata.json."
)
return False

expected_language_qid = language_entry["qid"]
print("Expected language QID:", expected_language_qid)

if lang_qid != expected_language_qid:
print(
f"Incorrect language QID in {lang_directory_name}. "
f"Found: {lang_qid}, Expected: {expected_language_qid}"
)
return False
return True


def is_valid_data_type(query_file, data_type_qid):
directory_name = query_file.parent.name # e.g., "nouns" or "verbs"
expected_data_type_qid = data_type_metadata.get(directory_name)

if data_type_qid != expected_data_type_qid:
print(
f"Warning: Incorrect data type QID in {query_file}. Found: {data_type_qid}, Expected: {expected_data_type_qid}"
)
return False
return True


# Examples:

# file_path = Path("French/verbs/query_verbs.sparql")
# print(is_valid_data_type(file_path, "QW24907")) # check for data type
# print(is_valid_language(file_path, "Q150")) # check for if valid language

check_queries()
Loading