-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly enhance the safety of metadata manipulation #221
base: main
Are you sure you want to change the base?
Conversation
945daa8
to
c064b54
Compare
2bccb8d
to
6033f68
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #221 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 38 38
Lines 2224 2369 +145
Branches 426 448 +22
==========================================
+ Hits 2224 2369 +145 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
6033f68
to
e759a36
Compare
e759a36
to
298beef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow that's a lot of change!
See inline comments ; maybe we should discuss it live once you've looked at it
@@ -136,3 +136,8 @@ def compute_tags( | |||
return { | |||
tag.strip() for tag in list(default_tags) + (user_tags or "").split(";") if tag | |||
} | |||
|
|||
|
|||
def unique_values(items: list) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must be added to changelog
|
||
import zimscraperlib.zim.metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
By default, all metadata are validated for compliance with openZIM guidelines and | ||
conventions. Set disable_metadata_checks=True to disable this validation (you can | ||
By default, all metadata are validated for compliance with openZIM specification and | ||
conventions. Set check_metadata_conventions=True to disable this validation (you can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess
conventions. Set check_metadata_conventions=True to disable this validation (you can | |
conventions. Set check_metadata_conventions=False to disable this validation (you can |
if indexing and not is_valid_iso_639_3(language): | ||
raise ValueError("Not a valid ISO-639-3 language code") | ||
super().config_indexing(indexing, language) | ||
self.__indexing_configured = True | ||
return self | ||
|
||
def _log_metadata(self): | ||
"""Log (DEBUG) all metadata set on (_metadata ~ config_metadata()) | ||
"""Log in DEBUG level all metadata key and value""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this method?
It's only limited to the metadata in self._metadata at the time of the call.
Wouldn't it make more sense to simply log calls to add_metadata instead? That would be more true in a sense
name: str, | ||
value: bytes | str, | ||
value: Metadata, | ||
mimetype: str = "text/plain;charset=UTF-8", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be handled in the Metadata
itself. Feels weird to have to specify it here
super().__init__(name=name, value=value) | ||
|
||
|
||
class _MandatoryTextMetadata(_TextMetadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no reason for this to be kept private
return value | ||
|
||
|
||
class _MandatoryMetadata(Metadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a shame both _MandatoryMetadata
and _MandatoryTextMetadata
are identical.
We could have used a MixIn approach instead of single inheritance ; even stacking checks for for our use case, it's good enough. Just wanted to acknowledge it.
value = super().libzim_value | ||
if check_metadata_conventions: | ||
if nb_grapheme_for(value.decode()) > RECOMMENDED_MAX_TITLE_LENGTH: | ||
raise ValueError("Title is too long.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use this opportunity to raise custom Exceptions (ValueError subclass) for metadata validation error?
class IllustrationMetadata(_MandatoryMetadata): | ||
"""Any Illustration_**x**@* metadata""" | ||
|
||
def __init__(self, name: str, value: bytes | io.BytesIO) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given Illustrations have a static name pattern, maybe we should not ask for a name but sizes and scale instead. That would remove the need for the pattern check.
We can imagine having a special Default48IllustrationMetadata
or something that only takes data and supers this with 48x48 at 1 ; since that's mandatory
) | ||
super().__init__( | ||
name="Language", | ||
value=",".join(value) if isinstance(value, list | set) else value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not storing original in this case?
Fix #205
This is a full rewrite of #217, so I've opened a new PR since changes since last review made no more sense from my PoV.
zim.metadata.check_metadata_conventions
zim.creator.Creator.config_metadata
by using these types and been more strict:StandardMetadata
class for standard metadata, including list of mandatory oneX-
prefixfail_on_missing_prefix
argumentadd_metadata
, use same metadata typeszim.creator.Creator.start
with new types, and drop all metadata from memory after being passed to the libzimzim.creator.convert_and_check_metadata
(not usefull anymore, simply use proper metadata type)MANDATORY_ZIM_METADATA_KEYS
andDEFAULT_DEV_ZIM_METADATA
fromconstants
tozim.metadata
to avoid circular dependenciesinputs.unique_values
utility function to compute the list of uniques values from a given list, but preserving initial list order__init__
ofzim.creator.Creator
, renamedisable_metadata_checks
tocheck_metadata_conventions
for clarity and brevityzim.metadata.check_metadata_conventions
, so if you have many creator running in parallel, they can't have different settings, last one initialized will "win"Nota:
tests/zim/test_zim_creator.py
totests/zim/test_metadata.py
since most checks are now done at metadata initialization instead of whenconfig_metadata
orstart
are called, but coverage is similar