Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly enhance the safety of metadata manipulation #221

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Nov 21, 2024

Fix #205

This is a full rewrite of #217, so I've opened a new PR since changes since last review made no more sense from my PoV.

  • add types for all metadata, one type per metadata name plus some generic ones for non-standard metadata
    • all types are responsible to validate metadata value at initialization time
    • validation checks for adherence to the ZIM specification and conventions are automated
    • cleanup of unwanted control characters and stripping white characters are automated in all text metadata
    • whenever possible, try to automatically clean a "reasonably" bad metadata (e.g. automaticall accept and remove duplicate tags - harmless - but not duplicate language codes - codes are supposed to be ordered, so it is a weird situation) ; this is an alignment of paradigm, because for some metadata the lib was permissive, while for other it was quite restrictive ; this PR tries to align this and make the lib as permissive as possible, avoiding to fail a scraper for something which could be automatically fixed
    • it is now possible to disable ZIM conventions checks with zim.metadata.check_metadata_conventions
  • simplify zim.creator.Creator.config_metadata by using these types and been more strict:
    • add new StandardMetadata class for standard metadata, including list of mandatory one
    • by default, all non-standard metadata must start with X- prefix
      • this not yet an openZIM convention / specification, so it is possible to disable this check with fail_on_missing_prefix argument
  • simplify add_metadata, use same metadata types
  • simplify zim.creator.Creator.start with new types, and drop all metadata from memory after being passed to the libzim
  • drop zim.creator.convert_and_check_metadata (not usefull anymore, simply use proper metadata type)
  • move MANDATORY_ZIM_METADATA_KEYS and DEFAULT_DEV_ZIM_METADATA from constants to zim.metadata to avoid circular dependencies
  • new inputs.unique_values utility function to compute the list of uniques values from a given list, but preserving initial list order
  • in __init__ of zim.creator.Creator, rename disable_metadata_checks to check_metadata_conventions for clarity and brevity
    • beware that this manipulate the global zim.metadata.check_metadata_conventions, so if you have many creator running in parallel, they can't have different settings, last one initialized will "win"

Nota:

  • I've moved many tests from tests/zim/test_zim_creator.py to tests/zim/test_metadata.py since most checks are now done at metadata initialization instead of when config_metadata or start are called, but coverage is similar

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (5f92462) to head (298beef).

Additional details and impacted files
@@            Coverage Diff             @@
##              main      #221    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           38        38            
  Lines         2224      2369   +145     
  Branches       426       448    +22     
==========================================
+ Hits          2224      2369   +145     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@benoit74 benoit74 marked this pull request as ready for review November 22, 2024 07:47
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's a lot of change!
See inline comments ; maybe we should discuss it live once you've looked at it

@@ -136,3 +136,8 @@ def compute_tags(
return {
tag.strip() for tag in list(default_tags) + (user_tags or "").split(";") if tag
}


def unique_values(items: list) -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be added to changelog


import zimscraperlib.zim.metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

By default, all metadata are validated for compliance with openZIM guidelines and
conventions. Set disable_metadata_checks=True to disable this validation (you can
By default, all metadata are validated for compliance with openZIM specification and
conventions. Set check_metadata_conventions=True to disable this validation (you can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess

Suggested change
conventions. Set check_metadata_conventions=True to disable this validation (you can
conventions. Set check_metadata_conventions=False to disable this validation (you can

if indexing and not is_valid_iso_639_3(language):
raise ValueError("Not a valid ISO-639-3 language code")
super().config_indexing(indexing, language)
self.__indexing_configured = True
return self

def _log_metadata(self):
"""Log (DEBUG) all metadata set on (_metadata ~ config_metadata())
"""Log in DEBUG level all metadata key and value"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this method?
It's only limited to the metadata in self._metadata at the time of the call.
Wouldn't it make more sense to simply log calls to add_metadata instead? That would be more true in a sense

name: str,
value: bytes | str,
value: Metadata,
mimetype: str = "text/plain;charset=UTF-8",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be handled in the Metadata itself. Feels weird to have to specify it here

super().__init__(name=name, value=value)


class _MandatoryTextMetadata(_TextMetadata):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reason for this to be kept private

return value


class _MandatoryMetadata(Metadata):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame both _MandatoryMetadata and _MandatoryTextMetadata are identical.
We could have used a MixIn approach instead of single inheritance ; even stacking checks for for our use case, it's good enough. Just wanted to acknowledge it.

value = super().libzim_value
if check_metadata_conventions:
if nb_grapheme_for(value.decode()) > RECOMMENDED_MAX_TITLE_LENGTH:
raise ValueError("Title is too long.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use this opportunity to raise custom Exceptions (ValueError subclass) for metadata validation error?

class IllustrationMetadata(_MandatoryMetadata):
"""Any Illustration_**x**@* metadata"""

def __init__(self, name: str, value: bytes | io.BytesIO) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given Illustrations have a static name pattern, maybe we should not ask for a name but sizes and scale instead. That would remove the need for the pattern check.
We can imagine having a special Default48IllustrationMetadata or something that only takes data and supers this with 48x48 at 1 ; since that's mandatory

)
super().__init__(
name="Language",
value=",".join(value) if isinstance(value, list | set) else value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not storing original in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[next major] remove **extra from Creator.config_metadata
2 participants