Skip to content

Commit

Permalink
Significantly enhance the safety of metadata manipulation
Browse files Browse the repository at this point in the history
  • Loading branch information
benoit74 committed Nov 21, 2024
1 parent 5f92462 commit 6033f68
Show file tree
Hide file tree
Showing 10 changed files with 1,271 additions and 557 deletions.
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Renamed `filesystem.validate_zimfile_creatable` to `filesystem.file_creatable` to reflect general applicability to check file creation beyond ZIM files #200
- Remove any "ZIM" reference in exceptions while working with files #200
- Significantly enhance the safety of metadata manipulation (#205)
- add types for all metadata, one type per metadata name plus some generic ones for non-standard metadata
- all types are responsible to validate metadata value at initialization time
- validation checks for adherence to the ZIM specification and conventions are automated
- cleanup of unwanted control characters and stripping white characters are automated
- whenever possible, try to clean a "reasonably" bad metadata (e.g. automaticall accept and remove duplicate tags - harmless - but not duplicate language codes - codes are supposed to be ordered, so it is a weird situation)
- it is now possible to disable ZIM conventions checks with `zim.metadata.check_metadata_conventions`
- simplify `zim.creator.Creator.config_metadata` by using these types and been more strict:
- add new `StandardMetadata` class for standard metadata, including list of mandatory one
- by default, all non-standard metadata must start with `X-` prefix
- this not yet an openZIM convention / specification, so it is possible to disable this check with `fail_on_missing_prefix` argument
- simplify `add_metadata`, use same metadata types
- simplify `zim.creator.Creator.start` with new types, and drop all metadata from memory after being passed to the libzim
- drop `zim.creator.convert_and_check_metadata` (not usefull anymore, simply use proper metadata type)
- move `MANDATORY_ZIM_METADATA_KEYS` and `DEFAULT_DEV_ZIM_METADATA` from `constants` to `zim.metadata` to avoid circular dependencies
- new `inputs.unique_values` utility function to compute the list of uniques values from a given list, but preserving initial list order
- in `__init__` of `zim.creator.Creator`, rename `disable_metadata_checks` to `check_metadata_conventions` for clarity and brevity

### Added

Expand Down
29 changes: 0 additions & 29 deletions src/zimscraperlib/constants.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#!/usr/bin/env python3
# vim: ai ts=4 sts=4 et sw=4 nu

import base64
import pathlib
import re

Expand All @@ -21,34 +20,6 @@
# list of mimetypes we consider articles using it should default to FRONT_ARTICLE
FRONT_ARTICLE_MIMETYPES = ["text/html"]

# list of mandatory meta tags of the zim file.
MANDATORY_ZIM_METADATA_KEYS = [
"Name",
"Title",
"Creator",
"Publisher",
"Date",
"Description",
"Language",
"Illustration_48x48@1",
]

DEFAULT_DEV_ZIM_METADATA = {
"Name": "Test Name",
"Title": "Test Title",
"Creator": "Test Creator",
"Publisher": "Test Publisher",
"Date": "2023-01-01",
"Description": "Test Description",
"Language": "fra",
# blank 48x48 transparent PNG
"Illustration_48x48_at_1": base64.b64decode(
"iVBORw0KGgoAAAANSUhEUgAAADAAAAAwAQMAAABtzGvEAAAAGXRFWHRTb2Z0d2FyZQBB"
"ZG9iZSBJbWFnZVJlYWR5ccllPAAAAANQTFRFR3BMgvrS0gAAAAF0Uk5TAEDm2GYAAAAN"
"SURBVBjTY2AYBdQEAAFQAAGn4toWAAAAAElFTkSuQmCC"
),
}

RECOMMENDED_MAX_TITLE_LENGTH = 30
MAXIMUM_DESCRIPTION_METADATA_LENGTH = 80
MAXIMUM_LONG_DESCRIPTION_METADATA_LENGTH = 4000
Expand Down
5 changes: 5 additions & 0 deletions src/zimscraperlib/inputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,3 +136,8 @@ def compute_tags(
return {
tag.strip() for tag in list(default_tags) + (user_tags or "").split(";") if tag
}


def unique_values(items: list) -> list:
"""Return unique values in input list while preserving list order"""
return list(dict.fromkeys(items))
Loading

0 comments on commit 6033f68

Please sign in to comment.