Skip to content

v1.20.3-at.20200401.01

Compare
Choose a tag to compare

Wget-AT 20200401.01 (Wget 1.20.3-at.20200401.01) Release Notes

This is the first official release of Wget-AT as continuation of Wget-Lua. Wget-AT is a new direction with Wget-Lua to add more modern features for web archiving, in addition to the already implemented Lua scripting.

This release adds support for Zstandard with dictionary compression, implements URL-agnostic deduplication and moves to version 1.1 of the WARC format.

WARC/1.1

Version 1.1 of the WARC format (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) implements a number of different fields and changes a number of erroneous recommendations in version 1.0 of the format.

The notable changes to version 1.1 WARCs created with 1.20.3-at.20200401.01 compared to 1.0 WARCs created with previous versions are the addition of

  • the WARC-Refers-To-Target-URI header and
  • the WARC-Refers-To-Date header

for WARC revisit records. The version noted in the WARC records is now WARC/1.1 instead of WARC/1.0.

Zstandard with dictionary

Normally, according to the standard for WARC/1.1, WARC records are compressed using Zlib, creating .warc.gz files. Every record is compressed individually. If many webpages are stored in a WARC files that have overlap, this overlap would cause an equal relative overlap between compressed records. With the use of dictionaries in which these overlapping parts can be referenced, the overlapping parts can be largely compressed away, causing a much smaller overhead in size for records compressed with Zstandard with a dictionary.

Implementation

The implementation of Zstandard with dictionary compression has been created in cooperation with Internet Archive to allow playback of Zstandard compressed WARCs through the Wayback Machine. WARCs created with Zstandard compression have extention .warc.zst, similar to .warc.gz when Zlib compression is used.

Zstandard can both be used with and without dictionary. Without dictionary it is shown that Zstandard performs better than many other compression algorithms, like Zlib normally used for WARC record compression. The additional use of dictionaries for compression allows records to be compressed to smaller sizes and allows for overlapping data between records to be compressed away with the right trained dictionaries.

Zstandard allows for skippable frames, which allow for any user data to be added between frames in an additional frame. This frame is normally skipped by software handling Zstandard compressed files. The skippable frame (see https://facebook.github.io/zstd/zstd_manual.html for details) consists of, in listed order,

  • the skippable frame ID with values between 0x184D2A50 and 0x184D2A5F, in little endian format,
  • the frame size in 4 bytes, in little endian format, and
  • the content of the frame.

A used dictionary can be stored in the skippable frame with frame ID 0x184D2A5D as very first frame of the WARC file. By default the Zstandard dictionary is compressed with Zstandard before added as content of the skippable frame, unless option --warc-zstd-dict-no-compression is given to prevent compression of the dictionary before storing it. To prevent the dictionary from being included at the start of the resulting WARC file, option --warc-zstd-dict-no-include should be used.

--warc-compression-use-zstd

Use Zstandard instead of Zlib compression for compressing WARC records. To use a Zstandard dictionary as well, use option --warc-zstd-dict=FILENAME.

--warc-zstd-dict=FILENAME

The Zstandard dictionary to use for compression. Option --warc-compression-use-zstd needs to be used in order to use this option.

The dictionary is by default compressed with Zstandard and included in at the beginning of the WARC file, unless respectively options --warc-zstd-dict-no-compression or --warc-zstd-dict-no-include are used.

--warc-zstd-dict-no-include

Prevent the used Zstandard dictionary from being included in a skippable frame at the start of the WARC file. Option --warc-zstd-dict=FILENAME needs to be used in order to use this option.

It can be useful to not include the dictionary if many seperate WARCs are created using the same dictionary. Storing the dictionary in every WARC creates overhead in size. Instead, it may be useful to store the Zstandard dictionary separately.

--warc-zstd-dict-no-compression

Prevent the compression of the used Zstandard dictionary with Zstandard before writing it to the skippable frame. Option --warc-zstd-dict=FILENAME needs to be used in order to use this option.

Zstandard dictionaries themselves are not compressed, and compression can often yield tens of percents of reduction in the size of the skippable frame with compressed dictionary over that with uncompressed dictionary. Not compressing the dictionary might improve performance, as no decompression needs to take place in order to use the dictionary.

Deduplication

With deduplication on WARC records, a response record can be converted to a revisit record if it is found to be a duplicate from another record. In accordance with version 1.1 of the WARC format, the headers

  • WARC-Refers-To, referring to WARC-Record-ID of the original record,
  • WARC-Refers-To-Target-URI, referring to WARC-Target-URI of the original record,
  • WARC-Refers-To-Date, referring to WARC-Date of the original record,
  • WARC-Profile, with value http://netpreserve.org/warc/1.1/revisit/identical-payload-digest, and
  • WARC-Truncated, with value length,

are added and header WARC-Type is assigned value revisit. WARC-Block-Digest is set to the digest of the truncated data and WARC-Payload-Digest is the digest of the original payload.

With this release URL-agnostic deduplication is supported for WARC records in a single Wget session with the --warc-dedup-url-agnostic option. URL-gnostic deduplication is used by default for WARC writing, unless disabled with --warc-dedup-disable.

--warc-dedup-url-agnostic

Allow URL-agnostic deduplication of WARC records in the same Wget session.

A response record is converted into a revisit records with URL-agnostic deduplication when only the WARC-Payload-Digest matches that of a previously written record. Other WARC headers, like WARC-Target-URI, do not have to be equal in order for a revisit record to be written.

--warc-dedup-min-size=NUMBER

The minimum number of bytes a payload should be large before it is deduplicated. The default value is 100.

When a response record is converted to a revisit record, a number of fields are added. The value of --warc-dedup-min-size is used to determine when it is 'worth it' to write a revisit record instead of the original, given the increase or decrease in size, performance, and other factors.

--warc-dedup-disable

Disables the URL-gnostic deduplication. This deduplication is turned on by default.

URL-gnostic deduplication converts a response record into a revisit record when another record was previously written with equal values for the WARC-Payload-Digest and WARC-Target-URI WARC headers.