v1.20.3-at.20200401.01
Wget-AT 20200401.01 (Wget 1.20.3-at.20200401.01) Release Notes
This is the first official release of Wget-AT as continuation of Wget-Lua. Wget-AT is a new direction with Wget-Lua to add more modern features for web archiving, in addition to the already implemented Lua scripting.
This release adds support for Zstandard with dictionary compression, implements URL-agnostic deduplication and moves to version 1.1 of the WARC format.
WARC/1.1
Version 1.1 of the WARC format (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) implements a number of different fields and changes a number of erroneous recommendations in version 1.0 of the format.
The notable changes to version 1.1 WARCs created with 1.20.3-at.20200401.01
compared to 1.0 WARCs created with previous versions are the addition of
- the
WARC-Refers-To-Target-URI
header and - the
WARC-Refers-To-Date
header
for WARC revisit
records. The version noted in the WARC records is now WARC/1.1
instead of WARC/1.0
.
Zstandard with dictionary
Normally, according to the standard for WARC/1.1
, WARC records are compressed using Zlib, creating .warc.gz
files. Every record is compressed individually. If many webpages are stored in a WARC files that have overlap, this overlap would cause an equal relative overlap between compressed records. With the use of dictionaries in which these overlapping parts can be referenced, the overlapping parts can be largely compressed away, causing a much smaller overhead in size for records compressed with Zstandard with a dictionary.
Implementation
The implementation of Zstandard with dictionary compression has been created in cooperation with Internet Archive to allow playback of Zstandard compressed WARCs through the Wayback Machine. WARCs created with Zstandard compression have extention .warc.zst
, similar to .warc.gz
when Zlib compression is used.
Zstandard can both be used with and without dictionary. Without dictionary it is shown that Zstandard performs better than many other compression algorithms, like Zlib normally used for WARC record compression. The additional use of dictionaries for compression allows records to be compressed to smaller sizes and allows for overlapping data between records to be compressed away with the right trained dictionaries.
Zstandard allows for skippable frames, which allow for any user data to be added between frames in an additional frame. This frame is normally skipped by software handling Zstandard compressed files. The skippable frame (see https://facebook.github.io/zstd/zstd_manual.html for details) consists of, in listed order,
- the skippable frame ID with values between
0x184D2A50
and0x184D2A5F
, in little endian format, - the frame size in 4 bytes, in little endian format, and
- the content of the frame.
A used dictionary can be stored in the skippable frame with frame ID 0x184D2A5D
as very first frame of the WARC file. By default the Zstandard dictionary is compressed with Zstandard before added as content of the skippable frame, unless option --warc-zstd-dict-no-compression
is given to prevent compression of the dictionary before storing it. To prevent the dictionary from being included at the start of the resulting WARC file, option --warc-zstd-dict-no-include
should be used.
--warc-compression-use-zstd
Use Zstandard instead of Zlib compression for compressing WARC records. To use a Zstandard dictionary as well, use option --warc-zstd-dict=FILENAME
.
--warc-zstd-dict=FILENAME
The Zstandard dictionary to use for compression. Option --warc-compression-use-zstd
needs to be used in order to use this option.
The dictionary is by default compressed with Zstandard and included in at the beginning of the WARC file, unless respectively options --warc-zstd-dict-no-compression
or --warc-zstd-dict-no-include
are used.
--warc-zstd-dict-no-include
Prevent the used Zstandard dictionary from being included in a skippable frame at the start of the WARC file. Option --warc-zstd-dict=FILENAME
needs to be used in order to use this option.
It can be useful to not include the dictionary if many seperate WARCs are created using the same dictionary. Storing the dictionary in every WARC creates overhead in size. Instead, it may be useful to store the Zstandard dictionary separately.
--warc-zstd-dict-no-compression
Prevent the compression of the used Zstandard dictionary with Zstandard before writing it to the skippable frame. Option --warc-zstd-dict=FILENAME
needs to be used in order to use this option.
Zstandard dictionaries themselves are not compressed, and compression can often yield tens of percents of reduction in the size of the skippable frame with compressed dictionary over that with uncompressed dictionary. Not compressing the dictionary might improve performance, as no decompression needs to take place in order to use the dictionary.
Deduplication
With deduplication on WARC records, a response
record can be converted to a revisit
record if it is found to be a duplicate from another record. In accordance with version 1.1 of the WARC format, the headers
WARC-Refers-To
, referring toWARC-Record-ID
of the original record,WARC-Refers-To-Target-URI
, referring toWARC-Target-URI
of the original record,WARC-Refers-To-Date
, referring toWARC-Date
of the original record,WARC-Profile
, with valuehttp://netpreserve.org/warc/1.1/revisit/identical-payload-digest
, andWARC-Truncated
, with valuelength
,
are added and header WARC-Type
is assigned value revisit
. WARC-Block-Digest
is set to the digest of the truncated data and WARC-Payload-Digest
is the digest of the original payload.
With this release URL-agnostic deduplication is supported for WARC records in a single Wget session with the --warc-dedup-url-agnostic
option. URL-gnostic deduplication is used by default for WARC writing, unless disabled with --warc-dedup-disable
.
--warc-dedup-url-agnostic
Allow URL-agnostic deduplication of WARC records in the same Wget session.
A response
record is converted into a revisit
records with URL-agnostic deduplication when only the WARC-Payload-Digest
matches that of a previously written record. Other WARC headers, like WARC-Target-URI
, do not have to be equal in order for a revisit
record to be written.
--warc-dedup-min-size=NUMBER
The minimum number of bytes a payload should be large before it is deduplicated. The default value is 100
.
When a response
record is converted to a revisit
record, a number of fields are added. The value of --warc-dedup-min-size
is used to determine when it is 'worth it' to write a revisit
record instead of the original, given the increase or decrease in size, performance, and other factors.
--warc-dedup-disable
Disables the URL-gnostic deduplication. This deduplication is turned on by default.
URL-gnostic deduplication converts a response
record into a revisit
record when another record was previously written with equal values for the WARC-Payload-Digest
and WARC-Target-URI
WARC headers.