Many benchmarks focus on intangible measurements, analyzing the relationship between compression speed and ratio without providing intuition as to their real-world benefits.
(Notable improvements are the LZ4 benchmark and Squash, but they are either not as extensive or not as focused as this article.)
This benchmark aims to bridge that gap, while comparing the most widely deployed compression schemes in the world.
The motivation for compressing data mostly falls within three categories:
You wish to reduce disk, memory or network data use. For instance, you are a package distribution service, or store lots of logs or files. What you want to maximize then is compression ratio: how big a megabyte of compressed data is, once uncompressed. For instance, a ratio of 3 means that a 1 MB zip on disk stores 3 MB of content, effectively tripling the amount of disk space.
The winner there is Lzip, closely followed by XZ.
(Lzip and XZ contain the same compression algorithm, LZMA, which relies on range encoding, a derivative of arithmetic coding, a compute-intensive method that has very high compression power.)
Of course, storage costs may need to be balanced against CPU costs of running compression. The faster the compression, the lower the CPU cost.
Assuming we have a constraint on both parameters computed as a linear combination, it is representable as a line on the plot that “falls” from the top right. If disk space is more important, the line will be closer to the horizontal; if CPU is more relevant, it will be more vertical. (That said, storage costs usually dominate.)
A line that is going up is not optimal, so all positive slopes can be discarded. If you imagine that line starting horizontal at the top, and slowly sloping down, it will come into contact with various compressors, until it reaches a vertical state.
That creates a convex hull around the compressors, corresponding to the best choices for all weighted compromises between storage and CPU costs.
You want fast downloads. For instance, you serve assets for a website. To truly shine there, you need to have previously compressed the files to disk once. The time needed to get the file loaded from the server to your client is the sum of two pieces: first, you need to transfer the compressed zip, then the client needs to extract it.
Your hope is that the time won by transfering less bytes will more than make up for the time lost by extracting them.
Transfer 1 MB of raw data at 1 MB/s
┌───────────────────────────────────────────┐
└───────────────────────────────────────────┘
└──────────────── 1 second ─────────────────┘
Transfer (1÷ratio) MB Extract 1 MB
┌───────────────────────┬──────────────┐
└───────────────────────┴──────────────┘
└──────── 203 ms ───────┴──── 3 ms ────┘
Usually, for a given algorithm, the decompression speed (the number of megabytes decompressed every second) is the same regardless of the compression level (ie, how viciously the compressor tried to reduce the size of the file). So the extraction time is a constant of the raw data size.
Therefore, you always get linearly higher download speeds with higher compression levels. You want to compare compression software only between the flags that yield the highest ratio.
The winner in this field seems to be Zstandard and Brotli, with XZ and Lzip not far behind (in this order).
(At least, at 1 MB/s. As network bandwidth grows closer to the decompression speed, the bottleneck becomes decompression. Past a certain point, the compressors get slower than sending data without compression, and you need compressors with much higher decompression speeds, like Zstandard and LZ4, to compete.)
If you are looking for the fastest transfer time, and you don't want to keep the compressed version, you may be able to gain a little extra transfer speed by compressing the file first, if the network bandwidth is not too fast.
To compute this, we add the time it takes to compress the file to the loading time we computed previously.
At 1 MB/s, when the sending time (compression + transfer + extraction) per megabyte takes more than 1 second, we are better off sending the raw file.
Unlike the decompression speed, the compression speed slows linearly with the compression level. As a result, the sending time is a U curve over compression levels, for any given compressor.
The winner seems to be Zstandard -6, with Brotli -5 not far behind twenty milliseconds later. Gzip -5 is relevant again 20 milliseconds later, closely followed by bzip2. XZ and Lzip are many tens of milliseconds slower than the rest even at the lowest compression level.
The big surprise is bzip2. With its consistently high compression speed, it remains competitive throughout its levels. If you are ready to compromise a tiny bit of transfer time to gain space, you can transmit one more megabyte for each megabyte transmitted.
I chose tools that were ❶ readily available, ❷ popular, and ❸ address various segments of the Pareto frontier.
Name | Algorithm | Related archivers |
---|---|---|
gzip | DEFLATE (LZSS, Huffman) | ZIP, PNG |
bzip2 | BWT, MTF, RLE, Huffman | |
XZ | LZMA (LZ77, range encoding) | 7-Zip |
Lzip | LZMA (LZ77, range encoding) | 7-Zip |
Brotli | LZ77, Huffman, context modeling | WOFF2 |
LZ4 | LZ77 | |
Zstandard | LZ77, tANS, Huffman |
- A given algorithm cannot compress all inputs to a smaller size. In particular, already-compressed inputs (such as images) might actually compress to a bigger file.
- Certain implementations may make use of CPU-dependent optimizations that can improve their performance on other devices.