Implementation

Structure

Lempel-Ziv

The data is encoded into sequences of three bytes. The first 4 bits are the offset left of the current position. The next 12 bits represent the length of the match. The remaining byte is the next byte in the original data.

However, an optimization is made when the match length is under 3 bytes. In that case only the byte is inserted.

To decode this variable-width data 0 bit is inserted before the three bytes when match is found. If only the byte is written a 1 bit is inserted before the byte.

Huffman

In the beginning of the data there is a header of three 32-bit integers. These are:

Size of the Huffman tree in bytes.
Amount of unique bytes in the original data.
Size of the original data in bytes.

The Huffman tree is stored after the header. This tree is encoded with post-order traversal, starting from the left. When a leaf node is encountered 1 bit is appended after which the byte that the leaf node represents is inserted. Other nodes in the tree will output 0 bit in the encoding.

After the Huffman tree the data encoded according to the tree is written.

Complexity

Lempel-Ziv

Encoding:

The encoding algorithm approaches linear time since the input is processed in a single loop. Inside that loop, however, some matches are found recursively. This recursion still has a stack limit of 15 since the length of the match cannot be longer. This is due to the limit of four bits that are allocated for the match length in the output. Further, the matches of the already-processed part of the input are store in a dict. Dict has a O(1) constant time complexity in access operations. So accessing past matches will not affect time complexity.

Time complexity: O(n).

Decoding:

Much like the encoding, the decoder processed the input in a single while-loop. Looping over the data and appending matches to an output buffer.

Time complexity: O(n).

Huffman

Encoding:

The algorithm starts by iterating the input to calculate the frequencies of characters. This is done in O(n) time. After that the Huffman tree is built. This also happens in linear time as it is normal tree traversal. When the tree is ready, it is encoded into a buffer. This also traverses the tree in linear time. At last the input bytes are processed one by one in a for loop to build the output.

Time complexity: O(n).

Decoding:

The decoding reverses the encoding operation thus starting with decoding the Huffman tree from the beginning of the input. The tree is built in linear time in a single while loop. After the tree is build the rest of the input is processed in a while loop and the arrays of bits are searched from the Huffman tree. Tree access operation is done in O(log n) time, but since the tree will not contain more than 256 leaf nodes the tree will not grow at the same rate as the input data size.

Time complexity: O(n).

Performance Results

Further empirical proof of the time complexities can be read from the visualization below.

Lempel-Ziv

Sample	Encoding Speed (s)	Decoding Speed (s)	Original Data (KB)	Compressed Data (KB)	Compression Ratio
sample/simple.txt	0.0001	0.0000	0.078	0.065	0.833
sample/lorem.txt	0.0955	0.0266	88.319	58.291	0.66
sample/shakespeare.txt	6.4050	1.6093	5330.272	3901.245	0.732
sample/simple_image.png	0.0020	0.0005	2.185	0.878	0.402
sample/small_image.jpeg	0.0129	0.0020	8.248	9.026	1.094
n = 1 000	0.0014	0.0002	0.977	1.099	1.125
n = 10 000	0.0173	0.0024	9.766	10.978	1.124
n = 100 000	0.1893	0.0238	97.656	109.718	1.124
n = 1 000 000	1.9506	0.2337	976.562	1096.839	1.123
n = 10 000 000	21.3077	2.3422	9765.625	10968.202	1.123
n = 100 000 000	233.3153	23.7175	97656.25	109682.097	1.123

Huffman

Sample	Encoding Speed (s)	Decoding Speed (s)	Original Data (KB)	Compressed Data (KB)	Compression Ratio
sample/simple.txt	0.0001	0.0001	0.078	0.048	0.615
sample/lorem.txt	0.0126	0.0855	88.319	47.287	0.535
sample/shakespeare.txt	0.7357	5.4583	5330.272	3083.375	0.578
sample/simple_image.png	0.0006	0.0021	2.185	1.149	0.526
sample/small_image.jpeg	0.0022	0.0138	8.248	8.44	1.023
n = 1 000	0.0004	0.0014	0.977	0.938	0.96
n = 10 000	0.0017	0.0129	9.766	8.296	0.849
n = 100 000	0.0153	0.1281	97.656	82.016	0.84
n = 1 000 000	0.1528	1.2804	976.562	820.079	0.84
n = 10 000 000	1.5420	12.9038	9765.625	8201.985	0.84
n = 100 000 000	15.3825	128.9075	97656.25	82027.189	0.84

Notes

The LZ implementation does not seem to play well with totally random bytes. The speed of LZ compression could be improved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementation.md

implementation.md

Implementation

Structure

Lempel-Ziv

Huffman

Complexity

Lempel-Ziv

Encoding:

Decoding:

Huffman

Encoding:

Decoding:

Performance Results

Lempel-Ziv

Huffman

Notes

Sources

Files

implementation.md

Latest commit

History

implementation.md

File metadata and controls

Implementation

Structure

Lempel-Ziv

Huffman

Complexity

Lempel-Ziv

Encoding:

Decoding:

Huffman

Encoding:

Decoding:

Performance Results

Lempel-Ziv

Huffman

Notes

Sources