Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse array as compressor dictionary #20

Open
andrew-aladev opened this issue Jan 17, 2019 · 3 comments
Open

Sparse array as compressor dictionary #20

andrew-aladev opened this issue Jan 17, 2019 · 3 comments

Comments

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Jan 17, 2019

Hello. In 2019 year we have a great amount of ram available. So we can reach max possible performance for lzw (with clear) algorithm.

The idea is simple: we can use sparse array instead of double hashing array as dictionary.
Please imagine big array where (code << 8) | symbol => next_code. symbol is between 0, 255, code between 0 and (2 ** 16) - 1 and next_code between 257 and (2 ** 16) - 1. 33.5 MB ram is required for such array.

The problem is that we have to clear sparse array. For example we have to clear dictionary 2040 times for compressing 850 MB linux-4.20.3.tar. 33.5 MB * 2040 ~ 68 GB. I've solved this issue by collecting used sparse array indexes in separate array and clear just these indexes.

The complexity of insert or find is still O(1). You can find docs here. Implementation is here.

I am about to be sure that ncompress won't accept such huge memory eater. I want just to inform people. Thank you.

@vapier
Copy link
Owner

vapier commented Jan 19, 2019

stepping back, the compress format is pretty uncommon nowadays. it shows up with a lot of older files, but nowadays files are compressed using something like XZ when they care about speed/size. this is 2019 after all and there are many free/open source compression algorithms out there to choose from that are way faster & smaller than LZW :).

so i have a hard time justifying significant changes to the ncompress package when, in practice, people simply aren't using compress anymore, and nor should they.

that isn't to say collecting thoughts and feedback from people such as yourself aren't useful. thanks for posting your findings/research and code for others to utilize.

@andrew-aladev
Copy link
Contributor Author

LZW compress is old. The problem is that it was used all over the world for creating smth.tar.Z and content via ancient web servers. Today nobody will try to use lzw, it is ok. But we still want to support ancient applications. The problem is that application/x-compress has undocumented underwater rocks. We will reverse engineer it and restore ability to support ancient software.

@N-R-K
Copy link

N-R-K commented Apr 22, 2024

I've experimented a bit around with this approach but it doesn't seem to be any faster than the hashtable. Using ch << 16 | code I get around 42ms compressing a 1080p binary ppm file. With hashtable I get 43ms.

Using perf I can see that the vast majority of time is spent stalled waiting for memory to arrive from the sparse array, since the array is too big to fit into the L1 or L2 cache.

Also interestingly, when I use code << 8 | ch to index into the array, performance gets MUCH worse: 110ms. I think this is because when ch is the high bits of the index, the index tends to stick closer to recently accessed indices and thus exploit better cache performance.

TL;DR: sparse array doesn't seem worthwhile to use over a hash-table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants