-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse array as compressor dictionary #20
Comments
stepping back, the compress format is pretty uncommon nowadays. it shows up with a lot of older files, but nowadays files are compressed using something like XZ when they care about speed/size. this is 2019 after all and there are many free/open source compression algorithms out there to choose from that are way faster & smaller than LZW :). so i have a hard time justifying significant changes to the ncompress package when, in practice, people simply aren't using that isn't to say collecting thoughts and feedback from people such as yourself aren't useful. thanks for posting your findings/research and code for others to utilize. |
LZW |
I've experimented a bit around with this approach but it doesn't seem to be any faster than the hashtable. Using Using Also interestingly, when I use TL;DR: sparse array doesn't seem worthwhile to use over a hash-table. |
Hello. In 2019 year we have a great amount of ram available. So we can reach max possible performance for lzw (with clear) algorithm.
The idea is simple: we can use sparse array instead of double hashing array as dictionary.
Please imagine big array where
(code << 8) | symbol => next_code
.symbol
is between0
,255
,code
between0
and(2 ** 16) - 1
andnext_code
between257
and(2 ** 16) - 1
.33.5 MB
ram is required for such array.The problem is that we have to clear sparse array. For example we have to clear dictionary 2040 times for compressing
850 MB
linux-4.20.3.tar.33.5 MB * 2040 ~ 68 GB
. I've solved this issue by collecting used sparse array indexes in separate array and clear just these indexes.The complexity of insert or find is still O(1). You can find docs here. Implementation is here.
I am about to be sure that ncompress won't accept such huge memory eater. I want just to inform people. Thank you.
The text was updated successfully, but these errors were encountered: