-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293
base: unstable
Are you sure you want to change the base?
Conversation
8bcf1ae
to
f730f91
Compare
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1293 +/- ##
============================================
+ Coverage 70.69% 70.70% +0.01%
============================================
Files 114 115 +1
Lines 63161 63233 +72
============================================
+ Hits 44650 44710 +60
- Misses 18511 18523 +12
|
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results? |
The algorithms are verified by comparing the results between scalar code and simd code with random input. |
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
This is so cool! 😎
I think we need to test this in our repo in some way. The binary representation can't change, because things like replicas will not understand it, so we should verify the binary representation. A hyperloglog key is actually a string so we can use the GET command to get the binary representation. In a TCL test case, we can use GET and compare the reply to the binary data we have stored earlier. Alternatively we can use DUMP. Can you add it? |
@lipzhu You're the performance expert. Do you want to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nugine Great job, the performance is impressive.
Echo @zuiderkwast @xbasel .
Maybe a unit test needed to verify the results is totally same W/ AVX2 instructions and make sure binary file will not changed when RDB presentence is enabled?
case HLL_DENSE: hllDenseSet(hdr->registers, j, max[j]); break; | ||
case HLL_SPARSE: hllSparseSet(o, j, max[j]); break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Missing indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-format-18 removes the indentation 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, case
and labels for goto are indented one step less than other code. This is correct indentation, a common style in C.
* EEEFFFGGGHHH0000 | ||
* AAABBBCCCDDDEEEFFFGGGHHH0000 | ||
* | ||
* Note that the last 4 bytes are padding bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the last 4 bytes padding bytes impact the binary dump files which caused the inconsistent with origin one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 4 padding bytes produced by AVX2 STORE are overwritten by the last scalar computation (32 registers).
The two AVX2 functions keep the memory format unchanged. They just accelerate the computation.
WIP: unit tests for verifying the results produced by non-AVX2 and AVX2 implementations. I have other things to do these days. So it may take a while. |
WIP: unit tests
This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths.
Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated.
AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6).
When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version.
Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp