Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293

Open
wants to merge 3 commits into
base: unstable
Choose a base branch
from

Conversation

Nugine
Copy link

@Nugine Nugine commented Nov 12, 2024

WIP: unit tests


This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths.

Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated.

AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6).

When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version.

pfcount key1 key2 key3
pfmerge keyall key1 key2 key3
======================================================================================================
Type             Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar    5665.56        35.29839        32.25500        63.99900        67.58300       608.60
PFCOUNT-avx2     72377.83         2.75834         2.67100         5.34300         6.81500      7774.96
------------------------------------------------------------------------------------------------------
PFMERGE-scalar    9851.29        20.28806        20.09500        36.86300        39.16700       615.71
PFMERGE-avx2    125621.89         1.59126         1.55100         3.11900         4.70300     15702.74
------------------------------------------------------------------------------------------------------

scalar: valkey:unstable  2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76
avx2:   Nugine:hll-simd  8f9adc34021080d96e60bd0abe06b043f3ed0275

CPU:    13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS:     Ubuntu 22.04.5 LTS

Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

@Nugine Nugine force-pushed the hll-simd branch 2 times, most recently from 8bcf1ae to f730f91 Compare November 12, 2024 07:56
@Nugine Nugine mentioned this pull request Nov 12, 2024
3 tasks
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
Copy link

codecov bot commented Nov 12, 2024

Codecov Report

Attention: Patch coverage is 52.80899% with 42 lines in your changes missing coverage. Please review.

Project coverage is 70.70%. Comparing base (2df56d8) to head (a6136ca).
Report is 10 commits behind head on unstable.

Files with missing lines Patch % Lines
src/hyperloglog.c 52.80% 42 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1293      +/-   ##
============================================
+ Coverage     70.69%   70.70%   +0.01%     
============================================
  Files           114      115       +1     
  Lines         63161    63233      +72     
============================================
+ Hits          44650    44710      +60     
- Misses        18511    18523      +12     
Files with missing lines Coverage Δ
src/hyperloglog.c 86.35% <52.80%> (-4.89%) ⬇️

... and 22 files with indirect coverage changes

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
@xbasel
Copy link
Member

xbasel commented Nov 14, 2024

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

@Nugine
Copy link
Author

Nugine commented Nov 14, 2024

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

The algorithms are verified by comparing the results between scalar code and simd code with random input.
https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
@zuiderkwast
Copy link
Contributor

This is so cool! 😎

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

The algorithms are verified by comparing the results between scalar code and simd code with random input. https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

I think we need to test this in our repo in some way. The binary representation can't change, because things like replicas will not understand it, so we should verify the binary representation.

A hyperloglog key is actually a string so we can use the GET command to get the binary representation. In a TCL test case, we can use GET and compare the reply to the binary data we have stored earlier. Alternatively we can use DUMP. Can you add it?

@zuiderkwast
Copy link
Contributor

@lipzhu You're the performance expert. Do you want to review this?

Copy link
Contributor

@lipzhu lipzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nugine Great job, the performance is impressive.

Echo @zuiderkwast @xbasel .
Maybe a unit test needed to verify the results is totally same W/ AVX2 instructions and make sure binary file will not changed when RDB presentence is enabled?

Comment on lines +1617 to +1618
case HLL_DENSE: hllDenseSet(hdr->registers, j, max[j]); break;
case HLL_SPARSE: hllSparseSet(o, j, max[j]); break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Missing indentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format-18 removes the indentation 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, case and labels for goto are indented one step less than other code. This is correct indentation, a common style in C.

* EEEFFFGGGHHH0000
* AAABBBCCCDDDEEEFFFGGGHHH0000
*
* Note that the last 4 bytes are padding bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the last 4 bytes padding bytes impact the binary dump files which caused the inconsistent with origin one?

Copy link
Author

@Nugine Nugine Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 4 padding bytes produced by AVX2 STORE are overwritten by the last scalar computation (32 registers).

The two AVX2 functions keep the memory format unchanged. They just accelerate the computation.

@Nugine
Copy link
Author

Nugine commented Nov 20, 2024

WIP: unit tests for verifying the results produced by non-AVX2 and AVX2 implementations.

I have other things to do these days. So it may take a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants