-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: MSM skip doubling when window has all zeros #152
feat: MSM skip doubling when window has all zeros #152
Conversation
Do you have benchmark results for this change? I tried running some and found the results surprising: there's a small speed improvement with k < 21, and at k > 21 there's a slowdown (that's only for the I ran a test where all the coefficients are 8 bits (so that the skipping of zeros can shine). These are my results
Tested via a6abbc8 with
My CPU is AMD Ryzen 5 3600 6-Core Processor. In case you benchmarked this, did you get different results? I wonder if my tests have any mistake that lead to this surprising result 🤔 |
Here's a proposal for a different approach to skip zeros. Instead of preparing the coeffs in windows beforehand, just scan the coeffs and find the max amount of bytes they use, and then proceed with the algorithm only working with the max amount of bytes found. This is what it looks like: Lines 541 to 559 in db46631
Now if the max number of bytes is smaller than the field number of bytes, the number of windows to slide over will be smaller (thus skipping the windows that would pick the most significant bits, which were found to be zeroes). These are the tests results I get
A very interesting next step that I was exploring a few weeks ago was adding metadata to the vector of coefficients to indicate how many bytes they use; because when writing a circuit we may already know how many bytes are used in certain columns; this way we skip the scanning to figure out max bytes. See privacy-scaling-explorations/halo2#315 |
@jonathanpwang did you have time to take a look at this? |
Sorry I have not had a chance to look at this further. I had considered your suggestion before: I was worried that for values that weren't small bits, the initial scan would add an unwanted overhead. What happens if you run your second approach on full size scalars? |
I ran the test with full size scalars and found that the older implementation was slightly slower than the one I suggest, which doesn't make sense. The suggestion should have a small overhead. The test was running for each k, first the old implementation and then my suggested one. Then I tried swapping the order and the results were that my suggestion was slightly slower. So I'm thinking this way of comparing isn't very good; my guess is that the second one has advantage because some data is already in the cache? So I decided to work on testing this with proper benchmarks and will report the results once I have them. |
Did some benches with criterion to compare the original msm, your proposal and my proposal; with big values and 8 bit values. Here are the results:
The summary is:
To reproduce checkout this commit d1f79a5 and run with
|
Yes interesting, thanks for the benchmarks. I'm guessing your scan just preloads stuff into cache so it doesn't have much slowdown. I am in favor of going with yours. |
Superseded by #168 |
Closes #150
To be honest I did not have enough time to understand the full implementation of Cyclone MSM. However the principle that if an entire window has all 0 bits, then it can be totally skipped, seems like it can be carried over exactly the same.