Replies: 9 comments
-
Howdy! I'm afraid I'm not familiar with the math in ChaCha20, but I'll try to help.
This may not be a problem as long as you don't need a ton of registers for your data and state. For example, I do Blake2Fast/src/Blake2Fast/Blake2b/Blake2bAvx2.cs Lines 27 to 31 in e49f3b9 On x64, we have plenty of registers to spare, so those masks stay in those registers. On x86, they can be spilled to the stack, but the spill/fill doesn't make it noticeably slower. Using I'm using a static property for my
I assume you mean the way a The same can be applied to a SIMD vector, provided the shift instruction exists for the element size you're using. It appears from your example, though, you're doing a If you aren't able to spare a register to cache the permute mask, I'd suggest taking advantage of the fact that VEX encoding allows a memory-address operand, and the JIT will fold a vector load into that encoding. Combined with the ROS trick for fixed data, |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Ah, it's so simple for 64bit increment: var addOne = Vector256.Create(0ul, 0, 1, 0);
var vector = Vector256.Create(0ul, 0, (ulong)uint.MaxValue, 0);
var result = Avx2.Add(vector, addOne).AsUInt32(); What is the difference between Vector256.Create and Avx2.LoadVector256? Can both be used by VEX encoding? One more question what is more optimal solution to get HIGH/LOW 128-bit part of 256-bit vector? Vector256<uint> a = ...
a.GetUpper();
Avx2.ExtractVector128(a, (byte)Vector128<uint>.Count); |
Beta Was this translation helpful? Give feedback.
-
For your
The low part can be extracted for free with Since you're obviously doing your research and coming up with clear and specific questions, you should be asking these on StackOverflow, where both the questions and answers will get more visibility. I'm on there too (same username), so if you want to tag me or message me (I'm actually not sure how it works), feel free. |
Beta Was this translation helpful? Give feedback.
-
Actually I very much like the discussion happened here -- if it was on SO from the start, I'd find it only when explicitly looking for it -- while here, it became a new point of interest to everyone subscribed to the repo, and is directly relatable to your work, which is quite an interesting one for sure :) |
Beta Was this translation helpful? Give feedback.
-
@viktor-svub Indeed, I have searched SO a lot together with google/duckduckgo and there is not much about AVX/AVX2, especially when it comes to NETCore that supports these instruction sets. I see a big potencial in using these to close the gap between C/C++ and C#. I have actually solved my problem to fully vectorize ChaCha20 using AVX2. It still needs some refactoring, unit tests (including test vectors) and benchmarks to be done. I plan to release it as open-source for any .NET developer once I'll be satisfied with it. As I have still some questions about performance of different AVX2 instructions and their combinations, I'll definitelly ask on SO and post a link here 😉. |
Beta Was this translation helpful? Give feedback.
-
Ha, I was trying to cut down on the noise here, but that's a good point. I'm glad you found it interesting. @xtremertx That's great! If you need a review, I'd be happy to give it a look. My thought was the larger community at SO might get your questions seen by people who have experience specifically with SIMD implementations of ChaCha20, either with assembly or native intrinsics. It's true there's not much info out there about using the new .NET intrinsics, and I'm happy to help other devs getting into them. I figure if they end up on SO, they'll show up better for people looking in the future :) |
Beta Was this translation helpful? Give feedback.
-
Ok, I have posted a new question on SO. Here is a link. |
Beta Was this translation helpful? Give feedback.
-
Upvoted 👍 Some really nice discussion over there already, I see 😄 |
Beta Was this translation helpful? Give feedback.
-
Hi,
sorry that I have opened an issue, but I couldn't find any contact on you.
First of all, this repository is a briliant implementation and I'm still learning from it.
I'm currently trying to vectorize ChaCha20 (a symmetric cipher) using C#/NetCore and I dont know how to rotate left or right using AVX2 as there are no native instructions for rotation on AVX2.
My current code sucks, because I have to define a mask for each fixed rotation ( x >>> 96, x >>> 64, x >>> 32), this however takes more registers...
I have heard that it is possible to use "Xor, Shift, Add" to do so? How?
Would you be so kind to share your Insight or a fragment of code with a little explanation?
Thanks a lot!
Beta Was this translation helpful? Give feedback.
All reactions