AVX2 Question #3

xtremertx · 2020-05-15T14:33:07Z

xtremertx
May 15, 2020

Hi,
sorry that I have opened an issue, but I couldn't find any contact on you.

First of all, this repository is a briliant implementation and I'm still learning from it.

I'm currently trying to vectorize ChaCha20 (a symmetric cipher) using C#/NetCore and I dont know how to rotate left or right using AVX2 as there are no native instructions for rotation on AVX2.

My current code sucks, because I have to define a mask for each fixed rotation ( x >>> 96, x >>> 64, x >>> 32), this however takes more registers...

uint* msk96 = stackalloc uint[8] { 5, 6, 7, 0, 1, 2, 3, 4 };   // rotate right 96-bits using a hardcoded mask
var mask = Avx2.LoadVector256(msk96);
d = Avx2.PermuteVar8x32(d, mask);

I have heard that it is possible to use "Xor, Shift, Add" to do so? How?

Would you be so kind to share your Insight or a fragment of code with a little explanation?

Thanks a lot!

saucecontrol · 2020-05-15T18:45:11Z

saucecontrol
May 15, 2020
Maintainer

Howdy! I'm afraid I'm not familiar with the math in ChaCha20, but I'll try to help.

I have to define a mask for each fixed rotation ( x >>> 96, x >>> 64, x >>> 32), this however takes more registers...

This may not be a problem as long as you don't need a ton of registers for your data and state. For example, I do ror by 24 and 16 bits in the AVX2 BLAKE2 implementation with masks that are cached in spare registers:

Blake2Fast/src/Blake2Fast/Blake2b/Blake2bAvx2.cs

Lines 27 to 31 in e49f3b9

    
           // Rotate shuffle masks. We can safely convert the ref to a pointer because the compiler guarantees the 
        
           // data is in a fixed location, and the ref itself is converted from a pointer. Same for the IV below. 
        
           byte* prm = (byte*)Unsafe.AsPointer(ref MemoryMarshal.GetReference(rormask)); 
        
           var r24 = Avx2.BroadcastVector128ToVector256(prm); 
        
           var r16 = Avx2.BroadcastVector128ToVector256(prm + Vector128<byte>.Count);

On x64, we have plenty of registers to spare, so those masks stay in those registers. On x86, they can be spilled to the stack, but the spill/fill doesn't make it noticeably slower.

Using stackalloc to set up and load the mask as you have in that example would be very inefficient, so hopefully that's just illustrative. If you're not familiar with the ReadOnlySpan<byte> trick with Roslyn for fixed data, give this a read: https://vcsjones.dev/2019/02/01/csharp-readonly-span-bytes-static/

I'm using a static property for my ror masks, but a local works as well with the ROS trick.

I have heard that it is possible to use "Xor, Shift, Add" to do so? How?

I assume you mean the way a ror or rol is written in languages that don't natively support such a construct, as implemented here:

https://github.com/dotnet/runtime/blob/592a1e61af160f383859f1a46c63649cc71a8274/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs#L463-L464

The same can be applied to a SIMD vector, provided the shift instruction exists for the element size you're using. It appears from your example, though, you're doing a ror of 96 bits across a 256-bit value. In that case, your choice of Avx2.PermuteVar8x32 is the right way to go. In fact, since that operation requires data to move between 128-bit lanes, a permute is the only option.

If you aren't able to spare a register to cache the permute mask, I'd suggest taking advantage of the fact that VEX encoding allows a memory-address operand, and the JIT will fold a vector load into that encoding. Combined with the ROS trick for fixed data, dest = Avx2.PermuteVar8x32(data, Avx.LoadVector256(maskptr)) will emit vpermd dest, data, maskptr, which is as good as it gets :)

0 replies

xtremertx · 2020-05-16T20:15:25Z

xtremertx
May 16, 2020
Author

Alright, I was able to use ROS trick together with cached masks and VEX encoding, that helped a lot!

Main calculation seems to work. Now I need to increment a 64bit section inside 256bit vector by one. As I'm calculating 2 blocks simultaneously and each block has "counter" inside that is sequential.

I have this Vector256 |C:0, D:1, E:2, F:3, C:4, D:5, E:6, F:7| and I need to increment (C:4 D:5) as a 64bit number by one without affecting other elements inside. Is that possible to do using vectors?

Basically I have this 4 x 256-bit vectors (a, b, c ,d). First 4 columns represent "block0" and second 4 columns represent "block1". And I need to increment (C D) in block1 as a 64bit number by one. (because block1 is a copy of block0 with same "counter" - (C D))

Reason not incrementing counter before fetching data into vectors is that I'm using: BroadcastVector128ToVector256

// test state matrix (16 * sizeof(uint) = 64 bytes per block)
uint* state = stackalloc uint[16] 
{ 
0x0, 0x1, 0x2, 0x3, 
0x4, 0x5, 0x6, 0x7, 
0x8, 0x9, 0xA, 0xB, 
0xC, 0xD, 0xE, 0xF      // 64bit Counter (C, D);
};

// More efficient than loading Vector256<uint>
// This will load block0 (state matrix) and block1 (copy of block0), represented as 4 x 256bit vectors
a = Avx2.BroadcastVector128ToVector256(state);
b = Avx2.BroadcastVector128ToVector256(state + Vector128<uint>.Count);
c = Avx2.BroadcastVector128ToVector256(state + Vector128<uint>.Count * 2);
d = Avx2.BroadcastVector128ToVector256(state + Vector128<uint>.Count * 3);

// Now I need to increment "counter" in block1 by one... before doing any calculations..

// a += b; d ^= a; d <<<= 16;
a = Avx2.Add(a, b);
d = Avx2.Xor(d, a);
d = Avx2.ShiftLeftLogical(d, 16);

// c += d; b ^= c; b <<<= 12;
c = Avx2.Add(c, d);
b = Avx2.Xor(b, c);
b = Avx2.ShiftLeftLogical(b, 12);

// .............

0 replies

xtremertx · 2020-05-16T21:35:07Z

xtremertx
May 16, 2020
Author

Ah, it's so simple for 64bit increment:

var addOne = Vector256.Create(0ul, 0, 1, 0);
var vector = Vector256.Create(0ul, 0, (ulong)uint.MaxValue, 0);
var result = Avx2.Add(vector, addOne).AsUInt32();

What is the difference between Vector256.Create and Avx2.LoadVector256? Can both be used by VEX encoding?

One more question what is more optimal solution to get HIGH/LOW 128-bit part of 256-bit vector?

Vector256<uint> a = ...
a.GetUpper();
Avx2.ExtractVector128(a, (byte)Vector128<uint>.Count);

0 replies

saucecontrol · 2020-05-17T20:57:11Z

saucecontrol
May 17, 2020
Maintainer

What is the difference between Vector256.Create and Avx2.LoadVector256? Can both be used by VEX encoding?

Vector256.Create() can generate a sequence of mov/shuffle/insert/broadcast steps to construct a vector, whereas Avx.LoadVector256() is a single instruction that loads the vector as already laid out in memory. VEX is the encoding used for all AVX instructions. My comment earlier referred to the fact that VEX-encoded SIMD instructions often allow for a memory address operand in place of the last XMM/YMM operand. The JIT can eliminate a Avx.LoadVector256() entirely by folding it into the following instruction in many cases.

For your addOne vector, you can use the ROS trick I mentioned before to pre-generate the bytes for the vector and then have its load folded into the Avx2.Add when you use it. Additionally, since you are already pre-generating your rotate masks and loading them from a ROS, you could help the JIT out by putting your constant vectors all sequentially in a single ROS and referencing them as offsets from a single pointer. That's actually what a native compiler would do.

One more question what is more optimal solution to get HIGH/LOW 128-bit part of 256-bit vector?

The low part can be extracted for free with .GetLower(), which simply reinterprets a Vector256<T> as a Vector128<T> (or YMM as XMM). .GetUpper() will emit vextracti128 to get the upper half. It's the same as Avx2.ExtractVector128(vec, 1) but easier on the eyes ;)

Since you're obviously doing your research and coming up with clear and specific questions, you should be asking these on StackOverflow, where both the questions and answers will get more visibility. I'm on there too (same username), so if you want to tag me or message me (I'm actually not sure how it works), feel free.

0 replies

viktor-svub · 2020-05-18T07:14:21Z

viktor-svub
May 18, 2020

Actually I very much like the discussion happened here -- if it was on SO from the start, I'd find it only when explicitly looking for it -- while here, it became a new point of interest to everyone subscribed to the repo, and is directly relatable to your work, which is quite an interesting one for sure :)
If it moves to SO, please link the question here to follow up 👍

0 replies

xtremertx · 2020-05-18T07:38:59Z

xtremertx
May 18, 2020
Author

@viktor-svub Indeed, I have searched SO a lot together with google/duckduckgo and there is not much about AVX/AVX2, especially when it comes to NETCore that supports these instruction sets. I see a big potencial in using these to close the gap between C/C++ and C#.

I have actually solved my problem to fully vectorize ChaCha20 using AVX2. It still needs some refactoring, unit tests (including test vectors) and benchmarks to be done. I plan to release it as open-source for any .NET developer once I'll be satisfied with it.

As I have still some questions about performance of different AVX2 instructions and their combinations, I'll definitelly ask on SO and post a link here 😉.

0 replies

saucecontrol · 2020-05-18T07:52:26Z

saucecontrol
May 18, 2020
Maintainer

if it was on SO from the start, I'd find it only when explicitly looking for it

Ha, I was trying to cut down on the noise here, but that's a good point. I'm glad you found it interesting.

@xtremertx That's great! If you need a review, I'd be happy to give it a look.

My thought was the larger community at SO might get your questions seen by people who have experience specifically with SIMD implementations of ChaCha20, either with assembly or native intrinsics.

It's true there's not much info out there about using the new .NET intrinsics, and I'm happy to help other devs getting into them. I figure if they end up on SO, they'll show up better for people looking in the future :)

0 replies

xtremertx · 2020-05-18T11:14:46Z

xtremertx
May 18, 2020
Author

Ok, I have posted a new question on SO. Here is a link.

0 replies

saucecontrol · 2020-05-18T19:34:32Z

saucecontrol
May 18, 2020
Maintainer

Upvoted 👍

Some really nice discussion over there already, I see 😄

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 Question #3

{{title}}

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

AVX2 Question #3

xtremertx May 15, 2020

Replies: 9 comments

saucecontrol May 15, 2020 Maintainer

xtremertx May 16, 2020 Author

xtremertx May 16, 2020 Author

saucecontrol May 17, 2020 Maintainer

viktor-svub May 18, 2020

xtremertx May 18, 2020 Author

saucecontrol May 18, 2020 Maintainer

xtremertx May 18, 2020 Author

saucecontrol May 18, 2020 Maintainer

xtremertx
May 15, 2020

saucecontrol
May 15, 2020
Maintainer

xtremertx
May 16, 2020
Author

xtremertx
May 16, 2020
Author

saucecontrol
May 17, 2020
Maintainer

viktor-svub
May 18, 2020

xtremertx
May 18, 2020
Author

saucecontrol
May 18, 2020
Maintainer

xtremertx
May 18, 2020
Author

saucecontrol
May 18, 2020
Maintainer