Idepmotent finish #1

buybackoff · 2018-12-07T21:50:05Z

buybackoff
Dec 7, 2018

Hi, thanks for the blog posts and this implementation!

Just FYI, based on your code I implemented incremental hashing with intermediate hash calculations and optimized somewhat (removed fixed usage in favor of Unsafe and pinned input).

Spreads/Spreads@4dae1bf

saucecontrol · 2018-12-07T23:07:49Z

saucecontrol
Dec 7, 2018
Maintainer

Nice! Thanks for sharing. I plan on adding an AVX2 implementation as well once the API settles down in netcoreapp3.0. Should compare favorably to anything native at that point.

0 replies

buybackoff · 2018-12-07T23:19:25Z

buybackoff
Dec 7, 2018
Author

That would be great! One thing I had to change was ArraySegment usage. Why you think it's preferable not to use System.Memory NuGet when fast span is not available?

Also, a dispatch like this:

#if NETCOREAPP2_1
            if (Sse41.IsSupported)
            {
                mixSse41(s, m);
            }
            else
#endif
            {
                mixScalar(s, m);
            }

is as efficient as #ifdef USE_INTRINSICS, because Sse41.IsSupported is JIT-time constant and the branch is completely eliminated. Intrinsics are going official quite soon, there souldn't be any reason not to use them if hardware supports them.

0 replies

saucecontrol · 2018-12-07T23:54:46Z

saucecontrol
Dec 7, 2018
Maintainer

Why you think it's preferable not to use System.Memory NuGet when fast span is not available?

This was just an attempt to keep dependencies to a minimum. I'll probably switch over to Span<T> for all builds in the next version. I can't really imagine anyone being put off by a dependency on System.Memory.

Intrinsics are going official quite soon, there souldn't be any reason not to use them if hardware supports them.

Yeah, it's only because they're not official yet, I wanted to make a generic define to toggle the intrinsics support. It's useful for benchmarking as well. I'll have to change that define in the next version anyway because the X86 Intrinsics API has changed considerably in 3.0. I haven't decided whether to maintain the 2.1-compatible version of the code since it was experimental anyway.

0 replies

buybackoff · 2018-12-08T00:34:10Z

buybackoff
Dec 8, 2018
Author

BTW, some .NET-specific changes Spreads/Spreads@895e446

For incremental case copy only h, t, f.
Cache s->h and s->viv pointers as local, otherwise they are dereferenced every time
Pass Vector128 always by ref, otherwise it has to be copied even if a calee is inlined by JIT.

Together it's c.10% faster incremental caching. The first one gave 3.5%. The last 2 gave the rest and are general for all cases.

0 replies

saucecontrol · 2018-12-08T01:25:34Z

saucecontrol
Dec 8, 2018
Maintainer

You sure about those numbers?

Cache s->h and s->viv pointers as local, otherwise they are dereferenced every time

There's no dereference for struct members accessed from a pointer. Struct members have a known constant offset, so for example s->h is just s + 64.

Pass Vector128 always by ref, otherwise it has to be copied even if a calee is inlined by JIT.

If the callee is inlined, the Vector128 will stay enregistered. It's not passed anywhere, so there's no copy.

0 replies

buybackoff · 2018-12-08T01:41:26Z

buybackoff
Dec 8, 2018
Author

You sure about those numbers?

The code is there. Difference is there. With the last 3 commit I got numbers from c.290 to c.330 in incremental benchmark. Anyone could try to reproduce.

I will not go into assembly, but (s->h) first calculates the pointer, maybe +64 addition is the reason.

If the callee is inlined, the Vector128 will stay enregistered. It's not passed anywhere, so there's no copy.

Maybe ref is better optimized. Or maybe copy happens before. .NET is known to make defensive copies of mutable structs unless it's 100% sure it's not needed. Maybe here it's 99% case :)

0 replies

saucecontrol · 2018-12-08T01:57:40Z

saucecontrol
Dec 8, 2018
Maintainer

The code is there. Difference is there. With the last 3 commit I got numbers from c.290 to c.330 in incremental benchmark. Anyone could try to reproduce.

That's why I ask. Unless you ran both versions of the code together in the same benchmark run, I'd put any differences down to other variables between the runs.

I will not go into assembly, but (s->h) first calculates the pointer, maybe +64 addition is the reason.

I actually did look at the assembly as I was writing the code. s->h becomes s + 64, s->h + 2 becomes s + 80, etc. In the SSE version of the code, there's no harm to introducing a new local there (also no benefit), but in the scalar version of the code, such a local adds additional register pressure in the allocator and results in worse performance. You may find the same is true on x86 runtimes even with SSE because there are fewer GP registers available.

Maybe ref is better optimized. Or maybe copy happens before. .NET is known to make defensive copies of mutable structs unless it's 100% sure it's not needed. Maybe here it's 99% case :)

This is also something you need not guess about if you look at the assembly. Vector128<T> is a readonly struct, so no defensive copies ever, and in any case an inlined method accepting Vector128<T> keeps the vector enregistered.

0 replies

buybackoff · 2018-12-08T02:16:24Z

buybackoff
Dec 8, 2018
Author

That would be interesting to statistically prove 0.15% or 1.5%, but 15% is just visible. In the original code many methods were also not inlined (either no AggressiveInlining attribute or exception thrwing instead of throw helper).

Interesting read on a related subject: https://github.com/dotnet/coreclr/issues/21330#issuecomment-443531813

A minor change could make JIT decide not to inline, etc and probably only @AndyAyersMS knows how it works .

I'm just happy with the result. And copying just half the struct is definitely visible.

0 replies

saucecontrol · 2018-12-08T02:29:12Z

saucecontrol
Dec 8, 2018
Maintainer

I made an edit to my previous reply, but it's worth repeating. If you haven't tested with your changes on x86, you may find a regression there. There are fewer GP registers available on x86, and you can introduce additional register pressure by adding an unnecessary local variable. It's been a while since I wrote the code, so I can't remember if that was the case.

The issue you referenced regarding inlining only applies to the heuristic-driven inlining, which in this case is overridden with the AggressiveInlining hint. Again, it pays to see what actually comes out after JIT.

If you're happy with your changes, then by all means keep them. I'm just saying your separate benchmarking runs (with multiple code changes between) make it difficult to demonstrate the exact improvement. I don't doubt that your partial state struct improved things for your modified incremental hashing implementation, but the other changes are unlikely to have had the impact you claimed.

Thanks for sharing regardless. I'm happy to see someone getting use from my code, and I'm always interested to know if I can improve things.

0 replies

AndyAyersMS · 2018-12-08T03:17:08Z

AndyAyersMS
Dec 8, 2018

Inlining can certainly make things slower, so I usually tell people to go easy on AggressiveInlining and add it in gradually, measuring along the way...

0 replies

saucecontrol · 2018-12-08T06:27:48Z

saucecontrol
Dec 8, 2018
Maintainer

Totally. I was referring to methods like this, where it would be very bad for perf if it weren't inlined for some reason. This should, in fact, compile to a single instruction.

Blake2Fast/src/Blake2Fast/Blake2bSse4.cs

Lines 36 to 38 in 4880435

    
           [MethodImpl(MethodImplOptions.AggressiveInlining)] 
        
           private static Vector128<ulong> alignr_ulong(Vector128<ulong> x, Vector128<ulong> y, byte m) => 
        
           	Sse.StaticCast<sbyte, ulong>(Ssse3.AlignRight(Sse.StaticCast<ulong, sbyte>(x), Sse.StaticCast<ulong, sbyte>(y), m));

In that case, there's no reason to pass args by ref since it's always inlined and should operate on an already enregistered local.

Thankfully, the clumsier aspects of the X86 Intrinsics API have been tidied to the point that method probably doesn't need to exist anymore 😄

0 replies

buybackoff · 2020-06-28T12:32:27Z

buybackoff
Jun 28, 2020
Author

@saucecontrol
Just want to thank you for v.2.0 API changes! Will happily drop my copy of this code and use NuGet directly. And it's much faster 👍

0 replies

saucecontrol · 2020-06-28T22:55:12Z

saucecontrol
Jun 28, 2020
Maintainer

Hey, thanks for checking back in! I meant to look through your changes before releasing a new version and forgot to. I'm glad the updates covered your use case as well.

I saw you put up a PR. I'm AFK today but will give it a look this week.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idepmotent finish #1

{{title}}

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Idepmotent finish #1

buybackoff Dec 7, 2018

Replies: 13 comments

saucecontrol Dec 7, 2018 Maintainer

buybackoff Dec 7, 2018 Author

saucecontrol Dec 7, 2018 Maintainer

buybackoff Dec 8, 2018 Author

saucecontrol Dec 8, 2018 Maintainer

buybackoff Dec 8, 2018 Author

saucecontrol Dec 8, 2018 Maintainer

buybackoff Dec 8, 2018 Author

saucecontrol Dec 8, 2018 Maintainer

AndyAyersMS Dec 8, 2018

saucecontrol Dec 8, 2018 Maintainer

buybackoff Jun 28, 2020 Author

saucecontrol Jun 28, 2020 Maintainer

buybackoff
Dec 7, 2018

saucecontrol
Dec 7, 2018
Maintainer

buybackoff
Dec 7, 2018
Author

saucecontrol
Dec 7, 2018
Maintainer

buybackoff
Dec 8, 2018
Author

saucecontrol
Dec 8, 2018
Maintainer

buybackoff
Dec 8, 2018
Author

saucecontrol
Dec 8, 2018
Maintainer

buybackoff
Dec 8, 2018
Author

saucecontrol
Dec 8, 2018
Maintainer

AndyAyersMS
Dec 8, 2018

saucecontrol
Dec 8, 2018
Maintainer

buybackoff
Jun 28, 2020
Author

saucecontrol
Jun 28, 2020
Maintainer