Idepmotent finish #1
Replies: 13 comments
-
Nice! Thanks for sharing. I plan on adding an AVX2 implementation as well once the API settles down in netcoreapp3.0. Should compare favorably to anything native at that point. |
Beta Was this translation helpful? Give feedback.
-
That would be great! One thing I had to change was ArraySegment usage. Why you think it's preferable not to use Also, a dispatch like this:
is as efficient as #ifdef USE_INTRINSICS, because |
Beta Was this translation helpful? Give feedback.
-
This was just an attempt to keep dependencies to a minimum. I'll probably switch over to
Yeah, it's only because they're not official yet, I wanted to make a generic define to toggle the intrinsics support. It's useful for benchmarking as well. I'll have to change that define in the next version anyway because the X86 Intrinsics API has changed considerably in 3.0. I haven't decided whether to maintain the 2.1-compatible version of the code since it was experimental anyway. |
Beta Was this translation helpful? Give feedback.
-
BTW, some .NET-specific changes Spreads/Spreads@895e446
Together it's c.10% faster incremental caching. The first one gave 3.5%. The last 2 gave the rest and are general for all cases. |
Beta Was this translation helpful? Give feedback.
-
You sure about those numbers?
There's no dereference for struct members accessed from a pointer. Struct members have a known constant offset, so for example
If the callee is inlined, the |
Beta Was this translation helpful? Give feedback.
-
The code is there. Difference is there. With the last 3 commit I got numbers from c.290 to c.330 in incremental benchmark. Anyone could try to reproduce. I will not go into assembly, but (s->h) first calculates the pointer, maybe
Maybe |
Beta Was this translation helpful? Give feedback.
-
That's why I ask. Unless you ran both versions of the code together in the same benchmark run, I'd put any differences down to other variables between the runs.
I actually did look at the assembly as I was writing the code.
This is also something you need not guess about if you look at the assembly. |
Beta Was this translation helpful? Give feedback.
-
That would be interesting to statistically prove 0.15% or 1.5%, but 15% is just visible. In the original code many methods were also not inlined (either no AggressiveInlining attribute or exception thrwing instead of throw helper). Interesting read on a related subject: https://github.com/dotnet/coreclr/issues/21330#issuecomment-443531813 A minor change could make JIT decide not to inline, etc and probably only @AndyAyersMS knows how it works . I'm just happy with the result. And copying just half the struct is definitely visible. |
Beta Was this translation helpful? Give feedback.
-
I made an edit to my previous reply, but it's worth repeating. If you haven't tested with your changes on x86, you may find a regression there. There are fewer GP registers available on x86, and you can introduce additional register pressure by adding an unnecessary local variable. It's been a while since I wrote the code, so I can't remember if that was the case. The issue you referenced regarding inlining only applies to the heuristic-driven inlining, which in this case is overridden with the If you're happy with your changes, then by all means keep them. I'm just saying your separate benchmarking runs (with multiple code changes between) make it difficult to demonstrate the exact improvement. I don't doubt that your partial state struct improved things for your modified incremental hashing implementation, but the other changes are unlikely to have had the impact you claimed. Thanks for sharing regardless. I'm happy to see someone getting use from my code, and I'm always interested to know if I can improve things. |
Beta Was this translation helpful? Give feedback.
-
Inlining can certainly make things slower, so I usually tell people to go easy on |
Beta Was this translation helpful? Give feedback.
-
Totally. I was referring to methods like this, where it would be very bad for perf if it weren't inlined for some reason. This should, in fact, compile to a single instruction. Blake2Fast/src/Blake2Fast/Blake2bSse4.cs Lines 36 to 38 in 4880435 In that case, there's no reason to pass args by ref since it's always inlined and should operate on an already enregistered local. Thankfully, the clumsier aspects of the X86 Intrinsics API have been tidied to the point that method probably doesn't need to exist anymore 😄 |
Beta Was this translation helpful? Give feedback.
-
@saucecontrol |
Beta Was this translation helpful? Give feedback.
-
Hey, thanks for checking back in! I meant to look through your changes before releasing a new version and forgot to. I'm glad the updates covered your use case as well. I saw you put up a PR. I'm AFK today but will give it a look this week. |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for the blog posts and this implementation!
Just FYI, based on your code I implemented incremental hashing with intermediate hash calculations and optimized somewhat (removed
fixed
usage in favor of Unsafe and pinned input).Spreads/Spreads@4dae1bf
Beta Was this translation helpful? Give feedback.
All reactions