Question from C++ On Sea #99

kaloyanpenev · 2024-07-08T12:35:51Z

Hi, I was in @jamierpond's talk on SWAR at the C++ on sea conference and really enjoyed it. Great concept. I haven't worked with avx, and I'm pretty much a layman when it comes to CPU SIMD. I have implemented some reduction algorithms with CUDA on the GPU, so I know the overall idea, but I'm not sure how the memory accesses and cache utilization change on the CPU is so I'm not in the loop with the standard practices for SWAR.

On the way back, I thought a bit more about SWAR, and a few questions came to mind. You may have answered these before, and there's some forum posts on the topic, but I'd enjoy hearing your opinion on it.

What is the advantage of packing primitives over a std::bitset? It feels like most bitwise mask operations would be supported, so a lot of the clever algorithms can be utilized, but with potentially less boilerplate from a user's standpoint? Also, is there a performance difference that you've found?
There was a question on handling large data, with multiple primitives. How do you plan on handling lanes for non-power-of-two lane sizes? Say a lane size of 5, with a 64-bit integer, and the user wants to use 50 integers to operate on? In GPGPU reductions, generally the leftover is padded with zeros, i.e. in this case each primitive would use 60 bits for 12 lanes and leave 4 unused bits. Do you plan on using the same pattern, or something else?

Thanks a lot for the talk. Will definitely give SWAR a go. Can you recommend any other talks that go over the concept? Eduardo's talk from CppCon 2019 is already on my list. :)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question from C++ On Sea #99

Question from C++ On Sea #99

kaloyanpenev commented Jul 8, 2024

Question from C++ On Sea #99

Question from C++ On Sea #99

Comments

kaloyanpenev commented Jul 8, 2024