Suspicious performance variation using LB1 in the CUDA-based multi-GPU code #10

Guillaume-Helbecque · 2024-07-25T15:00:58Z

While testing PR #9, I observed a performance curiosity using LB1 with the CUDA-based multi-GPU code: when a same instance is repeated multiple times, the execution time may vary drastically, as well as the workload per GPUs.

For instance:
Workload per GPU: 39.48 20.65 19.68 20.19 takes 28.2437s
Workload per GPU: 24.26 24.15 26.18 25.41 takes 19.8376s
Workload per GPU: 22.18 21.23 22.93 33.65 takes 42.1932s

As far as I know, this is not the case using LB2, and also not the case using LB1 with the Chapel code. One potential issue could be a bottleneck in the CUDA-based version of the WS mechanism.

The text was updated successfully, but these errors were encountered:

Guillaume-Helbecque added the performance optimization label Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspicious performance variation using LB1 in the CUDA-based multi-GPU code #10

Suspicious performance variation using LB1 in the CUDA-based multi-GPU code #10

Guillaume-Helbecque commented Jul 25, 2024

Suspicious performance variation using LB1 in the CUDA-based multi-GPU code #10

Suspicious performance variation using LB1 in the CUDA-based multi-GPU code #10

Comments

Guillaume-Helbecque commented Jul 25, 2024