You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While testing PR #9, I observed a performance curiosity using LB1 with the CUDA-based multi-GPU code: when a same instance is repeated multiple times, the execution time may vary drastically, as well as the workload per GPUs.
For instance: Workload per GPU: 39.48 20.65 19.68 20.19 takes 28.2437s Workload per GPU: 24.26 24.15 26.18 25.41 takes 19.8376s Workload per GPU: 22.18 21.23 22.93 33.65 takes 42.1932s
As far as I know, this is not the case using LB2, and also not the case using LB1 with the Chapel code. One potential issue could be a bottleneck in the CUDA-based version of the WS mechanism.
The text was updated successfully, but these errors were encountered:
While testing PR #9, I observed a performance curiosity using LB1 with the CUDA-based multi-GPU code: when a same instance is repeated multiple times, the execution time may vary drastically, as well as the workload per GPUs.
For instance:
Workload per GPU: 39.48 20.65 19.68 20.19
takes 28.2437sWorkload per GPU: 24.26 24.15 26.18 25.41
takes 19.8376sWorkload per GPU: 22.18 21.23 22.93 33.65
takes 42.1932sAs far as I know, this is not the case using LB2, and also not the case using LB1 with the Chapel code. One potential issue could be a bottleneck in the CUDA-based version of the WS mechanism.
The text was updated successfully, but these errors were encountered: