Mellanox - GPUDirect RDMA are not working inside the horovod-docker #1

vilmara · 2018-06-08T19:49:41Z

hi all, I am running TensorFlow benchmarks inside the horovod-docker to evaluate the models in distributed mode. I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:

Outside the docker:
service nv_peer_mem status
Output
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago
Docs: man:systemd-sysv-generator(8)
Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CPU: 0

Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK

Inside the docker:
service nv_peer_mem status
Output:
nv_peer_mem: unrecognized service

Also, when I run the benchmarks inside the docker, the scaling efficiency drops from ~90% to ~77%. The systems releases this warning:
host-1-V100:24:203 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
host-1-V100:24:203 [0] INFO Using internal Network Socket

Can you help to find out how to fix it? also what are the mpirun flags to enable rmda (infiniband) and be sure the network communication is over rmda (infiniband) instead of the socket?

paravmellanox · 2019-05-14T04:10:37Z

@vilmara not sure why I missed this issue until now? Are you able to resolve it? You need nvidia runtime to have gpu access inside the container.

Is this issue still open?

vilmara changed the title ~~Mellanox - GPUDirect RDMA is not working inside the horovod-docker~~ Mellanox - GPUDirect RDMA are not working inside the horovod-docker Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mellanox - GPUDirect RDMA are not working inside the horovod-docker #1

Mellanox - GPUDirect RDMA are not working inside the horovod-docker #1

vilmara commented Jun 8, 2018

paravmellanox commented May 14, 2019

Mellanox - GPUDirect RDMA are not working inside the horovod-docker #1

Mellanox - GPUDirect RDMA are not working inside the horovod-docker #1

Comments

vilmara commented Jun 8, 2018

paravmellanox commented May 14, 2019