You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi all, I am running TensorFlow benchmarks inside the horovod-docker to evaluate the models in distributed mode. I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:
Outside the docker:
service nv_peer_mem status
Output
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago
Docs: man:systemd-sysv-generator(8)
Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CPU: 0
Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK
Inside the docker:
service nv_peer_mem status
Output:
nv_peer_mem: unrecognized service
Also, when I run the benchmarks inside the docker, the scaling efficiency drops from ~90% to ~77%. The systems releases this warning:
host-1-V100:24:203 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
host-1-V100:24:203 [0] INFO Using internal Network Socket
Can you help to find out how to fix it? also what are the mpirun flags to enable rmda (infiniband) and be sure the network communication is over rmda (infiniband) instead of the socket?
The text was updated successfully, but these errors were encountered:
vilmara
changed the title
Mellanox - GPUDirect RDMA is not working inside the horovod-docker
Mellanox - GPUDirect RDMA are not working inside the horovod-docker
Jun 8, 2018
hi all, I am running TensorFlow benchmarks inside the horovod-docker to evaluate the models in distributed mode. I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:
Outside the docker:
service nv_peer_mem status
Output
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago
Docs: man:systemd-sysv-generator(8)
Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CPU: 0
Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK
Inside the docker:
service nv_peer_mem status
Output:
nv_peer_mem: unrecognized service
Also, when I run the benchmarks inside the docker, the scaling efficiency drops from ~90% to ~77%. The systems releases this warning:
host-1-V100:24:203 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
host-1-V100:24:203 [0] INFO Using internal Network Socket
Can you help to find out how to fix it? also what are the mpirun flags to enable rmda (infiniband) and be sure the network communication is over rmda (infiniband) instead of the socket?
The text was updated successfully, but these errors were encountered: