You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
Traceback (most recent call last):
File "/usr/local/bin/dglke_server", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/dglke/kvserver.py", line 232, in main
start_server(args)
File "/usr/local/lib/python3.6/site-packages/dglke/kvserver.py", line 227, in start_server
my_server.start()
File "/usr/local/lib64/python3.6/site-packages/dgl/contrib/dis_kvstore.py", line 509, in start
_sender_connect(self._sender)
File "/usr/local/lib64/python3.6/site-packages/dgl/network.py", line 98, in _sender_connect
_CAPI_DGLSenderConnect(sender)
File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: Resource temporarily unavailable
File "/usr/local/lib/python3.6/site-packages/dglke/models/pytorch/tensor_models.py", line 77, in decorated_function
raise exception.__class__(trace)
dgl._ffi.base.DGLError: Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/dglke/models/pytorch/tensor_models.py", line 65, in _queue_result
res = func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/dglke/train_pytorch.py", line 1492, in dist_train_test
client = connect_to_kvstore(args, entity_pb, relation_pb, l2g)
File "/usr/local/lib/python3.6/site-packages/dglke/train_pytorch.py", line 1111, in connect_to_kvstore
my_client.connect()
File "/usr/local/lib64/python3.6/site-packages/dgl/contrib/dis_kvstore.py", line 953, in connect
_receiver_wait(self._receiver, client_ip, int(client_port), self._server_count)
File "/usr/local/lib64/python3.6/site-packages/dgl/network.py", line 116, in _receiver_wait
_CAPI_DGLReceiverWait(receiver, ip_addr, int(port), int(num_sender))
File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: Resource temporarily unavailable
terminate called after throwing an instance of 'dmlc::Error'
what(): [11:13:56] /opt/dgl/src/graph/network/socket_communicator.cc:144: Check failed: tmp != -1 (-1 vs. -1) :
Stack trace:
[bt] (0) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dgl::network::SocketSender::SendLoop(dgl::network::TCPSocket*, dgl::network::MessageQueue*)+0x7a6) [0x7f3002adce16]
[bt] (1) /lib64/libstdc++.so.6(+0xb5070) [0x7f305d5ee070]
[bt] (2) /lib64/libpthread.so.0(+0x7dd5) [0x7f306f1f6dd5]
[bt] (3) /lib64/libc.so.6(clone+0x6d) [0x7f306e816ead]
When I tried the following command, I found that the number of servers and clients were different on each machine:
I am running the following command on a cluster of 4 machines.
I got following errors:
When I tried the following command, I found that the number of servers and clients were different on each machine:
Experimental configuration:
When I try to change ''--num_client_proc 40'' to ''--num_client_proc 8 '' or less, it works fine.
The text was updated successfully, but these errors were encountered: