-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ananse network Out Of Memory (OOM) error and killed #96
Comments
Hi Alberto, Thanks for the issue and the interest! Unfortunately ANANSE is extremely memory hungry 😞, mainly due to our naive design, but also because of the way Python works with threads/processes. The way Python code (often) gets parallelized is by making a complete copy of what is in memory and running that, no memory is shared. This is exactly how it is also implemented for ANANSE, and means that the more cores you use, the higher the memory usage. I am expecting that using less cores (e.g. 4) will give a more acceptible memory usage than what it is currently requesting for you. Let us know if that helps or not! |
Hi Maarten, Thanks a lot for the suggestion! I'll re-run it again using 4 threads, and I will let you know how it goes. All the best, |
@apposada if the solution from @Maarten-vd-Sande doesn't help, there is a fix that you can try. You'll have to run the following command in your environment to install the develop version of ANANSE with this fix:
We haven't thoroughly tested this yet, but it should keep memory usage within ~12-15GB. The |
@apposada scratch that, there is an occasional error that pops up in that version that seems to affect the results as well. Still have to look into that more deeply. |
Hi Simon, Maarten, I'm afraid to say that Maarten's suggestion did not work either. I ran it using 4 cores but still there was a memory overload. Again, when looking at dmesg:
Thanks and all the best, |
HI @apposada
This version has the option to control memory usage (at the expense of speed). Each additional core uses ~12GB of memory. Running it with one core will take likely 1 hour, with 4 cores it will take ~48GB of memory, but run in ~15 minutes. The strange thing is that 8GB of that 12GB per core is not ANANSE related. It is related to the |
Hi @simonvh, I was actually waiting for the fix as I have the same issue of memory. Using the ananse network I got to 65% and then it is killed. In the high memory cloud, it is killed at 25%. Thank you!!! ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 8 -o fibroblast.network.txt 2021-06-25 18:49:44 | INFO | Loading expression During handling of the above exception, another exception occurred: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
Hi Simon, I tried updating using pip, as you suggested, and later I was asked to install dask distributed (which I did using: After installing and re-trying with both 1 and 4 cores, I got the following. Which seems unrelated to the error by @cdsoria , and potentially nothing to do with the OOM issue, but still I am reporting it here. No luck so far it seems... Thanks
|
Thanks so much @cdsoria and @apposada for providing this input and bug reports. I'm sorry it's such a hassle for you, but I'm thankful you provide the feedback which allows us to hopefully fix these issues. @cdsoria can you try with a lower value for @apposada I suspect this has something to do with the format of the input files. Would it be possible for you provide the output of |
Hello @simonvh. No problem at all and happy to help. I tried with n2, n1 and no 'n',, but I get the same error. It seems something about wanting to connect. Apologies, I dont really understand the error well.
During handling of the above exception, another exception occurred:
|
@cdsoria The error is completely unclear to be, but it seems like the jobs are cancelled somehow, after which this error is thrown. To check if this is related to available resources, can you check the following. This will download a small test data set, and run
|
Thanks @simonvh so it goes a bit further but stops at Computing network 2021-06-28 16:27:05 | INFO | Loading expression |
@simonvh Just to say that, when I go back to my previous version of Ananse ("pip install git+https://github.com/vanheeringen-lab/ANANSE.git@9de0982"), your test data runs ok. I am re-running again with the full data at the moment just to make sure that the memory runs out as before. Yep, with the old version runs out at 66% Completed. |
Okay, @cdsoria, another try. The high memory usage was related to a really obscure issue with another library. I've fixed it, and as a result the memory usage (at least on our server) has decrease significantly. Can you run the following command again and check if it works after that?
|
@simonvh Thank you so much again. So this time it goes further to 99% but then it hangs in there, as it stops and re-starts workers. I tested this in my local computer (error attached) but also in an AWS EC2 instance r5d.24xlarge with similar error once it reaches 99%. Also tested with -n 2 It gives this error to start with but it actually starts Traceback (most recent call last): |
Huh, that worker memory is too high, it should no reach more than 4/5GB. Can you double-check something for me, and post the output of this:
|
You can also try running the command like this
|
Hi @simonvh, this is the output:
|
Hi again @simonvh .Still the same persists unfortunately: `OMP_NUM_THREADS=1 ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 4 -o fibroblast.network.txt 2021-06-30 10:28:58 | INFO | Loading expression The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
So you also get this on an Amazon EC2 instance right? That should make it possible to see if we can reproduce and thereby test this. Do you have the exact steps that you use to create your environment? |
@simonvh good news!!! Sorry my bad, I should have also tried in the EC2 instances. So, yay! it worked! I am posting for ref. Also it was very fast. Thank you so much!!!!
|
Great! Just out of curiosity, what is the total memory size of the computer on which it failed? |
Sure, this my hardware overview: |
Thanks. Mac is indeed a platform that we don't test. If you have some time, can you check if the following works on your Mac?
|
Of course happy to try, with those parameters it does worse actually 8% and then starts throwing the errors. |
@simonvh not sure why but the fix you made |
The changes have been merged in the
(This link will be stable btw) |
Great thank you!!!! |
Based on this issue, and on other reports of memory issues, we completely overhauled
Note: this is incompatible with the output from the current version of |
Dear @simonvh , I also started with having memory issues (almost all the RAM being used on the server) using the current version of ananse. I saw this thread and I installed the changes that should fix the need for so much RAM by executing: pip install git+https://github.com/vanheeringen-lab/ANANSE.git@3ec07af Unfortunately this did not fix it for me. I got errors regarding workers. Is there any other way to fix this? This is the output of my init.py file after the pip install: (base) julian@cn45:~$ cat /vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/ananse/__init__.py
from ._version import get_versions
import os
# This is here to prevent very high memory usage on numpy import.
# On a machine with many cores, just importing numpy can result in up to
# 8GB of (virtual) memory. This wreaks havoc on management of the dask
# workers.
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
__version__ = get_versions()["version"]
del get_versions It seems that specifying a low number of cores, it does not begin to start network calculation and on a high number of cores it begins calculating only up to a certain point (see below). Command with two cores (which was manually interrupted, because the last line ran for over 8 minutes without updates): (ananse) julian@cn45:/ceph/rimlsfnwi/data/moldevbio/zhou/jarts/jupyter_notebooks$ nice -15 OMP_NUM_THREADS=1 ananse network -e /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/jupyter_notebooks/CjStpm.tsv -b /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/lako2021/ANANSE/outs2/CjS/binding.tsv -a /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/genome/hg38/hg38.annotation.bed -o /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/lako2021/ANANSE/outs2/CjS/full_network_includeprom.txt -g /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/genome/hg38/hg38.fa -n 2
2021-07-15 11:22:18 | INFO | Loading expression
2021-07-15 11:22:18 | INFO | Aggregate binding
2021-07-15 11:22:18 | INFO | reading enhancers
2021-07-15 11:22:50 | INFO | Reading binding file...
2021-07-15 11:23:10 | INFO | Grouping by tf and target gene...
2021-07-15 11:23:10 | INFO | Done grouping...
2021-07-15 11:23:11 | INFO | Reading factor activity
2021-07-15 11:23:11 | INFO | Computing network
distributed.worker - WARNING - Worker is at 85% memory usage. Pausing worker. Process memory: 9.54 GiB -- Worker memory limit: 11.18 GiB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41309
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read
n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
response = await get_data_from_worker(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
response = await send_recv(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
response = await comm.read(deserializers=deserializers)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
convert_stream_closed_error(self, e)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:41387 -> tcp://127.0.0.1:41309
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1431, in get_data
response = await comm.read(deserializers=serializers)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
convert_stream_closed_error(self, e)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.nanny - WARNING - Restarting worker
distributed.worker - ERROR - Handle missing dep failed, retrying
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2491, in handle_missing_dep
for dep in deps:
RuntimeError: Set changed size during iteration
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
^C This is my console command (with 6 cores) (ananse) julian@cn45:/ceph/rimlsfnwi/data/moldevbio/zhou/jarts/jupyter_notebooks$ nice -15 ananse network -e /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/jupyter_notebooks/CjStpm.tsv -b /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/lako2021/ANANSE/outs2/CjS/binding.tsv -a /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/genome/hg38/hg38.annotation.bed -o /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/lako2021/ANANSE/outs2/CjS/full_network_includeprom.txt -g /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/genome/hg38/hg38.fa -n 6
2021-07-15 11:32:52 | INFO | Loading expression
2021-07-15 11:32:52 | INFO | Aggregate binding
2021-07-15 11:32:52 | INFO | reading enhancers
2021-07-15 11:33:25 | INFO | Reading binding file...
2021-07-15 11:33:50 | INFO | Grouping by tf and target gene...
2021-07-15 11:33:50 | INFO | Done grouping...
2021-07-15 11:33:51 | INFO | Reading factor activity
2021-07-15 11:33:51 | INFO | Computing network
[### ] | 9% Completed | 20.2sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:42231
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 196, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
response = await get_data_from_worker(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3731, in _get_data
comm = await rpc.connect(worker)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 1012, in connect
comm = await connect(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 325, in connect
raise IOError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:42231 after 10 s
[###### ] | 16% Completed | 41.7sdistributed.worker - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 9.01 GiB -- Worker memory limit: 11.18 GiB
[###### ] | 17% Completed | 1min 1.2sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[####### ] | 17% Completed | 1min 2.1sdistributed.nanny - WARNING - Restarting worker
[####### ] | 17% Completed | 1min 6.1sdistributed.worker - WARNING - Worker is at 37% memory usage. Resuming worker. Process memory: 4.16 GiB -- Worker memory limit: 11.18 GiB
[####### ] | 17% Completed | 1min 6.3sdistributed.worker - ERROR - failed during get data with tcp://127.0.0.1:41013 -> tcp://127.0.0.1:38655
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 971, in _handle_write
num_bytes = self.write_to_fd(self._write_buffer.peek(size))
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1148, in write_to_fd
return self.socket.send(data) # type: ignore
BrokenPipeError: [Errno 32] Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1431, in get_data
response = await comm.read(deserializers=serializers)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
convert_stream_closed_error(self, e)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: BrokenPipeError: [Errno 32] Broken pipe
[####### ] | 17% Completed | 1min 6.4sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38655
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read
n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
response = await get_data_from_worker(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
response = await send_recv(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
response = await comm.read(deserializers=deserializers)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
convert_stream_closed_error(self, e)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
[####### ] | 18% Completed | 1min 18.8sdistributed.worker - ERROR - failed during get data with tcp://127.0.0.1:34621 -> tcp://127.0.0.1:38655
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1431, in get_data
response = await comm.read(deserializers=serializers)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
convert_stream_closed_error(self, e)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38655
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read
n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
response = await get_data_from_worker(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
response = await send_recv(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
response = await comm.read(deserializers=deserializers)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
convert_stream_closed_error(self, e)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
[####### ] | 18% Completed | 1min 21.1sdistributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
[####### ] | 19% Completed | 1min 37.9sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[####### ] | 19% Completed | 1min 39.1sdistributed.nanny - WARNING - Restarting worker
[####### ] | 19% Completed | 1min 47.4sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[####### ] | 19% Completed | 1min 48.4sdistributed.nanny - WARNING - Restarting worker
[####### ] | 19% Completed | 1min 56.6sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[####### ] | 19% Completed | 1min 57.0sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:43089
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read
n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
response = await get_data_from_worker(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
response = await send_recv(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
response = await comm.read(deserializers=deserializers)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
convert_stream_closed_error(self, e)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
[####### ] | 19% Completed | 1min 58.2sdistributed.nanny - WARNING - Restarting worker
[####### ] | 19% Completed | 2min 0.3sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker% Completed | 2min 1.6s
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/bin/ananse", line 326, in <module>
args.func(args)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/ananse/commands/network.py", line 41, in network
b.run_network(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/ananse/network.py", line 616, in run_network
result = result.compute()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/dask/base.py", line 285, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/dask/base.py", line 567, in compute
results = schedule(dsk, keys, **kwargs)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 2705, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 2014, in gather
return self.sync(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 855, in sync
return sync(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 338, in sync
raise exc.with_traceback(tb)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 321, in f
result[0] = yield future
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 1879, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ("('merge-92b1078b3a42900fd782ac0b2b147609', 62)", <WorkerState 'tcp://127.0.0.1:34569', name: 4, memory: 0, processing: 75>)
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:34569
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
response = await get_data_from_worker(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
response = await send_recv(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
response = await comm.read(deserializers=deserializers)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
convert_stream_closed_error(self, e)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 498, in handle_comm
result = await result
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5002, in add_client
self.remove_client(client=client)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5029, in remove_client
self.client_releases_keys(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 4769, in client_releases_keys
self.transitions(recommendations)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 6683, in transitions
self.send_all(client_msgs, worker_msgs)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5265, in send_all
w = stream_comms[worker]
KeyError: None
tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection.<locals>.<lambda> at 0x1507069f4e50>, <Task finished name='Task-56' coro=<BaseTCPListener._handle_stream() done, defined at /vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py:476> exception=KeyError(None)>)
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/tcpserver.py", line 331, in <lambda>
gen.convert_yielded(future), lambda f: f.result()
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 493, in _handle_stream
await self.comm_handler(comm)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 498, in handle_comm
result = await result
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5002, in add_client
self.remove_client(client=client)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5029, in remove_client
self.client_releases_keys(
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 4769, in client_releases_keys
self.transitions(recommendations)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 6683, in transitions
self.send_all(client_msgs, worker_msgs)
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5265, in send_all
w = stream_comms[worker]
KeyError: None
Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join:
Traceback (most recent call last):
File "/vol/mbconda/julian/envs/ananse/lib/python3.9/threading.py", line 954, in _bootstrap_inner |
Can you try the new (just released!) version: 0.3.0? |
Hi,
I am trying to run ananse (pip virtual environment; last updated yesterday) on a 64thread / 110G RAM machine. While ananse binding runs fine, ananse network progresses up to ~95% of network constructed when it is killed by the system due to an out of memory error.
The command I am running is:
That prompt stays as 96% and a fixed unchanging time for a while, until it is killed.
RAM usage remains at near half of total available (~50-60GB) for a while, then it slowly increases until the process is killed.
Prior to interruption, a number of child processes are generated that remain in sleep status. The number of child processes increases over time. Tracing their calls returns the following (an excerpt below):
After interruption, running dmesg and looking for 'ananse' shows the following:
If I understand correctly, ananse allocates 87855820kB (=~83GB) of memory, which leaves the system under half the amount of memory. Is this an expected behavior? Is it possible for a limit to be set up by the user, or can this behavior be modified?
Really looking forward to use this tool!
All the best,
Alberto
The text was updated successfully, but these errors were encountered: