Don't keep large data around in a pipeline #3381
Replies: 1 comment 1 reply
-
Have you actually observed any problems in practice, or is this a hypothetical question?
Can you elaborate what the collections API has to do with this, and how partially implementing it is related to the problem you describe?
You could add a fake dependency, forcing completion of the metadata task early, before other more expensive compute steps. Aside from that, did you have a look at the Dask docs such as https://docs.dask.org/en/stable/order.html and https://docs.dask.org/en/stable/scheduling-policy.html? |
Beta Was this translation helpful? Give feedback.
-
The problem
In Sciline, we can have pipelines where multiple providers depend on some large data. Depending on the order those providers are evaluated in, they can keep that large data in memory for a long time. This increases the risk for running out of RAM.
Here is an example:
RawData
is potentially very large andreduce
likely reduces the size quickly. But sincebuild_meta
depends onRawData
, it keeps the large data in memory almost until the end.We can insert an extra step to significantly reduce how much data needs to be kept in memory:
Now, we can call
extract_meta
early and releaseRawData
. But this depends on the scheduler. So it may or may not know to scheduleextract_meta
beforereduce
has finished.Solutions?
Is there a good way to steer the scheduler to do what we want? I think in Dask, if we can communicate the size of intermediate data to it, it can make a clever decision. This probably means implementing at least part of the collection API for Scipp objects.
I can't think of another way in Dask or a way to do it with the naive scheduler at all. At least none that wouldn't make scheduling much more complicated.
Beta Was this translation helpful? Give feedback.
All reactions