Don't keep large data around in a pipeline #3381

jl-wynen · 2024-02-02T14:29:00Z

jl-wynen
Feb 2, 2024
Maintainer

The problem

In Sciline, we can have pipelines where multiple providers depend on some large data. Depending on the order those providers are evaluated in, they can keep that large data in memory for a long time. This increases the risk for running out of RAM.
Here is an example:

graph LR;
load(load)-->RawData;
RawData-->reduce_data(reduce);
reduce_data-->ReducedData;
ReducedData-->save(save);
RawData-->build_meta;
ReducedData-->build_meta(build_meta);
build_meta-->Metadata;
Metadata-->save;

RawData is potentially very large and reduce likely reduces the size quickly. But since build_meta depends on RawData, it keeps the large data in memory almost until the end.

We can insert an extra step to significantly reduce how much data needs to be kept in memory:

graph LR;
load(load)-->RawData;
RawData-->reduce_data(reduce);
reduce_data-->ReducedData;
ReducedData-->save(save);
RawData-->extract_meta(extract_meta);
extract_meta-->RawMetadata;
RawMetadata-->build_meta;
ReducedData-->build_meta(build_meta);
build_meta-->Metadata;
Metadata-->save;

Now, we can call extract_meta early and release RawData. But this depends on the scheduler. So it may or may not know to schedule extract_meta before reduce has finished.

Solutions?

Is there a good way to steer the scheduler to do what we want? I think in Dask, if we can communicate the size of intermediate data to it, it can make a clever decision. This probably means implementing at least part of the collection API for Scipp objects.

I can't think of another way in Dask or a way to do it with the naive scheduler at all. At least none that wouldn't make scheduling much more complicated.

SimonHeybrock · 2024-02-05T04:15:30Z

SimonHeybrock
Feb 5, 2024
Maintainer

But this depends on the scheduler. So it may or may not know to schedule extract_meta before reduce has finished.

Have you actually observed any problems in practice, or is this a hypothetical question?

I think in Dask, if we can communicate the size of intermediate data to it, it can make a clever decision. This probably means implementing at least part of the collection API for Scipp objects.

Can you elaborate what the collections API has to do with this, and how partially implementing it is related to the problem you describe?

I can't think of another way in Dask or a way to do it with the naive scheduler at all. At least none that wouldn't make scheduling much more complicated.

You could add a fake dependency, forcing completion of the metadata task early, before other more expensive compute steps.

Aside from that, did you have a look at the Dask docs such as https://docs.dask.org/en/stable/order.html and https://docs.dask.org/en/stable/scheduling-policy.html?

1 reply

jl-wynen Feb 5, 2024
Maintainer Author

Have you actually observed any problems in practice, or is this a hypothetical question?

Aside from that, did you have a look at the Dask docs such as https://docs.dask.org/en/stable/order.html and https://docs.dask.org/en/stable/scheduling-policy.html?

It's hypothetical. But looking at those pages, it seems that Dask prefers a depth-first approach which, if it first walks down the reduce chain, causes the problem I described. If it first walks down the metadata chain, all is well. But since build_meta will in reality depend on multiple intermediate steps inside of reduce, the latter scenario is not so likely.

But yes, I just noticed that this could turn into a problem but didn't see any evidence yet (I have no large files...). We shouldn't put a lot of work into this now. I just wanted to bring it up in case someone knows a good solution.

Can you elaborate what the collections API has to do with this, and how partially implementing it is related to the problem you describe?

The collections API allows us to tell Dask how large intermediate objects are. So I was hoping it could prioritise scheduling dependents of those large objects. But the documents linked above suggest that this doesn't happen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sci++

Don't keep large data around in a pipeline #3381

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Sci++

Don't keep large data around in a pipeline #3381

jl-wynen Feb 2, 2024 Maintainer

The problem

Solutions?

Replies: 1 comment · 1 reply

SimonHeybrock Feb 5, 2024 Maintainer

jl-wynen Feb 5, 2024 Maintainer Author

jl-wynen
Feb 2, 2024
Maintainer

Replies: 1 comment 1 reply

SimonHeybrock
Feb 5, 2024
Maintainer

jl-wynen Feb 5, 2024
Maintainer Author