You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@willwerscheid was wondering about a scenario where for a large data-set we can load the data once, and subsample from it many times to simulate smaller data. This boils down to some sequential execution logic, eg:
simulate:
seed: R{1:10}
data: '/path/to/data'
...
and we execute this module sequentially not in parallel (or parallel it in R session) so that we load data only once.
The biggest challenge is that we'd then have to move the for loop to module script level (language specific) rather than doing it at DSC level. It is some fundamental changes that existing code cannot be easily adapted into doing. But I can see the appeal of the request, so we need to think about how to best do it.
The text was updated successfully, but these errors were encountered:
@willwerscheid I think if loading a data set multiple times is a big issue, you should consider: (1) timing a way to make the data loading run faster (e.g., by saving in an efficient format), or (2) having a single module that creates all the data subsets in one go, and then can the subsets can be loaded in a separate module that is replicated many times.
This seems to me more a question about how to best design your DSC, and I think can be accomplished with the existing DSC features.
gaow
changed the title
Sequentially executed modules
DSC execution logic
May 2, 2019
@pcarbo I agree with your assessment, although it is not completely impossible to address this at higher DSC level. I'm thinking of addressing things like that in DSC 2.0, along with the map-reduce notion that in the end all results flows to one node. A third thing worth doing is to allow for multiple outcomes per module instance -- that is the best way to address to the issue of benchmarking with command line tools in say bash.
In any case, point 2 and 3 are not relevant to @willwerscheid 's initial question but these are related in a way because they are exceptions or extensions to the parallel execution paradigm. So I'd like to keep this ticket open as a reminder of myself when I re-evaluate and design some of the execution logic down the road.
@willwerscheid was wondering about a scenario where for a large data-set we can load the data once, and subsample from it many times to simulate smaller data. This boils down to some sequential execution logic, eg:
and we execute this module sequentially not in parallel (or parallel it in R session) so that we load data only once.
The biggest challenge is that we'd then have to move the
for
loop to module script level (language specific) rather than doing it at DSC level. It is some fundamental changes that existing code cannot be easily adapted into doing. But I can see the appeal of the request, so we need to think about how to best do it.The text was updated successfully, but these errors were encountered: