Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and disallow orphan downstream modules #183

Open
zouyuxin opened this issue Apr 24, 2019 · 14 comments
Open

Detect and disallow orphan downstream modules #183

zouyuxin opened this issue Apr 24, 2019 · 14 comments

Comments

@zouyuxin
Copy link
Member

zouyuxin commented Apr 24, 2019

@pcarbo How to extract filenames using dscquery? I'm using dsc to benchmark susie summary stats version. For some simulated dataset, susie summary version fails to converge within a fixed number of iterations. So, I want to check the dataset to understand the problem.

dscout = dscquery('r_compare_add_z', targets = 'get_sumstats score_susie.converged', omit.filenames = FALSE)

used to work. But it gives ERROR now.

ERROR: Requested module get_sumstats is an orphan branch. Please consider using separate queries for information from this module.

Is there a way to extract filenames? Thanks.

@pcarbo
Copy link
Member

pcarbo commented Apr 24, 2019

@zouyuxin Which version of dscquery do you have? Does it work with omit.filenames = TRUE? Could you post the full error?

@zouyuxin
Copy link
Member Author

I'm using dscrutils_0.3.8

@gaow
Copy link
Member

gaow commented Apr 24, 2019

@zouyuxin it looks like the bug (or "feature"?) is on dsc-query level. @pcarbo please hold on I'll take a look at this first now.

@gaow
Copy link
Member

gaow commented Apr 24, 2019

@zouyuxin the error is unfortunately a "feature" in dsc-query and the reason is exactly what it suggests. But I understand this can be confusing and sometimes unnecessarily complicated, even impossible, to develop the exact queries that ensures linear logic flow. Therefore I now downgraded the message to a WARNING so that your query will not be interrupted. The message now looks like:

WARNING: Requested module get_sumstats is an orphan branch with respect to susie; thus removed from sub-query involving susie.

Now you see that there is a complain in subquery susie which you do not care anyways.

Please test and let us know if you still have troubles extracting filenames!

(@pcarbo on a side note the output extraction is quite slow for @zouyuxin 's test example benchmark involving 12000 rows and 2 fields to extract)

@gaow gaow changed the title Extract filenames using dscquery Behavior of query with non-linear logic Apr 24, 2019
@pcarbo
Copy link
Member

pcarbo commented Apr 24, 2019

@gaow Could you please explain this warning? I've seen this warning/error before, and I still don't quite understand it.

@pcarbo
Copy link
Member

pcarbo commented Apr 24, 2019

@pcarbo on a side note the output extraction is quite slow for @zouyuxin 's test example benchmark involving 12000 rows and 2 fields to extract.

Yes, if you have a lot of results, it will take a long time to load. No way around that!

@gaow
Copy link
Member

gaow commented Apr 24, 2019

No way around that!

We've yet to benchmark parallel I/O because I dont believe in a modern system it does not make a difference. And possibly 12,000 result is not a lot? I think the speed limit here is not just I/O but also the computation involve to process and load data.

There are some other design decisions we can possibly make, eg, different data storage structure or format. For example, it will be a lot faster if there is a way to store data in R and access only relevant keys in the disk not loading the entire object then extract relevant keys. But I'm not aware of such a solution to my current knowledge

Could you please explain this warning?

For example in a DSC two types of pipelines are executed simulate -> compute_summary_stats -> susie_with_summary_stats -> score and simulate -> susie_full_data -> score. Then a query asks for compute_summary_stats, susie, score. Then there will be a gap in the logic, that is, the flow is broken into susie -> score AND compute_summary_stats there is no way to connect these two. In this case I call compute_summary_stats an orphan. Current behavior is to look backwards from the end of pipelines, ignoring orphans along the way and give a warning.

@pcarbo
Copy link
Member

pcarbo commented Apr 25, 2019

We've yet to benchmark parallel I/O.

What is parallel I/O?

I think the speed limit here is not just I/O but also the computation involve to process and load data.

This will depend on many things. In general, assessing I/O load is difficult, and highly dependent on the computing infrastructure and the file system used. In short, I wouldn't bother with it.

@pcarbo
Copy link
Member

pcarbo commented Apr 25, 2019

@zouyuxin I'd like to take a look at your DSC, if that's okay. Where can I see it?

@zouyuxin
Copy link
Member Author

@gaow
Copy link
Member

gaow commented Apr 29, 2019

I take that this issue is resolved? We can use other tickets for discussions RE slowness in dscquery if we get more complains from users.

@gaow gaow closed this as completed Apr 29, 2019
@gaow
Copy link
Member

gaow commented Apr 29, 2019

As one example requested by @pcarbo, the DSC is:

A: R()
  $out1: 1

B: R()
  $out2: 1

C: R()
  out1: $out1
  $out3: 1

DSC:
  run: A*B*C

and use dsc test.dsc to run, then:

> dscrutils::dscquery('test', targets='A C')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c97e249c8a.csv --target "A C" --force
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC A.output.file C.output.file
1   1         A/A_1     C/A_1_C_1
> dscrutils::dscquery('test', targets='A B')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c96bd1501f.csv --target "A B" --force
INFO: Loading database ...
INFO: Running queries ...
WARNING: Requested module A is an orphan branch with respect to module B; thus removed from sub-query involving module B.
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC B.output.file
1   1         B/B_1
> dscrutils::dscquery('test', targets='A B C')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c96bb482ae.csv --target "A B C" --force
INFO: Loading database ...
INFO: Running queries ...
WARNING: Requested module B is an orphan branch with respect to module C; thus removed from sub-query involving module C.
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC A.output.file C.output.file
1   1         A/A_1     C/A_1_C_1

The first query will work, but the 2nd and 3rd will give warnings. I'll try not to explain it but let you decide if it is reasonable behavior? That way we know if the prompted message makes sense and how we can possibly improve it :) Note you'll need the current master version to reproduce, not the current release.

@gaow
Copy link
Member

gaow commented Apr 30, 2019

@pcarbo The warning text is now changed to "WARNING: Requested module B is not connected to module C;". The word "orphan" is from the term "orphan process" whose "parent" (upper level) process is not available. But it is a phrase we should avoid anyways.

@pcarbo
Copy link
Member

pcarbo commented May 3, 2019

Consider the following very simple pipeline:

DSC:
  run: A * B * C

Suppose that module C does not use any of the outputs from module B, and only uses outputs from module A. (This may seem like an odd thing to do, but in practice it could happen.)

Due to the way this would be run internally, this odd situation creates several downstream issues, particularly in querying the results. (I'll spare you the details, but happy to provide them if you want.)

This is my proposed rule:

In DSC, we should require that module C take as input at least one output from module B.

There's some additional motivation for this rule:

When C does not depend on B, internally this would be run as:

(A * B, A * C)

Which could make querying the DSC results difficult if you aren't expecting this.

I'm not saying that this is the best way to do this, but in light of this fact, I was expediently suggesting this rule to avoid downstream querying issues.

@pcarbo pcarbo reopened this May 3, 2019
@gaow gaow changed the title Behavior of query with non-linear logic Detect and disallow orphan downstream modules Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants