Detect and disallow orphan downstream modules #183

zouyuxin · 2019-04-24T16:02:17Z

@pcarbo How to extract filenames using dscquery? I'm using dsc to benchmark susie summary stats version. For some simulated dataset, susie summary version fails to converge within a fixed number of iterations. So, I want to check the dataset to understand the problem.

dscout = dscquery('r_compare_add_z', targets = 'get_sumstats score_susie.converged', omit.filenames = FALSE)

used to work. But it gives ERROR now.

ERROR: Requested module get_sumstats is an orphan branch. Please consider using separate queries for information from this module.

Is there a way to extract filenames? Thanks.

The text was updated successfully, but these errors were encountered:

pcarbo · 2019-04-24T16:04:06Z

@zouyuxin Which version of dscquery do you have? Does it work with omit.filenames = TRUE? Could you post the full error?

zouyuxin · 2019-04-24T16:06:07Z

I'm using dscrutils_0.3.8

gaow · 2019-04-24T16:07:46Z

@zouyuxin it looks like the bug (or "feature"?) is on dsc-query level. @pcarbo please hold on I'll take a look at this first now.

gaow · 2019-04-24T19:02:10Z

@zouyuxin the error is unfortunately a "feature" in dsc-query and the reason is exactly what it suggests. But I understand this can be confusing and sometimes unnecessarily complicated, even impossible, to develop the exact queries that ensures linear logic flow. Therefore I now downgraded the message to a WARNING so that your query will not be interrupted. The message now looks like:

WARNING: Requested module get_sumstats is an orphan branch with respect to susie; thus removed from sub-query involving susie.

Now you see that there is a complain in subquery susie which you do not care anyways.

Please test and let us know if you still have troubles extracting filenames!

(@pcarbo on a side note the output extraction is quite slow for @zouyuxin 's test example benchmark involving 12000 rows and 2 fields to extract)

pcarbo · 2019-04-24T21:25:30Z

@gaow Could you please explain this warning? I've seen this warning/error before, and I still don't quite understand it.

pcarbo · 2019-04-24T21:48:44Z

@pcarbo on a side note the output extraction is quite slow for @zouyuxin 's test example benchmark involving 12000 rows and 2 fields to extract.

Yes, if you have a lot of results, it will take a long time to load. No way around that!

gaow · 2019-04-24T22:31:31Z

No way around that!

We've yet to benchmark parallel I/O because I dont believe in a modern system it does not make a difference. And possibly 12,000 result is not a lot? I think the speed limit here is not just I/O but also the computation involve to process and load data.

There are some other design decisions we can possibly make, eg, different data storage structure or format. For example, it will be a lot faster if there is a way to store data in R and access only relevant keys in the disk not loading the entire object then extract relevant keys. But I'm not aware of such a solution to my current knowledge

Could you please explain this warning?

For example in a DSC two types of pipelines are executed simulate -> compute_summary_stats -> susie_with_summary_stats -> score and simulate -> susie_full_data -> score. Then a query asks for compute_summary_stats, susie, score. Then there will be a gap in the logic, that is, the flow is broken into susie -> score AND compute_summary_stats there is no way to connect these two. In this case I call compute_summary_stats an orphan. Current behavior is to look backwards from the end of pipelines, ignoring orphans along the way and give a warning.

pcarbo · 2019-04-25T15:19:10Z

We've yet to benchmark parallel I/O.

What is parallel I/O?

I think the speed limit here is not just I/O but also the computation involve to process and load data.

This will depend on many things. In general, assessing I/O load is difficult, and highly dependent on the computing infrastructure and the file system used. In short, I wouldn't bother with it.

pcarbo · 2019-04-25T15:19:25Z

@zouyuxin I'd like to take a look at your DSC, if that's okay. Where can I see it?

zouyuxin · 2019-04-25T16:57:20Z

It's at https://github.com/stephenslab/dsc-finemap/blob/dsc-R/r_compare.dsc

gaow · 2019-04-29T18:23:45Z

I take that this issue is resolved? We can use other tickets for discussions RE slowness in dscquery if we get more complains from users.

gaow · 2019-04-29T23:44:50Z

As one example requested by @pcarbo, the DSC is:

A: R()
  $out1: 1

B: R()
  $out2: 1

C: R()
  out1: $out1
  $out3: 1

DSC:
  run: A*B*C

and use dsc test.dsc to run, then:

> dscrutils::dscquery('test', targets='A C')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c97e249c8a.csv --target "A C" --force
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC A.output.file C.output.file
1   1         A/A_1     C/A_1_C_1
> dscrutils::dscquery('test', targets='A B')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c96bd1501f.csv --target "A B" --force
INFO: Loading database ...
INFO: Running queries ...
WARNING: Requested module A is an orphan branch with respect to module B; thus removed from sub-query involving module B.
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC B.output.file
1   1         B/B_1
> dscrutils::dscquery('test', targets='A B C')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c96bb482ae.csv --target "A B C" --force
INFO: Loading database ...
INFO: Running queries ...
WARNING: Requested module B is an orphan branch with respect to module C; thus removed from sub-query involving module C.
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC A.output.file C.output.file
1   1         A/A_1     C/A_1_C_1

The first query will work, but the 2nd and 3rd will give warnings. I'll try not to explain it but let you decide if it is reasonable behavior? That way we know if the prompted message makes sense and how we can possibly improve it :) Note you'll need the current master version to reproduce, not the current release.

gaow · 2019-04-30T17:56:55Z

@pcarbo The warning text is now changed to "WARNING: Requested module B is not connected to module C;". The word "orphan" is from the term "orphan process" whose "parent" (upper level) process is not available. But it is a phrase we should avoid anyways.

pcarbo · 2019-05-03T11:50:10Z

Consider the following very simple pipeline:

DSC:
  run: A * B * C

Suppose that module C does not use any of the outputs from module B, and only uses outputs from module A. (This may seem like an odd thing to do, but in practice it could happen.)

Due to the way this would be run internally, this odd situation creates several downstream issues, particularly in querying the results. (I'll spare you the details, but happy to provide them if you want.)

This is my proposed rule:

In DSC, we should require that module C take as input at least one output from module B.

There's some additional motivation for this rule:

When C does not depend on B, internally this would be run as:

(A * B, A * C)

Which could make querying the DSC results difficult if you aren't expecting this.

I'm not saying that this is the best way to do this, but in light of this fact, I was expediently suggesting this rule to avoid downstream querying issues.

gaow pushed a commit that referenced this issue Apr 24, 2019

Downgrade ERROR to WARNING for orphan branches #183

6baebf1

gaow changed the title ~~Extract filenames using dscquery~~ Behavior of query with non-linear logic Apr 24, 2019

pcarbo assigned gaow Apr 24, 2019

gaow closed this as completed Apr 29, 2019

pcarbo reopened this May 3, 2019

gaow added discussion enhancement soon labels Nov 21, 2019

gaow changed the title ~~Behavior of query with non-linear logic~~ Detect and disallow orphan downstream modules Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect and disallow orphan downstream modules #183

Detect and disallow orphan downstream modules #183

zouyuxin commented Apr 24, 2019 •

edited

Loading

pcarbo commented Apr 24, 2019

zouyuxin commented Apr 24, 2019

gaow commented Apr 24, 2019

gaow commented Apr 24, 2019

pcarbo commented Apr 24, 2019

pcarbo commented Apr 24, 2019

gaow commented Apr 24, 2019 •

edited

Loading

pcarbo commented Apr 25, 2019

pcarbo commented Apr 25, 2019

zouyuxin commented Apr 25, 2019

gaow commented Apr 29, 2019

gaow commented Apr 29, 2019 •

edited by pcarbo

Loading

gaow commented Apr 30, 2019

pcarbo commented May 3, 2019

Detect and disallow orphan downstream modules #183

Detect and disallow orphan downstream modules #183

Comments

zouyuxin commented Apr 24, 2019 • edited Loading

pcarbo commented Apr 24, 2019

zouyuxin commented Apr 24, 2019

gaow commented Apr 24, 2019

gaow commented Apr 24, 2019

pcarbo commented Apr 24, 2019

pcarbo commented Apr 24, 2019

gaow commented Apr 24, 2019 • edited Loading

pcarbo commented Apr 25, 2019

pcarbo commented Apr 25, 2019

zouyuxin commented Apr 25, 2019

gaow commented Apr 29, 2019

gaow commented Apr 29, 2019 • edited by pcarbo Loading

gaow commented Apr 30, 2019

pcarbo commented May 3, 2019

zouyuxin commented Apr 24, 2019 •

edited

Loading

gaow commented Apr 24, 2019 •

edited

Loading

gaow commented Apr 29, 2019 •

edited by pcarbo

Loading