-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and disallow orphan downstream modules #183
Comments
@zouyuxin Which version of dscquery do you have? Does it work with |
I'm using dscrutils_0.3.8 |
@zouyuxin the error is unfortunately a "feature" in
Now you see that there is a complain in subquery Please test and let us know if you still have troubles extracting filenames! (@pcarbo on a side note the output extraction is quite slow for @zouyuxin 's test example benchmark involving 12000 rows and 2 fields to extract) |
@gaow Could you please explain this warning? I've seen this warning/error before, and I still don't quite understand it. |
We've yet to benchmark parallel I/O because I dont believe in a modern system it does not make a difference. And possibly 12,000 result is not a lot? I think the speed limit here is not just I/O but also the computation involve to process and load data. There are some other design decisions we can possibly make, eg, different data storage structure or format. For example, it will be a lot faster if there is a way to store data in R and access only relevant keys in the disk not loading the entire object then extract relevant keys. But I'm not aware of such a solution to my current knowledge
For example in a DSC two types of pipelines are executed |
What is parallel I/O?
This will depend on many things. In general, assessing I/O load is difficult, and highly dependent on the computing infrastructure and the file system used. In short, I wouldn't bother with it. |
@zouyuxin I'd like to take a look at your DSC, if that's okay. Where can I see it? |
I take that this issue is resolved? We can use other tickets for discussions RE slowness in |
As one example requested by @pcarbo, the DSC is: A: R()
$out1: 1
B: R()
$out2: 1
C: R()
out1: $out1
$out3: 1
DSC:
run: A*B*C and use > dscrutils::dscquery('test', targets='A C')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c97e249c8a.csv --target "A C" --force
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
DSC A.output.file C.output.file
1 1 A/A_1 C/A_1_C_1
> dscrutils::dscquery('test', targets='A B')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c96bd1501f.csv --target "A B" --force
INFO: Loading database ...
INFO: Running queries ...
WARNING: Requested module A is an orphan branch with respect to module B; thus removed from sub-query involving module B.
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
DSC B.output.file
1 1 B/B_1
> dscrutils::dscquery('test', targets='A B C')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c96bb482ae.csv --target "A B C" --force
INFO: Loading database ...
INFO: Running queries ...
WARNING: Requested module B is an orphan branch with respect to module C; thus removed from sub-query involving module C.
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
DSC A.output.file C.output.file
1 1 A/A_1 C/A_1_C_1 The first query will work, but the 2nd and 3rd will give warnings. I'll try not to explain it but let you decide if it is reasonable behavior? That way we know if the prompted message makes sense and how we can possibly improve it :) Note you'll need the current |
@pcarbo The warning text is now changed to "WARNING: Requested module B is not connected to module C;". The word "orphan" is from the term "orphan process" whose "parent" (upper level) process is not available. But it is a phrase we should avoid anyways. |
Consider the following very simple pipeline: DSC:
run: A * B * C Suppose that module Due to the way this would be run internally, this odd situation creates several downstream issues, particularly in querying the results. (I'll spare you the details, but happy to provide them if you want.) This is my proposed rule: In DSC, we should require that module There's some additional motivation for this rule: When
Which could make querying the DSC results difficult if you aren't expecting this. I'm not saying that this is the best way to do this, but in light of this fact, I was expediently suggesting this rule to avoid downstream querying issues. |
@pcarbo How to extract filenames using dscquery? I'm using dsc to benchmark susie summary stats version. For some simulated dataset, susie summary version fails to converge within a fixed number of iterations. So, I want to check the dataset to understand the problem.
used to work. But it gives ERROR now.
Is there a way to extract filenames? Thanks.
The text was updated successfully, but these errors were encountered: