-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal properly with download of data #168
Comments
Hello, In metatdenovo there are three main programs that need a database, EUKulele, EggNOG and KOFAMSCAN. EUKulele and EGGNOG have their build-in tool for downloading it while KOFAMSCAN relies on a wget module . For each module is possible to skip the download step if you already have the database by pointing to the directory with the stored files. This method works quite well but we are still experiencing issues in the downloading steps (when the user doesn't have the databases then the pipeline tries by it's own):
There might be other cases but I don't recall them now. I am sure something is missing in our code: there might be some kind of conflict + we should avoid to be too conservative and save the databases in the work directory and the output directory if necessary. |
One thought: Eukulele downloads things automatically in search mode too, right? Would it be better to rely on that and scrap the download module? Or, would that make it more difficult to make sure files are available after a run? |
Will skip the download modules if the user specifies an available database, and ensure the files are staged properly for their downstream modules too. |
Just to be clear: When you check if the download module should be called or not, check for the existence of at least one file (see the second if clase in For Eukulele this file could e.g. be We might want to make larger changes later (see above discussion) but this should get us started. |
I'll have to look into doing this outside of the module's context. I think the core of the problem is staging the files in the projectDir isn't helpful in the Tower context, since files are moved around and symlinked automatically. I'll definitely try to test for input file completeness, but will have to think of some solutions here. |
I think the subworkflows is the best place, just like in the eggnog case, see above. |
For eukuele download we are addressing the issue in PR #190 |
Today we agreed on closing this issue. |
Description of the bug
I think it's great to have possibility to automatically download data within the pipeline, but I would have that as optional and add the possibility to provide already downloaded data.
We already talked about this on Slack, and I'd be happy to help find a smart solution to deal with it.
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: