Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal properly with download of data #168

Closed
maxulysse opened this issue Aug 24, 2023 · 9 comments
Closed

Deal properly with download of data #168

maxulysse opened this issue Aug 24, 2023 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@maxulysse
Copy link
Member

Description of the bug

I think it's great to have possibility to automatically download data within the pipeline, but I would have that as optional and add the possibility to provide already downloaded data.
We already talked about this on Slack, and I'd be happy to help find a smart solution to deal with it.

Command used and terminal output

No response

Relevant files

No response

System information

No response

@maxulysse maxulysse added the bug Something isn't working label Aug 24, 2023
@erikrikarddaniel erikrikarddaniel added this to the nf-core candidate milestone Aug 24, 2023
@erikrikarddaniel
Copy link
Member

👍 @maxulysse

@danilodileo
Copy link
Collaborator

Hello,
to address the issue I am writing here a recap about how the pipeline is dealing with downloading the databases.

In metatdenovo there are three main programs that need a database, EUKulele, EggNOG and KOFAMSCAN. EUKulele and EGGNOG have their build-in tool for downloading it while KOFAMSCAN relies on a wget module . For each module is possible to skip the download step if you already have the database by pointing to the directory with the stored files. This method works quite well but we are still experiencing issues in the downloading steps (when the user doesn't have the databases then the pipeline tries by it's own):

  • nextflow Tower;
  • in some clusters wget seems to not working;

There might be other cases but I don't recall them now.

I am sure something is missing in our code: there might be some kind of conflict + we should avoid to be too conservative and save the databases in the work directory and the output directory if necessary.

@erikrikarddaniel
Copy link
Member

One thought: Eukulele downloads things automatically in search mode too, right? Would it be better to rely on that and scrap the download module? Or, would that make it more difficult to make sure files are available after a run?

@tfalkarkea
Copy link
Collaborator

Will skip the download modules if the user specifies an available database, and ensure the files are staged properly for their downstream modules too.

@erikrikarddaniel
Copy link
Member

Just to be clear: When you check if the download module should be called or not, check for the existence of at least one file (see the second if clase in subworkflows/local/eggnog.nf).

For Eukulele this file could e.g. be eukulele/$db/reference.pep.fa (where $db is the name of the database), for kofamscan kofamscan/db/ko_list.

We might want to make larger changes later (see above discussion) but this should get us started.

@tfalkarkea
Copy link
Collaborator

I'll have to look into doing this outside of the module's context. I think the core of the problem is staging the files in the projectDir isn't helpful in the Tower context, since files are moved around and symlinked automatically. I'll definitely try to test for input file completeness, but will have to think of some solutions here.

@erikrikarddaniel
Copy link
Member

I think the subworkflows is the best place, just like in the eggnog case, see above.

@danilodileo
Copy link
Collaborator

For eukuele download we are addressing the issue in PR #190

@danilodileo
Copy link
Collaborator

Today we agreed on closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants