Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use indexed BLAST DB if already exists #124

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

cnthornton
Copy link

Currently, when given the --proteins argument, prokka will automatically create a BLAST indexed database using the makeblastdb command, regardless of whether or not one already exists. This modification will search the directory of the protein database that is supplied by the user to determine whether an indexed database with the same prefix as the fasta file supplied already exists. If it does, prokka will use that instead of creating a new one.

Currently, when given the --proteins argument, prokka will automatically create a BLAST indexed database using the makeblastdb command, regardless of whether or not one already exists. This modification will search the directory of the protein database that is supplied by the user to determine whether an indexed database with the same prefix as the fasta file supplied already exists. If it does, prokka will use that instead of creating a new one.
@tseemann tseemann self-assigned this Jun 28, 2015
@tseemann
Copy link
Owner

@cnthornton Thanks for this request - I have some comments:

I assume you want to provide a large database to Prokka but not have to format it each time?

If so, this is related to Issue #90

The code you provide is fine, except I am not sure if parsing for a .XXX suffix is correct. What happens if the original file was "refseq.proteins" - it will look for "refseq.pin" rather than "refseq.proteins".

I think what we want to do is allow the --proteins option to be a BLAST database name. If the --proteins file doesn't exist at all, just check for a .pal or .pin and go. No need to extract any .XXX suffix.

The other big problem is that later on the files get deleted!

if ($proteins) {
  delfile( map { "$outdir/proteins.$_" } qw(psq phr pin) );
}

Does this make sense?
If so, can you amend the pull request?
You should also add in a msg() line to inform the user what is happening.

@cnthornton
Copy link
Author

Your assumption is correct. The protein database that I have been giving prokka is large, although not excessively so. And while makeblastdb does not contribute significantly to the run time, it does add up when run frequently enough.

You are right though - there really is no need to strip away the portion presumed to be the file extension. I will fix this, as well as add an associated message, when I get back from my upcoming sampling expedition next Thursday.

As for your other point, I don't think that that section of code will be a problem. The files being deleted there should not exist anyway unless somehow the indexed dbs being provided are in the output directory, which I think should never be the case. However, it is sloppy coding on my end. I should have modified that section to check the existence of one or more of the actual db files in the output directory, and if they exist delete them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants