Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial v Complete Genes for Metagenomic Analysis #283

Open
JChristopherEllis opened this issue Feb 26, 2018 · 8 comments
Open

Partial v Complete Genes for Metagenomic Analysis #283

JChristopherEllis opened this issue Feb 26, 2018 · 8 comments

Comments

@JChristopherEllis
Copy link

Hi,

I would like to see the partial genes and the complete genes when performing metagenomic analysis. Is there a way to identify both?

Thanks,
micromania

@tseemann tseemann self-assigned this Feb 27, 2018
@tseemann
Copy link
Owner

What do you mean by partial genes?
True pseudo/broken genes?
Genes articifically broken by contigs ends? Or by mis-assemblies?

@JChristopherEllis
Copy link
Author

Sorry for the confusion. I am referring to genes that are artificially broken by contig ends.

@JChristopherEllis
Copy link
Author

Or really anything that would yield a partial protein product. I would like to be able to tell the difference between partial protein sequences and my complete proteins sequences in my metagenomic data.

@jvollme
Copy link

jvollme commented Mar 12, 2018

Hi micromania2,

I had the same interest (specifically for metagenomic bins and single cell genomes). Since I had the impression that nobody else wanted that feature i simply slightly modified the "prodigal" call of my locally installed prokka version for this.

The way prokka originally calls the ORF-caller prodigal is with the "-c" argument (for "closed ends"), which won't let ORFs run over the contig ends. In order to remove this, you simply have to edit line 961 in the prokka script from:

my $cmd = "prodigal -i \Q$outdir/$prefix.fna\E -c -m -g $gcode -p $prodigal_mode -f sco -q";

to

my $cmd = "prodigal -i \Q$outdir/$prefix.fna\E -m -g $gcode -p $prodigal_mode -f sco -q"; #removed "-c" argument

Now you will also get all genes that are artificially broken by contig ends (however they will not be specifically marked as such.)

(Edit: this is related to #88 btw)

@tseemann
Copy link
Owner

I agree that Prokka should be allowing partial genes AND annotating them as such.

I am rethinking the whole design of Prokka, esp in terms of metagenomes.

@novigit
Copy link

novigit commented Mar 29, 2018

Hi! Just would like to mention that me, @jennahd and @lguy have submitted pull request #219 a while ago, that deals exactly with the problem of partial genes at contig edges.

Simply changing the prodigal line to add the -c flag is not enough! For example, in the resulting GenBank files, gene coordinates should be annotated with '<1' (if partial at the start of the contig) or '>5234' (if partial at the end of a contig with length 5234). We recommend using this version with the flag (--partialgenes), which should deal with the problem automatically!

Hope the pull request will be implemented in the main software at some point.

@JChristopherEllis
Copy link
Author

I went back and used prodigal to differentiate the full length genes from fragmented genes. I then separated them into two files one with full length genes and the other file with only fragmented putative genes. I used these two files to pass back through prokka for functional annotation.

The full length putative genes worked well with almost all of them functionally annotated when passed back through prokka.

However, for the fragmented genes only about 1/3 of them were identified with Prokka. I think this may be an issue with the options I am using, is there something I could be doing differently to restore the functional annotation calls to what they were without parsing fragment and full length sequences into separate files?

Here is the command line...

prokka --outdir

--prefix <PREFIX_NAME> --notrna --metagenome --cpus <#CPUs> --addgenes <FILE_NAME>

@ankeetkumar
Copy link

ankeetkumar commented Aug 21, 2023

Hi! Just would like to mention that me, @jennahd and @lguy have submitted pull request #219 a while ago, that deals exactly with the problem of partial genes at contig edges.

Simply changing the prodigal line to add the -c flag is not enough! For example, in the resulting GenBank files, gene coordinates should be annotated with '<1' (if partial at the start of the contig) or '>5234' (if partial at the end of a contig with length 5234). We recommend using this version with the flag (--partialgenes), which should deal with the problem automatically!

Hope the pull request will be implemented in the main software at some point.

Dear Sir,

I am trying to annotate a viral genome, and due to the lack of coverage, most of my genes are partial.

When I am making submissions to Bankit I am getting an error that says the gene starts with downstream methionine and I haven't labelled partial genes.

How do I add that flag of partial to the genomes which are partial? Also, I see sometimes Prokka breaks the genes into two and labels the genes as Gene1_1 and Gene1_2. How to solve that?

Thank you in advance.

Regards,
Ankeet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants