-
Notifications
You must be signed in to change notification settings - Fork 6
Geneid Accuracy
Emilio Righi edited this page Jan 19, 2023
·
1 revision
Because of the lack of well annnotated large genomic sequences, it is
difficult to assess the accuracy of "ab initio" gene finders. We have
attempted to analyze the accuracy of geneid in a number of different
sets. In this page, we presente the results compared with
genscan.
Accuracy measures are the usual ones (see the page
Accuracy of Gene Identification Programs in this server, for a description).
- h178 This is a set of well annotated human sequences. The sort of dataset were genefinders are tipically evaluated. This set contains only single gene sequences, it is biased towards short genes, and it is likely to have been included in the training set of genefinders. Therefore, it is unclear how well results n this set can be extrapolated to large genomic sequences. genscan outperforms geneid on this set, although geneid tends to predict less false positives
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.89 0.90 0.89 0.66 0.74 0.70 0.18 0.08
genscan 0.97 0.86 0.91 0.83 0.75 0.79 0.06 0.14
- h178Art To simulate large genomic sequences, single gene sequences in h178 have been embedded in simulated intergenic DNA. Thus, the 178 single gene sequences have been collapsed into 42 multigene sequences of about 200 Kb each. Although preliminary results seem to indicate that the actual accuracy of genefinders can be better estimated in this data set than in the original set of single gene sequences, it is actually unclear how realistic this procedure is, and therefore how well results in this set can be extrapolated to large genomic sequences. while genscan outperformed geneid on the set of single gene sequences, geneid outperforms genscan now. genscan, however, is still superior in sensitivity.
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.89 0.79 0.84 0.67 0.61 0.64 0.18 0.25
genscan 0.94 0.64 0.78 0.68 0.45 0.57 0.11 0.42
- h800 This is a set of 800 human single gene sequences in embl, release 56, not present in embl, release 50. Therefore, these sequences are less likely to have been included in the training sets of current genefinders (including the current version of geneid). It is unclear, however, how well annotated these sequences are. In particular, they may include unannotated genes, and genscan may have already been used to annotated some of them. genscan and geneid perform comparably. genscan is more sensitive, while geneid produces less false positives.
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.89 0.82 0.85 0.59 0.63 0.61 0.19 0.18
genscan 0.95 0.77 0.85 0.71 0.58 0.64 0.09 0.28
- h800Art The 800 human sequences above embedded in simulated intergenic DNA and collapsed in 195 multigene sequences. geneid outperforms genscan.
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.90 0.76 0.83 0.60 0.52 0.56 0.19 0.32
genscan 0.93 0.61 0.77 0.62 0.36 0.49 0.12 0.50
- SGR-C1 This is a set of 25 real genomic sequences of about 100 Kb each from chromosome 1, annotated at the Sanger Center. It is unclear how exhaustive the annotation is, and it is likely that genscan has been used to obtain it. genscan slightly outperforms geneid on this set.
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.81 0.34 0.57 0.56 0.34 0.45 0.28 0.57
genscan 0.91 0.33 0.61 0.68 0.29 0.48 0.15 0.64
- Chromosome 22 We ran geneid on the sequence of chromosome 22, and compared the predictions with two different annotations. genscan released predictions were also compared to these two sets of annotations. The completeness of the annotation of chromosome 22 sequences is unclear, and genscan may had already been used to annotated it. geneid and genscan perform comparably on this set.
- annotation 1
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.23 0.30 0.25 0.56 0.22 0.39 0.19 0.66
genscan 0.23 0.31 0.26 0.58 0.19 0.38 0.14 0.69
- annotation 2
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.27 0.43 0.34 0.53 0.32 0.43 0.21 0.50
genscan 0.28 0.45 0.35 0.54 0.28 0.41 0.17 0.55