-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Geneid v 1.2 Documentation
Enrique Blanco García · Genís Parra · Roderic Guigó Serra
Table of contents
- Description
- Main Features
- Examples
- Training Geneid
- Gene predictions on genomes
- Accuracy
- Speed
- Source Code Distribution
- Geneid Parameter Files
- Web Server
- If you encounter problems
- References
- Authors and acknowledgements
- geneid on line
- Geneid authors email address
Geneid is a program to predict genic elements as splice sites, exons and genes, along eukaryotic DNA sequences. geneid offers some type of support to integrate predictions from multiple sources and allows to improve the quality of predictions by using sequence homology information.
Designed with a hierarchical structure.
- In the first step, splice sites, start and stop codons are predicted and scored along the sequence using Position Weight Arrays (PWAs).
- In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the the log-likelihood ratio of a Markov Model for coding DNA.
- Finally, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons.
Currently, geneid v1.2 analyzes the whole human genome in 3 hours (approx. 1 Gbp / hour) on a processor Intel(R) Xeon CPU 2.80 Ghz.
The program is written in ANSI C and runs on UNIX operating systems such as Linux, Solaris or Irix. Installation, setup and usage are pretty fast and simple and there is a wide range of options to configure the behaviour of the program engine.
Geneid source code, compiled binaries and documentation are available under the GNU GENERAL PUBLIC LICENSE.
Comments, suggestions and other questions will be warmly welcomed by the authors.
- geneid accuracy compares to that of other existing "ab initio" gene prediction tools.
-
geneid is very efficient in terms of speed and memory usage. In
practice, geneid can analyze chromosome size sequences at a rate of
about 1 Gbp per hour on the Intel(R) Xeon CPU 2.80 Ghz. For the largest
human chromosome (chr1), it requires 1/2 Gbyte of RAM plus the size of the Fasta
sequence.
.
- geneid offers support to integrate predictions from multiple sources
(ESTs, blast HSPs) and to reannotate genomic sequences, via external gff files
and together with the redefinition of the "gene model".
- geneid output can be customized to different levels of
detail, including exhaustive listing of potential signals and exons. Furthermore,
several output formats as gff or XML are available.
- There are available parameter files in geneid v 1.2 for Drosophila Melanogaster,
human (which can be also used for vertebrate genomes), Dictyostelium discoideum
and Tetraodon nigroviridis (which can be used for Fugu rubripes) among many others for species spanning the four "classical" kingdoms. The additional currently available parameter files can be found under the section "geneid parameter files" .
- INPUT: test.fa
- Prediction of acceptor splice sites
- Prediction of exons
- Gene prediction
- Improving gene prediction by using re-annotation
- Improving gene prediction by using homology information
In order to build a parameter file for geneid it is necessary to "train" the program and parameter configurations exist for a number of eukaryotic species. Training basically consists of computing position weight matrices (PWMs) or Markov models for the splice sites and start codong and deriving a model for coding DNA (generally a Markov model of order 4 or 5). The basic requirements for a training set is an annotation file (preferably in geneid gff format and a set of fasta sequences corresponding to the gene models in the annotation file.
Generally as few as 100 gene models could be enough to build a reasonably accurate geneid parameter file, but generally a user would want to have as many sequences as possible (> 500) to build an optimally accurate matrix and also to be able to set aside some of the gene models for testing purposes (see training document).
If a user wants to evaluate the accuracy of the newly developed parameter file she will also require an annotation file and fasta files corresponding to the sequences in the evaluation set. However if a user only has a limited number of gene models to train geneid with (generally < 500 sequences) she can use a "leave-one-out strategy" for evaluating the accuracty (more information in the training tutorial).
The user can go through an example of a typical geneid "training" protocol (Training geneid for the parasite Perkinsus marinus) by following this tutorial
This link contains the set of predicted genes
using geneid on the recently sequenced genomes (Drosophila melanogaster, Homo sapiens, Mus musculus,
Fugu rubripes or Dictyostelium discoideum) for some of their most common releases.
Because of the lack of well annotated large genomic sequences, it is
difficult to assess the accuracy of "ab initio" gene finders. We have
attempted to analyze the accuracy of geneid in a number of different
sets. We believe that in the analysis of large genomic sequences geneid may be superior to other existing tools. A side by side comparison with
genscan can be found here.
The benchmark sequence is the human Chromosome 1 (239 Mb) extracted from
the goldenPath-UCSC assembly
(July 2003 release):
Computer | Intel Pentium Intel(R) Xeon CPU 2.80 Ghz. 4Gb RAM |
CPU/real time(s) | 1025 / 1045 secs |
geneid distributions contains several directories and files compressed in
tar.gz file. Source code and documentation files are included in the distribution,
as well as several parameters files and other extra information.
All of the files can be obtained from our ftp server:
Cummulative change log: ChangeLog
geneid v 1.4:
geneid v 1.4.4 full distribution: source code and documentation
(documentation does not yet reflect new features; for help, type geneid -h)
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.4.4.Jan_13_2011.tar.gz
-> 05c00f283a8fa996418aff0bc8db1c6d
geneid v 1.4.4 (current development version):
(documentation does not yet reflect new features; for help, type geneid -h)
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.4.4.Jan_13_2011.tar.gz
-> 05c00f283a8fa996418aff0bc8db1c6d
geneid v 1.3:
geneid v 1.3 preview release 3 (version used for NGASP phase II category 4):
(documentation does not yet reflect new features; for help, type geneid -h)
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.3.Mar_30_2007.tar.gz
-> 10cad4e6ae25a57fcc6bb062692626ae
geneid v 1.3 preview release 1 (version used for NGASP phase I category 1):
geneid v 1.3 full distribution: source code and documentation(documentation does not yet reflect new features; for help, type geneid -h)
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.3.Dec_21_2006.tar.gz
-> 1ff0f870e5ec5a553e4603102a9d7c62
geneid v 1.2:
-
geneid v 1.2 Solaris 64-bits distribution
(Makefiles optimized by Mithun Sridharan, Sun Microsystems GmbH)
[FULL VERSION - DOWNLOAD] [BINARY FILE] -
geneid v 1.2 Linux binary (gcc version 3.3.1)
[DOWNLOAD] -
geneid v 1.2 documentation (HTML)
[READ]
geneid v 1.2 full distribution: source code and documentation
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.2.March_1_2005.tar.gz
-> 6f350210ead7e49ac76be1fd17ef91f9
Instructions to install geneid in your computer.
Geneid Chapter's Documentation
1. Signals, exons and genes
2. How to install geneid
3. Running Geneid
4. Assembling predicted exons into gene structures
5. Introducing external information: annotations
6. Using sequence homology information
7. Geneid parameter file
8. Source Code Documentation
9. Geneid History
Old releases:
-
geneid v 1.1 full distribution: source code and documentation
[DOWNLOAD]
-
geneid v 1.1 Linux binary (gcc version 2.95 19990728)
[DOWNLOAD]
-
geneid v 1.1 documentation (HTML)
[DOWNLOAD] [READ]
geneid v 1.1:
- geneid v 1.0 full distribution: source code and documentation
[DOWNLOAD]
- geneid v 1.0 binary files for some architectures
Linux, SGI and Solaris.
- geneid v 1.0 documentation (PostScript)
[DOWNLOAD]
geneid v 1.0:
-
geneid Parallel full distribution: source code and documentation
[DOWNLOAD]
geneid v 1.0 (Parallel version):
Requires UNIX/LINUX pthreads library
geneid
has been trained on several species and it is being trained
on other genomes as well. See this help
for more details about the different parts of parameter files as well as their statistical meaning.
ℹ️
The parameter files for geneid v 1.2
are not compatible with previous versions
The parameter files for geneid v 1.3 and 1.4
are not back-compatible with previous versions, however, version 1.2 parameter files ARE forward-compatible with version 1.3 and 1.4
List of available parameter files (geneid v 1.3 and 1.4
):
- Homo sapiens (suitable for vertebrates) (UPDATED - January 2nd, 2007)
- Drosophila melanogaster (suitable for fly and mosquito) (UPDATED - January 2nd, 2007)
-
Acyrthosiphon pisum (This version of the aphid parameter file detects GC donors and requires
geneid v 1.3 and above
)
List of available ANIMAL parameter files (geneid v 1.2 and above
):
- Homo sapiens (suitable for vertebrates) (UPDATED - February 22nd, 2006)
- Tetraodon nigroviridis
- Loa loa
- Caenorhabditis elegans (UPDATED - December 20th, 2006)
- Ciona intestinalis
- Oikopleura dioica
- Schistosoma japonicum
- Dinoponera longipes (dino ant)
- Nasonia vitripennis (parasitoid wasp species)
- Polistes canadensis (red paper wasp)
- Polybia occidentalis (camoati wasp) (NEW - Feb 13th, 2017)
- Metapolybia cingulata (paper wasp) (NEW - Feb 13th, 2017)
- Liostenogaster flavolineata (tropical hover wasp) (NEW - Nov 29th, 2017)
- Harpegnathos saltator
- Drosophila melanogaster (suitable for species of fly and mosquito) (UPDATED: Nov 13, 2015)
- Culex pipiens (suitable for species of mosquito)
- Apis mellifera (European honey bee)
- Apis dorsata (Giant bee)
- Apis florea (Dwarf bee)
- Bombus impatiens (common eastern bumble bee)
- Bombus terrestris (buff-tailed bumble bee)
- Acyrthosiphon pisum (pea aphid)
- Cinara cedri (aphid species)
- Rhizoglyphus echinopus (bulbmite) (NEW - July 20th, 2016)
- Aceria tosichella (NEW - June 25th, 2018)
- Gyrodactylus bullatarudis (NEW - May 29th, 2018)
List of available PROTIST parameter files (geneid v 1.2 and above
):
- Dictyostelium discoideum
- Perkinsus marinus
- Plasmodium vivax
- Plasmodium falciparum
- Trypanosoma brucei
- Blastocystis hominis
- Cryptosporidium ubiquitum (NEW - September 3rd, 2018)
- Cryptosporidium parvum (NEW - September 3rd, 2018)
- Paramecium tetraurelia (uses codon table 6 -only TGA is a stop codon-. Please contact us for modified version of geneid required to predict on this species)
- Tetrahymena thermophila (uses codon table 6 -only TGA is a stop codon-. Please contact us for modified version of geneid required to predict on this species)
List of available PLANT parameter files (geneid v 1.2 and above
):
- Triticum aestivum (wheat) (UPDATED - February 12th, 2014)
- Oryza sativa (rice)
- Brachypodium distachyon (a grass spp.)
- Solanaceae (suitable for species of tomato, potato, tobacco and petunia)
- Solanum lycopersicum (tomato)
- Solanum tuberosum (potato)
- Phaseolus vulgaris (common bean)
- Cucumis melo (melon) (UPDATED - January 21st, 2011)
- Cucumis spp. (suitable for species of melon and cucumber)
- Vitis vinifera (grape)
- Arabidopsis thaliana
- Chondrus crispus (red algae)
- Musa acuminata (banana tree)
List of available FUNGI parameter files (geneid v 1.2 and above
):
- Emericella nidulans
- Neurospora crassa
- Filobasidiella neoformans
- Coprinopsis cinerea
- Chaetomium globosum
- Stagonospora nodorum
- Rhizopus oryzae
- Sclerotinia sclerotiorum
- Histoplasma capsulatum
- Coccidioides immitis
- Schizosaccharomyces japonicus
- Phytophthora infestans
- Batrachochytrium dendrobatidis
- Puccinia graminis
- Fusarium oxysporum
- Plectosphaerella cucumerina
List of available parameter files for OLDER VERSION OF GENEID (geneid v 1.1
):
- Homo sapiens (suitable for vertebrates)
- Drosophila melanogaster
- Tetraodon nigroviridis
- Dictyostelium discoideum
- Plasmodium falciparum
- Triticum aestivum
- Caenorhabditis elegans
- Arabidopsis thaliana
- Oryza sativa
A geneid web server is available to submit sequences over the Internet. There is no limit to the length of the submitted sequence, other than the imposed by the Internet (except when plotting is required).
geneid homepage including more information about program such as accuracy or efficiency (benchmarking) and a geneid online webserver service are provided over there:
If you encounter problems using geneid, or have suggestions on how to improve it send an e-mail to
geneid@crg.es
- ftp.imim.es in the directory /pub/software/geneid
- E. Blanco, G. Parra and R. Guigo,
"Using geneid to Identify Genes.",
In A. Baxevanis, editor:
Current Protocols in Bioinformatics. Unit 4.3.
John Wiley & Sons Inc., New York (2002) (in press)
- E. Blanco, G. Parra, S. Castellano, J.F. Abril,
M. Burset, X. Fustero, X. Messeguer and R. Guigó
"Gene Prediction in the Post-Genomic Era."
IX th ISMB (Poster), Copenhagen, Denmark (2001)
- G. Parra, E. Blanco, and R. Guigo,
"Geneid in Drosophila",
Genome Research 10(4):511-515 (2000).
- R. Guigo,
"Assembling genes from predicted exons in linear time with dynamic programming",
Journal of Computational Biology, 5:681-702 (1998).
- R. Guigo, S. Knudsen, N. Drake, and T. F. Smith,
"Prediction of gene structure",
Journal of Molecular Biology, 226:141-157 (1992).
The current version of geneid has been written by
Enrique Blanco,
Tyler Alioto and
Roderic Guigó.
The parameter files have been constructed by
Genis Parra,
Tyler Alioto
and Francisco Camara.
With contributions from Josep F.Abril, Moises Burset and Xavier Messeguer.
- Enrique Blanco Garcia: eblanco@imim.es
- Genis Parra Farre: gparra@imim.es
- Roderic Guigo i Serra: rguigo@imim.es
If you need help about geneid, send a message to:
This training tutorial document was prepared by: Francisco Camara
Enrique Blanco Garcia © 2003