Skip to content

An accurate GFF3/GTF lift over pipeline

License

Notifications You must be signed in to change notification settings

biogeeker/Liftoff

 
 

Repository files navigation

Liftoff

PyPI - Downloads Conda Travis (.org) Stars

Liftoff is a tool that accurately maps annotations in GFF or GTF between assemblies of the same, or closely-related species. Unlike current coordinate lift-over tools which require a pre-generated “chain” file as input, Liftoff is a standalone tool that takes two genome assemblies and a reference annotation as input and outputs an annotation of the target genome. Liftoff uses Minimap2 [(Li, 2018)](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778) to align the gene sequences from a reference genome to the target genome. Rather than aligning whole genomes, aligning only the gene sequences allows genes to be lifted over even if there are many structural differences between the two genomes. For each gene, Liftoff finds the alignments of the exons that maximize sequence identity while preserving the transcript and gene structure. If two genes incorrectly map to overlapping loci, Liftoff determines which gene is most-likely mis-mapped, and attempts to re-map it. Liftoff can also find additional gene copies present in the target assembly that are not annotated in the reference.

Getting Started

INSTALLATION

The easiest way to install Liftoff is with the conda package manager.

conda install -c bioconda liftoff

If you don't have conda installed, you need to install Minimap2 (following instructions here) and Liftoff from source or with pip.

git clone https://github.com/agshumate/Liftoff liftoff 
cd liftoff
python setup.py install
pip install Liftoff

USAGE

usage: liftoff [-h] (-g GFF | -db DB) [-o FILE] [-u FILE] [-exclude_partial]
               [-dir DIR] [-mm2_options =STR] [-a A] [-s S] [-d D] [-flank F]
               [-V] [-p P] [-m PATH] [-f TYPES] [-infer_genes]
               [-infer_transcripts] [-chroms TXT] [-unplaced TXT] [-copies]
               [-sc SC] [-overlap O] [-mismatch M] [-gap_open GO]
               [-gap_extend GE]
               target reference

Lift features from one genome assembly to another

Required input (sequences):
  target              target fasta genome to lift genes to
  reference           reference fasta genome to lift genes from

Required input (annotation):
  -g GFF              annotation file to lift over in GFF or GTF format
  -db DB              name of feature database; if not specified, the -g
                      argument must be provided and a database will be built
                      automatically

Output:
  -o FILE             write output to FILE in GFF3 format; by default, output
                      is written to terminal (stdout)
  -u FILE             write unmapped features to FILE; default is
                      "unmapped_features.txt"
  -exclude_partial    write partial mappings below -s and -a threshold to
                      unmapped_features.txt; if true partial/low sequence
                      identity mappings will be included in the gff file with
                      partial_mapping=True, low_identity=True in comments
  -dir DIR            name of directory to save intermediate fasta and SAM
                      files; default is "intermediate_files"

Alignments:
  -mm2_options =STR   space delimited minimap2 parameters. By default ="-a
                      --end-bonus 5 --eqx -N 50 -p 0.5"
  -a A                designate a feature mapped only if it aligns with
                      coverage ≥A; by default A=0.5
  -s S                designate a feature mapped only if its child features
                      (usually exons/CDS) align with sequence identity ≥S; by
                      default S=0.5
  -d D                distance scaling factor; alignment nodes separated by
                      more than a factor of D in the target genome will not be
                      connected in the graph; by default D=2.0
  -flank F            amount of flanking sequence to align as a fraction
                      [0.0-1.0] of gene length. This can improve gene
                      alignment where gene structure differs between target
                      and reference; by default F=0.0

Miscellaneous settings:
  -h, --help          show this help message and exit
  -V, --version       show program version
  -p P                use p parallel processes to accelerate alignment; by
                      default p=1
  -m PATH             Minimap2 path
  -f TYPES            list of feature types to lift over
  -infer_genes        use if annotation file only includes transcripts,
                      exon/CDS features
  -infer_transcripts  use if annotation file only includes exon/CDS features
                      and does not include transcripts/mRNA
  -chroms TXT         comma seperated file with corresponding chromosomes in
                      the reference,target sequences
  -unplaced TXT       text file with name(s) of unplaced sequences to map
                      genes from after genes from chromosomes in chroms.txt
                      are mapped; default is "unplaced_seq_names.txt"
  -copies             look for extra gene copies in the target genome
  -sc SC              with -copies, minimum sequence identity in exons/CDS for
                      which a gene is considered a copy; must be greater than
                      -s; default is 1.0
  -overlap O          maximum fraction [0.0-1.0] of overlap allowed by 2
                      features; by default O=0.1
  -mismatch M         mismatch penalty in exons when finding best mapping; by
                      default M=2
  -gap_open GO        gap open penalty in exons when finding best mapping; by
                      default GO=2
  -gap_extend GE      gap extend penalty in exons when finding best mapping;
                      by default GE=1

Input

The only required inputs are the reference genome sequence(fasta format), the target genome sequence(fasta format) and the reference annotation or feature database. If an annotation file is provided with the -g argument, a feature database will be built automatically and can be used for future lift overs by providing the -db argument.

Feature Types

By default, 'gene' features and all child features of genes (i.e. trancripts, mRNA, exons, CDS, UTRs) will be lifted over. The -f parameter can be used to specify a file containing a list of additional parent feature types you wish to lift-over. Note: feature IDs must be unique for every feature and may not contain spaces. Example of a feature types file would be the following:

biological_region
miRNA
repeat_element

Sequence Identity and Alignment Coverage

A gene will be considered mapped successfully if the alignment coverage and sequence identity in the child features (usually exons/CDS) is >= 50%. This can be changed with the -a and -s options. By default, genes that map below these thresholds will be included in the gff file with partial_mapping=True and low_identity=True in the last column. To exclude these partial/low identity mappings from the final GFF use -exclude_partial, and these genes will instead be written to the unmapped_features.txt file. The sequence identity and alignment coverage is reported in the final column of the output GFF for feach gene.

Minimap2 parameters

By default liftoff uses the following parameters for the minimap2 alignments -a --eqx --end-bonus 5 -N 50 -p 0.5 -a and --eqx specify that the output should be in SAM format with the cigar string including "=" for matches and "X" for mismatches (opposed to the default SAM format using 'M' for both). The -N and -p parameters specficied allow for more secondary alignments to be considered which is helpful in the resolution of multi-gene families. The --end-bonus parameter favors end-to-end alignments of the gene over soft clipping a mismatched base at the start or end of the alignment. For example if the stop codon of the reference gene is TAA and the stop codon of the target gene is TAG, without the end-bonus parameter, this alignment and subsequent annotation would be truncated by 1 base.

The user may wish to change the minimap2 parameters for their specific data. This can be done with the -mm2_options parameter with a string of options to add/change preceeded by an "=" sign. The "=" is important as it distinguishes minimap2 parameters from liftoff parameters with the same flag. For more divergent species in particular, increasing the -r and -z parameters may improve results (see Minimap2 documentation for more details). An example of changing these with -mm2_options would be

-mm2_options="-r 2k -z 5000"

Polishing Exon/CDS Annotations

With the -polish option Liftoff will re-align the exons in attempt to restore proper coding sequences in cases where the lift-over resulted in start/stop codon loss or introduced an in-frame stop codon. This will increase the run time but offers improvments in preserving proper CDS annotations. With the polish option, 2 output GFF/GTF files will be created named {output}.gff and {output}.gff_polished. {output}.gff contains the annotations prior to the polishing step and {output}.gff_polished contains the annotations after being polished.

Gene Structure in Cross-Species Lift-over

Liftoff works best when the gene structure (i.e intron size) is similar in the reference and target genomes. When genes differ significantly in size, the alignments are more fragmented and often small exons at the beginning or end of the gene are not aligned. Adding and aligning some percentage of flanking sequence to the gene with the -flank option can improve this in some cases. Additionally increasing the -d parameter will allow mappings where the genes are much larger in the target genome than in the reference.

Chromosome by Chromosome Lift-over

By default, all genes will be aligned to the entire target assembly. However, for chromosome-scale assemblies of the same species, the -chroms option can be used to perform the lift-over chromosome by chromosome which improves accuracy. After the chromosome by chromosome lift over is complete, any genes that did not map will be aligned to the whole genome. This is strongly recommended for repetitive/polyploid genomes where there are many similar genes on different chromosomes. This option can be enabled by providing a comma seperated file chroms.txt with corresponding chromosome names with the -chroms argument. Each line of the file should follow {ref_chrom_name},{target_chrom_name} for each pair of corresponding chromosomes. For example, a lift over from a Genbank human assembly to a Refseq human assembly would have the following chroms.txt file.

chr1,NC_000001.10
chr2,NC_000002.11
chr3,NC_000003.11
chr4,NC_000004.11
chr5,NC_000005.9
chr6,NC_000006.11
chr7,NC_000007.13
chr8,NC_000008.10
chr9,NC_000009.11
chr10,NC_000010.10
chr11,NC_000011.9
chr12,NC_000012.11
chr13,NC_000013.10
chr14,NC_000014.8
chr15,NC_000015.9
chr16,NC_000016.9
chr17,NC_000017.10
chr18,NC_000018.9
chr19,NC_000019.9
chr20,NC_000020.10
chr21,NC_000021.8
chr22,NC_000022.10
chrX,NC_000023.10
chrY,NC_000024.9

Unplaced Genes

A list of unplaced sequence names can be provided with the -unplaced option. With this option, genes from these unplaced contigs in the reference will be mapped to the target assembly after the genes on the main chromosomes in the chroms.txt have been mapped.

Extra Gene Copies

With the -copies option, Liftoff will look for extra copies of genes that are not annotated in the reference after the initial lift over. A gene copy will only be annonated at a locus if it does not overlap another annotated feature. By default, exons/CDS's must have 100% sequence identity Extra gene copies will have the same ID as the reference gene and will be tagged with extra_copy_number={copy_number} in the last column of the GFF file.

Output

The output is a file in the same format as the reference annotation (GFF3 or GTF) for the target genome and a file with the IDs of unmapped genes. The 9th column of the target annotation will contain the same information as the original reference plus the following

Genes:

sequence_ID: The sequence identity of the gene compared to the reference in exon regions
coverage: The alignment coverage of the gene in exon regions
valid_ORFs: The number of valid ORFs annotated within the gene

Transcripts:

valid_ORF: Indicates the CDS annotation properly starts with a start codon, ends with a stop codon, 
           and does not have any in-frame stop codons.
matches_ref_protein: Indicates the translated CDS matches the reference CDS exactly
missing_start_codon: Indicates the CDS does not begin with a start codon
missing_stop_codon: Indicates the CDS does not end with a stop codon
inframe_stop_codon: Indicates the CDS has an inframe stop codon. 

All features:

extra_copy_number: The copy number increase of this feature compared to the reference. 
extra_copy_number=0 means this is the original reference gene. 

Known Issues:

Extracting gene sequences from bgzipped files is possilbe but much slow. It is recommended to decompress bgzipped FASTA files first.

Citation

If you use Liftoff in your work please cite

Shumate, Alaina, and Steven L. Salzberg. 2020. “Liftoff: Accurate Mapping of Gene Annotations.” Bioinformatics , December. https://doi.org/10.1093/bioinformatics/btaa1016.

About

An accurate GFF3/GTF lift over pipeline

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%