Skip to content

Specifications and example usage

Evan Staton edited this page Apr 23, 2023 · 13 revisions

Before I dive into the usage of Tephra I'd like to briefly explain the intended usage. Essentially, Tephra is for finding transposons and analyzing patterns of genome evolution. The goal is for this tool to be generally applicable to finding transposons of any type in any genome. This is a major challenge given the diversity of transposons and also the variation of forms within a single transposon type. For this reason it is necessary to be able to fine tune the search parameters, and this is what the Tephra configuration file is designed to enable. For most eukaryotic species, it is advised to run the tephra all command, which will search for all major transposon types and analyze the patterns of evolution of those elements. This is the scenario I will describe on this page. For some species, yeast for example, it may make sense to only search for LTR retrotransposons, or you may only be interested in finding Helitrons. I will describe those use cases in the future, but for now lets focus on the situation where you want to find all transposons in a genome and describe the evolution of those elements.

CONFIGURING THE SEARCH PARAMETERS

At the root of the distribution is a config folder which contains the main Tephra configuration file. This file can viewed on github from this link. From the command line, you can type the following command to fetch the configuration file:

curl -sL -o tephra_config.yml https://git.io/v5HFq

Shown below is the configuration file which I will describe in detail on this page.

## For more information about this file, see: 
## https://github.com/sestaton/tephra/wiki/Specifications-and-example-usage.
all:
  - logfile:          tephra_tair10_full.log
  - genome:           TAIR10_chr1-5.fas
  - outfile:          TAIR10_chr1-5_tephra_transposons.gff3
  - repeatdb:         repbase1801_athaliana.fasta 
  - genefile:         TAIR10_genes.fas
  - trnadb:           TephraDB
  - hmmdb:            TephraDB
  - threads:          24
  - clean:            YES
  - debug:            NO
  - subs_rate:        1e-8
findltrs:
  - dedup:             NO
  - tnpfilter:         NO
  - domains_required:  NO
  - ltrharvest:
     - mintsd:         4
     - maxtsd:         20
     - minlenltr:      100
     - maxlenltr:      1000
     - mindistltr:     1000
     - maxdistltr:     15000
     - seedlength:     30
     - tsdradius:      60
     - xdrop:          5
     - swmat:          2 
     - swmis:          -2
     - swins:          -3
     - swdel:          -3
     - overlaps:       best
  - ltrdigest:
     - pptradius:      30
     - pptlen:         8 30
     - pptagpr:        0.25
     - uboxlen:        3 30
     - uboxutpr:       0.91
     - pbsradius:      30
     - pbslen:         11 30
     - pbsoffset:      0 5
     - pbstrnaoffset:  0 5
     - pbsmaxeditdist: 1
     - pdomevalue:     1E-6
     - pdomcutoff:     NONE
     - maxgaplen:      50
classifyltrs:
  - percentcov:       50
  - percentid:        80
  - hitlen:           80
illrecomb:
  - repeat_pid:       10
ltrage:
  - all:              NO
maskref:
  - percentid:        80
  - hitlength:        70
  - splitsize:        5000000
  - overlap:          100
sololtr:
  - percentid:        39
  - percentcov:       80
  - matchlen:         80
  - numfamilies:      20
  - allfamilies:      NO
tirage:
  - all:              NO

The first thing to notice at the top of this document is the URL of this page which you can always refer back to for a description of the configuration. To help you understand this configuration file lets take a look at the Tephra subcommands and their description (you can type tephra at the command line for more information):

           age: Calculate the age distribution of LTR or TIR transposons.
           all: Run all subcommands and generate annotations for all transposon types.
  classifyltrs: Classify LTR retrotransposons into superfamilies and families.
  classifytirs: Classify TIR transposons into superfamilies.
 findfragments: Search a masked genome with a repeat database to find fragmented elements.
 findhelitrons: Find Helitons in a genome assembly.
      findltrs: Find LTR retrotransposons in a genome assembly.
   findnonltrs: Find non-LTR retrotransposons in a genome assembly.  
      findtirs: Find TIR transposons in a genome assembly.
     findtrims: Find TRIM retrotransposons in a genome assembly.
     illrecomb: Characterize the distribution of illegitimate recombination in a genome.
       maskref: Mask a reference genome with transposons.
    reannotate: Transfer annotations from a reference set of repeats to Tephra annotations.
       sololtr: Find solo-LTRs in a genome assembly.
          info: Show version information for all external programs configured and used by Tephra.

You can see that most of the command names start with "find" and these are for finding a particular transposon type. The other commands are for analyzing the transposons that are discovered. The important thing to note is that the command names match the sections in the configuration file, which allows you to tune the parameters for each command. With that point in mind we can now move on to describing the purpose of each line in the configuration. I've separated out each section of this file below.

all

      - logfile:          tephra_tair10_full.log
      - genome:           TAIR10_chr1-5.fas
      - outfile:          TAIR10_chr1-5_tephra_transposons.gff3
      - repeatdb:         repbase1801_athaliana.fasta
      - genefile:         TAIR10_genes.fas 
      - trnadb:           TephraDB
      - hmmdb:            TephraDB
      - threads:          24
      - clean:            YES
      - debug:            NO
      - subs_rate:        1e-8

The 'all' section of the configuration file is the most important because this is where you specify which files you run the analysis on and which files to write the results to. Each entry will be described briefly:

  • logfile - the main file for logging progress and results. This will indicate the result files from each step and allow you to monitor the analysis.
  • genome - The genome in FASTA format to be used for finding transposons.
  • outfile - The name of the GFF3 file to store all of the resulting transposon annotations.
  • repeatdb - The name of a database of repetitive elements in FASTA format to be used for making similarity-based comparisons.
  • genefile - The name a FASTA file of gene sequences from the focal species, or a closely related species. This helps to filter out spurious transposons that are tandem gene duplicates.
  • trnadb - The name of database of tRNA sequences in FASTA format to be use for finding primer-binding sites of LTR retrotransposons. Leave as "TephraDB" (Tephra's custom database) unless you want to provide your own database.
  • hmmdb - The name of the HHM database of transposon-related protein profiles. Leave as "TephraDB" (Tephra's custom database) unless you want to provide your own database.
  • threads - The number of parallel process you want to run. The threads you use the faster the analysis will run, though this should be set to within the number of threads your computer can handle.
  • clean - This boolean (YES|NO) specifies whether to clean up all intermediate files. This should be set to "YES" for all situations except debugging.
  • debug - This boolean (YES|NO) specifies whether to increase the verbosity of the logging for debugging purposes. Unless you are trying to debug issues, set to "NO" or the output will be very noisy.
  • subs_rate - The per-base substitution rate used for calculating transposon age. If this is unknown, set it to that of a closely related species or model in your system.
Clone this wiki locally