Skip to content
Quarkins edited this page Jul 12, 2016 · 39 revisions

Welcome to the SuperTranscript wiki!

The aim of this wiki is to provide the user a brief explanation of this tool and how to use it.

The aim of SuperTranscript is simple: from a group of transcripts, de novo contigs, reference sequence (or a mixture of all the above) to construct a long transcript (a SuperTranscript) containing all the bases from all the shorter sequences whilst preserving the ordering in which they come in the shorter trancsripts.

Installation

There are two pre-requisites for Ribbon:

  1. Anaconda for python3 (which will come with all necessary python packages) - https://www.continuum.io/downloads
  2. BLAT v.35 (Download zip file and install as per README - https://users.soe.ucsc.edu/~kent/src/)

Annotations Additionally, SuperTranscript can provide two complementary annotations of these SuperTranscripts:

1) An annotation by SuperBlocks whom are defined by matched overlapping parts of two or more of the contigs which build the SuperTranscript, these are undiverging paths on the graph. These SuperBlocks are thought of as exon like structures, in reality they can include fewer or greater number of bases than one exon from annotated genome.
2) An annotation built from using the transcript coverage over the SuperTranscript. In other words, it annotates where the transcripts used to build the SuperTranscript map back to the SuperTranscript.

The Algorithm

The algorithm can be thought of as the following steps: 1) Input a list of contigs/trancsripts and their sequence in a fasta file and a text file with the clustering information for which gene/cluster each transcript belongs to.

For each gene:
2) Using BLAT (https://genome.ucsc.edu/FAQ/FAQblat.html) pairwise align each transcript in the cluster to find the regions which overlap.
3) Construct a directed graph, where each node is a base in one of the transcripts and the directed edge retains the ordering of the bases in each transcript. Using the pairwise alignments of all clusters merge shared bases (nodes) together.
4) Simplify graph and remove all cycles in order to create a Directed Acyclic Graph (DAG), necessary for the next step.
5) Topologically sort the nodes ( each node know is a string of bases from the original unsimplified graph) using Khan's algorithm, which will give a non-unique sorting of the bases.
6) Extract the annotations (both SuperBlock style and transcript style)

Visualisation

The output of SuperTranscript is a fasta file and two annotation files (.gff). These can be easily read into the Integrative Genomics Viewer (IGV - https://www.broadinstitute.org/igv/) where one can easily view the read coverage of the SuperTranscript and the various annotations.

Code Example

Producing SuperTranscript

python ParaClusters.py Genome.fasta Clusters.txt

Where Genome.fasta if a fasta file which contains all the transcripts in all the genes/clusters you wish to construct a SueprTranscript for. Clusters.txt is a text file containing a two tab seperated columns containing the transcript/contig name in the first column and the cluster/gene name in the second column (as is the output of Corset).

These both run in parallel mode (each gene can be run as a seperate stand alone thread).

Options: -n Number of cores to run on (default is max. number of cores available)
-C To run in cleanup mode, this deletes all intermediate fasta and psl files (except the SuperTranscript ones)
-a Boolen flag to produce the annotation with the transcript coverage (note: this takes approximately 10% more time to run). Default is that this flag is True.

Pre-requisite: Standalone BLAT installed on your computer (depending on which version of BLAT you use, results may differ)

IGV viewer

To start IGV from the command line, simply type: igv This will load igv (if you have it installed), then one simply has to load the SuperDuper.fasta file which contains the sequence for each gene. The sorted .bam files which contains the reads mapped to the SuperDuper.fasta and the annotation files, SuperDuper.gff and SuperDuper_trans.gff (remembering to expand them using a right click on the annotation object in igv and choosing expanded view mode).

A suggested pipeline for deNovo assembled non-model organisms

1) Run a DeNovo assembly (e.g. with Trinity)
2) Cluster the contigs into genes (e.g. Using Trinity or Corset)
3) Build SuperTranscript and annotations.
4) Map reads to SuperTranscript
5) Sort and Index bam files (if want to view in IGV)

Further optional analyses

  • Extract Differential Expression at gene level (e.g. using Corset or DESeq, limma/voom)
  • Extract differential transcript/exon usage(e.g. with DEXseq or voom/diffsplice)
  • Cryptic Cancer variants