Skip to content

Usage Documentation

nadiadavidson edited this page Feb 12, 2020 · 21 revisions

The files used in this example are find in the Examples folder of the repository.

Producing a SuperTranscript

Lace/Lace_run.py Example/Example_Genome.fasta Example/clusters.txt -t -o Test

Where Genome.fasta is a fasta file which contains all the transcripts for all the genes/clusters you wish to construct a SuperTranscript for, Clusters.txt is a text file containing two tab separated columns containing the transcript/contig name in the first column and the cluster/gene name in the second column (as is the output of Corset).

In this example, there are two mock genes (A and B) which each are expressed by two different transcripts.

This runs in parallel mode (each gene can be run as a separate stand alone thread).

To get the help options simply type:

Lace/Lace_run.pyLace.py --help
usage: Lace_run.py [-h] [--cores CORES] [--alternate] [--tidy] [--maxTran MAXTRAN] [--outputDir OUTPUTDIR] TranscriptsFile ClusterFile

positional arguments:
  TranscriptsFile       The name of the fasta file containing all transcripts
  ClusterFile           The name of the text file with the transcript to cluster mapping

optional arguments:
  -h, --help            show this help message and exit
  --cores CORES         The number of cores you wish to run the job on (default = 1)
  --alternate, -a       Create alternate annotations and create metrics on success of SuperTranscript Building
  --tidy, -t            Remove intermediate fasta files after running
  --maxTran MAXTRAN     Set a maximum for the number of transcripts from a cluster to be included for building the SuperTranscript (default=50).
  --outputDir OUTPUTDIR, -o OUTPUTDIR
                        Output Directory

The outputs of this script are:

  • SuperDuper.fasta containing the SuperTranscript sequence per gene. The SuperTranscript ID line include the cluster ID, the number of transcripts used in construction and the number of whirls (loops in the splicing graph). Whirls can result in repeated sequence in the SuperTranscript. If the transcript and while numbers are -1, this indicates that the splicing graph was too complex to construct and instead Lace reported the longest isoform.
  • SuperDuper.gff The annotation for each SuperTranscript obtained from the overlap graph. This annotates the superTranscript blocks and can be used for downstream alignment and counting for differential transcript usage. Note that it does not link blocks to transcripts, ie. it is not an annotation of transcripts to the SuperTranscript (see next section for that).
  • Intermediate files: .fasta files containing all transcripts for each gene and .psl files containing the pairwise alignment of all transcripts by blat per gene. These are generally not needed for downstream analysis.

Extracting the annotation of transcripts against the SuperTranscript

If one did not originally create the alternate annotation by calling flag --alternate [-a] in the previous step, one can easily create this afterwards. Simply

Move into output directory:

cd Test

Make the alternate annotation (if not called as flag in original lace running (-a) this requires that you run the tool in the same directory as Superduper.fasta and the *.fasta files created for each gene):

python Lace/Checker.py SuperDuper.fasta

usage: Checker.py [-h] [--cores CORES] SuperFile  

positional arguments:  
  SuperFile      The name of the SuperDuper.fasta file created by  
             SuperTranscript  

optional arguments:  
  -h, --help     show this help message and exit   
  --cores CORES  The number of cores you wish to run the job on (default = 1)  

Outputs:

  • SuperDuperTrans.gff The annotation of the transcripts on the SuperTranscript [Optional - if --alternate flag invoked]
  • LogOut.pdf A pdf documenting various metrics for assessing the quality of the SuperTranscript construction.

IGV viewer

To start IGV from the command line, simply type: igv This will load igv (if you have it installed), then one simply has to load the SuperDuper.fasta file which contains the sequence for each gene. The sorted .bam files which contains the reads mapped to the SuperDuper.fasta and the annotation files, SuperDuper.gff and SuperDuper_trans.gff (remembering to expand them using a right click on the annotation object in igv and choosing expanded view mode).

Viewing transcript coverage on SuperTranscript

Another function which the lace package includes is to view for a given gene the coverage of each transcript on the SuperTranscript. To run this script make sure to be in the same directory as SuperDuper.fasta.

python Lace/STViewer.py SuperFiles/GeneA.fasta

usage: STViewer.py  GeneFile 

positional arguments:  
  GeneFile    The fasta file for the gene you wish to view  

Outputs:

  • GeneView.pdf - A pdf displaying the transcript coverage to the SuperTranscript.

Creating Splice Block Annotation

In order to create this annotation one has to have mapped the reads back to the SuperTranscript already. In order to make this annotation one requires a splice aware mapper, i typically use STAR since it outputs the splice junctions in a handy tab delimited file. The little script i made to construct the Splice Blocks is called Mobius and it requires to inputs:

  • SJ.out.tab (delimited output from STAR for splice junctions)
  • SuperDuper.fasta (The fasta file containing sequence for the constructed SuperTranscripts)

The code to produce the splice block annotation:

python Lace/Mobius.py SJ.out.tab SuperDuper.fasta  

Output: Spliced.gtf (the annotation file based off of splice blocks)

The Perils of Large Clusters

One problem that users have encountered are what to do with large clusters! These large clusters can sometimes include several hundred contigs, or some contigs which have serveral hundred kbp of sequence. De Novo assembly is a really hard job and often junk can be compiled into contigs or a group of many junk contigs can be clustered together, sweeping up the garbage. One can try and cluster them into superTranscripts but BLAT may take a long time to align them (and stall) and Lace may not like making such a large graph! So if you run into issues where Lace takes too long or BLAT hangs e.t.c i would edit your cluster file and remove the cluster or contig which is causing the issue, whilst there may be some real sequence buried in the junk it will be had to see the forest from the trees...so in our experience it is best to just cut your losses!