README

Manual of TSscan

1. System Requirement

The TSscan pipeline is executed on the 64-bit Linux operation system (e.g., 
Bio-Linux 6; also see http://nebc.nerc.ac.uk/ for more information). The BLAT
and BFAST aligners can be downloaded at http://genome.ucsc.edu/ (the UCSC 
Genome Browser) and http://sourceforge.net/apps/mediawiki/bfast/, respectively.

All source codes can be compiled by g++. The makefile that can automatically 
generate all executable programs is also provided. Of note, the system should 
support OpenMP to compile the source codes. The complied programs of TSscan 
are also accessible from our website at 
http://idv.sinica.edu.tw/trees/TSscan/TSscan.html

2. Preparation

The initial input data include the reference sequences, the long read data 
and the short read data.

2.1 Reference sequences

The following three data sets are retrieved from the reference sequences 
(e.g., hg19 or GRCh37).

 (1) Date set 1: the whole reference genomic sequences.  
 The whole reference genomic sequences should be completely downloaded from 
 the UCSC Genome Browser, which includes the sequences from chromosomes and 
 the mitochondrion genome and the unplaced/unlocalized sequences 
 (i.e., chr*_random and chrUn_*).
 
 (2) Data set 2: the processed mitochondrion genomic sequences. 
 The mitochondrion genomes are formed in a circular fashion. To 
 comprehensively detect possible fusion sequences in the mitochondrion 
 genomes, for each mitochondrion genome we generate a copy and then assemble 
 these two copies together. Such generated genomic sequences are designated
 as "processed mitochondrion genomic sequences". The processed mitochondrion
 genomic sequences can be generated by the mitochondrion genome with
 following UNIX instructions.

    head -1 chrM.fa > chrM.title 
    cat chrM.fa | grep -v "^>" > chrM.seq 
    cat chrM.title chrM.seq chrM.seq > RepChrM.fa 
 
 (3) Data set 3: the annotated RNA sequences. 
 The annotated RNAs are downloaded from the UCSC Genome Browser and the
 Ensembl Genome Browser (http://www.ensembl.org/‎).    

To minimize mapping errors due to unsequenced gaps, it would be better to
detect trans-splicing candidates on a model species with high-quality genomic
sequences and annotations.

2.2 Long read data

The polyA tails of the 454-reads should be removed, and the raw sequencing
data of the long 454-reads should be converted into a fasta format. 

2.3 Short read data

The raw sequencing data of the short reads should be converted into a fastq
format.

After that, install all data sets and the TSscan files in the same folder.
During the process of TSscan, do not move any file or change any file name.

3. The Pipeline of TSscan

The TSscan processes include the following steps (see Fig. 1).

 Step 1: Identifying chimeric RNA candidates by BLAT-aligning long reads
 against the reference genome.

 1.1: Mapping the long reads onto the Data set 1 (the whole reference genomic
 sequences) by BLAT
  
 Example: blat RefGenome.fa longreads.fa out_step1_1.psl
  
 Note: If the BLAT alignments are processed by chromosomes, all the results
 should be integrated into a file in a psl format and be sorted according to
 the long read IDs (i.e., "query ID", the 10th column of the psl-formatted
 file).

 1.2 TSscan1of4 out_step1_1.psl longreads.fa out_step1_2.fa

 Usage:
 TSscan1of4 [psl] [fasta] [output] 
 [psl]       the result of the BLAT-alignment between the long reads and the
 reference genome.  
 [fasta]     the long reads in a fasta format. 
 [output]    name of the output file.
 

 1.3 Mapping the output file of Step 1.2 into the Data set 2 and Data set 3
 (the processed mitochondrion genomic sequences and the annotated RNA
 sequences) by BLAT
 
 Example: blat out_step1_2.fa longreads.fa out_step1_3.psl

 1.4 TSscan2of4 RefRNA.blat longreads.fa out_step1_4.fa

 Usage:
 TSscan2of4 [psl] [fasta] [output]
 [psl]       the output file of step 3.
 [fasta]     the long reads in a fasta format.
 [output]    name of the output file. 

 1.5 Mapping the output file of Step 1.4 into the unplaced/unlocalized
 sequences (i.e., chr*_random and chrUn_*) by BLAT
 
 Example: blat out_step1_4.fa longreads.fa out_step1_5.psl

 1.6 TSscan3of4 out_step1_5.psl longreads.fa out_step1_6.fa

 Usage:
 TSscan3of4 [psl] [fasta] [output] 
 [psl]       the output file of Step 1.5.
 [fasta]     the long reads in a fasta format.
 [output]    name of the output file.

 Step 2 Excluding candidates without the support of short RNA-Seq reads.
 
 2.1 Mapping the short reads into the output file of Step 1.6 by BFAST
 
 Note: Please see the BFAST page at
 http://sourceforge.net/projects/bfast/files/ for details.
 
 2.2 
 For illumina RNA-Seq reads:
 cat out_step2_1.sam | ./TSscanSamParser.NT out_step1_6.fa > out_step2_2.sam 

 For color space reads (SOLiD reads):
 cat out_step2_1.sam | ./TSscanSamParser.CS50 out_step1_6.fa > out_step2_2.sam 

 Note: TSscan-parsing the output of Step 1.6. For the current version, the
 length of illumina RNA-Seq reads is limited to 50 bases and the length of
 the color space reads must be exactly 50 bases. 
 
 2.3 cat shortreads.fastq | ./FastqOut out_step2_2.sam 1 > out_step2_3.fastq
 
 Note: Extracting short reads which remain in the output SAM file of
 Step 2.2. 
 
 2.4 Mapping the output file of Step 2.3 into the Data sets 1~3 by BFAST.
 All the SAM files are then merged into a SAM file 
 
 Note: Please see the BFAST page at
 http://sourceforge.net/projects/bfast/files/ for details.
 
 2.5 TSscan4of4 out_step2_2.sam out_step2_4.sam longreads.fa out_step2_5.out

 Usage:
 TSscan4of4 [sam1] [sam2] [fasta] [output] 
 [sam1]     result file of mapping short reads to junction sequences
 (in a SAM format).
 [sam2]     result file of mapping short reads to the reference genomic
 sequences (in a SAM format).
 [fasta]    the long reads in a fasta format.
 [output]   name of the output file.
   
 After that, the users can manually filter out potential experimental
 artifacts (Step 3 of Fig. 1) and potential genetic rearrangement events
 (Step 4 of Fig. 1) by the criteria stated in the text and Figure 1.