-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
163 lines (115 loc) · 5.98 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
Manual of TSscan
1. System Requirement
The TSscan pipeline is executed on the 64-bit Linux operation system (e.g.,
Bio-Linux 6; also see http://nebc.nerc.ac.uk/ for more information). The BLAT
and BFAST aligners can be downloaded at http://genome.ucsc.edu/ (the UCSC
Genome Browser) and http://sourceforge.net/apps/mediawiki/bfast/, respectively.
All source codes can be compiled by g++. The makefile that can automatically
generate all executable programs is also provided. Of note, the system should
support OpenMP to compile the source codes. The complied programs of TSscan
are also accessible from our website at
http://idv.sinica.edu.tw/trees/TSscan/TSscan.html
2. Preparation
The initial input data include the reference sequences, the long read data
and the short read data.
2.1 Reference sequences
The following three data sets are retrieved from the reference sequences
(e.g., hg19 or GRCh37).
(1) Date set 1: the whole reference genomic sequences.
The whole reference genomic sequences should be completely downloaded from
the UCSC Genome Browser, which includes the sequences from chromosomes and
the mitochondrion genome and the unplaced/unlocalized sequences
(i.e., chr*_random and chrUn_*).
(2) Data set 2: the processed mitochondrion genomic sequences.
The mitochondrion genomes are formed in a circular fashion. To
comprehensively detect possible fusion sequences in the mitochondrion
genomes, for each mitochondrion genome we generate a copy and then assemble
these two copies together. Such generated genomic sequences are designated
as "processed mitochondrion genomic sequences". The processed mitochondrion
genomic sequences can be generated by the mitochondrion genome with
following UNIX instructions.
head -1 chrM.fa > chrM.title
cat chrM.fa | grep -v "^>" > chrM.seq
cat chrM.title chrM.seq chrM.seq > RepChrM.fa
(3) Data set 3: the annotated RNA sequences.
The annotated RNAs are downloaded from the UCSC Genome Browser and the
Ensembl Genome Browser (http://www.ensembl.org/).
To minimize mapping errors due to unsequenced gaps, it would be better to
detect trans-splicing candidates on a model species with high-quality genomic
sequences and annotations.
2.2 Long read data
The polyA tails of the 454-reads should be removed, and the raw sequencing
data of the long 454-reads should be converted into a fasta format.
2.3 Short read data
The raw sequencing data of the short reads should be converted into a fastq
format.
After that, install all data sets and the TSscan files in the same folder.
During the process of TSscan, do not move any file or change any file name.
3. The Pipeline of TSscan
The TSscan processes include the following steps (see Fig. 1).
Step 1: Identifying chimeric RNA candidates by BLAT-aligning long reads
against the reference genome.
1.1: Mapping the long reads onto the Data set 1 (the whole reference genomic
sequences) by BLAT
Example: blat RefGenome.fa longreads.fa out_step1_1.psl
Note: If the BLAT alignments are processed by chromosomes, all the results
should be integrated into a file in a psl format and be sorted according to
the long read IDs (i.e., "query ID", the 10th column of the psl-formatted
file).
1.2 TSscan1of4 out_step1_1.psl longreads.fa out_step1_2.fa
Usage:
TSscan1of4 [psl] [fasta] [output]
[psl] the result of the BLAT-alignment between the long reads and the
reference genome.
[fasta] the long reads in a fasta format.
[output] name of the output file.
1.3 Mapping the output file of Step 1.2 into the Data set 2 and Data set 3
(the processed mitochondrion genomic sequences and the annotated RNA
sequences) by BLAT
Example: blat out_step1_2.fa longreads.fa out_step1_3.psl
1.4 TSscan2of4 RefRNA.blat longreads.fa out_step1_4.fa
Usage:
TSscan2of4 [psl] [fasta] [output]
[psl] the output file of step 3.
[fasta] the long reads in a fasta format.
[output] name of the output file.
1.5 Mapping the output file of Step 1.4 into the unplaced/unlocalized
sequences (i.e., chr*_random and chrUn_*) by BLAT
Example: blat out_step1_4.fa longreads.fa out_step1_5.psl
1.6 TSscan3of4 out_step1_5.psl longreads.fa out_step1_6.fa
Usage:
TSscan3of4 [psl] [fasta] [output]
[psl] the output file of Step 1.5.
[fasta] the long reads in a fasta format.
[output] name of the output file.
Step 2 Excluding candidates without the support of short RNA-Seq reads.
2.1 Mapping the short reads into the output file of Step 1.6 by BFAST
Note: Please see the BFAST page at
http://sourceforge.net/projects/bfast/files/ for details.
2.2
For illumina RNA-Seq reads:
cat out_step2_1.sam | ./TSscanSamParser.NT out_step1_6.fa > out_step2_2.sam
For color space reads (SOLiD reads):
cat out_step2_1.sam | ./TSscanSamParser.CS50 out_step1_6.fa > out_step2_2.sam
Note: TSscan-parsing the output of Step 1.6. For the current version, the
length of illumina RNA-Seq reads is limited to 50 bases and the length of
the color space reads must be exactly 50 bases.
2.3 cat shortreads.fastq | ./FastqOut out_step2_2.sam 1 > out_step2_3.fastq
Note: Extracting short reads which remain in the output SAM file of
Step 2.2.
2.4 Mapping the output file of Step 2.3 into the Data sets 1~3 by BFAST.
All the SAM files are then merged into a SAM file
Note: Please see the BFAST page at
http://sourceforge.net/projects/bfast/files/ for details.
2.5 TSscan4of4 out_step2_2.sam out_step2_4.sam longreads.fa out_step2_5.out
Usage:
TSscan4of4 [sam1] [sam2] [fasta] [output]
[sam1] result file of mapping short reads to junction sequences
(in a SAM format).
[sam2] result file of mapping short reads to the reference genomic
sequences (in a SAM format).
[fasta] the long reads in a fasta format.
[output] name of the output file.
After that, the users can manually filter out potential experimental
artifacts (Step 3 of Fig. 1) and potential genetic rearrangement events
(Step 4 of Fig. 1) by the criteria stated in the text and Figure 1.