Skip to content

Latest commit

 

History

History
112 lines (75 loc) · 6.24 KB

README.md

File metadata and controls

112 lines (75 loc) · 6.24 KB

-- still under construction --

Introduction

What is ncSplice?

ncSplice is a collection of ruby scripts for the identification of circular, intra- and inter-chromosomal junction reads from Illumina RNA Sequencing data. Reads indicative for these splicing events, are often discarded by standard mapping tools and marked as unmapped. ncSplice remaps these unmapped reads by taking the 20 most left and most right basepairs (= anchors) from each unmapped read and remapping them again independently on the reference genome. Different filtering criteria are then applied to identify circular, intra- and inter-chromosomal candidate reads.

1. Circular RNAs (circRNAs)

CircRNAs are characterized by their atypical splicing behavior, in which the 5’-end of an exon is spliced to the 3’-end of an upstream exon (head-to-tail configuration). To be counted as a candidate read, anchor pairs have to

  • be on the same chromosome
  • map to the same strand
  • should not be further than 100 kb apart from each other
2. Intra-chromosomal fusions

Intra-chromosomal fusion reads are defined by reads for which the anchor pair maps

  • on the same chromosome
  • at least 1 mp apart from each other
  • the strand orientation of anchors within the anchor pair can be different
3. Inter-chromosomal fusions

Inter-chromosomal fusion reads are defined by reads for which the anchor pair maps

  • on different chromosomes
  • the strand orientation of anchors within the anchor pair can be different

For what data types does ncSplice work?

Currently, the detection works for unstranded, single-end Illumina sequencing data inly. Further library options (paired-end, stranded) will be implemented soon. ncSplice uses Bowtie2 to map reads.

Detection

The detection of circRNAs, intra- and inter-chromosomal fusion follows similar steps with adaptations for each of the different splicing types. The general outline is:

1. Creation of anchors from unmapped reads

Take a fastq-file, which contains the unmapped reads and prepare an output fastq-file with corresponding anchors. A new qname is created, which is based on the original qname, the mate information (1 for single-end data), the read itself and a terminal A or B to show from which side of the read the anchor was taken (A = left, B = right): @HWI-D00108:213:C3U67ACXX:1:1106:5218:96673_1_...TTTCTGTGAG..._A

2. 1st mapping round

Anchor mapping with bowtie2, output sorted according to read name.

3. Seed extension

Reads are read from the bowtie2 output bam-file. Each anchor pair is evaluated to see whether anchor pairs fulfill the splice type conditions (see Introduction). If so, the anchors are extended. A maximum of 1 mismatch on each of the sides is allowed. If the read can be fully extended, it is written to the output file. Reads for which R1 and R2 fall on the same or different junction are removed.

4. Collision of reads on candidate loci
5. Conversion of loci into fasta format

Conversion of breakpoint into fasta-format by joining 100 bp up- and downstream from the breakpoint.

6. 2nd remapping round on junction index

An index based on the detected junctions is created and all unmapped reads are remapped to find reads that span these junctions with less than 20 bp (but at least 8 bp).

7. Filtering of remapped reads and final candidate list

Uniquely remapped reads with a maximum of 2 mismatches are added to the first candidate list (if not already used).

8. A post-filtering step should be applied to remove low-coverage junctions and extreme outliers.

Requirements

To run ncSplice, bowtie 1 or bowtie 2 and samtools have to be added to path. They can be downloaded and installed from:

The ncSplice was developed on ruby 2.0.0. The latest ruby version can be downloaded from

ncSplice will not work on versions < 1.9.2. These versions do not contain the built-in method require_relative, which is used in ncSplice.

How to run ncSplice

ncSplice was developed based on the tophat output, which will by default create two bam files: accepted_hits.bam and unmapped.bam. For the further analysis, unmapped reads have to be converted into fastq-format. This can be done via the helper script readPreparation.rb. This step is not necessary if the unmapped reads are already in fastq-format. (add: how should fastq file look like)

Usage: ncSplice.rb -u <unmapped.fastq> -p <prefix> -x <index-directory>/<bt2-index> -a <anchor-length> -l <read-length> -c <chromsomes>/*.fa -s <exclude.txt> [options]

-h, --help                       Display help screen.
-v, --version                    Print ncSplice version and dependencies.
-u, --unmapped <filename>        fastq file with unmapped reads
-q, --quality <integer>          Minimal phred quality unmapped reads need to have for further analysis.
    --sequencing-type <string>   Sequencing type, currently only single-end librariers supported
    --library-type <string>      Library type, currently only unstranded libraries supported
-p, --prefix <string>            Prefix for all files.
-x, --bowtie-index <directory>   Bowtie-index diretory and base name: <index-directory>/<bt2-index>.
-a, --anchor-length <integer>    Length of the read anchor for remapping, default is 20 bp, shorter anchors will decrease the mapping precision and longer anchors will cause a reduction in candidates.
-l, --read-length <integer>      Length of the sequencing read.
-c <directory>,                  Directory with chromosome fasta files, one fasta-file per chromosome.
    --chromosome-files
-s, --skip-chr <filename>        Text file with chromosomes to exclude, such as the mitochondrial chromosome (recommanded), chromosomes need to be listed in a separate text-file with one chromosome per line.    

To do

  1. Implementation of
    • paired-end option
    • different library options
  2. Write overall documentation
    • document potential errors
  3. Conversion to ruby gem

Authors

Franziska Gruhl, franziska.gruhl@unil.ch

Apr 2016