This pipelines processes data using the following steps:
- MultiQC - aggregate report, describing results of the whole pipeline
- FastQC - read quality control
- TrimGalore - adapter trimming (optional)
- STAR - alignment
- RSEM - Estimate gene and isoform expression from RNA-Seq data
- Picard CollectRnaSeqMetrics- RNA alignment metrics for a BAM file
- RSeQC - RNA quality control metrics
- deepTools - visualization
All results are written to the so called publish directory, which is results
by default (and used
as example below).
Many results (FastQC, STAR alignment stats etc.) are summarised in MultiQC plots, which
can be found in results/MultiQC/multiqc_report.html
. An example MultiQC report can be download
here
The MultiQC plots are a good starting point to explore your results.
Results per sample and process can be found in the respectively named sub directories.
All pipeline internal info (timings etc.) are written to results/pipeline_info/
. Most of this will not be
of interest to the average user.
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
- Directory:
results/MultiQC
Project_multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser
Project_multiqc_data/
- Directory containing parsed statistics from the different tools used in the pipeline
See MultiQC homepage
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the
trim_galore
directory.
- Directory:
results/fastqc
sample_fastqc.html
- FastQC report, containing quality metrics for your untrimmed raw fastq files
zips/sample_fastqc.zip
- zip file containing the FastQC report, tab-delimited data file and plot images
For further reading and documentation see the FastQC help.
TrimGalore is used for (optional) removal of adapter contamination and trimming of low quality regions. TrimGalore uses Cutadapt for adapter trimming and runs FastQC after it finishes.
MultiQC reports the percentage of bases removed by TrimGalore in the General Statistics table, along with a line plot showing where reads were trimmed.
The directory results/trim_galore
contains FastQ files with quality and adapter trimmed reads for each sample, along with a log file describing the trimming.
sample_val_1.fq.gz
,sample_val_2.fq.gz
- Trimmed FastQ data, reads 1 and 2.
- NB: Only saved if
--saveTrimmed
has been specified.
logs/sample_val_1.fq.gz_trimming_report.txt
- Trimming report (describes which parameters that were used)
FastQC/sample_val_1_fastqc.zip
- FastQC report for trimmed reads
Single-end data will have slightly different file names and only one FastQ file per sample.
sample_trimmed.fq.gz
- Trimmed FastQ data, read 1.
FastQC/sample_trimmed_fastqc.zip
- FastQC report for trimmed reads
STAR is a read aligner designed for RNA sequencing. STAR stands for Spliced Transcripts Alignment to a Reference, it produces results comparable to TopHat (the aligned previously used by NGI for RNA alignments) but is much faster.
The STAR section of the MultiQC report shows a bar plot with alignmen:wt rates: good samples should have most reads as Uniquely mapped and few Unmapped reads.
- Directory:
results/STAR
Sample_Aligned.sortedByCoord.out.bam
- The aligned BAM file
Sample_Log.final.out
- The STAR alignment report, contains mapping results summary
Sample_Log.out
andSample_Log.progress.out
- STAR log files, containing a lot of detailed information about the run. Typically only useful for debugging purposes.
Sample_SJ.out.tab
- Filtered splice junctions detected in the mapping
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data by rsem-calculate-expression and rsem-plot-model for visulazing the model learned.
The following plot includes fragment length distribution, mate length distribution, read start position distribution (RSPD), quality score vs observed quality given a reference base, position vs percentage of sequencing error given a reference base and alignment statistics.
RSEM generates mapping statistics, which are plotted by MultiQC. See the following examples:
Sample.genes.results
- genes expression count matrix
Sample.isoforms.results
- isoforms expression count matrix
Sample..pdf
- rsem-plot-model
This produces RNA alignment metrics for alignments. It takes a SAM/BAM file containing the aligned reads from an RNAseq experiment and produces metrics describing the distribution of the bases within the transcripts. It calculates the total numbers and the fractions of nucleotides within specific genomic regions including untranslated regions (UTRs), introns, intergenic sequences (between discrete genes), and peptide-coding sequences (exons). This tool also determines the numbers of bases that pass quality filters that are specific to Illumina data (PF_BASES). For more information please see the corresponding GATK Dictionary entry.
Other metrics include the median coverage (depth), the ratios of 5 prime /3 prime-biases, and the numbers of reads with the correct/incorrect strand designation. The 5 prime /3 prime-bias results from errors introduced by reverse transcriptase enzymes during library construction, ultimately leading to the over-representation of either the 5 prime or 3 prime ends of transcripts.
See results/RnaSeqMetrics
Sample_RNA_Metrics.txt
- RNA alignment metrics
RSeQC is a package of scripts designed to evaluate the quality of RNA seq data. You can find out more about the package at the RSeQC website.
This pipeline runs several, but not all RSeQC scripts. All of these results are summarised within the MultiQC report and described below.
See results/rseqc
These are all quality metrics files and contains the raw data used for the plots in the MultiQC report. In general, the .r
files are R scripts for generating the figures, the .txt
are summary files, the .xls
are data tables and the .pdf
files are summary figures.
Output: Sample_bam_stat.txt
This script gives numerous statistics about the aligned BAM files produced by STAR. A typical output looks as follows:
#Output (all numbers are read count)
#==================================================
Total records: 41465027
QC failed: 0
Optical/PCR duplicate: 0
Non Primary Hits 8720455
Unmapped reads: 0
mapq < mapq_cut (non-unique): 3127757
mapq >= mapq_cut (unique): 29616815
Read-1: 14841738
Read-2: 14775077
Reads map to '+': 14805391
Reads map to '-': 14811424
Non-splice reads: 25455360
Splice reads: 4161455
Reads mapped in proper pairs: 21856264
Proper-paired reads map to different chrom: 7648
MultiQC plots each of these statistics in a dot plot. Each sample in the project is a dot - hover to see the sample highlighted across all fields.
Output:
Sample_read_duplication.DupRate_plot.pdf
Sample_read_duplication.DupRate_plot.r
Sample_read_duplication.pos.DupRate.xls
Sample_read_duplication.seq.DupRate.xls
This plot shows the number of reads (y-axis) with a given number of exact duplicates (x-axis). Most reads in an RNA-seq library should have a low number of exact duplicates. Samples which have many reads with many duplicates (a large area under the curve) may be suffering excessive technical duplication.
Output: Sample_read_distribution.txt
This tool calculates how mapped reads are distributed over genomic features. A good result for a standard RNA seq experiments is generally to have as many exonic reads as possible (CDS_Exons
). A large amount of intronic reads could be indicative of DNA contamination in your sample or some other problem.
This tool takes an alignment of reads from star as input (BAM file) and generates a coverage track (bigWig) as output. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute's Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV.
It generate two independent bigWig files for all reads on the forward and reverse strand, respectively.
Sample_fwd.bw
Sample_rev.bw