This pipeline is for RNASeq analysis and runs STAR, followed by RNASeQC, RSEM and optionally Cuffdiff. It should be suitable for most types of RNASeq, except small RNASeq.
Note that this pipeline is not suitable for bacterial data as its parameters are tuned towards large, spliced genomes and it requires good quality genome annotation.
Reads are aligned to given reference genome using the
STAR mapper. See
cfg/references.yaml
for references used by default (also refer to
option --references-cfg
). Output of STAR includes the uniquely
mapped genome bam file, transcripts mapped bam file, gene based read
count matrix, bigwig files etc. (see below). For running STAR we
follow recipes given
here.
The transcripts/genes expression abundance are estimated by STAR and RSEM (reusing STAR's BAM file). The RSEM results matrix contains mapped reads count and TPM (normalized value) of genes and isoforms.
The pipeline also provides generic stats, coverage, mappability, QC e.g. by running RNA-SeQC.
Cuffdiff can be run optionally (slow!): it will run in Cufflinks mode,
with no differential analysis carried out, to get raw fragment count
of genes and isoforms in addition to Cufflinks' FPKM. The different
options for the stranded
argument are translated as follows (see
also http://chipster.csc.fi/manual/library-type-summary.html):
- none (default): fr-unstranded
- reverse: fr-firststrand
- forward: fr-secondstrand
Expression values of genes and isoforms are provided with annotation in all run methods.
Note, STAR is very memory hungry. Because its shared memory option can cause trouble when jobs fail (memory needs to be cleared manually), we do not make use of it, even in a multi-sample setting.
The following lists the most important files/directories that are created in correspondingly named subfolders:
- Mapped genome BAM:
{sample}_{genome}_Aligned.sortedByCoord.out.bam
- Mapped transcriptome BAM (RSEM input):
{sample}_{genome}_Aligned.toTranscriptome.out.bam
- Visualization: Bigwig files (
*.bw
) - Read count (genes):
{sample}_{genome}_ReadsPerGene.out.tab
- Mappability stats:
{sample}_{genome}_Log.final.out
Exact STAR mapping parameters can be looked up in the Snakefile.
- Genes expression values with annotation:
{sample}_{genome}_RSEM.genes.results.desc
- Isoforms expression values with annotation:
{sample}_{genome}_RSEM.isoforms.results.desc
- Visualization: Wiggle files (
*.wig
) - Plots:
{sample}_{genome}_RSEM.pdf
QC and rate of rRNA and distribution of reads on transcripts:
countMetrics.html
metrics.tsv
- Genes expression values with annotation:
{sample}_{genome}_genes_FPKM_Rawreadcount_GIS.txt
- Genes with raw fragment and fpkm value:
genes.read_group_tracking