Ercc 1.1, Niassa-wrapped Cromwell (widdle) workflow for running ERCC spike-in analysis on RNAseq data. The workflow counts total reads in fastq files, evaluate expression of ERCC spike-ins using bwa aligner and creates a report in .json format, along with plots in .png and .pdf formats. The .json file contains RPKMs for all ERCC transcripts as well as other information. RPKM table in .csv format is also provisioned.
java -jar cromwell.jar run ercc.wdl --inputs inputs.json
Parameter | Value | Description |
---|---|---|
fastqR1 |
File | File with reads for mate 1 or fastq file for single-read data |
mixId |
String | LIMS-approved identification of the Spike-In Mix |
sampleId |
String | The Id used to identify the data in the report |
reference |
String | hg19 or hg38 (assembly id) |
Parameter | Value | Default | Description |
---|---|---|---|
fastqR2 |
File? | None | File with reads for mate 2, is available |
outputFileNamePrefix |
String | basename(fastqR1,'.fastq.gz') | Prefix for the output file |
Parameter | Value | Default | Description |
---|---|---|---|
countTotal.jobMemory |
Int | 8 | Memory allocated to this job |
countTotal.timeout |
Int | 20 | Timeout in hours for this task |
countTranscripts.timeout |
Int | 20 | Timeout in hours for this task |
countTranscripts.jobMemory |
Int | 20 | Memory allocated to this job |
countTranscripts.threads |
Int | 8 | Threads to use with bwa |
countTranscripts.cnv_file |
String | "ercc_counts.csv" | Output, contains ERCC ids and their respective number of reads |
rpkmTable.timeout |
Int | 20 | Timeout in hours for this task |
rpkmTable.jobMemory |
Int | 10 | Memory allocated to sort task |
makeReport.imagingScriptPath |
String | "$ERCC_SCRIPTS_ROOT/ercc_plots.R" | path to R script ercc_plots.R |
makeReport.rScript |
String | "$RSTATS_ROOT/bin/Rscript" | Path to Rscript command |
makeReport.jobMemory |
Int | 10 | Memory allocated to classify task |
Output | Type | Description | Labels |
---|---|---|---|
rpkmData |
File | ERCC readouts in RPKMs | vidarr_label: rpkmData |
image |
File | png with a plot (Dose Response supported at the moment) | vidarr_label: image |
pdf |
File | pdf with the same plot as png, a legacy output | vidarr_label: pdf |
json |
File | json file with ERCC numbers | vidarr_label: json |
This section lists command(s) run by ercc workflow
- Running ercc
ercc workflow checks out spike-in signal (reads) and plots the QC graph(s).
zcat FASTQ_R1 | paste - - - - | wc -l
bwa mem -t THREADS -M REF_GENOME FASTQ_R1 FASTQ_R2 | samtools view -S - |
cut -f 3 | grep ERCC | sort | uniq -c | sed s/^\ *// > CNV_FILE
python <<CODE
import os
ercc = os.path.expandvars("~{erccData}")
counts = "~{erccCounts}"
total = ~{totalReads}
output_file = "~{basename(erccCounts, '.csv')}_rpkm.csv"
# 1. Open file, read and collect length data
ercc_data = open(ercc, 'r')
length_hash = {}
for nextLine in ercc_data.readlines():
nextFields = nextLine.split("\t")
length_hash[nextFields[0]] = int(nextFields[4])
ercc_data.close()
# 2. Calculate RPKMs, put them in a hash
count_data = open(counts, 'r')
rpkm_hash = {}
count_hash = {}
for nextLine in count_data.readlines():
nextLine = nextLine.strip("\n")
countChunk = nextLine.split(" ")
if len(countChunk) == 2 and countChunk[1] in length_hash.keys():
count_hash[countChunk[1]] = int(countChunk[0])
pm = int(total)/1000000.0
for e in sorted(length_hash.keys()):
rpkm = 0.0
if e in count_hash.keys():
rpkm = count_hash[e]/pm/(length_hash[e]/1000.0)
rpkm_hash[e] = "{0:.2f}".format(rpkm)
count_data.close()
# 3. print out RPKMS sorted by
f = open(output_file, "w+")
for e in sorted(rpkm_hash.keys()):
f.write('\t'.join([e, rpkm_hash[e]]) + '\n')
f.close()
CODE
Rscript IMAGING_SCRIPT RPKM_TABLE CONTROL_DATA ERCC_DATA PREFIX DoseResponse SAMPLES
For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .
Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)