ercc

Ercc 1.1, Niassa-wrapped Cromwell (widdle) workflow for running ERCC spike-in analysis on RNAseq data. The workflow counts total reads in fastq files, evaluate expression of ERCC spike-ins using bwa aligner and creates a report in .json format, along with plots in .png and .pdf formats. The .json file contains RPKMs for all ERCC transcripts as well as other information. RPKM table in .csv format is also provisioned.

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run ercc.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter	Value	Description
`fastqR1`	File	File with reads for mate 1 or fastq file for single-read data
`mixId`	String	LIMS-approved identification of the Spike-In Mix
`sampleId`	String	The Id used to identify the data in the report
`reference`	String	hg19 or hg38 (assembly id)

Optional workflow parameters:

Parameter	Value	Default	Description
`fastqR2`	File?	None	File with reads for mate 2, is available
`outputFileNamePrefix`	String	basename(fastqR1,'.fastq.gz')	Prefix for the output file

Optional task parameters:

Parameter	Value	Default	Description
`countTotal.jobMemory`	Int	8	Memory allocated to this job
`countTotal.timeout`	Int	20	Timeout in hours for this task
`countTranscripts.timeout`	Int	20	Timeout in hours for this task
`countTranscripts.jobMemory`	Int	20	Memory allocated to this job
`countTranscripts.threads`	Int	8	Threads to use with bwa
`countTranscripts.cnv_file`	String	"ercc_counts.csv"	Output, contains ERCC ids and their respective number of reads
`rpkmTable.timeout`	Int	20	Timeout in hours for this task
`rpkmTable.jobMemory`	Int	10	Memory allocated to sort task
`makeReport.imagingScriptPath`	String	"$ERCC_SCRIPTS_ROOT/ercc_plots.R"	path to R script ercc_plots.R
`makeReport.rScript`	String	"$RSTATS_ROOT/bin/Rscript"	Path to Rscript command
`makeReport.jobMemory`	Int	10	Memory allocated to classify task

Outputs

Output	Type	Description	Labels
`rpkmData`	File	ERCC readouts in RPKMs	vidarr_label: rpkmData
`image`	File	png with a plot (Dose Response supported at the moment)	vidarr_label: image
`pdf`	File	pdf with the same plot as png, a legacy output	vidarr_label: pdf
`json`	File	json file with ERCC numbers	vidarr_label: json

Commands

This section lists command(s) run by ercc workflow

Running ercc

ercc workflow checks out spike-in signal (reads) and plots the QC graph(s).

Count reads in fastq file:

   zcat FASTQ_R1 | paste - - - - | wc -l

Align reads to chimeric reference (modified reference also containing all spike-in sequences)

   bwa mem -t THREADS -M REF_GENOME FASTQ_R1 FASTQ_R2 | samtools view -S - | 
       cut -f 3 | grep ERCC | sort | uniq -c | sed s/^\ *// > CNV_FILE

In this custom script we build the RPKM table

  python <<CODE
  import os
  ercc = os.path.expandvars("~{erccData}")
  counts = "~{erccCounts}"
  total = ~{totalReads}
  output_file = "~{basename(erccCounts, '.csv')}_rpkm.csv"
 
  # 1. Open file, read and collect length data
  ercc_data = open(ercc, 'r')
  length_hash = {}
  for nextLine in ercc_data.readlines():
      nextFields = nextLine.split("\t")
      length_hash[nextFields[0]] = int(nextFields[4])
  ercc_data.close()
 
  # 2. Calculate RPKMs, put them in a hash
  count_data = open(counts, 'r')
  rpkm_hash = {}
  count_hash = {}
 
  for nextLine in count_data.readlines():
      nextLine = nextLine.strip("\n")
      countChunk = nextLine.split(" ")
      if len(countChunk) == 2 and countChunk[1] in length_hash.keys():
          count_hash[countChunk[1]] = int(countChunk[0])
 
  pm = int(total)/1000000.0
  for e in sorted(length_hash.keys()):
     rpkm = 0.0
     if e in count_hash.keys():
         rpkm = count_hash[e]/pm/(length_hash[e]/1000.0)
     rpkm_hash[e] = "{0:.2f}".format(rpkm)
  count_data.close()
 
  # 3. print out RPKMS sorted by
  f = open(output_file, "w+")
  for e in sorted(rpkm_hash.keys()):
      f.write('\t'.join([e, rpkm_hash[e]]) + '\n')
  f.close()
  CODE

Report Making using oputputs from the other steps:

  Rscript IMAGING_SCRIPT RPKM_TABLE CONTROL_DATA ERCC_DATA PREFIX DoseResponse SAMPLES

Support

For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
docs		docs
tests		tests
CHANGELOG.md		CHANGELOG.md
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
README.md		README.md
commands.txt		commands.txt
ercc.wdl		ercc.wdl
ercc_plots.R		ercc_plots.R
vidarrbuild.json		vidarrbuild.json
vidarrtest-regression.json.in		vidarrtest-regression.json.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ercc

Overview

Dependencies

Usage

Cromwell

Inputs

Required workflow parameters:

Optional workflow parameters:

Optional task parameters:

Outputs

Commands

Count reads in fastq file:

Align reads to chimeric reference (modified reference also containing all spike-in sequences)

In this custom script we build the RPKM table

Report Making using oputputs from the other steps:

Support

About

Releases 6

Packages

Languages

License

oicr-gsi/ercc

Folders and files

Latest commit

History

Repository files navigation

ercc

Overview

Dependencies

Usage

Cromwell

Inputs

Required workflow parameters:

Optional workflow parameters:

Optional task parameters:

Outputs

Commands

Count reads in fastq file:

Align reads to chimeric reference (modified reference also containing all spike-in sequences)

In this custom script we build the RPKM table

Report Making using oputputs from the other steps:

Support

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages