MeRIPseqPipe

Introduction

Here, we present MeRIPseqPipe, an integrated analysis pipeline for MeRIP-seq data based on Nextflow. It integrates ten main functional modules including data preprocessing, quality control, read mapping, peak calling, peak merging, motif searching, peak annotation, differential methylation analysis, differential expression analysis, and data visualization, which covers the basic analysis of MeRIP-seq data. All the analysis modules are generated by Nextflow, and all the third-party tools are encapsulated in the Docker container.

Key Points

MeRIPseqPipe is an integrated and automatic pipeline, covers ten main analysis modules and can provide users a friendly solution to perform in-depth mining of MeRIP-seq data.
MeRIPseqPipe is particularly suitable for analyzing a large number of samples at once with a simple command.
Based on Nextflow, MeRIPseqPipe can achieve automatic parallelization and allows users to cancel pipeline, reset parameters, skip processes and resume analysis from any continuous checkpoint.

Installation

You only need Nextflow (version >= 19.04.0) and Docker installed to run the pipeline. All dependencies will be pulled automatically.

run MeRIPseqPipe by cloning this repository:

git clone https://github.com/canceromics/MeRIPseqPipe.git
nextflow run /path/to/MeRIPseqPipe --help

or let Nextflow do the pull

nextflow pull canceromics/MeRIPseqPipe -r v1.0 --help

Samplesheet input

`--designfile`

Designfile is just like the following table splited by tabs ("\t") separated, which is .csv suffix file. It's recommended edited by Excel and save as .tsv suffix file.

--designfile 'path/to/designfile.tsv'

Example for Paired-end data:

Sample_ID	input_R1	input_R2	ip_R1	ip_R2	Group_ID
A	path/to/A.input.read1.fastq.gz	path/to/A.input.read2.fastq.gz	path/to/A.ip.read1.fastq.gz	path/to/A.ip.read1.fastq.gz	control
B	path/to/B.input.read1.fastq.gz	path/to/B.input.read2.fastq.gz	path/to/B.ip.read1.fastq.gz	path/to/B.ip.read1.fastq.gz	control
C	path/to/C.input.read1.fastq.gz	path/to/C.input.read2.fastq.gz	path/to/C.ip.read1.fastq.gz	path/to/C.ip.read1.fastq.gz	treated
D	path/to/D.input.read1.fastq.gz	path/to/D.input.read2.fastq.gz	path/to/D.ip.read1.fastq.gz	path/to/D.ip.read1.fastq.gz	treated

Example for Single-end data:

Sample_ID	input_R1	input_R2	ip_R1	ip_R2	Group_ID
A	path/to/A.input.read1.fastq.gz	false	path/to/A.ip.read1.fastq.gz	false	control
B	path/to/B.input.read1.fastq.gz	false	path/to/B.ip.read1.fastq.gz	false	control
C	path/to/C.input.read1.fastq.gz	false	path/to/C.ip.read1.fastq.gz	false	treated
D	path/to/D.input.read1.fastq.gz	false	path/to/D.ip.read1.fastq.gz	false	treated

Example for BAM data:

Sample_ID	input_R1	input_R2	ip_R1	ip_R2	Group_ID
A	path/to/A.input.bam	false	path/to/A.ip.bam	false	control
B	path/to/B.input.bam	false	path/to/B.ip.bam	false	control
C	path/to/C.input.bam	false	path/to/C.ip.bam	false	treated
D	path/to/D.input.bam	false	path/to/D.ip.bam	false	treated

Notes：

You can use absolute path and relative path of data in designfile, but absolute path is more recommended.

Note that "false" is lowercase.

`--comparefile`

Comparefile is just like the following text which is a "vs" between two groups.

--comparefile 'path/to/compare.txt'

Example:

Control1_vs_Treated1

Control2_vs_Treated2

Notes:

Exprimental replicates is required for DESeq2.

Parameters

Please see parameter docs for the available parameters when running the pipeline.

To specify the parameters, you can:

edit the nextflow.config, like:

  // Setting main parameters of analysis mode
  stranded = "no" // "yes" OR "no" OR "reverse"
  single_end = false
  gzip = true
  mapq_cutoff = 20 // [0-255], "255" means only keep uniquely mapping reads
  motiflength = "5,6,7,8"
  featurecount_minMQS = "0"
  aligners = "star" // "star" OR "bwa" OR "tophat2" OR "hisat2" OR "none"
  peak_threshold = "medium" // "low" OR "medium" OR "high"
  peakCalling_mode = "independence" // "group" OR "independence"
  peakMerged_mode = "rank" // "rank" OR "mspc" OR "macs2" OR "MATK" OR "metpeak" OR "meyer"
  expression_analysis_mode = "DESeq2" // "DESeq2" OR "edgeR" OR "none"
  methylation_analysis_mode = "QNB" // "MATK" OR "QNB" OR "Wilcox-test" OR "edgeR" OR "DESeq2"

specify in the bash file, like:

nextflow main.nf -c nextflow.config -profile docker --designfile designfile.tsv --comparefile compare.txt -resume --aligners star --fasta hg38_genome.fa --gtf gencode.v25.annotation.gtf --rRNA_fasta hg38_rRNA.fasta --outdir path/to/results --skip_createbedgraph --peakMerged_mode rank --star_index hg38/starindex --skip_meyer --skip_matk --methylation_analysis_mode Wilcox-test

Reference genomes

The minimum reference genome requirements are a FASTA and GTF file, all other files required to run the pipeline can be generated from these files. However, it is more storage and compute friendly if you are able to re-use reference genome files as efficiently as possible. It is recommended to define the local genome path or index path when running this pipeline, which can save the download time. But we also bundle the pipeline config files with paths to the illumina iGenomes reference index files.

`--genome` (using iGenomes)

There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome flag.

You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:

Human
- --genome GRCh38
Mouse
- --genome GRCm38
Drosophila
- --genome BDGP6
S. cerevisiae
- --genome 'R64-1-1'

Running the pipeline

The typical command for running the pipeline is as follows:

nextflow run /path/to/MeRIPseqPipe -profile test,docker

This will launch the pipeline with the test and docker configuration profile.

Note that the pipeline will create the following files in your working directory:

work            # Directory containing the nextflow working files
results         # Finished results (configurable, see below)
.nextflow_log   # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Core Nextflow arguments

NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).

`-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker - the order of arguments is important!

If -profile is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH.

conda
- A generic configuration profile to be used with conda
- Pulls most software from Bioconda
docker
- A generic configuration profile to be used with Docker
- Pulls software from dockerhub: kingzhuky/meripseqpipe:dev
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters

`-resume`

MeRIPseqPipe allows users to cancel pipeline, reset parameters and resume analysis from any continuous checkpoint by secifying -resume.

`-c`

Specify the path to a specific config file. (In this pipeline is nextflow.config)

Job resources

Automatic resubmission

Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.

Custom resource requests

You can edit the base.config to overwrite default set.

process {
  cpus = { check_max( 20, 'cpus' ) }
  memory = { check_max( 40.GB * task.attempt, 'memory' ) }
  time = { check_max( 240.h * task.attempt, 'time' ) }
}

Running in the background

Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.

The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.

Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).

Nextflow memory requirements

In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile):

NXF_OPTS='-Xms1g -Xmx4g'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage.md

usage.md

MeRIPseqPipe

Introduction

Key Points

Installation

Samplesheet input

`--designfile`

`--comparefile`

Parameters

Reference genomes

`--genome` (using iGenomes)

Running the pipeline

Core Nextflow arguments

`-profile`

`-resume`

`-c`

Job resources

Automatic resubmission

Custom resource requests

Running in the background

Nextflow memory requirements

Files

usage.md

Latest commit

History

usage.md

File metadata and controls

MeRIPseqPipe

Introduction

Key Points

Installation

Samplesheet input

--designfile

--comparefile

Parameters

Reference genomes

--genome (using iGenomes)

Running the pipeline

Core Nextflow arguments

-profile

-resume

-c

Job resources

Automatic resubmission

Custom resource requests

Running in the background

Nextflow memory requirements

`--designfile`

`--comparefile`

`--genome` (using iGenomes)

`-profile`

`-resume`

`-c`