Here, we present MeRIPseqPipe, an integrated analysis pipeline for MeRIP-seq data based on Nextflow. It integrates ten main functional modules including data preprocessing, quality control, read mapping, peak calling, peak merging, motif searching, peak annotation, differential methylation analysis, differential expression analysis, and data visualization, which covers the basic analysis of MeRIP-seq data. All the analysis modules are generated by Nextflow, and all the third-party tools are encapsulated in the Docker container.
- MeRIPseqPipe is an integrated and automatic pipeline, covers ten main analysis modules and can provide users a friendly solution to perform in-depth mining of MeRIP-seq data.
- MeRIPseqPipe is particularly suitable for analyzing a large number of samples at once with a simple command.
- Based on Nextflow, MeRIPseqPipe can achieve automatic parallelization and allows users to cancel pipeline, reset parameters, skip processes and resume analysis from any continuous checkpoint.
You only need Nextflow (version >= 19.04.0) and Docker installed to run the pipeline. All dependencies will be pulled automatically.
-
run MeRIPseqPipe by cloning this repository:
git clone https://github.com/canceromics/MeRIPseqPipe.git nextflow run /path/to/MeRIPseqPipe --help
-
or let Nextflow do the pull
nextflow pull canceromics/MeRIPseqPipe -r v1.0 --help
Designfile is just like the following table splited by tabs ("\t") separated, which is .csv suffix file. It's recommended edited by Excel and save as .tsv suffix file.
--designfile 'path/to/designfile.tsv'
Example for Paired-end data:
Sample_ID | input_R1 | input_R2 | ip_R1 | ip_R2 | Group_ID |
---|---|---|---|---|---|
A | path/to/A.input.read1.fastq.gz | path/to/A.input.read2.fastq.gz | path/to/A.ip.read1.fastq.gz | path/to/A.ip.read1.fastq.gz | control |
B | path/to/B.input.read1.fastq.gz | path/to/B.input.read2.fastq.gz | path/to/B.ip.read1.fastq.gz | path/to/B.ip.read1.fastq.gz | control |
C | path/to/C.input.read1.fastq.gz | path/to/C.input.read2.fastq.gz | path/to/C.ip.read1.fastq.gz | path/to/C.ip.read1.fastq.gz | treated |
D | path/to/D.input.read1.fastq.gz | path/to/D.input.read2.fastq.gz | path/to/D.ip.read1.fastq.gz | path/to/D.ip.read1.fastq.gz | treated |
Example for Single-end data:
Sample_ID | input_R1 | input_R2 | ip_R1 | ip_R2 | Group_ID |
---|---|---|---|---|---|
A | path/to/A.input.read1.fastq.gz | false | path/to/A.ip.read1.fastq.gz | false | control |
B | path/to/B.input.read1.fastq.gz | false | path/to/B.ip.read1.fastq.gz | false | control |
C | path/to/C.input.read1.fastq.gz | false | path/to/C.ip.read1.fastq.gz | false | treated |
D | path/to/D.input.read1.fastq.gz | false | path/to/D.ip.read1.fastq.gz | false | treated |
Example for BAM data:
Sample_ID | input_R1 | input_R2 | ip_R1 | ip_R2 | Group_ID |
---|---|---|---|---|---|
A | path/to/A.input.bam | false | path/to/A.ip.bam | false | control |
B | path/to/B.input.bam | false | path/to/B.ip.bam | false | control |
C | path/to/C.input.bam | false | path/to/C.ip.bam | false | treated |
D | path/to/D.input.bam | false | path/to/D.ip.bam | false | treated |
Notes:
- You can use absolute path and relative path of data in designfile, but absolute path is more recommended.
- Note that "false" is lowercase.
Comparefile is just like the following text which is a "vs" between two groups.
--comparefile 'path/to/compare.txt'
Example:
Control1_vs_Treated1
Control2_vs_Treated2
Notes:
Exprimental replicates is required for DESeq2.
Please see parameter docs for the available parameters when running the pipeline.
To specify the parameters, you can:
-
edit the nextflow.config, like:
// Setting main parameters of analysis mode stranded = "no" // "yes" OR "no" OR "reverse" single_end = false gzip = true mapq_cutoff = 20 // [0-255], "255" means only keep uniquely mapping reads motiflength = "5,6,7,8" featurecount_minMQS = "0" aligners = "star" // "star" OR "bwa" OR "tophat2" OR "hisat2" OR "none" peak_threshold = "medium" // "low" OR "medium" OR "high" peakCalling_mode = "independence" // "group" OR "independence" peakMerged_mode = "rank" // "rank" OR "mspc" OR "macs2" OR "MATK" OR "metpeak" OR "meyer" expression_analysis_mode = "DESeq2" // "DESeq2" OR "edgeR" OR "none" methylation_analysis_mode = "QNB" // "MATK" OR "QNB" OR "Wilcox-test" OR "edgeR" OR "DESeq2"
-
specify in the bash file, like:
nextflow main.nf -c nextflow.config -profile docker --designfile designfile.tsv --comparefile compare.txt -resume --aligners star --fasta hg38_genome.fa --gtf gencode.v25.annotation.gtf --rRNA_fasta hg38_rRNA.fasta --outdir path/to/results --skip_createbedgraph --peakMerged_mode rank --star_index hg38/starindex --skip_meyer --skip_matk --methylation_analysis_mode Wilcox-test
The minimum reference genome requirements are a FASTA and GTF file, all other files required to run the pipeline can be generated from these files. However, it is more storage and compute friendly if you are able to re-use reference genome files as efficiently as possible. It is recommended to define the local genome path or index path when running this pipeline, which can save the download time. But we also bundle the pipeline config files with paths to the illumina iGenomes reference index files.
There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome
flag.
You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:
- Human
--genome GRCh38
- Mouse
--genome GRCm38
- Drosophila
--genome BDGP6
- S. cerevisiae
--genome 'R64-1-1'
The typical command for running the pipeline is as follows:
nextflow run /path/to/MeRIPseqPipe -profile test,docker
This will launch the pipeline with the test
and docker
configuration profile.
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker
- the order of arguments is important!
If -profile
is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH
.
- conda
- docker
- A generic configuration profile to be used with Docker
- Pulls software from dockerhub:
kingzhuky/meripseqpipe:dev
- test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
MeRIPseqPipe allows users to cancel pipeline, reset parameters and resume analysis from any continuous checkpoint by secifying -resume
.
Specify the path to a specific config file. (In this pipeline is nextflow.config)
Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143
(exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.
You can edit the base.config to overwrite default set.
process {
cpus = { check_max( 20, 'cpus' ) }
memory = { check_max( 40.GB * task.attempt, 'memory' ) }
time = { check_max( 240.h * task.attempt, 'time' ) }
}
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'