Skip to content

Latest commit

 

History

History
218 lines (148 loc) · 10.3 KB

usage.md

File metadata and controls

218 lines (148 loc) · 10.3 KB

MeRIPseqPipe

Introduction

Here, we present MeRIPseqPipe, an integrated analysis pipeline for MeRIP-seq data based on Nextflow. It integrates ten main functional modules including data preprocessing, quality control, read mapping, peak calling, peak merging, motif searching, peak annotation, differential methylation analysis, differential expression analysis, and data visualization, which covers the basic analysis of MeRIP-seq data. All the analysis modules are generated by Nextflow, and all the third-party tools are encapsulated in the Docker container.

Key Points

  • MeRIPseqPipe is an integrated and automatic pipeline, covers ten main analysis modules and can provide users a friendly solution to perform in-depth mining of MeRIP-seq data.
  • MeRIPseqPipe is particularly suitable for analyzing a large number of samples at once with a simple command.
  • Based on Nextflow, MeRIPseqPipe can achieve automatic parallelization and allows users to cancel pipeline, reset parameters, skip processes and resume analysis from any continuous checkpoint.

Installation

You only need Nextflow (version >= 19.04.0) and Docker installed to run the pipeline. All dependencies will be pulled automatically.

  1. run MeRIPseqPipe by cloning this repository:

    git clone https://github.com/canceromics/MeRIPseqPipe.git
    nextflow run /path/to/MeRIPseqPipe --help
  2. or let Nextflow do the pull

    nextflow pull canceromics/MeRIPseqPipe -r v1.0 --help

Samplesheet input

--designfile

Designfile is just like the following table splited by tabs ("\t") separated, which is .csv suffix file. It's recommended edited by Excel and save as .tsv suffix file.

--designfile 'path/to/designfile.tsv'

Example for Paired-end data:

Sample_ID input_R1 input_R2 ip_R1 ip_R2 Group_ID
A path/to/A.input.read1.fastq.gz path/to/A.input.read2.fastq.gz path/to/A.ip.read1.fastq.gz path/to/A.ip.read1.fastq.gz control
B path/to/B.input.read1.fastq.gz path/to/B.input.read2.fastq.gz path/to/B.ip.read1.fastq.gz path/to/B.ip.read1.fastq.gz control
C path/to/C.input.read1.fastq.gz path/to/C.input.read2.fastq.gz path/to/C.ip.read1.fastq.gz path/to/C.ip.read1.fastq.gz treated
D path/to/D.input.read1.fastq.gz path/to/D.input.read2.fastq.gz path/to/D.ip.read1.fastq.gz path/to/D.ip.read1.fastq.gz treated

Example for Single-end data:

Sample_ID input_R1 input_R2 ip_R1 ip_R2 Group_ID
A path/to/A.input.read1.fastq.gz false path/to/A.ip.read1.fastq.gz false control
B path/to/B.input.read1.fastq.gz false path/to/B.ip.read1.fastq.gz false control
C path/to/C.input.read1.fastq.gz false path/to/C.ip.read1.fastq.gz false treated
D path/to/D.input.read1.fastq.gz false path/to/D.ip.read1.fastq.gz false treated

Example for BAM data:

Sample_ID input_R1 input_R2 ip_R1 ip_R2 Group_ID
A path/to/A.input.bam false path/to/A.ip.bam false control
B path/to/B.input.bam false path/to/B.ip.bam false control
C path/to/C.input.bam false path/to/C.ip.bam false treated
D path/to/D.input.bam false path/to/D.ip.bam false treated

Notes:

  1. You can use absolute path and relative path of data in designfile, but absolute path is more recommended.
  2. Note that "false" is lowercase.

--comparefile

Comparefile is just like the following text which is a "vs" between two groups.

--comparefile 'path/to/compare.txt'

Example:

Control1_vs_Treated1

Control2_vs_Treated2

Notes:

Exprimental replicates is required for DESeq2.

Parameters

Please see parameter docs for the available parameters when running the pipeline.

To specify the parameters, you can:

  1. edit the nextflow.config, like:

      // Setting main parameters of analysis mode
      stranded = "no" // "yes" OR "no" OR "reverse"
      single_end = false
      gzip = true
      mapq_cutoff = 20 // [0-255], "255" means only keep uniquely mapping reads
      motiflength = "5,6,7,8"
      featurecount_minMQS = "0"
      aligners = "star" // "star" OR "bwa" OR "tophat2" OR "hisat2" OR "none"
      peak_threshold = "medium" // "low" OR "medium" OR "high"
      peakCalling_mode = "independence" // "group" OR "independence"
      peakMerged_mode = "rank" // "rank" OR "mspc" OR "macs2" OR "MATK" OR "metpeak" OR "meyer"
      expression_analysis_mode = "DESeq2" // "DESeq2" OR "edgeR" OR "none"
      methylation_analysis_mode = "QNB" // "MATK" OR "QNB" OR "Wilcox-test" OR "edgeR" OR "DESeq2"
  2. specify in the bash file, like:

    nextflow main.nf -c nextflow.config -profile docker --designfile designfile.tsv --comparefile compare.txt -resume --aligners star --fasta hg38_genome.fa --gtf gencode.v25.annotation.gtf --rRNA_fasta hg38_rRNA.fasta --outdir path/to/results --skip_createbedgraph --peakMerged_mode rank --star_index hg38/starindex --skip_meyer --skip_matk --methylation_analysis_mode Wilcox-test

Reference genomes

The minimum reference genome requirements are a FASTA and GTF file, all other files required to run the pipeline can be generated from these files. However, it is more storage and compute friendly if you are able to re-use reference genome files as efficiently as possible. It is recommended to define the local genome path or index path when running this pipeline, which can save the download time. But we also bundle the pipeline config files with paths to the illumina iGenomes reference index files.

--genome (using iGenomes)

There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome flag.

You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:

  • Human
    • --genome GRCh38
  • Mouse
    • --genome GRCm38
  • Drosophila
    • --genome BDGP6
  • S. cerevisiae
    • --genome 'R64-1-1'

Running the pipeline

The typical command for running the pipeline is as follows:

nextflow run /path/to/MeRIPseqPipe -profile test,docker 

This will launch the pipeline with the test and docker configuration profile.

Note that the pipeline will create the following files in your working directory:

work            # Directory containing the nextflow working files
results         # Finished results (configurable, see below)
.nextflow_log   # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Core Nextflow arguments

NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).

-profile

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker - the order of arguments is important!

If -profile is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH.

  • conda
    • A generic configuration profile to be used with conda
    • Pulls most software from Bioconda
  • docker
  • test
    • A profile with a complete configuration for automated testing
    • Includes links to test data so needs no other parameters

-resume

MeRIPseqPipe allows users to cancel pipeline, reset parameters and resume analysis from any continuous checkpoint by secifying -resume.

-c

Specify the path to a specific config file. (In this pipeline is nextflow.config)

Job resources

Automatic resubmission

Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.

Custom resource requests

You can edit the base.config to overwrite default set.

process {
  cpus = { check_max( 20, 'cpus' ) }
  memory = { check_max( 40.GB * task.attempt, 'memory' ) }
  time = { check_max( 240.h * task.attempt, 'time' ) }
}

Running in the background

Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.

The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.

Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).

Nextflow memory requirements

In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile):

NXF_OPTS='-Xms1g -Xmx4g'