-
Notifications
You must be signed in to change notification settings - Fork 10
Introduction to Workflows: Snakemake Part I & II May 12 & 14th, 2021
Wed, May 12, 2021, from 09:00am - 11:00pm PDT
Instructor: Dr. C Titus Brown
Moderator: Jose Sanchez
- Snakemake Wrappers
- Blog info on Snakemake Wrappers
- conda directive in snakemake
- An example of multiple conda env
- Tessa’s blog on snakemake with slurm
- Example Workflows
Learn to create reproducible and scalable data analyses using the Snakemake workflow management system. Snakemake is a text-based workflow system in which workflows are defined in terms of rules that connect input and output files. Snakemake offers many desired features such as flexibility to scale without modification, integration with package manager Conda and container engine Singularity, and compatibility with cloud computing.
This free two-part workshop series will cover the basics of Snakemake including setup, installation, building, and running an example workflow along with a discussion on advanced features such as customizing computational threads, use of config file, and use of arbitrary parameters.
💻 Open a web browser - e.g., Firefox, Chrome, or Safari. Open the binder by clicking this button:
We are part of the training and engagement team for the NIH Common Fund Data Ecosystem, a project supported by the NIH to increase data reuse and cloud computing for biomedical research.
Day 1:
- Introduction to workflows
- Introduction to snakemake
- Structure of the Snakefile
- Linking snakemake rules (decorating the Snakefile)
Day 2:
- Continue decorating the Snakefile
- Running lots of rules all at once
- Open Q&A time
If you have questions at any point,
- Drop them in the chat, or
- Direct messages to the moderator or helpers are welcome, or
- Unmute yourself and ask during the workshop
We're going to use the "raise hand" reaction in zoom to make sure people are on board during the hands-on activities.
- What is a workflow?
- Why use workflows?
- What and why Snakemake?
(Arrange/orientation to Rstudio panels; this is just a user interface, and is not critical for running snakemake!)
- Click on "Terminal"
- Shorten the command prompt to
$
by typing this:echo "PS1='\w $ '" >> .bashrc
- Open a new terminal to see the results
Link to tutorial with a slightly different binder experience
- This binder comes with Conda installed. To use conda on your local machine, you will need to install it (if you don't already have conda installed)
- Let's look at the contents of the
environment.yml
file which specifies all the necessary dependencies installed in the binder. Use the point and click interface.
Note that the base conda environment in the binder is defined by the environment.yml
specifications. On your local machine, you can create a separate conda environment on top of your base environment with the same specification:
# create a virtual environment snaketest
conda env create -n snaketest -f environment.yml
# activate the virtual environment
conda activate snaketest
-
Check if the setup was successful:
samtools --version
In this workshop, we will walk through the basic steps for creating a variant calling workflow with the Snakemake workflow management system.
Open the Snakefile by clicking on it. It will open up in your text editor tab. The Snakefile is a collection of rules. Each rule has this stucture:
rule rule_name:
shell:
# for single line commands
# command must be enclosed in single quotes, "
# multiline commands can be in triple quotes """
"command"
- Let's run a rule:
snakemake -j 2 -p map_reads
Did it run? Why or why not?
Now let's try another rule, the very first rule:
snakemake -j 2 -p download_data
In this case, the shell command downloads the raw read file from a public repository on osf.io. The -p
means show the command that you're running. The -j 2
tells snakemake to run up to two tasks at the same time - this will not become important until Friday, but we still have to give snakemake a number!
Did it work? Let's ls
to see what we downloaded, or look at the file tab in RStudio.
Next, let's run more rules sequentially:
snakemake -j 2 -p download_data
snakemake -j 2 -p download_genome
snakemake -j 2 -p uncompress_genome
snakemake -j 2 -p index_genome_bwa
snakemake -j 2 -p map_reads
Check the working directory again. The directory is populated by many output files including reference genome (.fa), genome index (.fa.sa, .fa.amb etc) and mapped reads (.sam) files.
It gets tedious to run each command individually, and we can do that already without Snakemake! What if we want to run all the commands at once?
By defining the inputs and outputs for each rule's command(s), Snakemake can figure out how the rules are linked together. The rule structure will now look something like this, where input:, output:, and shell: are Snakemake directives:
rule rule_name:
input:
# input file names must be enclosed in quotes
# multiple inputs should be separated by commas
# the new line for each input is optional
"input file 1",
"input file 2",
"input file 3"
output:
# output file names must be enclosed in quotes
# multiple outputs should be separated by commas
"output file 1",
"output file 2"
shell:
# for multi-line commands
# commands must be enclosed in triple quotes
"""
command_1
command_2
"""
Here, Snakemake interprets the input:
and output:
sections as Python code, and the shell:
section as the bash code that gets run on the command line.
Let's add inputs and outputs to the Snakefile!
(follow Titus' lead and delete all the output files from before)
- Add output to
rule download_data
and remove redundancy:
rule download_data:
output: "SRR2584857_1.fastq.gz"
shell:
"wget https://osf.io/4rdza/download -O {output}"
What practical impact does this have? Hint: try running download_data
twice with
snakemake -j 2 -p download_data
Now let's add an output: block to the download_genome
rule, too!
Question: why don't these have inputs?
Add an input and output to the uncompress_genome
rule
Hint: The uncompress step is to get rid of the .gz extension. Your filename should look identical but without the .gz
extension.
Let's continue decorating the Snakefile with inputs and outputs.
rule map_reads:
input:
"ecoli-rel606.fa",
"SRR2584857_1.fastq.gz",
"ecoli-rel606.fa.pac",
"ecoli-rel606.fa.ann",
"ecoli-rel606.fa.amb",
"ecoli-rel606.fa.bwt",
"ecoli-rel606.fa.sa"
output: "SRR2584857.sam"
shell:
"bwa mem -t 4 ecoli-rel606.fa SRR2584857_1.fastq.gz > {output}"
Discussion points for Friday:
- Adding inputs and outputs to all the rules.
- How are the rules linked?
- How to link the rules to run them together?
End of Day 1
For a recap of what you learned today please watch the 14 min vidlet linked to our tutorial page before the next session.
For next session, we would like you to pre-watch these two short (~16 min) vidlets.
Warning!
Be sure to save any work/notes you took in the binder to your computer. Any new files/changes are not available after the binder session is closed!
For example, select a file, click "More", click "Export":
See you on Friday!
# download data from https://osf.io/vzfc6/
rule download_data:
output: "SRR2584857_1.fastq.gz"
shell:
"wget https://osf.io/4rdza/download -O {output}"
# this rule downloads the genome file
rule download_genome:
output: "ecoli-rel606.fa.gz"
shell:
"wget https://osf.io/8sm92/download -O {output}"
rule uncompress_genome:
input: "ecoli-rel606.fa.gz"
output: "ecoli-rel606.fa"
shell:
"gunzip {input}"
rule index_genome_bwa:
input: "ecoli-rel606.fa"
output: "ecoli-rel606.fa.pac",
"ecoli-rel606.fa.ann",
"ecoli-rel606.fa.amb",
"ecoli-rel606.fa.bwt",
"ecoli-rel606.fa.sa"
shell:
"bwa index {input}"
rule map_reads:
input:
genome="ecoli-rel606.fa",
reads="SRR2584857_1.fastq.gz",
pac="ecoli-rel606.fa.pac",
ann="ecoli-rel606.fa.ann",
amb="ecoli-rel606.fa.amb",
bwt="ecoli-rel606.fa.bwt",
sa="ecoli-rel606.fa.sa"
output: sam="SRR2584857.sam"
shell:
"bwa mem -t 4 {input.genome} {input.reads} > {output.sam}"
rule index_genome_samtools:
shell:
"samtools faidx ecoli-rel606.fa"
rule samtools_import:
shell:
"samtools import ecoli-rel606.fa.fai SRR2584857.sam SRR2584857.bam"
rule samtools_sort:
shell:
"samtools sort SRR2584857.bam -o SRR2584857.sorted.bam"
rule samtools_index_sorted:
shell: "samtools index SRR2584857.sorted.bam"
rule samtools_mpileup:
shell:
"""samtools mpileup -u -t DP -f ecoli-rel606.fa SRR2584857.sorted.bam | \
bcftools call -mv -Ob -o - > variants.raw.bcf"""
rule make_vcf:
shell: "bcftools view variants.raw.bcf > variants.vcf"
# at end, run:
## samtools tview -p ecoli:4202391 SRR2584857.sorted.bam ecoli-rel606.fa
Snakemake also enables creation of a DAG i.e. directed acyclic graph. You can read more about the syntax information on the snakemake documentation.
To generate a png DAG file, we need Graphviz
, a graph visualization software. Since the workshop binder is preloaded with conda, we can create a new conda environment with the required packages.
If you are unfamiliar with conda or would like a refresher, see click here for our materials on Conda.
# initialize and reset the bash with conda
conda init bash
source .bashrc
#create a new environment and install Graphviz
conda create -y -n graphs graphviz
# activate conda environment
conda activate graphs
Graphviz
uses dot
to create directed graphs. At the end of Day 1 we have completed map_reads
rule. Let's try to create a DAG of all rules from Day 1.
We can specify any file with snakemake using the -s
or --snakemake
flag
--snakefile, -s The workflow definition in form of a snakefile.Usually, you should not need to specify this. By default, Snakemake will search for ‘Snakefile’, ‘snakefile’, ‘workflow/Snakefile’, ‘workflow/snakefile’ beneath the current working directory, in this order. Only if you definitely want a different layout, you need to use this parameter.
We have added the Snakemake modifications from Day 1 to the binder as day1_Snakefile.txt
which we will use to
snakemake -s day1_Snakefile.txt --dag SRR2584857.sam | dot -Tpng > dag_map_reads.png
You can now click on the dag_map_reads.png
from the Files/Plots/Packages pane to visualize the steps. The DAG contains a node for each job with the edges connecting them representing the dependencies.
You can run the above command to generate the DAG even without creating the actually input and output files from the rules. The difference between running the dag before and after running through all rules until map_reads
is distinguished by solid vs dashed lines for the nodes which indicate that the output is up-to-date.
Friday, May 14, 2021, from 09:00am - 11:00pm PDT
Instructor: Dr. C Titus Brown
Moderator: Jose Sanchez
Note that you can use the day1_Snakefile.txt
for this session but would have to appropriately update the snakemake commands by adding
snakemake -s day1_Snakefile.txt <rule_name> -j 2
Alternatively, you can update the existing Snakefile by copying all the rules up to map_reads
from Day 1 and you can use the snakemake syntax without specifying the snakefile.
Perhaps simplest:
cp day1_Snakefile.txt Snakefile
- reset your prompt
- copy our day 1 Snakefile over with
cp day1_Snakefile.txt Snakefile
- and then run
snakemake -j 2 map_reads
and everything should be "green" with five jobs run.
- Run the rules one at a time to figure out what the output files are. Check the file time stamps (
ls -lht
) to track more recent output files. The inputs are specified in the shell command section.
Snakemake Dry Run
You can do a dry run (--dry-run
or -n
) of the rule with
snakemake -n <rule name>
to check how Snakemake is interpreting the rule input(s), output(s), and shell command(s), without actually running the command(s) or creating any output(s).
- Let's discuss inputs and outputs for
rule map_reads
- What happens when you have more than one input?
- What to do when you have invisible input files (i.e. inputs that are not in the shell command)?
rule map_reads:
input:
genome = "ecoli-rel606.fa",
reads = "SRR2584857_1.fastq.gz",
idxfile = expand("ecoli-rel606.fa.{ext}", ext=['sa', 'amb', 'ann', 'pac', 'bwt'])
output: "SRR2584857.sam"
shell:
"bwa mem -t 4 {input.genome} {input.reads} > {output}"
- Once you've fixed the rules
index_genome_bwa
andmap_reads
, you should be able to run everything up to the ruleindex_genome_samtools
by running
snakemake -p index_genome_samtools -j 2
- Snakemake also has the option to delete all the inputs/outputs for a particular rule (including preceding rules), shown here by running the
index_genome_samtools
rule:
snakemake --delete-all-output index_genome_samtools -j 2
You don't actually need to use the rule names (this will be important later on!). Instead of rule names, you can specify the required output file in Snakemake which will trigger execution of all the upstream linked rules necessary to produce the file. So, the command below will also work to run the rule map_reads
, and you don't have to remember the rule name (which can be arbitrary).
snakemake -p SRR2584857.sam -j 2
We'll work through index_genome_samtools
and samtools_import
rules together.
Add an input and output to the samtools_sort
rule
(adventure goes here)
Once you've decorated the entire Snakefile, you should be able to run through the whole workflow using either the rule name:
snakemake -p make_vcf -j 2
or by using the file name:
snakemake -p variants.vcf -j 2
Recollect that Snakemake will execute the first rule in the Snakefile by default. We can use this feature by creating a default rule called all
at the top of the Snakefile:
rule all:
input: "variants.vcf"
Run it like this:
snakemake -p all -j 2
This rule runs through the entire workflow with a single command! This is much better than running each command one by one!
Let's look at the vcf file!
less variants.vcf
We can look at the alignment by running the alignment viewer in samtools:
samtools tview -p ecoli:4202391 SRR2584857.sorted.bam ecoli-rel606.fa
If you run samtools view as in the last comment in the Snakefile, you can use ‘q’ to get out of it (and arrow keys to navigate)
edit to make_vcf
rule:
shell: "bcftools view --include 'INFO/DP>=20' variants.raw.bcf > variants.vcf"
specifying temp output files:
rule samtools_index_sorted:
input: "SRR2584857.sorted.bam"
output: temp("SRR2584857.sorted.bam.bai")
shell: "samtools index {input}"
specifying protected file:
output: protected("variants.raw.bcf")
data = ['SRR_TITUS_1', 'SRR2584857_1']
rule first_rule_of_titus:
input:
# fill in the expected files from the 'data' list
expand("{s}.sorted.bam.bai", s=data),
expand("{s}.variants.vcf", s=data),
expand("{s}.variants.hidepth.vcf", s=data)
rule download_data:
output: "SRR2584857.fastq.gz"
shell:
"wget https://osf.io/4rdza/download -O {output}"
# this rule downloads the genome file
rule download_genome:
output: "ecoli-rel606.fa.gz"
shell:
"wget https://osf.io/8sm92/download -O {output}"
rule uncompress_genome:
input: "ecoli-rel606.fa.gz"
output: "ecoli-rel606.fa"
shell:
"gunzip {input}"
rule index_genome_bwa:
input: "ecoli-rel606.fa"
output: "ecoli-rel606.fa.pac",
"ecoli-rel606.fa.ann",
"ecoli-rel606.fa.amb",
"ecoli-rel606.fa.bwt",
"ecoli-rel606.fa.sa"
shell:
"bwa index {input}"
rule map_reads:
input:
genome="ecoli-rel606.fa",
reads="{sample}.fastq.gz",
pac="ecoli-rel606.fa.pac",
ann="ecoli-rel606.fa.ann",
amb="ecoli-rel606.fa.amb",
bwt="ecoli-rel606.fa.bwt",
sa="ecoli-rel606.fa.sa"
output: sam="{sample}.sam"
shell:
"bwa mem -t 4 {input.genome} {input.reads} > {output.sam}"
rule index_genome_samtools:
input: "ecoli-rel606.fa"
output: "ecoli-rel606.fa.fai"
shell:
"samtools faidx ecoli-rel606.fa"
rule samtools_import:
input:
fai = "ecoli-rel606.fa.fai",
sam = "{sample}.sam"
output: "{sample}.bam"
shell:
"samtools view -bt {input.fai} -o {output} {input.sam}"
# TODO: add 'input' and 'output' to samtools_sort
rule samtools_sort:
input: "{sample}.bam"
output: "{sample}.sorted.bam"
shell:
"samtools sort {input} -o {output}"
rule samtools_index_sorted:
input: "{sample}.sorted.bam"
output: temp("{sample}.sorted.bam.bai")
shell: "samtools index {input}"
rule samtools_mpileup:
input:
genome = "ecoli-rel606.fa",
sorted_bam = "{sample}.sorted.bam",
bai = "{sample}.sorted.bam.bai"
output: protected("{sample}.variants.raw.bcf")
shell:
"""samtools mpileup -u -t DP -f {input.genome} {input.sorted_bam} | \
bcftools call -mv -Ob -o - > {output}"""
rule make_vcf:
input: "{sample}.variants.raw.bcf"
output: "{sample}.variants.vcf"
shell: "bcftools view {input} > {output}"
rule make_highdepth_vcf:
input: "{sample}.variants.raw.bcf"
output: "{sample}.variants.hidepth.vcf"
shell: "bcftools view --include 'INFO/DP>=20' {input} > {output}"
# at end, run:
## samtools tview -p ecoli:4202391 SRR2584857.sorted.bam ecoli-rel606.fa
- Home
- Resources for Attendees
- Resources for Instructors
- Training Workshop Notes
-
HuBMAP Tools
-
R
-
RNA-Seq Concepts, Design and Workflows
-
RNA-Seq in the Cloud
-
Snakemake Part I & II
-
UNIX