This repository holds the Snakemake-based workflow for implementing the Molecular Ecology and Genetic Analysis Team’s Simple Microhaplotyping workflow. Hence the name, mega-simple-microhap-snakeflow. It was written and is maintained by Eric C. Anderson at NOAA’s Southwest Fisheries Science Center with input and assistance from Anthony Clemento and Ellen Campbell. This is a workflow to provide genotypes of individual fish (or other organisms) at microhaplotypes associated with different sets of amplicons. It is configurable to be able to work with different species, different marker sets, and different ways in which the marker-set reference sequences are specified.
The development and Snakemake code within this workflow isn’t entirely “mega-simple.” There are a couple of features of the code/wildcarding that could be streamlined, and might be at some point. Nonetheless, once all the configurations are in place for a given species and all of the different panels of amplicons associated with it, it is pretty simple to run it.
If you just want to get the workflow and run the included, small, test data set on your own laptop or cluster, here are the steps:
-
Get this repository with something like
git clone https://github.com/eriqande/mega-simple-microhap-snakeflow.git
-
Make sure that you have R installed and that you have successfully installed the packages
tidyverse
andremotes
. (We have not found a way to use a conda-maintained R installation that works for our purposes here). -
Make sure you have Miniconda installed.
-
Get a full install of a Snakemake conda environment by following the directions here.
-
Activate that snakemake conda environment
-
From within the
mega-simple-microhap-snakeflow
directory give this command:snakemake --config run_dir=.test/data --configfile config/Chinook/config.yaml --use-conda -np
It should spit out a bunch of stuff that ends with something like this:
Job counts: count jobs 1 all 2 call_fullg_marker_sets_with_bcftools 2 call_fullgex_remapped_markers_with_bcftools 2 call_target_fasta_marker_sets_with_bcftools 20 extract_reads_from_full_genomes 33 flash_paired_ends 20 flashed_fastqs_from_fullg_extracted_bams 1 make_microhap_folder 20 map_fullg_extracted_to_thinned_genomes 20 map_to_full_genome 26 map_to_target_fastas 2 microhap_extract_fullgex_remapped 2 microhap_extract_target_fastas 151 This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
-
If that worked, you can do a full run with this:
snakemake --config run_dir=.test/data --configfile config/Chinook/config.yaml --use-conda --cores 1
The test data set is small enough that a single core is enough. However, the first time you run it, you can expect to spend quite a bit of time (30 minute to a couple hours, depending on the speed of your computer) downloading the Otsh_v1.0 genome and indexing it with bwa.
This workflow was originally developed for paired end data. If you have
single end data, you can just set the fq1
and fq2
values in each row
in samples.csv
to be the path to the single end file. Note: you must
use use_trimmomatic: false
in the config if you are using single-end
data (for now).
No snakemake workflow description is complete without a quick DAG showing all the different rules. Here it is for the test data set obtained with:
snakemake --config run_dir=.test/data \
--configfile config/Chinook/config.yaml \
--rulegraph | \
dot -Tsvg > rulegraph.svg
Assuming that you are running some sequencing data for a species that is already configured (meaning that the reference genomes and target regions are specified), then all you need to worry about is the sequencing and meta data input that you must provide. (This will be the case if you are using amplicons and specifications developed at the SWFSC).
The basic idea of the workflow is that input data (both sequences and meta data) from any single MiSeq run (or, presumably a run on a different sequencing platform) are provided in a single directory with fairly strict formatting guidelines. The workflow then proceeds upon those data, and, in the process, populates that input directory with a number of additional directories containing outputs and some intermediate files.
For example, the directory .test/data
in the repository is this sort
of input directory. It is obviously a pared-down version of a typical
sequencing run which might include hundreds of individuals, but it
serves out illustrative purposes here, nonetheless.
In order for the snakemake-based workflow to run upon such a directory, it must have the following elements
- A directory
raw
that includes only the paired-end FASTQ files from the sequencing. - A file
samples.csv
(described below) that includes one line for each sample giving information about the path to the FASTQ file that originated from each sample, additional identifiers for the sample, and also the names of the amplicon panels at which the the sample was amplified. - A file
units.csv
(also described below) that has a single line for each combination of a sample and an amplicon panel / marker set.
It is also good practice to have an additional file called meta.csv
that includes additional meta data for each sample in samples.csv
;
however this is not a requirement.
In our workflow at the SWFSC the samples.csv
and units.csv
file are
generated from the the sample sheet associated with the sequencing run
using the included R script in
preprocess/sample-sheet-processing-functions.R
. This is the
recommended way to obtain samples.csv
and units.csv
, and is the only
way that we will document that here. Therefore, in practice, each user
should plan to start with input data from the sequencer that includes:
- The directory
raw
that includes only the paired-end FASTQ files from the sequencing. - A file named
SampleSheet.csv
that includes information about the samples that were included on the sequencing run.
The key to making sure all of this will run smoothly is to be strict
about the formatting and contents of SampleSheet.csv
which are
described in the next section.
The example SampleSheet.csv
that comes with the small test data set
with the repository can be most easily viewed on GitHub
here.
As you can see, there are a bunch of header/preamble lines in it that
are part of the standard MiSeq output. The functions in
preprocess/sample-sheet-processing-functions.R
find the start of the
data section by finding the line that starts with [Data]
. So, if your
file doesn’t have that, it won’t work!
As currently configured, the scripts in
preprocess/sample-sheet-processing-functions.R
are setup for our own
conventions and things. Just go ahead and look at the functions to
understand what is going on there. For our purposes at the SWFSC, we
have these conventions:
-
Our NMFS_DNA_ID is the part before the first underscore in the
Sample_Plate
column. If your protocol has the sample ID in the Sample_ID column, then that is accommodated as described below. -
The marker sets to process each individual (row) at is given as a comma-separated (white space around the commas is stripped) series of marker-set names in the
Description
column. -
The
Sample_Name
column must give the identifier of the individual that forms the beginning of the name of its fastq files in raw. In particular, if YYYYYYY is the entry for the individual in theSample_Name
column, then its paired-end fastq files stored in theraw
directory must match the regular expressions:YYYYYYY_.*_L00[0-9]_R1_00[0-9].fastq.gz and YYYYYYY_.*_L00[0-9]_R2_00[0-9].fastq.gz
Entries in the
Sample_Name
must not have any underscores in them. The letters before the first underscore in the fastq file name is used to match each entry in theSample_Name
column to its fastq files.
In our workflows, these are produced from the SampleSheet.csv file using
the create_samples_and_units()
function defined in
preprocess/sample-sheet-processing-functions.R
like this, in R with
the working directory set to the top level of the workflow repository:
source("preprocess/sample-sheet-processing-functions.R")
create_samples_and_units(".test/data/SampleSheet.csv")
It is pretty straighforward. If you are trying to process a
SampleSheet.csv
that lives in a directory mypath/mydir
then you
simply do:
source("preprocess/sample-sheet-processing-functions.R")
create_samples_and_units("mypath/mydir/SampleSheet.csv")
and that will create samples.csv
and units.csv
in the directory
mypath/mydir
alongside SampleSheet.csv
.
If you have your sample ID in the Sample_ID column, then you can do like this:
source("preprocess/sample-sheet-processing-functions.R")
create_samples_and_units(
"mypath/mydir/SampleSheet.csv",
NMFS_DNA_ID_from_Sample_ID = TRUE
)
If you are going to make samples.csv
and units.csv
yourself here are
some examples so you can see their structure and what you need to put in
them:
samples.csv
and
units.csv.
Here is an idea of how a production-run command line might look for five directories. After checking out a node with 20 cores…
for i in data/200715_M02749_0092_000000000-CV34F \
data/201012_M02749_0095_000000000-CWHDK \
data/210129_M02749_0102_000000000-J33JD \
data/200828_M02749_0093_000000000-CV2FP \
data/201019_M02749_0096_000000000-CV34D; do
snakemake --config run_dir=$i --configfile config/Chinook/config.yaml --use-conda --use-envmodules --cores 20
done
This is our current setup to simply call variants after running multiple directories. It just using globbing to figure out which BAMs were produced. It does not parse the sample sheets of each run to figure out which individuals should have BAMs and request those.
This is only imlemented for the fullgex-remapped-to-thinned stuff, at the moment, but I will get it going for the target_fastas soon, too.
Here is how to do it:
-
Establish a directory in the top level called
MULTI_RUN_RESULTS/some_directory_name
wheresome_directory_name
is some informative name. -
Inside that directory make a file called
dirs.txt
which lists the run paths to the directories you want to include individuals from (if any for the marker sets you will request). The paths should be relative to the top-level of themega-simple-microhap-snakeflow
directory. For example, after the above to steps you might have:# A directory MULTI_RUN_RESULTS/5_early_runs # a file dirs.txt within that with the following contents shown (snakemake) [node11: mega-simple-microhap-snakeflow]--% cat MULTI_RUN_RESULTS/5_early_runs/dirs.txt data/200715_M02749_0092_000000000-CV34F data/200828_M02749_0093_000000000-CV2FP data/201012_M02749_0095_000000000-CWHDK data/201019_M02749_0096_000000000-CV34D data/210129_M02749_0102_000000000-J33JD # that is just 5 lines, each with a path to a previously run run directory!
-
Request the output file you want by replacing the wildcards, as appropriate, in the following output path:
MULTI_RUN_RESULTS/{multi_run_dir}/{species_dir}/vcfs/{marker_set}/fullgex_remapped/{genome}/variants-from-multi-runs-bcftools.vcf
In our example, if we wanted to do this for both WRAP and LFAR, the output targets we would request would be:
MULTI_RUN_RESULTS/5_early_runs/Chinook/vcfs/{LFAR,WRAP}/fullgex_remapped/Otsh_v1.0/variants-from-multi-runs-bcftools.vcf
Note the use of the Unix brace expansion to get two paths out of that—one for LFAR and one for WRAP.
-
Request that target with snakemake. The full command would look like:
snakemake --use-conda --cores 20 \ MULTI_RUN_RESULTS/5_early_runs/Chinook/vcfs/{LFAR,WRAP}/fullgex_remapped/Otsh_v1.0/variants-from-multi-runs-bcftools.vcf
The obvious reason for doing multi-run variant calling is to have sufficient numbers of individuals from enough populations to ensure that you have most of the interesting variants for forming microhaplotypes. Here is how you create a new VCF for microhaplot, after doing the above multi-run variant calling. Here I show some simple steps for the WRAP markers:
First, filter out sites with a lot of missing data:
WRAP=MULTI_RUN_RESULTS/5_early_runs/Chinook/vcfs/WRAP/fullgex_remapped/Otsh_v1.0/variants-from-multi-runs-bcftools.vcf
(snakemake) [node11: mega-simple-microhap-snakeflow]--% bcftools view -i 'F_MISSING < 0.30' $WRAP | awk '!/^#/' | wc
55 37455 867160
(snakemake) [node11: mega-simple-microhap-snakeflow]--% bcftools view -i 'F_MISSING < 0.10' $WRAP | awk '!/^#/' | wc
53 36093 824120
(snakemake) [node11: mega-simple-microhap-snakeflow]--% bcftools view -i 'F_MISSING < 0.03' $WRAP | awk '!/^#/' | wc
53 36093 824120
# OK! Looks like 53 variants in the 24 regions/microhaps
# How many alternate alleles out of how many total alleles at each site?
(snakemake) [node11: mega-simple-microhap-snakeflow]--% bcftools view -i 'F_MISSING < 0.03' $WRAP | bcftools query -f '%CHROM\t%POS\t%AC/%AN\n'
NC_037104.1:55923357-55923657 151 583/1322
NC_037104.1:55966251-55966551 149 632/1322
NC_037104.1:55966251-55966551 151 641/1322
NC_037104.1:56061938-56062238 151 604/1322
NC_037104.1:56061938-56062238 182 1322/1322
NC_037104.1:56088878-56089178 151 452/1322
NC_037108.1:73538966-73539266 151 524/1318
NC_037108.1:73538966-73539266 160 57/1320
NC_037108.1:73538966-73539266 214 56/1320
NC_037108.1:73540716-73541016 151 526/1322
NC_037108.1:73543706-73544006 130 148/1322
NC_037108.1:73543706-73544006 133 17/1322
NC_037108.1:73543706-73544006 151 475/1322
NC_037108.1:73553140-73553440 151 475/1324
NC_037112.1:24500367-24500667 151 468/1320
NC_037112.1:24542569-24542869 151 455/1324
NC_037112.1:24542569-24542869 165 63/1324
NC_037112.1:24542569-24542869 172 891/1324
NC_037112.1:24593758-24594058 151 454/1320
NC_037112.1:24593758-24594058 181 141/1320
NC_037112.1:24609163-24609463 151 874/1322
NC_037112.1:24609163-24609463 188 447/1320
NC_037112.1:24618993-24619293 109 9/1322
NC_037112.1:24618993-24619293 131 455/1322
NC_037112.1:24618993-24619293 151 867/1322
NC_037112.1:24704405-24704705 151 454/1322
NC_037112.1:24704405-24704705 207 29/1322
NC_037112.1:24721041-24721341 141 450/1320
NC_037112.1:24721041-24721341 162 448/1320
NC_037112.1:24999768-25000068 125 455/1320
NC_037112.1:24999768-25000068 147 456/1320
NC_037112.1:24999768-25000068 155 455/1320
NC_037112.1:25015012-25015312 118 532/1322
NC_037112.1:25015012-25015312 151 453/1322
NC_037112.1:25015012-25015312 161 2/1322
NC_037112.1:25015012-25015312 163 801,6/1322
NC_037112.1:28278997-28279297 125 256/1326
NC_037112.1:28278997-28279297 151 508/1324
NC_037112.1:28296342-28296642 130 11/1324
NC_037112.1:28296342-28296642 149 467/1324
NC_037112.1:28296342-28296642 150 77/1308
NC_037112.1:28296342-28296642 151 482/1324
NC_037112.1:28296342-28296642 165 15/1324
NC_037112.1:28320487-28320787 126 53/1322
NC_037112.1:28320487-28320787 128 39/1322
NC_037112.1:28320487-28320787 137 831/1318
NC_037112.1:28320487-28320787 151 540/1322
NC_037112.1:28350510-28350810 116 22/1322
NC_037112.1:28350510-28350810 141 1170/1322
NC_037112.1:28350510-28350810 151 530/1322
NC_037112.1:28350510-28350810 163 363/1322
NC_037121.1:6243491-6243791 151 474/1324
NC_037121.1:6268440-6268740 151 448/1318
# So, some of them are quite rare, but let's go ahead and keep
# those in there...That is still fewer than two variants per
# amplicon, on average.
# Finally, let's make a VCF file with only one individual in it
# to use for our microhaplot VCF:
(snakemake) [node11: mega-simple-microhap-snakeflow]--% bcftools view -i 'F_MISSING < 0.03' $WRAP | bcftools view -s T170774 > config/Chinook/canonical_variation/WRAP-24-amplicons-53-variants.vcf
Now, to add that into the workflow, let’s request another set of canonical variation in the Chinook config:
WRAP:
genome:
Otsh_v1.0:
regions: config/Chinook/regions/WRAP-Otsh_v1.0.txt
microhap_variants:
all_variants: config/Chinook/canonical_variation/WRAP-all-snps-round-1.vcf
after_5_runs: config/Chinook/canonical_variation/WRAP-24-amplicons-53-variants.vcf <--- this line added
After that, invoking Snakemake will see that there are new microhap variants for things to be run at, if there are any WRAP fish in the run. Cool!
If we just did 5 runs worth of microhaplot for LFAR and WRAP, the results would be found in directories like this:
(snakemake) [node11: mega-simple-microhap-snakeflow]--% pwd
/home/eanderson/Documents/git-repos/mega-simple-microhap-snakeflow
(snakemake) [node11: mega-simple-microhap-snakeflow]--% ls data/*/Chinook/microhaplot/*after_5*
data/200715_M02749_0092_000000000-CV34F/Chinook/microhaplot/LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
data/200715_M02749_0092_000000000-CV34F/Chinook/microhaplot/LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
data/200828_M02749_0093_000000000-CV2FP/Chinook/microhaplot/LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
data/200828_M02749_0093_000000000-CV2FP/Chinook/microhaplot/LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
data/200828_M02749_0093_000000000-CV2FP/Chinook/microhaplot/WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
data/200828_M02749_0093_000000000-CV2FP/Chinook/microhaplot/WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
data/201012_M02749_0095_000000000-CWHDK/Chinook/microhaplot/LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
data/201012_M02749_0095_000000000-CWHDK/Chinook/microhaplot/LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
data/201019_M02749_0096_000000000-CV34D/Chinook/microhaplot/WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
data/201019_M02749_0096_000000000-CV34D/Chinook/microhaplot/WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
data/210129_M02749_0102_000000000-J33JD/Chinook/microhaplot/LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
data/210129_M02749_0102_000000000-J33JD/Chinook/microhaplot/LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
data/210129_M02749_0102_000000000-J33JD/Chinook/microhaplot/WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
data/210129_M02749_0102_000000000-J33JD/Chinook/microhaplot/WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
They all have the same name, so it is the directory structure that
distinguishes them. To get these all to your laptop, rsync with the R
option is your friend (and you can test with the -n
dry-run option
before doing it):
rsync -avR eanderson@sedna.nwfsc2.noaa.gov:'/home/eanderson/Documents/git-repos/mega-simple-microhap-snakeflow/./data/*/Chinook/microhaplot/*-after_5_*' ./
Note the use of the single quotes to avoid expanding the wildcards on
the local machine, and instead send them to the remote machine. And
also note the .
in: mega-simple-microhap-snakeflow/./data
. That
tells rsync to only include the part of the path to the left of the dot.
Cool! After running the above command, we have:
(base) /from_cluster/--% (master) tree .
.
└── data
├── 200715_M02749_0092_000000000-CV34F
│ └── Chinook
│ └── microhaplot
│ ├── LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
│ └── LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
├── 200828_M02749_0093_000000000-CV2FP
│ └── Chinook
│ └── microhaplot
│ ├── LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
│ ├── LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
│ ├── WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
│ └── WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
├── 201012_M02749_0095_000000000-CWHDK
│ └── Chinook
│ └── microhaplot
│ ├── LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
│ └── LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
├── 201019_M02749_0096_000000000-CV34D
│ └── Chinook
│ └── microhaplot
│ ├── WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
│ └── WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
└── 210129_M02749_0102_000000000-J33JD
└── Chinook
└── microhaplot
├── LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
├── LFAR--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
├── WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs.rds
└── WRAP--fullgex_remapped_to_thinned--Otsh_v1.0--after_5_runs_posinfo.rds
16 directories, 14 files