-
Notifications
You must be signed in to change notification settings - Fork 5
Minos for joint genotyping
This page describes how to use Minos to joint genotype a large number of samples.
The idea is that you have variant called each of your samples using your favourite method, and have a VCF file for each sample. The joint genotyping pipeline will genotype all samples at all sites/alleles found across all of the input VCF files. The output is one VCF per sample, where the sites and alleles are identical in all VCF files, and also one single "wide" VCF file that contains all samples and their calls. It can also optionally make a distance matrix of all the samples.
The joint genotyping pipeline is written in nextflow, so that must be installed. The pipeline also requires Minos and ivcfmerge - we recommend you use a singularity container instead of installing Minos and ivcfmerge.
For each sample, you will need a name, a VCF file of calls and a sorted indexed BAM file of reads. The VCF and BAM files must all be with respect to the same reference FASTA file.
Put the sample files in a tab-delimited "manifest" file,
which must have column names (in any order) name
, vcf
, reads
.
For example:
name reads vcf
sample1 /path/to/reads1.bam /path/to/sample1.vcf
sample2 /path/to/reads2.bam /path/to/sample2.vcf
sample3 /path/to/reads3.bam /path/to/sample3.vcf
The same rules apply for input VCF files as in single sample mode. Please see the Input VCF files section for details in which variants/alleles are used.
The nextflow pipeline file is in the Minos repository here:
nextflow/regenotype.nf
. There is also a configuration file
called nextflow/config.nf
. The configuration file comes
with two preset options, depending on the size of your
input data - these are used with the option
-profile medium
or -profile large
.
The "medium" profile has been successfully used on
around 400 samples. The "large" profile should be used for larger
sample sizes - it is known to work on around 70,000 M. tuberculosis
samples.
The nextflow command is:
nextflow run \
-c nextflow/config.nf \
-profile medium \
nextflow/regenotype.nf \
--ref_fasta <PATH/TO/REFERENCE.FASTA> \
--manifest manifest.tsv \
--outdir <PATH/TO/OUTPUT/DIRECTORY>
You will need to replace the value of --ref_fasta
with the path to the reference FASTA file,
and the value of --outdir
with the name of the output directory - this directory is made
by the pipeline and should not already exist.
It is beyond the scope of this documentation to provide help with how to use nextflow. But note that:
-
nextflow run
options have a single dash (eg-c nextflow/config.nf
). - Options to the actual pipeline have a double dash (eg
--manifest manifest.tsv
). - if you are using singularity instead of having
minos
installed, then tell nextflow with the option-with-singularity minos_container.simg
. - Nextflow writes all intermediate files (there can be a large number!)
in a "work directory", which is specified
using the option
-w /foo/bar
and is created by nextflow if it does not exist. The default iswork
in your current working directory.
The options to the pipeline that you are most likely to change are listed below.
The default values are sometimes different, depending on if you use the
profile medium
or large
. Specifying them on the command line
overrides the value used by the profile option.
-
--make_distance_matrix
. Add this option to make a distance matrix of all samples. -
--max_variants_per_sample
(default 5000). If a sample's input VCF file has more than this many variants, the sample is excluded. The idea being that if it is too far from the reference genome then it is probably contaminated. -
--max_genome_proportion_per_sample
(default 0.05). This is saying (approximately) that if the sample is more than 5% away from the reference, then it is excluded. The actual test is: (total length of REF alleles) / (genome size) > 0.05. -
--max_ref_allele_len
(default 50). Variants with a REF allele length longer than this are excluded. This is to prevent combinatoric explosions caused by one sample having a long deletion, and other samples having SNPs at the same position. -
--max_alleles_per_site
(default 500). After merging all input variants, when the final variant site is made for the graph mapping, this is the limit on the total number of alleles at the site. If generating all combinations of variants at the site is more than 500, then instead only the variant combinations actually seen in each sample is used. For example, if the reference is ATC, and samples have the ALTs A, GTC, ATG, then the combinations of the ALTs are A, GTC, GTG, ATC, ATG. If the cutoff was 4 (more than the 5 ALTs), then only A, GTC, ATG would be used. -
--number_of_ref_chunks
(default medium=100,large=300). To save RAM, the reference is split into chunks. More chunks will reduce RAM, at the cost longer run time.
First, note that if the pipeline ran successfully then the work directory
(as specified by the option -w
or -work-dir
) made
by nextflow can be completely deleted - it could contain a large number of files.
The output directory of the join genotyping pipeline contains these files:
-
merged.vcf
- this is the final call set VCF file, containing all samples. -
VCFs/
- a directory containing a VCF file for each sample. -
Logs/
- a directory containing the Minos log file for each sample. -
manifest.tsv
,manifest.json
- the same information in TSV and JSON format, linking each sample name to its VCF and log file. This is described next.
Since the pipeline can run on tens of thousands of files, the VCF and Log files
for each sample are split into separate directories. This prevents having
a huge number of files in one directory and making things difficult for
the filesystem (eg running ls
).
To get the VCF and/or log file for a sample, you can look it up in the
manifest.tsv
or manifest.json
file. The manifest.tsv
file looks like
this:
sample vcf_file log_file
ERR046800 VCFs/0/0.vcf.gz Logs/0/0.log.gz
ERR046753 VCFs/0/1.vcf.gz Logs/0/1.log.gz
ERR047006 VCFs/0/2.vcf.gz Logs/0/2.log.gz
... etc
and the JSON file looks like this:
{
"ERR025833": {
"log_file": "Logs/0/153.log.gz",
"vcf_file": "VCFs/0/153.vcf.gz"
},
"ERR025834": {
"log_file": "Logs/0/65.log.gz",
"vcf_file": "VCFs/0/65.vcf.gz"
},
... etc
}