Skip to content

Minos for joint genotyping

martinghunt edited this page Apr 30, 2024 · 7 revisions

This page describes how to use Minos to joint genotype a large number of samples.

Overview

The idea is that you have variant called each of your samples using your favourite method, and have a VCF file for each sample. The joint genotyping pipeline will genotype all samples at all sites/alleles found across all of the input VCF files. The output is one VCF per sample, where the sites and alleles are identical in all VCF files, and also one single "wide" VCF file that contains all samples and their calls. It can also optionally make a distance matrix of all the samples.

The joint genotyping pipeline is written in nextflow, so that must be installed. The pipeline also requires Minos and ivcfmerge - we recommend you use a singularity container instead of installing Minos and ivcfmerge.

Input files

For each sample, you will need a name, a VCF file of calls and a sorted indexed BAM file of reads. The VCF and BAM files must all be with respect to the same reference FASTA file.

Put the sample files in a tab-delimited "manifest" file, which must have column names (in any order) name, vcf, reads.

For example:

name	reads	vcf
sample1	/path/to/reads1.bam	/path/to/sample1.vcf
sample2	/path/to/reads2.bam	/path/to/sample2.vcf
sample3	/path/to/reads3.bam	/path/to/sample3.vcf

The same rules apply for input VCF files as in single sample mode. Please see the Input VCF files section for details in which variants/alleles are used.

Run the pipeline

The nextflow pipeline file is in the Minos repository here: nextflow/regenotype.nf. There is also a configuration file called nextflow/config.nf. The configuration file comes with two preset options, depending on the size of your input data - these are used with the option -profile medium or -profile large. The "medium" profile has been successfully used on around 400 samples. The "large" profile should be used for larger sample sizes - it is known to work on around 70,000 M. tuberculosis samples.

The nextflow command is:

nextflow run \
  -c nextflow/config.nf \
  -profile medium \
  nextflow/regenotype.nf \
  --ref_fasta <PATH/TO/REFERENCE.FASTA> \
  --manifest manifest.tsv \
  --outdir <PATH/TO/OUTPUT/DIRECTORY>

You will need to replace the value of --ref_fasta with the path to the reference FASTA file, and the value of --outdir with the name of the output directory - this directory is made by the pipeline and should not already exist.

It is beyond the scope of this documentation to provide help with how to use nextflow. But note that:

  • nextflow run options have a single dash (eg -c nextflow/config.nf).
  • Options to the actual pipeline have a double dash (eg --manifest manifest.tsv).
  • if you are using singularity instead of having minos installed, then tell nextflow with the option -with-singularity minos_container.simg.
  • Nextflow writes all intermediate files (there can be a large number!) in a "work directory", which is specified using the option -w /foo/bar and is created by nextflow if it does not exist. The default is work in your current working directory.

Pipeline options

The options to the pipeline that you are most likely to change are listed below. The default values are sometimes different, depending on if you use the profile medium or large. Specifying them on the command line overrides the value used by the profile option.

  • --make_distance_matrix. Add this option to make a distance matrix of all samples.
  • --max_variants_per_sample (default 5000). If a sample's input VCF file has more than this many variants, the sample is excluded. The idea being that if it is too far from the reference genome then it is probably contaminated.
  • --max_genome_proportion_per_sample (default 0.05). This is saying (approximately) that if the sample is more than 5% away from the reference, then it is excluded. The actual test is: (total length of REF alleles) / (genome size) > 0.05.
  • --max_ref_allele_len (default 50). Variants with a REF allele length longer than this are excluded. This is to prevent combinatoric explosions caused by one sample having a long deletion, and other samples having SNPs at the same position.
  • --max_alleles_per_site (default 500). After merging all input variants, when the final variant site is made for the graph mapping, this is the limit on the total number of alleles at the site. If generating all combinations of variants at the site is more than 500, then instead only the variant combinations actually seen in each sample is used. For example, if the reference is ATC, and samples have the ALTs A, GTC, ATG, then the combinations of the ALTs are A, GTC, GTG, ATC, ATG. If the cutoff was 4 (more than the 5 ALTs), then only A, GTC, ATG would be used.
  • --number_of_ref_chunks (default medium=100,large=300). To save RAM, the reference is split into chunks. More chunks will reduce RAM, at the cost longer run time.

Output files

First, note that if the pipeline ran successfully then the work directory (as specified by the option -w or -work-dir) made by nextflow can be completely deleted - it could contain a large number of files.

The output directory of the join genotyping pipeline contains these files:

  • merged.vcf - this is the final call set VCF file, containing all samples.
  • VCFs/ - a directory containing a VCF file for each sample.
  • Logs/ - a directory containing the Minos log file for each sample.
  • manifest.tsv, manifest.json - the same information in TSV and JSON format, linking each sample name to its VCF and log file. This is described next.

Since the pipeline can run on tens of thousands of files, the VCF and Log files for each sample are split into separate directories. This prevents having a huge number of files in one directory and making things difficult for the filesystem (eg running ls).

To get the VCF and/or log file for a sample, you can look it up in the manifest.tsv or manifest.json file. The manifest.tsv file looks like this:

sample	vcf_file	log_file
ERR046800	VCFs/0/0.vcf.gz	Logs/0/0.log.gz
ERR046753	VCFs/0/1.vcf.gz	Logs/0/1.log.gz
ERR047006	VCFs/0/2.vcf.gz	Logs/0/2.log.gz
... etc

and the JSON file looks like this:

{
  "ERR025833": {
    "log_file": "Logs/0/153.log.gz",
    "vcf_file": "VCFs/0/153.vcf.gz"
  },
  "ERR025834": {
    "log_file": "Logs/0/65.log.gz",
    "vcf_file": "VCFs/0/65.vcf.gz"
  },
  ... etc
}
Clone this wiki locally