diff --git a/README.md b/README.md index 29e3eee..31e57ce 100644 --- a/README.md +++ b/README.md @@ -20,9 +20,9 @@ BayesTyper can either be build from source or a static Linux x86_64 build can be ### Building BayesTyper ### #### Prerequisites #### -* gcc (c++11 support required. Tested with gcc 4.8 and 4.9) -* CMake (version 2.8.0 or higher) -* Boost (tested with version 1.55.0 and 1.56.0) +* gcc (c++11 support required) +* [CMake](https://cmake.org) (version 2.8.0 or higher) +* [Boost](http://www.boost.org) (tested with version 1.53.0, 1.56.0 and 1.59.0) #### Compilation #### @@ -39,35 +39,46 @@ The BayesTyper package contains `bayesTyper`, which does the genotyping, and `ba 1. Count k-mers - 1. Run [KMC3](https://github.com/refresh-bio/KMC) on each sample: `kmc -k55 sample_1.fq sample_1` + 1. Run [KMC3](https://github.com/refresh-bio/KMC) on each sample: `kmc -k55 -ci1 sample_1.fq sample_1` + * This will output k-mer counts to `sample_1.kmc_pre` and `sample_1.kmc_suf`. - * For low coverage data (<20X), include singleton k-mers by adding `-ci1` to the `kmc3` commandline. + + 2. For each sample create a read k-mer bloom filter: `bayesTyperTools makeBloom -k sample_1` + + * The resulting bloom filter (`sample_1.bloom`) and the KMC3 output (`sample_1.kmc_pre` and `sample_1.kmc_suf`) should be in the same directory with the same prefix. 2. Prepare variant input **IMPORTANT:** The variant input **must** contain simple variants (SNPs and short indels). These can be obtained by first running a standard tool like GATK, Platypus or Freebayes and then combine these variants with structural variants calls and/or prior as desired. **At least 1 million simple variants are required**. 1. If required, convert allele IDs (e.g. \) to sequence: `bayesTyperTools convertAlleleId -o sample_1_sv_calls_seq -v sample_1_sv_calls.vcf -g hg38.fa` + * Currently \, \, \, \, \, \ are supported. The latter require a fasta file with the mobile element insertion sequences. * This step can be skipped if the variant sets does not include any allele IDs (e.g. GATK, Platypus and Freebayes output). 2. Normalise variants using [Bcftools](https://samtools.github.io/bcftools/): `bcftools norm -o sample_1_gatk_norm.vcf -f hg38.fa sample_1_gatk.vcf` 3. Combine variant sets: `bayesTyperTools combine -o bayesTyper_input -v gatk:sample_1_gatk_norm.vcf,gatk:sample_2_gatk_norm.vcf,gatk:sample_3_gatk_norm.vcf,varDB:SNP_dbSNP150common_SV_1000g_dbSNP150all_GDK_GoNL_GTEx_GRCh38.vcf` + * The contig fields in the headers need to be identical between variant sets and the variants sorted in the same order as the fields. 3. Genotype variants - **IMPORTANT:** If you want to run BayesTyper on more than 30 samples, you should run BayesTyper in batches of 30 samples or less but using the **full** set of variants (i.e. across all individuals) + **IMPORTANT:** If you want to run BayesTyper on more than 30 samples, you should run BayesTyper in batches of 30 samples or less but using the **full** set of variants (i.e. across all individuals). 1. Prepare sample information: Create tsv file with one sample per row with columns \, \ and \ ([example](http://people.binf.ku.dk/~lassemaretty/bayesTyper/bt_samples_example.tsv)) 2. Run BayesTyper: `bayesTyper -o integrated_calls -s samples.tsv -v bayesTyper_input.vcf -g hg38.fa -p ` - * Decoy sequences: BayesTyper can be provided with decoy sequences using '-d' to handle sequence similarities between genotyped regions and non-genotyped regions (e.g. the mitochondrial genome and unplaced contigs in the reference). Matching reference and decoy sequences are available for + + * By default BayesTyper does not genotype variant alleles longer than 500,000 nts. If longer variants are of interest this can be changed using the option `--max-allele-length`, however at the cost of increased computation time and memory usage. + * BayesTyper can be provided with decoy sequences using `-d` to handle sequence similarities between genotyped regions and non-genotyped regions (e.g. the mitochondrial genome and unplaced contigs in the reference). Matching reference and decoy sequences are available for + * GRCh37: [Reference](http://people.binf.ku.dk/~lassemaretty/bayesTyper/GRCh37/GRCh37_canon.fa) and [decoy](http://people.binf.ku.dk/~lassemaretty/bayesTyper/GRCh37/GRCh37_decoy.fa) * GRCh38: [Reference](http://people.binf.ku.dk/~lassemaretty/bayesTyper/GRCh38/GRCh38_canon.fa) and [decoy](http://people.binf.ku.dk/~lassemaretty/bayesTyper/GRCh38/GRCh38_decoy.fa) - + + 4. Filter output 1. Run filtering: `bayesTyperTools filter -o integrated_calls_filtered -v integrated_calls.vcf -g hg38.fa --kmer-coverage-filename integrated_calls_kmer_coverage_estimates.txt` + * By default only genotypes with high confidence (posterior probability >= 0.99) are kept. If low confident genotypes are needed in a downstream analyses this can be changed using the option `--min-genotype-posterior`. ## Variant databases ## @@ -76,7 +87,7 @@ The BayesTyper package contains `bayesTyper`, which does the genotyping, and `ba ### Variant database sources ### #### GRCh37 #### -|Source|Version|Filters*|Lifted|Reference| +|Source|Version|Filters\*|Lifted|Reference| |------|-------|--------|------|---------| |dbSNP|150|No rare SNVs|No|[link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29783/)| |1000 Genomes Project (1KG)|Phase 3|No SNVs|No|[link](https://www.nature.com/nature/journal/v526/n7571/full/nature15394.html)|| @@ -85,7 +96,7 @@ The BayesTyper package contains `bayesTyper`, which does the genotyping, and `ba |GenomeDenmark (GDK)|v1.0|No SNVs|From GRCh38|[link](http://www.nature.com/nature/journal/vaop/ncurrent/full/nature23264.html)| #### GRCh38 #### -|Source|Version|Filters*|Lifted|Reference| +|Source|Version|Filters\*|Lifted|Reference| |------|-------|--------|------|---------| |dbSNP|150|No rare SNVs|No|[link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29783/)| |1000 Genomes Project (1KG)|Phase 3|No SNVs|No|[link](https://www.nature.com/nature/journal/v526/n7571/full/nature15394.html)|| @@ -93,17 +104,25 @@ The BayesTyper package contains `bayesTyper`, which does the genotyping, and `ba |Genotype-Tissue Expression (GTEx) Project|GTEx Analysis V6|No SNVs|From GRCh37|[link](http://www.nature.com/ng/journal/v49/n5/full/ng.3834.html)| |GenomeDenmark (GDK)|v1.0|No SNVs|No|[link](http://www.nature.com/nature/journal/vaop/ncurrent/full/nature23264.html)| -*Reference and alternative alleles containing ambiguous nucleotides were removed from all variant sources. +\*Reference and alternative alleles containing ambiguous nucleotides were removed from all variant sources. + +## Computational requirements ## +|Number of samples|Coverage|Number of variants|Max allele length (nts)|Number of threads|Wall time (hours)\*|Max memory (GB)| +|-----------------|--------|------------------|-----------------------|-----------------|------------------|---------------| +|10|10x|14.6M|500,000|32|17|152| +|10|30x|13.4M|500,000|32|19|148| +|13|50x|11.7M|500,000|32|91|169| +|13|50x|11.7M|10,000|32|42|129| +|13|50x|61.1M|500,000|32|125|375| +|10|13x|21.4M|500,000|32|16|159| +|10|13x|64.4M|500,000|32|61|291| -## Memory requirements ## -|Variants|Coverage|Samples|Singletons removed|Threads|Memory (GB)|Time (wall-time hours)| -|--------|--------|-------|-------------------|-------|-----------|----------------------| -|15M|30X|10|Yes|32|235|26| -|21M|~13X|10|No|32|280|20| -|51M|~50X|13|Yes|32|430|107| +\*All runs were done on a 64-bit Intel Xeon 2.30 GHz machine with 1TB of memory. ## Third-party ## Third-party software used by BayesTyper (distributed together with the BayesTyper source code). * [Edlib](https://github.com/Martinsos/edlib) * [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) * [KMC](https://github.com/refresh-bio/KMC) +* [libbf](https://github.com/mavam/libbf) +* [ntHash](https://github.com/bcgsc/ntHash)