Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Jonas Andreas Sibbesen authored Feb 26, 2018
1 parent 4f6ddfe commit 5ae8dcb
Showing 1 changed file with 36 additions and 17 deletions.
53 changes: 36 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ BayesTyper can either be build from source or a static Linux x86_64 build can be
### Building BayesTyper ###

#### Prerequisites ####
* gcc (c++11 support required. Tested with gcc 4.8 and 4.9)
* CMake (version 2.8.0 or higher)
* Boost (tested with version 1.55.0 and 1.56.0)
* gcc (c++11 support required)
* [CMake](https://cmake.org) (version 2.8.0 or higher)
* [Boost](http://www.boost.org) (tested with version 1.53.0, 1.56.0 and 1.59.0)

#### Compilation ####

Expand All @@ -39,35 +39,46 @@ The BayesTyper package contains `bayesTyper`, which does the genotyping, and `ba

1. Count k-mers

1. Run [KMC3](https://github.com/refresh-bio/KMC) on each sample: `kmc -k55 sample_1.fq sample_1`
1. Run [KMC3](https://github.com/refresh-bio/KMC) on each sample: `kmc -k55 -ci1 sample_1.fq sample_1`

* This will output k-mer counts to `sample_1.kmc_pre` and `sample_1.kmc_suf`.
* For low coverage data (<20X), include singleton k-mers by adding `-ci1` to the `kmc3` commandline.

2. For each sample create a read k-mer bloom filter: `bayesTyperTools makeBloom -k sample_1`

* The resulting bloom filter (`sample_1.bloom`) and the KMC3 output (`sample_1.kmc_pre` and `sample_1.kmc_suf`) should be in the same directory with the same prefix.

2. Prepare variant input

**IMPORTANT:** The variant input **must** contain simple variants (SNPs and short indels). These can be obtained by first running a standard tool like GATK, Platypus or Freebayes and then combine these variants with structural variants calls and/or prior as desired. **At least 1 million simple variants are required**.
1. If required, convert allele IDs (e.g. \<DEL\>) to sequence: `bayesTyperTools convertAlleleId -o sample_1_sv_calls_seq -v sample_1_sv_calls.vcf -g hg38.fa`

* Currently \<DEL\>, \<DUP\>, \<CN[digit(s)]\>, \<CNV\>, \<INV\>, \<INS:ME:[sequence name]\> are supported. The latter require a fasta file with the mobile element insertion sequences.
* This step can be skipped if the variant sets does not include any allele IDs (e.g. GATK, Platypus and Freebayes output).

2. Normalise variants using [Bcftools](https://samtools.github.io/bcftools/): `bcftools norm -o sample_1_gatk_norm.vcf -f hg38.fa sample_1_gatk.vcf`

3. Combine variant sets: `bayesTyperTools combine -o bayesTyper_input -v gatk:sample_1_gatk_norm.vcf,gatk:sample_2_gatk_norm.vcf,gatk:sample_3_gatk_norm.vcf,varDB:SNP_dbSNP150common_SV_1000g_dbSNP150all_GDK_GoNL_GTEx_GRCh38.vcf`

* The contig fields in the headers need to be identical between variant sets and the variants sorted in the same order as the fields.

3. Genotype variants

**IMPORTANT:** If you want to run BayesTyper on more than 30 samples, you should run BayesTyper in batches of 30 samples or less but using the **full** set of variants (i.e. across all individuals)
**IMPORTANT:** If you want to run BayesTyper on more than 30 samples, you should run BayesTyper in batches of 30 samples or less but using the **full** set of variants (i.e. across all individuals).
1. Prepare sample information: Create tsv file with one sample per row with columns \<sample_id\>, \<sex\> and \<path_to_kmc3_output\> ([example](http://people.binf.ku.dk/~lassemaretty/bayesTyper/bt_samples_example.tsv))

2. Run BayesTyper: `bayesTyper -o integrated_calls -s samples.tsv -v bayesTyper_input.vcf -g hg38.fa -p <threads>`
* Decoy sequences: BayesTyper can be provided with decoy sequences using '-d' to handle sequence similarities between genotyped regions and non-genotyped regions (e.g. the mitochondrial genome and unplaced contigs in the reference). Matching reference and decoy sequences are available for

* By default BayesTyper does not genotype variant alleles longer than 500,000 nts. If longer variants are of interest this can be changed using the option `--max-allele-length`, however at the cost of increased computation time and memory usage.
* BayesTyper can be provided with decoy sequences using `-d` to handle sequence similarities between genotyped regions and non-genotyped regions (e.g. the mitochondrial genome and unplaced contigs in the reference). Matching reference and decoy sequences are available for

* GRCh37: [Reference](http://people.binf.ku.dk/~lassemaretty/bayesTyper/GRCh37/GRCh37_canon.fa) and [decoy](http://people.binf.ku.dk/~lassemaretty/bayesTyper/GRCh37/GRCh37_decoy.fa)
* GRCh38: [Reference](http://people.binf.ku.dk/~lassemaretty/bayesTyper/GRCh38/GRCh38_canon.fa) and [decoy](http://people.binf.ku.dk/~lassemaretty/bayesTyper/GRCh38/GRCh38_decoy.fa)



4. Filter output

1. Run filtering: `bayesTyperTools filter -o integrated_calls_filtered -v integrated_calls.vcf -g hg38.fa --kmer-coverage-filename integrated_calls_kmer_coverage_estimates.txt`

* By default only genotypes with high confidence (posterior probability >= 0.99) are kept. If low confident genotypes are needed in a downstream analyses this can be changed using the option `--min-genotype-posterior`.

## Variant databases ##
Expand All @@ -76,7 +87,7 @@ The BayesTyper package contains `bayesTyper`, which does the genotyping, and `ba

### Variant database sources ###
#### GRCh37 ####
|Source|Version|Filters*|Lifted|Reference|
|Source|Version|Filters\*|Lifted|Reference|
|------|-------|--------|------|---------|
|dbSNP|150|No rare SNVs|No|[link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29783/)|
|1000 Genomes Project (1KG)|Phase 3|No SNVs|No|[link](https://www.nature.com/nature/journal/v526/n7571/full/nature15394.html)||
Expand All @@ -85,25 +96,33 @@ The BayesTyper package contains `bayesTyper`, which does the genotyping, and `ba
|GenomeDenmark (GDK)|v1.0|No SNVs|From GRCh38|[link](http://www.nature.com/nature/journal/vaop/ncurrent/full/nature23264.html)|

#### GRCh38 ####
|Source|Version|Filters*|Lifted|Reference|
|Source|Version|Filters\*|Lifted|Reference|
|------|-------|--------|------|---------|
|dbSNP|150|No rare SNVs|No|[link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29783/)|
|1000 Genomes Project (1KG)|Phase 3|No SNVs|No|[link](https://www.nature.com/nature/journal/v526/n7571/full/nature15394.html)||
|Genome of the Netherlands Project (GoNL)|Release 6|No SNVs|From GRCh37|[link](https://www.nature.com/articles/ncomms12989)|
|Genotype-Tissue Expression (GTEx) Project|GTEx Analysis V6|No SNVs|From GRCh37|[link](http://www.nature.com/ng/journal/v49/n5/full/ng.3834.html)|
|GenomeDenmark (GDK)|v1.0|No SNVs|No|[link](http://www.nature.com/nature/journal/vaop/ncurrent/full/nature23264.html)|

*Reference and alternative alleles containing ambiguous nucleotides were removed from all variant sources.
\*Reference and alternative alleles containing ambiguous nucleotides were removed from all variant sources.

## Computational requirements ##
|Number of samples|Coverage|Number of variants|Max allele length (nts)|Number of threads|Wall time (hours)\*|Max memory (GB)|
|-----------------|--------|------------------|-----------------------|-----------------|------------------|---------------|
|10|10x|14.6M|500,000|32|17|152|
|10|30x|13.4M|500,000|32|19|148|
|13|50x|11.7M|500,000|32|91|169|
|13|50x|11.7M|10,000|32|42|129|
|13|50x|61.1M|500,000|32|125|375|
|10|13x|21.4M|500,000|32|16|159|
|10|13x|64.4M|500,000|32|61|291|

## Memory requirements ##
|Variants|Coverage|Samples|Singletons removed|Threads|Memory (GB)|Time (wall-time hours)|
|--------|--------|-------|-------------------|-------|-----------|----------------------|
|15M|30X|10|Yes|32|235|26|
|21M|~13X|10|No|32|280|20|
|51M|~50X|13|Yes|32|430|107|
\*All runs were done on a 64-bit Intel Xeon 2.30 GHz machine with 1TB of memory.

## Third-party ##
Third-party software used by BayesTyper (distributed together with the BayesTyper source code).
* [Edlib](https://github.com/Martinsos/edlib)
* [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page)
* [KMC](https://github.com/refresh-bio/KMC)
* [libbf](https://github.com/mavam/libbf)
* [ntHash](https://github.com/bcgsc/ntHash)

0 comments on commit 5ae8dcb

Please sign in to comment.