DISCLAIMER: This project is under construction and valid results are therefore not guaranteed
- A process-ready BAM file (or folder containing them), e.g. pre-processed following GATK best practices
- GATK > 4.1.4.1
- Vardict Java > 1.8.2
- SAMtools > 1.9
- BCFtools > 1.11
- HTSlib > 1.11
- Python > 3.9
- R > 4.0
- LoFreq, vt (for post-processing), and SiNVICT with its prerequisites are pre-built, more info on this in 'Installation'
This is a pipeline made to reliably generate calls for somatic mutations in Low Variant Allele Frequencies (VAF) samples in specific regions, such as NGS data from cfDNA. This is done by analyzing .BAM
files with 4 different tools (VarDict
, LoFreq
, Mutect2
& SiNVICT
). The pipeline will output variants that are called with at least an x
amount of tools (this can be set from 1-4). Of course, the higher the number, the lower False Positive call rate, the higher the reliability of the call, but also the higher the chance you'll miss relevant somatic variants.
The general workflow in the pipeline is as follows:
.BAM
and .bed
files are copied to a temporary folder, where the processing happens. Ather that, the 4 tools will run in parralel, generating preliminary results which are also stored in the temporary folder. Since VarDict
and SiNVICT
don't output regular .vcf
files, this data first needs to be processed in order to compare the overlapping variants in the .vcf
files. This processing consists of sorting variants and generating .vcf
files, which is done with custom Python and R scripts. Then, all .vcf
files are decomposed, normalized, gunzipped and indexed. Finally, with all .vcf
files processed, the variants can be compared on overlapping variants. All variants called with x
or more tools will be saved. Also a venn diagram of mutation calls per tool is generated, together with histograms of amount of mutations with a certain VAF & Read Depth. VAF & Read Depth are calculated with the data from VarDict
, LoFreq
, Mutect2
, since SiNVICT
doensn't output this data. A folder contatining normal samples can be provided to generate a Panal of Normals (PoN) a.k.a. a blacklist. This can be used to filter out SNP's and/or technical artifacts (when the same library prep and sequencing methods are performed as in tumor samples). Finally, the remaining variants will be annotated using openCRAVAT. Finally, all mutations (if provided blacklisted- and non blacklisted) are annotated using OpenCravat. This is a wrapper around multiple well-known annotating tools, such as ClinVar, dbSNP, COSMIC, gnomAD and many more. All annotated mutations are saved in an excel file.
n.b. : A single .bam
file or a directory containing .bam
files can be given as arguments. When a directory is given, the process above will loop over all files, generating output folders for each .bam
file
1. Clone the repo
git clone https://github.com/nickveltmaat/SNVcaller
2. Set working directory to the repo
cd /path/to/SNVCaller
3. Create python virtual environment (env)
python3 -m venv ./env
4. Install needed packages in env with pip3
source ./env/bin/activate
pip3 install numpy
pip3 install pandas
pip3 install venn
pip3 install matplotlib
pip3 install pandas_bokeh
pip3 install glob
pip3 install xlrd
pip install open-cravat
oc module install-base
oc module ls -a -t annotator (this generartes a list of available annotators that can be downloaded)
oc module install clinvar cosmic dbsnp ... (see https://open-cravat.readthedocs.io/en/latest/1.-Installation-Instructions.html for more detailed instructions)
deactivate
5. Download and copy the pre-built tools to /path/to/SNVCaller/
and unzip
unzip ./tools.zip
Once all tools and pre-requisites are installed correctly, the pipeline can be called with:
bash ./SNVcaller.sh ARGUMENTS
Required arguments:
-I
Input: String --> example:/path/to/input.bam
Either one-file or directory-R
Reference: String --> example:/path/to/reference.fa
-L
Regions List: String --> example:/path/to/panel.bed
-D
minimum Read Depth: Int --> example:100
-V
minimum VAF: float [0-1] --> example:0.002
-C
minimum Calls: Int [1-4] --> example:2
-P
Panel of Normal: String --> example:/path/to/PoN/directory/
Optional
Output will be generated in /path/to/SNVcaller/output/name_of_.bam_file/