-
Notifications
You must be signed in to change notification settings - Fork 0
Variant data input
Our RaMeDiES statistical tools require separate, tab-delimited variant files for each patient in your dataset. Here we provide details about the specific input file format and scripts for processing your variant files for use with our framework.
The input files for all RaMeDiES tools have EIGHT required and SIX optional (but strongly suggested) tab-delimited columns. An uncommented header line must specify these exact column names, but they can be in any order. Other columns, including incorrectly named columns, and lines beginning with #
will be ignored. Examples of properly formatted input files can be found in the test/input/
directory in our repository.
column name | description | example value |
---|---|---|
ensembl_gene_id |
Ensembl canonical gene ID without version suffix | ENSG00000087884 |
chromosome |
chromosome name in Ensembl format | 11 |
1-indexed_location |
variant position according to genome build GRCh38/hg38 | 77872333 |
ref_allele |
reference allele of variant | A |
alt_allele |
alternate allele of variant | G |
consequence |
, - or & -delimited Ensembl VEP (Variant Effect Predictor) consequences which must be found in the VEP_cons_dict dictionary in cfg.py
|
intron_variant,3_prime_UTR_variant |
inherited_from |
variant inheritance; must be one of mom , dad , neither or an empty string (indicating a de novo) |
mom |
CADD-raw |
CADD v1.6 raw score (i.e., not Phred-scaled) | 1.80659 |
MAF |
❗ gnomAD popmax minor allele frequency, -1 indicates not present | 0.00420489 |
SpliceAI_acceptor-gain-score |
❗ SpliceAI acceptor gain delta score | 0.0 |
SpliceAI_acceptor-loss-score |
❗ SpliceAI acceptor loss delta score | 0.0 |
SpliceAI_donor-gain-score |
❗ SpliceAI donor gain delta score | 0.0 |
SpliceAI_donor-loss-score |
❗ SpliceAI donor loss delta score | 0.0 |
DenovoMutationRate |
❗ comma-separated Roulette mutation rate estimate and quality track | 8.16e-14,high |
❗
MAF
column is ignored if command-line parameter--MAF
is not specified❗
SpliceAI_*
columns are ignored if command-line parameter--variant_annots=C
(i.e., only consider coding variants) is specified❗
DenovoMutationRate
is ignored if command-line parameter--no_qual_track
is specified. However, we strongly recommend including these values as a quality control measure.
The genome-wide SNV/indel variants we use in our work (and have seen across cohorts) are in Variant Calling Format (VCF). These files are easily parseable and formatted according to well-documented specifications, and many tools exist for rapidly querying and filtering VCF files.
We understand that different institutions store and process variant-level data in different ways. You can choose to get your individual-level variant data into the correct format however you want! Here we detail the steps we took to produce formatted variant files from input VCF files.
bcftools
htslib
0️⃣ Generate a single VCF file containing genotypes for (at minimum) the proband, mother, and father.
We suggest jointly-calling your SNV/indel variants across family members in order to fill in homozygous reference (i.e., 0/0
) genotypes indicating where a variant is not present in a family member. There is a GATK implementation for joint-calling. We jointly called all variants across the entire UDN cohort using Sentieon tools.
If you have three separate VCFs, you can merge them into a single multi-sample VCF using bcftools merge, but variants that are truly not present in a family member or that simply weren't called due to quality and coverage issues will be indistinguishable and represented as ./.
.
1️⃣ Generate a tab-delimited file called trio_files.txt
with the following FOUR whitespace-delimited columns in order:
-
proband_id
: a UNIQUE (per-trio) identifier to be used as the output file prefix; cannot be a substring of any other identifier -
proband_id,maternal_id,paternal_id
: comma-separated list of sample IDs in this order that match (a subset of) sample IDs present in the starting VCF -
/full/path/to/starting/samples.vcf.gz
: family- or cohort-level joint-called VCF that has been block-gzipped (usingbgzip
or another implementation) and indexed (usingtabix
or another implementation, such that/full/path/to/starting/samples.vcf.gz.tbi
also exists) -
/path/to/trio/level/VCFs/proband_id
: output file DIRECTORY where output files will be written
- Starting input VCF (3) might be the same for all trios in your cohort (for instance, if you have a cohort-level jointly-called file).
- Output directory where all processed files will be stored (4) can also be the same for all trios in your cohort.
# reference FASTA file used to generate your VCF (needed for normalizing variants before annotation)
REF_FASTA=/path/to/reference/file/used/to/generate/input/vcf.fasta
# whitespace delimited mapping of chromosome identifiers (e.g., `chr1 1`)
MAP_CHROMS=/path/to/whitespace/delimited/file/from/ucsc_chromosome_ids/to/ensembl_chromsome_ids.txt
# whitespace-delimited file you just created with a single line for each trio in your cohort
TRIO_FILE=trio_files.txt
# best place to store (maybe large) temporary files
tmp_path=/tmp
# minimum read depth for variants you would like to consider (depends on the coverage depth of your samples)
READ_DEPTH=10
# maternally inherited variants:
hetmom="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && (FORMAT/AD[1:1])/(FORMAT/DP[1])>0.2 && (FORMAT/AD[1:1])/(FORMAT/DP[1])<0.8 && (FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=\".\")"
# paternally inherited variants:
hetdad="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && (FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=\".\") && (FORMAT/AD[2:1])/(FORMAT/DP[2])>0.2 && (FORMAT/AD[2:1])/(FORMAT/DP[2])<0.8"
# putative de novos:
denovo="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=\".\") && (FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=\".\")"
while IFS=$'\t' read -r trio_id samples invcf outdir; do
echo "Trio ID: $trio_id"
echo "Samples: $samples"
echo "Input VCF: $invcf"
echo "Output Directory: $outdir"
# step 1: extract the TRIO bcf
echo "Step 1: creating trio BCF for quick querying: ${trio_id}"
mkdir -p ${outdir} || exit 1
bcftools view -Oz --samples=${samples} --trim-alt-alleles ${invcf} > ${tmp_path}/${trio_id}.vcf.gz || exit 1
tabix --force ${tmp_path}/${trio_id}.vcf.gz || exit 1
bcftoofs annotate -Ou --rename-chrs=${MAP_CHROMS} ${tmp_path}/${trio_id}.vcf.gz \
| bcftools norm -Ou -m -any -f ${REF_FASTA} > ${tmp_path}/${trio_id}.bcf || exit 1
# step 2: extract variants uniquely inherited from mom/dad or denovos:
echo "Step 2: extracting maternal variants: ${trio_id}"
bcftools view -Oz --include "$hetmom" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.hetmom.vcf.gz || exit 1
tabix --force ${outdir}/${trio_id}.hetmom.vcf.gz || exit 1
echo "Step 3: extracting paternal variants: ${trio_id}"
bcftools view -Oz --include "$hetdad" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.hetdad.vcf.gz || exit 1
tabix --force ${outdir}/${trio_id}.hetdad.vcf.gz || exit 1
echo "Step 4: extracting denovo variants: ${trio_id}"
bcftools view -Oz --include "$denovo" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.denovo.vcf.gz || exit 1
tabix --force ${outdir}/${trio_id}.denovo.vcf.gz || exit 1
rm ${tmp_path}/${trio_id}.vcf.gz ${tmp_path}/${trio_id}.vcf.gz.tbi ${tmp_path}/${trio_id}.bcf || exit 1
done < "$TRIO_FILE"
We have successfully used many annotation tools (and combinations of tools) in the past, including vcfanno, echtvar, slivar, bcftools annotate, Ensembl VEP, and others. Whatever you choose, the following annotations must be present in your final VCF:
- CADD v1.6 scores and VEP consequences are downloadable from https://cadd.gs.washington.edu/
- precomputed scores for the subset of indels present in gnomAD are available; you can and should manually annotate any remaining indels using the CADD web browser
- SpliceAI scores are downloadable from Illumina (after logging in), linked from https://github.com/Illumina/SpliceAI
- gnomAD allele frequencies are downloadable from https://gnomad.broadinstitute.org/downloads (but we suggest annotating using an optimized tool such as
slivar gnotate
) - Roulette mutation rates (and quality tracks) are downloadable from http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/
Suppose you have your three VCF files per trio (from the previous step) that each look something like this:
> zless ${OUTDIR}/${trio_id}.hetmom.vcf.gz
##INFO=<ID=ConsDetail,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP">
##INFO=<ID=RawScore,Number=A,Type=Float,Description="Raw CADD score">
##INFO=<ID=PHRED,Number=A,Type=Float,Description="PHRED-like scaled CADD score">
##INFO=<ID=GeneID,Number=.,Type=String,Description="Ensembl Stable Gene ID">
##INFO=<ID=GeneName,Number=.,Type=String,Description="HGNC Gene Name">
##INFO=<ID=MAF,Number=A,Type=Float,Description="Maximum minor allele frequency found across gnomad v2 and v3 populations and TOPMed">
##INFO=<ID=DenovoMutationRate,Number=A,Type=Float,Description="De novo mutation rate from Roulette v5">
##INFO=<ID=DenovoQuality,Number=.,Type=String,Description="Low if in a poorly aligned/repetitve region, region with no gnomAD variants; high otherwise">
##INFO=<ID=SpliceAI_acc_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor gain">
##INFO=<ID=SpliceAI_acc_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor loss">
##INFO=<ID=SpliceAI_don_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor gain">
##INFO=<ID=SpliceAI_don_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor loss">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PROBAND_ID MOM_ID DAD_ID
11 77872333 . A G . . ConsDetail=intron_variant,3_prime_UTR_variant;RawScore=1.80659;PHRED=17.78;GeneID=ENSG00000087884;GeneName=AAMDC;MAF=0.00420489;DenovoMutationRate=8.16e-14;DenovoQuality=high;SpliceAI_acc_gain=0.0;SpliceAI_acc_loss=0.0;SpliceAI_don_gain=0.0;SpliceAI_don_loss=0.0 GT:AD:DP 0/1:18,21:39 0/1:30,40:70 0/0:42,0:42
You can run the following bash script to generate a tab-delimited input file:
OUTFILE=${OUTDIR}/${trio_id}.processed_variants.tsv
HETMOM=${OUTDIR}/${trio_id}.hetmom.vcf.gz
HETDAD=${OUTDIR}/${trio_id}.hetdad.vcf.gz
DENOVO=${OUTDIR}/${trio_id}.denovo.vcf.gz
echo -e "ensembl_gene_id\tchromosome\t1-indexed_location\tref_allele\talt_allele\tconsequence\tinherited_from\tCADD-raw\tMAF\tSpliceAI_acceptor-gain-score\tSpliceAI_acceptor-loss-score\tSpliceAI_donor-gain-score\tSpliceAI_donor-loss-score\tDenovoMutationRate" > ${OUTFILE}
bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\tmom\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${HETMOM} >> ${OUTFILE}
bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\tdad\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${HETDAD} >> ${OUTFILE}
bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\t\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${DENOVO} >> ${OUTFILE}
bgzip --force ${OUTFILE}
RaMeDiES | Last updated: February 2024