Variant data input

🌀 About

Our RaMeDiES statistical tools require separate, tab-delimited variant files for each patient in your dataset. Here we provide details about the specific input file format and scripts for processing your variant files for use with our framework.

🌀 Final variant file format

The input files for all RaMeDiES tools have EIGHT required and SIX optional (but strongly suggested) tab-delimited columns. An uncommented header line must specify these exact column names, but they can be in any order. Other columns, including incorrectly named columns, and lines beginning with # will be ignored. Examples of properly formatted input files can be found in the test/input/ directory in our repository.

column name	description	example value
`ensembl_gene_id`	Ensembl canonical gene ID without version suffix	`ENSG00000087884`
`chromosome`	chromosome name in Ensembl format	`11`
`1-indexed_location`	variant position according to genome build GRCh38/hg38	`77872333`
`ref_allele`	reference allele of variant	`A`
`alt_allele`	alternate allele of variant	`G`
`consequence`	`,`- or `&`-delimited Ensembl VEP (Variant Effect Predictor) consequences which must be found in the `VEP_cons_dict` dictionary in `cfg.py`	`intron_variant,3_prime_UTR_variant`
`inherited_from`	variant inheritance; must be one of `mom`, `dad`, `neither` or an empty string (indicating a de novo)	`mom`
`CADD-raw`	CADD v1.6 raw score (i.e., not Phred-scaled)	`1.80659`
`MAF`	❗ gnomAD popmax minor allele frequency, -1 indicates not present	`0.00420489`
`SpliceAI_acceptor-gain-score`	❗ SpliceAI acceptor gain delta score	`0.0`
`SpliceAI_acceptor-loss-score`	❗ SpliceAI acceptor loss delta score	`0.0`
`SpliceAI_donor-gain-score`	❗ SpliceAI donor gain delta score	`0.0`
`SpliceAI_donor-loss-score`	❗ SpliceAI donor loss delta score	`0.0`
`DenovoMutationRate`	❗ comma-separated Roulette mutation rate estimate and quality track	`8.16e-14,high`

❗ MAF column is ignored if command-line parameter --MAF is not specified

❗ SpliceAI_* columns are ignored if command-line parameter --variant_annots=C (i.e., only consider coding variants) is specified

❗ DenovoMutationRate is ignored if command-line parameter --no_qual_track is specified. However, we strongly recommend including these values as a quality control measure.

🌀 Processing Variant Call Format (VCF) files

The genome-wide SNV/indel variants we use in our work (and have seen across cohorts) are in Variant Calling Format (VCF). These files are easily parseable and formatted according to well-documented specifications, and many tools exist for rapidly querying and filtering VCF files.

We understand that different institutions store and process variant-level data in different ways. You can choose to get your individual-level variant data into the correct format however you want! Here we detail the steps we took to produce formatted variant files from input VCF files.

Prerequisites

bcftools
htslib

🌀 Step 1: Extract trio-level variants of interest with BCFtools

0️⃣ Generate a single VCF file containing genotypes for (at minimum) the proband, mother, and father.

We suggest jointly-calling your SNV/indel variants across family members in order to fill in homozygous reference (i.e., 0/0) genotypes indicating where a variant is not present in a family member. There is a GATK implementation for joint-calling. We jointly called all variants across the entire UDN cohort using Sentieon tools.

If you have three separate VCFs, you can merge them into a single multi-sample VCF using bcftools merge, but variants that are truly not present in a family member or that simply weren't called due to quality and coverage issues will be indistinguishable and represented as ./..

1️⃣ Generate a tab-delimited file called `trio_files.txt` with the following FOUR whitespace-delimited columns in order:

proband_id: a UNIQUE (per-trio) identifier to be used as the output file prefix; cannot be a substring of any other identifier
proband_id,maternal_id,paternal_id: comma-separated list of sample IDs in this order that match (a subset of) sample IDs present in the starting VCF
/full/path/to/starting/samples.vcf.gz: family- or cohort-level joint-called VCF that has been block-gzipped (using bgzip or another implementation) and indexed (using tabix or another implementation, such that /full/path/to/starting/samples.vcf.gz.tbi also exists)
/path/to/trio/level/VCFs/proband_id: output file DIRECTORY where output files will be written

Starting input VCF (3) might be the same for all trios in your cohort (for instance, if you have a cohort-level jointly-called file).

Output directory where all processed files will be stored (4) can also be the same for all trios in your cohort.

2️⃣ Set the following variables:

# reference FASTA file used to generate your VCF (needed for normalizing variants before annotation) 
REF_FASTA=/path/to/reference/file/used/to/generate/input/vcf.fasta  

# whitespace delimited mapping of chromosome identifiers (e.g., `chr1 1`)
MAP_CHROMS=/path/to/whitespace/delimited/file/from/ucsc_chromosome_ids/to/ensembl_chromsome_ids.txt 

# whitespace-delimited file you just created with a single line for each trio in your cohort
TRIO_FILE=trio_files.txt

# best place to store (maybe large) temporary files 
tmp_path=/tmp

# minimum read depth for variants you would like to consider (depends on the coverage depth of your samples)
READ_DEPTH=10

# maternally inherited variants: 
hetmom="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && (FORMAT/AD[1:1])/(FORMAT/DP[1])>0.2 && (FORMAT/AD[1:1])/(FORMAT/DP[1])<0.8 && (FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=\".\")"

# paternally inherited variants:
hetdad="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && (FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=\".\") && (FORMAT/AD[2:1])/(FORMAT/DP[2])>0.2 && (FORMAT/AD[2:1])/(FORMAT/DP[2])<0.8"

# putative de novos: 
denovo="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=\".\") && (FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=\".\")"

3️⃣ Run the following. We suggest parallelizing this step.

while IFS=$'\t' read -r trio_id samples invcf outdir; do
   
    echo "Trio ID: $trio_id"
    echo "Samples: $samples"
    echo "Input VCF: $invcf"
    echo "Output Directory: $outdir"
    
    # step 1: extract the TRIO bcf
    echo "Step 1: creating trio BCF for quick querying: ${trio_id}"
    mkdir -p ${outdir} || exit 1
    bcftools view -Oz --samples=${samples} --trim-alt-alleles ${invcf} > ${tmp_path}/${trio_id}.vcf.gz || exit 1
    tabix --force ${tmp_path}/${trio_id}.vcf.gz || exit 1

    bcftoofs annotate -Ou --rename-chrs=${MAP_CHROMS} ${tmp_path}/${trio_id}.vcf.gz \
    | bcftools norm -Ou -m -any -f ${REF_FASTA} > ${tmp_path}/${trio_id}.bcf || exit 1

    # step 2: extract variants uniquely inherited from mom/dad or denovos:
    echo "Step 2: extracting maternal variants: ${trio_id}"
    bcftools view -Oz --include "$hetmom" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.hetmom.vcf.gz || exit 1
    tabix --force ${outdir}/${trio_id}.hetmom.vcf.gz || exit 1

    echo "Step 3: extracting paternal variants: ${trio_id}"
    bcftools view -Oz --include "$hetdad" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.hetdad.vcf.gz || exit 1
    tabix --force ${outdir}/${trio_id}.hetdad.vcf.gz || exit 1

    echo "Step 4: extracting denovo variants: ${trio_id}"
    bcftools view -Oz --include "$denovo" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.denovo.vcf.gz || exit 1
    tabix --force ${outdir}/${trio_id}.denovo.vcf.gz || exit 1

    rm ${tmp_path}/${trio_id}.vcf.gz ${tmp_path}/${trio_id}.vcf.gz.tbi ${tmp_path}/${trio_id}.bcf || exit 1

done < "$TRIO_FILE"

🌀 Step 2: Annotate VCFs however you want!

We have successfully used many annotation tools (and combinations of tools) in the past, including vcfanno, echtvar, slivar, bcftools annotate, Ensembl VEP, and others. Whatever you choose, the following annotations must be present in your final VCF:

CADD v1.6 scores and VEP consequences are downloadable from https://cadd.gs.washington.edu/
- precomputed scores for the subset of indels present in gnomAD are available; you can and should manually annotate any remaining indels using the CADD web browser
SpliceAI scores are downloadable from Illumina (after logging in), linked from https://github.com/Illumina/SpliceAI
gnomAD allele frequencies are downloadable from https://gnomad.broadinstitute.org/downloads (but we suggest annotating using an optimized tool such as slivar gnotate)
Roulette mutation rates (and quality tracks) are downloadable from http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/

🌀 Step 3: Create final tab-delimited variant input files

Suppose you have your three VCF files per trio (from the previous step) that each look something like this:

> zless ${OUTDIR}/${trio_id}.hetmom.vcf.gz

##INFO=<ID=ConsDetail,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP">
##INFO=<ID=RawScore,Number=A,Type=Float,Description="Raw CADD score">
##INFO=<ID=PHRED,Number=A,Type=Float,Description="PHRED-like scaled CADD score">
##INFO=<ID=GeneID,Number=.,Type=String,Description="Ensembl Stable Gene ID">
##INFO=<ID=GeneName,Number=.,Type=String,Description="HGNC Gene Name">
##INFO=<ID=MAF,Number=A,Type=Float,Description="Maximum minor allele frequency found across gnomad v2 and v3 populations and TOPMed">
##INFO=<ID=DenovoMutationRate,Number=A,Type=Float,Description="De novo mutation rate from Roulette v5">
##INFO=<ID=DenovoQuality,Number=.,Type=String,Description="Low if in a poorly aligned/repetitve region, region with no gnomAD variants; high otherwise">
##INFO=<ID=SpliceAI_acc_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor gain">
##INFO=<ID=SpliceAI_acc_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor loss">
##INFO=<ID=SpliceAI_don_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor gain">
##INFO=<ID=SpliceAI_don_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor loss">
#CHROM  POS       ID         REF        ALT       QUAL      FILTER    INFO      FORMAT    PROBAND_ID  MOM_ID     DAD_ID
11      77872333  .          A          G         .         .         ConsDetail=intron_variant,3_prime_UTR_variant;RawScore=1.80659;PHRED=17.78;GeneID=ENSG00000087884;GeneName=AAMDC;MAF=0.00420489;DenovoMutationRate=8.16e-14;DenovoQuality=high;SpliceAI_acc_gain=0.0;SpliceAI_acc_loss=0.0;SpliceAI_don_gain=0.0;SpliceAI_don_loss=0.0   GT:AD:DP 0/1:18,21:39    0/1:30,40:70    0/0:42,0:42

You can run the following bash script to generate a tab-delimited input file:

OUTFILE=${OUTDIR}/${trio_id}.processed_variants.tsv
HETMOM=${OUTDIR}/${trio_id}.hetmom.vcf.gz
HETDAD=${OUTDIR}/${trio_id}.hetdad.vcf.gz
DENOVO=${OUTDIR}/${trio_id}.denovo.vcf.gz

echo -e "ensembl_gene_id\tchromosome\t1-indexed_location\tref_allele\talt_allele\tconsequence\tinherited_from\tCADD-raw\tMAF\tSpliceAI_acceptor-gain-score\tSpliceAI_acceptor-loss-score\tSpliceAI_donor-gain-score\tSpliceAI_donor-loss-score\tDenovoMutationRate" > ${OUTFILE}

bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\tmom\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${HETMOM} >> ${OUTFILE}

bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\tdad\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${HETDAD} >> ${OUTFILE}

bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\t\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${DENOVO} >> ${OUTFILE}

bgzip --force ${OUTFILE}

RaMeDiES | Last updated: February 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant data input

🌀 About

🌀 Final variant file format

🌀 Processing Variant Call Format (VCF) files

Prerequisites

🌀 Step 1: Extract trio-level variants of interest with BCFtools

0️⃣ Generate a single VCF file containing genotypes for (at minimum) the proband, mother, and father.

1️⃣ Generate a tab-delimited file called `trio_files.txt` with the following FOUR whitespace-delimited columns in order:

2️⃣ Set the following variables:

3️⃣ Run the following. We suggest parallelizing this step.

🌀 Step 2: Annotate VCFs however you want!

🌀 Step 3: Create final tab-delimited variant input files

RaMeDiES Manual

Home

Enabling cross-cohort meta-analyses

Data preparation

Statistical models:

De novo recurrence

Comphet recurrence

Individual comphets

Pathway analysis

Clone this wiki locally

Variant data input

🌀 About

🌀 Final variant file format

🌀 Processing Variant Call Format (VCF) files

Prerequisites

🌀 Step 1: Extract trio-level variants of interest with BCFtools

0️⃣ Generate a single VCF file containing genotypes for (at minimum) the proband, mother, and father.

1️⃣ Generate a tab-delimited file called trio_files.txt with the following FOUR whitespace-delimited columns in order:

2️⃣ Set the following variables:

3️⃣ Run the following. We suggest parallelizing this step.

🌀 Step 2: Annotate VCFs however you want!

🌀 Step 3: Create final tab-delimited variant input files

RaMeDiES Manual

Home

Enabling cross-cohort meta-analyses

Data preparation

Statistical models:

De novo recurrence

Comphet recurrence

Individual comphets

Pathway analysis

Clone this wiki locally

1️⃣ Generate a tab-delimited file called `trio_files.txt` with the following FOUR whitespace-delimited columns in order: