Skip to content

Variant data input

Shilpa Nadimpalli Kobren edited this page Aug 14, 2024 · 12 revisions

🌀 About

Our RaMeDiES statistical tools require separate, tab-delimited variant files for each patient in your dataset. Here we provide details about the specific input file format and scripts for processing your variant files for use with our framework.

🌀 Final variant file format

The input files for all RaMeDiES tools have EIGHT required and SIX optional (but strongly suggested) tab-delimited columns. An uncommented header line must specify these exact column names, but they can be in any order. Other columns, including incorrectly named columns, and lines beginning with # will be ignored. Examples of properly formatted input files can be found in the test/input/ directory in our repository.

column name description example value
ensembl_gene_id Ensembl canonical gene ID without version suffix ENSG00000087884
chromosome chromosome name in Ensembl format 11
1-indexed_location variant position according to genome build GRCh38/hg38 77872333
ref_allele reference allele of variant A
alt_allele alternate allele of variant G
consequence ,- or &-delimited Ensembl VEP (Variant Effect Predictor) consequences which must be found in the VEP_cons_dict dictionary in cfg.py intron_variant,3_prime_UTR_variant
inherited_from variant inheritance; must be one of mom, dad, neither or an empty string (indicating a de novo) mom
CADD-raw CADD v1.6 raw score (i.e., not Phred-scaled) 1.80659
MAF ❗ gnomAD popmax minor allele frequency, -1 indicates not present 0.00420489
SpliceAI_acceptor-gain-score ❗ SpliceAI acceptor gain delta score 0.0
SpliceAI_acceptor-loss-score ❗ SpliceAI acceptor loss delta score 0.0
SpliceAI_donor-gain-score ❗ SpliceAI donor gain delta score 0.0
SpliceAI_donor-loss-score ❗ SpliceAI donor loss delta score 0.0
DenovoMutationRate ❗ comma-separated Roulette mutation rate estimate and quality track 8.16e-14,high

MAF column is ignored if command-line parameter --MAF is not specified

SpliceAI_* columns are ignored if command-line parameter --variant_annots=C (i.e., only consider coding variants) is specified

DenovoMutationRate is ignored if command-line parameter --no_qual_track is specified. However, we strongly recommend including these values as a quality control measure.

🌀 Processing Variant Call Format (VCF) files

The genome-wide SNV/indel variants we use in our work (and have seen across cohorts) are in Variant Calling Format (VCF). These files are easily parseable and formatted according to well-documented specifications, and many tools exist for rapidly querying and filtering VCF files.

We understand that different institutions store and process variant-level data in different ways. You can choose to get your individual-level variant data into the correct format however you want! Here we detail the steps we took to produce formatted variant files from input VCF files.

Prerequisites

  • bcftools
  • htslib

🌀 Step 1: Extract trio-level variants of interest with BCFtools

0️⃣ Generate a single VCF file containing genotypes for (at minimum) the proband, mother, and father.

We suggest jointly-calling your SNV/indel variants across family members in order to fill in homozygous reference (i.e., 0/0) genotypes indicating where a variant is not present in a family member. There is a GATK implementation for joint-calling. We jointly called all variants across the entire UDN cohort using Sentieon tools.

If you have three separate VCFs, you can merge them into a single multi-sample VCF using bcftools merge, but variants that are truly not present in a family member or that simply weren't called due to quality and coverage issues will be indistinguishable and represented as ./..

1️⃣ Generate a tab-delimited file called trio_files.txt with the following FOUR whitespace-delimited columns in order:

  1. proband_id: a UNIQUE (per-trio) identifier to be used as the output file prefix; cannot be a substring of any other identifier
  2. proband_id,maternal_id,paternal_id: comma-separated list of sample IDs in this order that match (a subset of) sample IDs present in the starting VCF
  3. /full/path/to/starting/samples.vcf.gz: family- or cohort-level joint-called VCF that has been block-gzipped (using bgzip or another implementation) and indexed (using tabix or another implementation, such that /full/path/to/starting/samples.vcf.gz.tbi also exists)
  4. /path/to/trio/level/VCFs/proband_id: output file DIRECTORY where output files will be written
  • Starting input VCF (3) might be the same for all trios in your cohort (for instance, if you have a cohort-level jointly-called file).
  • Output directory where all processed files will be stored (4) can also be the same for all trios in your cohort.

2️⃣ Set the following variables:

# reference FASTA file used to generate your VCF (needed for normalizing variants before annotation) 
REF_FASTA=/path/to/reference/file/used/to/generate/input/vcf.fasta  

# whitespace delimited mapping of chromosome identifiers (e.g., `chr1 1`)
MAP_CHROMS=/path/to/whitespace/delimited/file/from/ucsc_chromosome_ids/to/ensembl_chromsome_ids.txt 

# whitespace-delimited file you just created with a single line for each trio in your cohort
TRIO_FILE=trio_files.txt

# best place to store (maybe large) temporary files 
tmp_path=/tmp

# minimum read depth for variants you would like to consider (depends on the coverage depth of your samples)
READ_DEPTH=10

# maternally inherited variants: 
hetmom="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && (FORMAT/AD[1:1])/(FORMAT/DP[1])>0.2 && (FORMAT/AD[1:1])/(FORMAT/DP[1])<0.8 && (FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=\".\")"

# paternally inherited variants:
hetdad="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && (FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=\".\") && (FORMAT/AD[2:1])/(FORMAT/DP[2])>0.2 && (FORMAT/AD[2:1])/(FORMAT/DP[2])<0.8"

# putative de novos: 
denovo="FORMAT/DP[0]>$READ_DEPTH && FORMAT/DP[1]>$READ_DEPTH && FORMAT/DP[2]>$READ_DEPTH && (FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=\".\") && (FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=\".\")"

3️⃣ Run the following. We suggest parallelizing this step.

while IFS=$'\t' read -r trio_id samples invcf outdir; do
   
    echo "Trio ID: $trio_id"
    echo "Samples: $samples"
    echo "Input VCF: $invcf"
    echo "Output Directory: $outdir"
    
    # step 1: extract the TRIO bcf
    echo "Step 1: creating trio BCF for quick querying: ${trio_id}"
    mkdir -p ${outdir} || exit 1
    bcftools view -Oz --samples=${samples} --trim-alt-alleles ${invcf} > ${tmp_path}/${trio_id}.vcf.gz || exit 1
    tabix --force ${tmp_path}/${trio_id}.vcf.gz || exit 1

    bcftoofs annotate -Ou --rename-chrs=${MAP_CHROMS} ${tmp_path}/${trio_id}.vcf.gz \
    | bcftools norm -Ou -m -any -f ${REF_FASTA} > ${tmp_path}/${trio_id}.bcf || exit 1

    # step 2: extract variants uniquely inherited from mom/dad or denovos:
    echo "Step 2: extracting maternal variants: ${trio_id}"
    bcftools view -Oz --include "$hetmom" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.hetmom.vcf.gz || exit 1
    tabix --force ${outdir}/${trio_id}.hetmom.vcf.gz || exit 1

    echo "Step 3: extracting paternal variants: ${trio_id}"
    bcftools view -Oz --include "$hetdad" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.hetdad.vcf.gz || exit 1
    tabix --force ${outdir}/${trio_id}.hetdad.vcf.gz || exit 1

    echo "Step 4: extracting denovo variants: ${trio_id}"
    bcftools view -Oz --include "$denovo" ${tmp_path}/${trio_id}.bcf > ${outdir}/${trio_id}.denovo.vcf.gz || exit 1
    tabix --force ${outdir}/${trio_id}.denovo.vcf.gz || exit 1

    rm ${tmp_path}/${trio_id}.vcf.gz ${tmp_path}/${trio_id}.vcf.gz.tbi ${tmp_path}/${trio_id}.bcf || exit 1

done < "$TRIO_FILE"

🌀 Step 2: Annotate VCFs however you want!

We have successfully used many annotation tools (and combinations of tools) in the past, including vcfanno, echtvar, slivar, bcftools annotate, Ensembl VEP, and others. Whatever you choose, the following annotations must be present in your final VCF:

🌀 Step 3: Create final tab-delimited variant input files

Suppose you have your three VCF files per trio (from the previous step) that each look something like this:

> zless ${OUTDIR}/${trio_id}.hetmom.vcf.gz

##INFO=<ID=ConsDetail,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP">
##INFO=<ID=RawScore,Number=A,Type=Float,Description="Raw CADD score">
##INFO=<ID=PHRED,Number=A,Type=Float,Description="PHRED-like scaled CADD score">
##INFO=<ID=GeneID,Number=.,Type=String,Description="Ensembl Stable Gene ID">
##INFO=<ID=GeneName,Number=.,Type=String,Description="HGNC Gene Name">
##INFO=<ID=MAF,Number=A,Type=Float,Description="Maximum minor allele frequency found across gnomad v2 and v3 populations and TOPMed">
##INFO=<ID=DenovoMutationRate,Number=A,Type=Float,Description="De novo mutation rate from Roulette v5">
##INFO=<ID=DenovoQuality,Number=.,Type=String,Description="Low if in a poorly aligned/repetitve region, region with no gnomAD variants; high otherwise">
##INFO=<ID=SpliceAI_acc_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor gain">
##INFO=<ID=SpliceAI_acc_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor loss">
##INFO=<ID=SpliceAI_don_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor gain">
##INFO=<ID=SpliceAI_don_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor loss">
#CHROM  POS       ID         REF        ALT       QUAL      FILTER    INFO      FORMAT    PROBAND_ID  MOM_ID     DAD_ID
11      77872333  .          A          G         .         .         ConsDetail=intron_variant,3_prime_UTR_variant;RawScore=1.80659;PHRED=17.78;GeneID=ENSG00000087884;GeneName=AAMDC;MAF=0.00420489;DenovoMutationRate=8.16e-14;DenovoQuality=high;SpliceAI_acc_gain=0.0;SpliceAI_acc_loss=0.0;SpliceAI_don_gain=0.0;SpliceAI_don_loss=0.0   GT:AD:DP 0/1:18,21:39    0/1:30,40:70    0/0:42,0:42

You can run the following bash script to generate a tab-delimited input file:

OUTFILE=${OUTDIR}/${trio_id}.processed_variants.tsv
HETMOM=${OUTDIR}/${trio_id}.hetmom.vcf.gz
HETDAD=${OUTDIR}/${trio_id}.hetdad.vcf.gz
DENOVO=${OUTDIR}/${trio_id}.denovo.vcf.gz

echo -e "ensembl_gene_id\tchromosome\t1-indexed_location\tref_allele\talt_allele\tconsequence\tinherited_from\tCADD-raw\tMAF\tSpliceAI_acceptor-gain-score\tSpliceAI_acceptor-loss-score\tSpliceAI_donor-gain-score\tSpliceAI_donor-loss-score\tDenovoMutationRate" > ${OUTFILE}

bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\tmom\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${HETMOM} >> ${OUTFILE}

bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\tdad\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${HETDAD} >> ${OUTFILE}

bcftools query -f '%INFO/GeneID\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/ConsDetail\t\t%INFO/RawScore\t%INFO/MAF\t%INFO/SpliceAI_acc_gain\t%INFO/SpliceAI_acc_loss\t%INFO/SpliceAI_don_gain\t%INFO/SpliceAI_don_loss\t%INFO/DenovoMutationRate,%INFO/DenovoQuality' ${DENOVO} >> ${OUTFILE}

bgzip --force ${OUTFILE}