Skip to content

Variant data input

Shilpa Nadimpalli Kobren edited this page Feb 12, 2024 · 12 revisions

πŸŒ€ About

Our RaMeDiES statistical tools require separate, tab-delimited variant files for each patient in your dataset. Here we provide details about the specific input file format and scripts for processing your variant files for use with our framework.

πŸŒ€ Final variant file format

The input files for all RaMeDiES tools have EIGHT required and SIX optional (but strongly suggested) tab-delimited columns. An uncommented header line must specify these exact column names, but they can be in any order. Other columns, including incorrectly named columns, and lines beginning with # will be ignored.

column name description example value
ensembl_gene_id Ensembl canonical gene ID without version suffix ENSG00000087884
chromosome chromosome name in Ensembl format 11
1-indexed_location variant position according to genome build GRCh38/hg38 77872333
ref_allele reference allele of variant A
alt_allele alternate allele of variant G
consequence ,- or &-delimited Ensembl VEP (Variant Effect Predictor) consequences which must be found in the VEP_cons_dict dictionary in cfg.py intron_variant,3_prime_UTR_variant
inherited_from variant inheritance; must be one of mom, dad, neither or an empty string (indicating a de novo) mom
CADD-raw CADD v1.6 raw score (i.e., not Phred-scaled) 1.80659
MAF ❗ gnomAD popmax minor allele frequency, -1 indicates not present 0.00420489
SpliceAI_acceptor-gain-score ❗ SpliceAI acceptor gain delta score 0.0
SpliceAI_acceptor-loss-score ❗ SpliceAI acceptor loss delta score 0.0
SpliceAI_donor-gain-score ❗ SpliceAI donor gain delta score 0.0
SpliceAI_donor-loss-score ❗ SpliceAI donor loss delta score 0.0
DenovoMutationRate ❗ comma-separated Roulette mutation rate estimate and quality track 8.16e-14,high

❗ MAF column is ignored if command-line parameter --MAF is not specified

❗ SpliceAI_* columns are ignored if command-line parameter --variant_annots=C (i.e., only consider coding variants) is specified

❗ DenovoMutationRate is ignored if command-line parameter --no_qual_track is specified. However, we strongly recommend including these values as a quality control measure.

πŸŒ€ Processing Variant Call Format (VCF) files

The genome-wide SNV/indel variants we use in our work (and have seen across cohorts) are in Variant Calling Format (VCF). These files are easily parseable and formatted according to well-documented specifications, and many tools exist for rapidly querying and filtering VCF files.

We understand that different institutions store and process variant-level data in different ways. You can choose to get your individual-level variant data into the correct format however you want! Here we detail the steps we took to produce formatted variant files from input VCF files.

Prerequisites

  • bcftools
  • htslib

πŸŒ€ Step 1: Extract trio-level variants of interest with BCFtools

1️⃣ Generate a tab-delimited file called trio_files.txt with the following FOUR whitespace-delimited columns in order:

  1. proband_id: a UNIQUE (per-trio) identifier to be used as the output file prefix; cannot be a substring of any other identifier
  2. proband_id,maternal_id,paternal_id: comma-separated list of sample IDs in this order that match (a subset of) sample IDs present in the starting VCF
  3. /full/path/to/starting/samples.vcf.gz: family- or cohort-level joint-called VCF that has been block-gzipped (using bgzip or another implementation) and indexed (using tabix or another implementation, such that /full/path/to/starting/samples.vcf.gz.tbi also exists)
  4. /path/to/trio/level/VCFs/proband_id: output file DIRECTORY where output files will be written
  • Starting input VCF (3) might be the same for all trios in your cohort (for instance, if you have a cohort-level jointly-called file).
  • Output directory where all processed files will be stored (4) can also be the same for all trios in your cohort.

2️⃣ Set the following variables:

# reference FASTA file used to generate your VCF (needed for normalizing variants before annotation) 
REF_FASTA=/path/to/reference/file/used/to/generate/input/vcf.fasta  

# whitespace delimited mapping of chromosome identifiers (e.g., `chr1 1`)
MAP_CHROMS=/path/to/whitespace/delimited/file/from/ucsc_chromosome_ids/to/ensembl_chromsome_ids.txt 

# whitespace-delimited file you just created with a single line for each trio in your cohort
TRIO_FILE=trio_files.txt

# best place to store (maybe large) temporary files 
TMP_PATH=/tmp

# minimum read depth for variants you would like to consider (depends on the coverage depth of your samples)
READ_DEPTH=10

# variants that are het in child, het in mother, and absent in father
hetmom='FORMAT/DP[0]>$READ_DEPTH && \
FORMAT/DP[1]>$READ_DEPTH && \
FORMAT/DP[2]>$READ_DEPTH && \
(FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && \
(FORMAT/AD[1:1])/(FORMAT/DP[1])>0.2 && (FORMAT/AD[1:1])/(FORMAT/DP[1])<0.8 && \
(FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=".")'

# variants that are het in child, absent in mother, and het in father
hetdad='FORMAT/DP[0]>$READ_DEPTH && \
FORMAT/DP[1]>$READ_DEPTH && \
FORMAT/DP[2]>$READ_DEPTH && \
(FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && \
(FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=".") && \
(FORMAT/AD[2:1])/(FORMAT/DP[2])>0.2 && (FORMAT/AD[2:1])/(FORMAT/DP[2])<0.8'

# variants that are present in child and absent in both parents
denovo='FORMAT/DP[0]>$READ_DEPTH && \
FORMAT/DP[1]>$READ_DEPTH && \
FORMAT/DP[2]>$READ_DEPTH && \
(FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && \
(FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=".") && \
(FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=".")'

3️⃣ Run the following. We suggest parallelizing this step.

# for each trio: 
trios=( $( cat ${TRIO_FILE} | awk '{print $1}' | sort | uniq ) )
for trio_id in ${trios[@]}; do

  SAMPLES=$( grep "^$trio_id" $TRIO_FILE | awk '{print $2}' )
  INVCF=$( grep "^$trio_id" $TRIO_FILE | awk '{print $3}' )
  OUTDIR=$( grep "^$trio_id" $TRIO_FILE | awk '{print $4}' )

  # extract a trio VCF with sample columns in the correct order
  bcftools view \
  --samples=$SAMPLES \
  -Oz \
  ${INVCF} \
  > ${TMP_PATH}/${trio_id}.trio.vcf.gz

  tabix --force ${TMP_PATH}/${trio_id}.trio.vcf.gz

  # normalize the VCF (for annotation) and rapid querying
  bcftools annotate \
  --rename-chrs=${MAP_CHROMS} \
  -Ou \
  ${TMP_PATH}/${trio_id}.trio.vcf.gz \
  | bcftools norm 
  -m \
  -any \
  -f ${REF_FASTA} \
  -Ou \
  > ${TMP_PATH}/${trio_id}.trio.bcf

  # extract variants uniquely inherited from mother: 
  bcftools view \
  -Oz \
  --include ${hetmom} \
  ${TMP_PATH}/${trio_id}.trio.bcf \
  > ${OUTDIR}/${trio_id}.hetmom.vcf.gz

  # extract variants uniquely inherited from the father: 
  bcftools view \
  -Oz \
  --include ${hetdad} \
  ${TMP_PATH}/${trio_id}.trio.bcf \
  > ${OUTDIR}/${trio_id}.hetdad.vcf.gz

  # extract putative de novo variants
  bcftools view \
  -Oz \
  --include ${denovo} \
  ${TMP_PATH}/${trio_id}.trio.bcf \
  > ${OUTDIR}/${trio_id}.denovo.vcf.gz

done

πŸŒ€ Step 2: Annotate VCFs however you want!

We have successfully used many annotation tools (and combinations of tools) in the past, including vcfanno, echtvar, slivar, bcftools annotate, Ensembl VEP, and others. Whatever you choose, the following annotations must be present in your final VCF:

πŸŒ€ Step 3: Create final tab-delimited variant input files

Suppose you have your three VCF files per trio (from the previous step) that each look something like this:

> zless ${OUTDIR}/${trio_id}.hetmom.vcf.gz
##INFO=<ID=ConsDetail,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP">
##INFO=<ID=RawScore,Number=A,Type=Float,Description="Raw CADD score">
##INFO=<ID=PHRED,Number=A,Type=Float,Description="PHRED-like scaled CADD score">
##INFO=<ID=GeneID,Number=.,Type=String,Description="Ensembl Stable Gene ID">
##INFO=<ID=GeneName,Number=.,Type=String,Description="HGNC Gene Name">
##INFO=<ID=MAF,Number=A,Type=Float,Description="Maximum minor allele frequency found across gnomad v2 and v3 populations and TOPMed">
##INFO=<ID=DenovoMutationRate,Number=A,Type=Float,Description="De novo mutation rate from Roulette v5">
##INFO=<ID=DenovoQuality,Number=.,Type=String,Description="Low if in a poorly aligned/repetitve region, region with no gnomAD variants; high otherwise">
##INFO=<ID=SpliceAI_acc_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor gain">
##INFO=<ID=SpliceAI_acc_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor loss">
##INFO=<ID=SpliceAI_don_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor gain">
##INFO=<ID=SpliceAI_don_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor loss">
#CHROM  POS       ID         REF        ALT       QUAL      FILTER    INFO      FORMAT    PROBAND_ID  MOM_ID     DAD_ID
11      77872333  .          A          G         .         .         ConsDetail=intron_variant,3_prime_UTR_variant;RawScore=1.80659;PHRED=17.78;GeneID=ENSG00000087884;GeneName=AAMDC;MAF=0.00420489;DenovoMutationRate=8.16e-14;DenovoQuality=high;SpliceAI_acc_gain=0.0;SpliceAI_acc_loss=0.0;SpliceAI_don_gain=0.0;SpliceAI_don_loss=0.0   GT:AD:DP 0/1:18,21:39    0/1:30,40:70    0/0:42,0:42

You can run the following Python script to generate a tab-delimited input file: