-
Notifications
You must be signed in to change notification settings - Fork 0
Variant data input
Our RaMeDiES statistical tools require separate, tab-delimited variant files for each patient in your dataset. Here we provide details about the specific input file format and scripts for processing your variant files for use with our framework.
The input files for all RaMeDiES tools have EIGHT required and SIX optional (but strongly suggested) tab-delimited columns. An uncommented header line must specify these exact column names, but they can be in any order. Other columns, including incorrectly named columns, and lines beginning with #
will be ignored.
column name | description | example value |
---|---|---|
ensembl_gene_id |
Ensembl canonical gene ID without version suffix | ENSG00000087884 |
chromosome |
chromosome name in Ensembl format | 11 |
1-indexed_location |
variant position according to genome build GRCh38/hg38 | 77872333 |
ref_allele |
reference allele of variant | A |
alt_allele |
alternate allele of variant | G |
consequence |
, - or & -delimited Ensembl VEP (Variant Effect Predictor) consequences which must be found in the VEP_cons_dict dictionary in cfg.py
|
intron_variant,3_prime_UTR_variant |
inherited_from |
variant inheritance; must be one of mom , dad , neither or an empty string (indicating a de novo) |
mom |
CADD-raw |
CADD v1.6 raw score (i.e., not Phred-scaled) | 1.80659 |
MAF |
β gnomAD popmax minor allele frequency, -1 indicates not present | 0.00420489 |
SpliceAI_acceptor-gain-score |
β SpliceAI acceptor gain delta score | 0.0 |
SpliceAI_acceptor-loss-score |
β SpliceAI acceptor loss delta score | 0.0 |
SpliceAI_donor-gain-score |
β SpliceAI donor gain delta score | 0.0 |
SpliceAI_donor-loss-score |
β SpliceAI donor loss delta score | 0.0 |
DenovoMutationRate |
β comma-separated Roulette mutation rate estimate and quality track | 8.16e-14,high |
β
MAF
column is ignored if command-line parameter--MAF
is not specifiedβ
SpliceAI_*
columns are ignored if command-line parameter--variant_annots=C
(i.e., only consider coding variants) is specifiedβ
DenovoMutationRate
is ignored if command-line parameter--no_qual_track
is specified. However, we strongly recommend including these values as a quality control measure.
The genome-wide SNV/indel variants we use in our work (and have seen across cohorts) are in Variant Calling Format (VCF). These files are easily parseable and formatted according to well-documented specifications, and many tools exist for rapidly querying and filtering VCF files.
We understand that different institutions store and process variant-level data in different ways. You can choose to get your individual-level variant data into the correct format however you want! Here we detail the steps we took to produce formatted variant files from input VCF files.
bcftools
htslib
1οΈβ£ Generate a tab-delimited file called trio_files.txt
with the following FOUR whitespace-delimited columns in order:
-
proband_id
: a UNIQUE (per-trio) identifier to be used as the output file prefix; cannot be a substring of any other identifier -
proband_id,maternal_id,paternal_id
: comma-separated list of sample IDs in this order that match (a subset of) sample IDs present in the starting VCF -
/full/path/to/starting/samples.vcf.gz
: family- or cohort-level joint-called VCF that has been block-gzipped (usingbgzip
or another implementation) and indexed (usingtabix
or another implementation, such that/full/path/to/starting/samples.vcf.gz.tbi
also exists) -
/path/to/trio/level/VCFs/proband_id
: output file DIRECTORY where output files will be written
- Starting input VCF (3) might be the same for all trios in your cohort (for instance, if you have a cohort-level jointly-called file).
- Output directory where all processed files will be stored (4) can also be the same for all trios in your cohort.
2οΈβ£ Set the following variables:
# reference FASTA file used to generate your VCF (needed for normalizing variants before annotation)
REF_FASTA=/path/to/reference/file/used/to/generate/input/vcf.fasta
# whitespace delimited mapping of chromosome identifiers (e.g., `chr1 1`)
MAP_CHROMS=/path/to/whitespace/delimited/file/from/ucsc_chromosome_ids/to/ensembl_chromsome_ids.txt
# whitespace-delimited file you just created with a single line for each trio in your cohort
TRIO_FILE=trio_files.txt
# best place to store (maybe large) temporary files
TMP_PATH=/tmp
# minimum read depth for variants you would like to consider (depends on the coverage depth of your samples)
READ_DEPTH=10
# variants that are het in child, het in mother, and absent in father
hetmom='FORMAT/DP[0]>$READ_DEPTH && \
FORMAT/DP[1]>$READ_DEPTH && \
FORMAT/DP[2]>$READ_DEPTH && \
(FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && \
(FORMAT/AD[1:1])/(FORMAT/DP[1])>0.2 && (FORMAT/AD[1:1])/(FORMAT/DP[1])<0.8 && \
(FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=".")'
# variants that are het in child, absent in mother, and het in father
hetdad='FORMAT/DP[0]>$READ_DEPTH && \
FORMAT/DP[1]>$READ_DEPTH && \
FORMAT/DP[2]>$READ_DEPTH && \
(FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && (FORMAT/AD[0:1])/(FORMAT/DP[0])<0.8 && \
(FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=".") && \
(FORMAT/AD[2:1])/(FORMAT/DP[2])>0.2 && (FORMAT/AD[2:1])/(FORMAT/DP[2])<0.8'
# variants that are present in child and absent in both parents
denovo='FORMAT/DP[0]>$READ_DEPTH && \
FORMAT/DP[1]>$READ_DEPTH && \
FORMAT/DP[2]>$READ_DEPTH && \
(FORMAT/AD[0:1])/(FORMAT/DP[0])>0.2 && \
(FORMAT/AD[1:1]=0 || FORMAT/AD[1:1]=".") && \
(FORMAT/AD[2:1]=0 || FORMAT/AD[2:1]=".")'
3οΈβ£ Run the following. We suggest parallelizing this step.
# for each trio:
trios=( $( cat ${TRIO_FILE} | awk '{print $1}' | sort | uniq ) )
for trio_id in ${trios[@]}; do
SAMPLES=$( grep "^$trio_id" $TRIO_FILE | awk '{print $2}' )
INVCF=$( grep "^$trio_id" $TRIO_FILE | awk '{print $3}' )
OUTDIR=$( grep "^$trio_id" $TRIO_FILE | awk '{print $4}' )
# extract a trio VCF with sample columns in the correct order
bcftools view \
--samples=$SAMPLES \
-Oz \
${INVCF} \
> ${TMP_PATH}/${trio_id}.trio.vcf.gz
tabix --force ${TMP_PATH}/${trio_id}.trio.vcf.gz
# normalize the VCF (for annotation) and rapid querying
bcftools annotate \
--rename-chrs=${MAP_CHROMS} \
-Ou \
${TMP_PATH}/${trio_id}.trio.vcf.gz \
| bcftools norm
-m \
-any \
-f ${REF_FASTA} \
-Ou \
> ${TMP_PATH}/${trio_id}.trio.bcf
# extract variants uniquely inherited from mother:
bcftools view \
-Oz \
--include ${hetmom} \
${TMP_PATH}/${trio_id}.trio.bcf \
> ${OUTDIR}/${trio_id}.hetmom.vcf.gz
# extract variants uniquely inherited from the father:
bcftools view \
-Oz \
--include ${hetdad} \
${TMP_PATH}/${trio_id}.trio.bcf \
> ${OUTDIR}/${trio_id}.hetdad.vcf.gz
# extract putative de novo variants
bcftools view \
-Oz \
--include ${denovo} \
${TMP_PATH}/${trio_id}.trio.bcf \
> ${OUTDIR}/${trio_id}.denovo.vcf.gz
done
We have successfully used many annotation tools (and combinations of tools) in the past, including vcfanno, echtvar, slivar, bcftools annotate, Ensembl VEP, and others. Whatever you choose, the following annotations must be present in your final VCF:
- CADD v1.6 scores and VEP consequences are downloadable from https://cadd.gs.washington.edu/
- precomputed scores for the subset of indels present in gnomAD are available; you can and should manually annotate any remaining indels using the CADD web browser
- SpliceAI scores are downloadable from Illumina (after logging in), linked from https://github.com/Illumina/SpliceAI
- gnomAD allele frequencies are downloadable from https://gnomad.broadinstitute.org/downloads (but we suggest annotating using an optimized tool such as
slivar gnotate
) - Roulette mutation rates (and quality tracks) are downloadable from http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/
Suppose you have your three VCF files per trio (from the previous step) that each look something like this:
> zless ${OUTDIR}/${trio_id}.hetmom.vcf.gz
##INFO=<ID=ConsDetail,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP">
##INFO=<ID=RawScore,Number=A,Type=Float,Description="Raw CADD score">
##INFO=<ID=PHRED,Number=A,Type=Float,Description="PHRED-like scaled CADD score">
##INFO=<ID=GeneID,Number=.,Type=String,Description="Ensembl Stable Gene ID">
##INFO=<ID=GeneName,Number=.,Type=String,Description="HGNC Gene Name">
##INFO=<ID=MAF,Number=A,Type=Float,Description="Maximum minor allele frequency found across gnomad v2 and v3 populations and TOPMed">
##INFO=<ID=DenovoMutationRate,Number=A,Type=Float,Description="De novo mutation rate from Roulette v5">
##INFO=<ID=DenovoQuality,Number=.,Type=String,Description="Low if in a poorly aligned/repetitve region, region with no gnomAD variants; high otherwise">
##INFO=<ID=SpliceAI_acc_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor gain">
##INFO=<ID=SpliceAI_acc_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for acceptor loss">
##INFO=<ID=SpliceAI_don_gain,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor gain">
##INFO=<ID=SpliceAI_don_loss,Number=.,Type=String,Description="SpliceAI predicted effect on splicing. Delta score for donor loss">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PROBAND_ID MOM_ID DAD_ID
11 77872333 . A G . . ConsDetail=intron_variant,3_prime_UTR_variant;RawScore=1.80659;PHRED=17.78;GeneID=ENSG00000087884;GeneName=AAMDC;MAF=0.00420489;DenovoMutationRate=8.16e-14;DenovoQuality=high;SpliceAI_acc_gain=0.0;SpliceAI_acc_loss=0.0;SpliceAI_don_gain=0.0;SpliceAI_don_loss=0.0 GT:AD:DP 0/1:18,21:39 0/1:30,40:70 0/0:42,0:42
You can run the following Python script to generate a tab-delimited input file:
RaMeDiES | Last updated: February 2024