A collection of scripts to use as a starting point for running quality control on whole exome sequencing datasets using Hail.
Throughout, we assume that the reference geneome is GRCh38. If this isn't the reference used for calling in your dataset, switch to the appropriate reference by changing REFERENCE = 'GRCh38'
in each of the Hail python scripts.
These scripts assume that you have access to a large number of cores to run them on. If not, you will need to split (e.g. by chromosome) and run scripts in parallel. Examples of this procedure and intermediate plotting (as well as scripts for submitting jobs to a HPC) for very similar steps to these can be found in this repository.
A set of useful file are in the example inputs folder:
- Low complexiy region file in the expected format:
LCR-hs38.bed.gz
- Regions of high-LD (in build 38) in the expected format:
b38_high_ld.bed
- Target intervals file in the expected format:
ice_coding_v1_targets.interval_list.gz
. - Padded target intervals file in the expected format
ice_coding_v1_padded_targets.interval_list.gz
- Note that the header lines of interval_list files are prepended with
@
- these lines are ignored by Hail.
- Note that the header lines of interval_list files are prepended with
Read in joint called VCF and write to Hail MatrixTable format
Inputs:
- Filepath to the jointcalled .vcf to input:
${INPUT_VCF}
Outputs:
- Filepath to place the output MatrixTable:
${RAW_MT}
python 00_load_and_write_vcf_as_mt.py ${INPUT_VCF} ${RAW_MT}
- Remove sites with a large number of alleles (>6)
- Filter genotypes based on depth, likelihood, and allele balance.
- If homozygous reference, at least one of:
- Genotype quality < 20
- Depth < 10
- If heterozygous, at least one of:
- (Reference allele depth + alternative allele depth)/depth < 0.8
- (Alternative allele depth)/depth < 0.2
- Reference phred-scaled genotype posterior < 20
- Depth < 10
- If homozygous variant, at least one of:
- (Alternative allele depth)/depth < 0.8
- Reference phred-scaled genotype posterior < 20
- Depth < 10
- If homozygous reference, at least one of:
Inputs:
- Filepath to place the MatrixTable output from step 0:
${RAW_MT}
Outputs:
- Filepath to place the output MatrixTable:
${MT}
- Filepath to place the output hard-calls MatrixTable:
${MT_HARDCALLS}
python 01_filterGT.py ${RAW_MT} ${MT} ${MT_HARDCALLS}
Remove variants that either:
- Fall in a low complexity region
- Fail VQSR (note that if GATK was not used for calling, you will not be able to apply this filter).
- Fall outside of padded target intervals (we recommend 50bp padding)
- Filter out invariant sites
Inputs:
- Filepath to hard-calls MatrixTable from step 1:
${MT_HARDCALLS}
- Filepath to the target intervals file (see below for expected format):
${TARGET_INTERVALS}
- Filepath to the padded target intervals file:
${PADDED_TARGET_INTERVALS}
- Filepath to low-complexity regions intervals file:
${LCRs}
Outputs:
- Filepath to place initial QC metrics for downstream plotting:
${INITIAL_VARIANT_QC_FILE}
- Filepath to place the set of variants remaining after step 2 filtering:
${INITIAL_VARIANT_LIST}
python 02_prefilter_variants.py ${MT_HARDCALLS} ${TARGET_INTERVALS} ${PADDED_TARGET_INTERVALS} ${LCRs} ${INITIAL_VARIANT_QC_FILE} ${INITIAL_VARIANT_LIST}
We have included an example target intervals file, target intervals file with 50bp padding and LCRs file to show the expected format.
- Target intervals file:
example_inputs/ice_coding_v1_targets.interval_list.gz
- Padded target intervals file:
example_inputs/ice_coding_v1_padded_targets.interval_list.gz
- LCRs file:
example_inputs/LCR-hs38.bed.gz
In this step, we run the sample QC function in hail to determine sample qc metrics
Inputs:
- Filepath to hard-calls MatrixTable from step 1:
${MT_HARDCALLS}
- Filepath to the set of variants remaining after step 2 filtering:
${INITIAL_VARIANT_LIST}
- Filepath to the target intervals file (see below for expected format):
${TARGET_INTERVALS}
Outputs:
- Filepath to place initial sample QC metrics for downstream plotting:
${INITIAL_SAMPLE_QC_FILE}$
python 03_0_initial_sample_qc.py ${MT_HARDCALLS} ${INITIAL_VARIANT_LIST} ${INITIAL_SAMPLE_QC_FILE}
We then plot the sample metrics using 03_1_initial_sample_qc_plot.r
.
We then create a series of plots, and define an empirical hard cutoff for a series of the metrics. Edit the chosen thresholds in utils/r_options.r
or redefine them at the start of the plotting script (03_1_initial_sample_qc_plot.r
).
When you are happy with the proposed cutoffs, run 03_2_initial_sample_filter.r
.
There are a few options here
This first step takes the X chromosome, and LD prunes to define a collection of pseudo-independent SNPs for subsequent F-statistic evaluation. We filter to the collection of samples with exome sequence data available to speed things.
Here's a bash script to do this using plink (assuming that plink is installed and in your $PATH
). It's just a wrapper to a plink command.
Inputs:
- Filepath to .bed for chromosome X:
${GENO_X_BED}
- Filepath to .bim for chromosome X:
${GENO_X_BIM}
- Filepath to .fam for chromosome X:
${GENO_X_FAM}
- Filepath to sample information (aka phenotype inforamtion) for the samples that are sequenced:
${SAMPLE_INFORMATION}
Outputs:
- Prefix for high-quality pruned common chromosome X variant plink files:
${PRUNED_X_PLINK}
bash 04_prune_genotyped_snps.sh ${GENO_X_BED} ${GENO_X_BIM} ${GENO_X_FAM} ${SAMPLE_INFORMATION} ${PRUNED_X_PLINK}
If only WES data is available, you'll need to create a set of high quality SNPs, and then LD-prune them. The first step is to spit out a high quality set of variants to a set of plink files.
Inputs:
- Filepath to hard-calls MatrixTable from step 1:
${MT_HARDCALLS}
- Filepath to the list of samples remaining after
03_2_initial_sample_filter.r
:${SAMPLE_LIST_INITIAL_QC}
- Filepath to the set of variants remaining after step 2 filtering:
${INITIAL_VARIANT_LIST}
- Filepath to set of high-LD regions for removal:
${HIGH_LD_INTERVALS}
Outputs:
- Prefix for high-quality common variant plink files:
${PLINK_FILES}
python 04_0_export_plink.py ${MT_HARDCALLS} ${SAMPLE_LIST_INITIAL_QC} ${INITIAL_VARIANT_LIST} ${HIGH_LD_INTERVALS} ${PLINK_FILES}
Next, take these plink files and LD-prune them. The following bash script is a simple wrapper to call plink a few times. Again, we're assuming that plink is installed and in your $PATH
.
Inputs:
- Prefix for high-quality common variant plink files:
${PLINK_FILES}
bash 04_1_prune_sequencing_snps.sh ${PLINK_FILES}
The outputs of step 4 (both when genotype data is available and if only WES data is available) feed into steps 5 and 6.
There are a few options here. If genotype data is available, or if only WES data is available.
We provide an R script to assign 1000G ancestry labels using genotype data, using the UK Biobank data as an example.
Here we make use of the OADP projection to guard against shrinking PCs to 0, though in the case of projection of the first four PCs which will be used downstream for superpopulation assignment, this is likely overkill as the shrinkage is not severe in the first few PCs.
- Combine the chromosome data together into a single plink file from R (as this is required by
bigsnpr
to perform the projections) in the data that we wish to project (UKBB data in this example). - Download the 1000G data, and project the UK Biobank samples onto the PC space defined by 1000G.
- Assign superpopulation labels to the 1000G data, based on the 1000G ped file.
- Create a random forest classifier using the
randomForest
library in R, and write out the results. We used a threshold of 0.99 to assign labels to UK Biobank samples. Samples that weren't clearly assigned to any cluster were labelled 'unsure'.
05_estimate_superpopulation.r
Run as above, but filtering to the set of SNPs present in both the WES and 1000G data first.
This step should be run separately for each superpopulation (MAF differences across superpopulations can throw off the F statistic).
There are two options:
- Genotype data (i.e. imputed SNP chip data) is available
- Genotype data isn't available.
The code is the same, just applied to either LD-pruned genotyping (output of 04_prune_genotyped_snps.sh
) or LD pruned sequencing data after filtering to high quality variants (output of 04_1_prune_sequencing_snps.sh
).
To do this we read in the pruned sex chromosome data, and determine the F-statistic for samples using the X chromosome, and check the number of non-missing allele counts on the Y.
Inputs:
- Filepath to hard-calls MatrixTable from step 1:
${MT_HARDCALLS}
- Filepath to the list of samples remaining after
03_2_initial_sample_filter.r
:${SAMPLE_LIST_INITIAL_QC}
- Filepath to the set of LD-pruned high-quality variants on the X (output from either
04_prune_genotyped_snps.sh
, or04_1_prune_sequencing_snps.sh
):${PRUNED_CHRX_VARIANTS}
. - Filepath to the SAMPLE_TABLE (phenotype information for the samples saved as a HailTable), which includes self-assigned sex or gender a ones of the annotations, with annotation name
reported_sex
present:${SAMPLE_TABLE}
.
Outputs:
- Filepath to output .tsv file used to plot imputed sex information in the next step:
${IMPUTESEX_TABLE}
- Filepath to output .tsv file used to count the mismatches at this step for a summary table:
${IMPUTESEX_FILE}
- Filepath to output .tsv file of the number of calls on the Y:
${Y_NCALLED}
python 06_0_impute_sex.py ${MT_HARDCALLS} ${SAMPLE_LIST_INITIAL_QC} ${PRUNED_CHRX_VARIANTS} ${SAMPLE_TABLE} ${IMPUTESEX_TABLE} ${IMPUTESEX_FILE} ${Y_NCALLED}
Plot the output of 06_impute_sex.py
. We plot the distribution of the F-statistic on the X, and define a cutoff for sex labelling. We also plot the X F-statistic against the number of reads on the Y chromosome. After adding genetically defined sex, we compare to the self assigned sex in the phenotype file and remove mismatches.
Inputs:
- Filepath to .tsv file containing imputed sex information from the previous step (
${IMPUTESEX_TABLE}
):--impute_sex_table
- Filepath to .tsv file containing the number of calls on the Y from the previous step (
${Y_NCALLED}
):--y_ncalled
Outputs:
- Filepath to output .tsv file containing the samples to be removed due to sex swaps:
--sexcheck_list
Rscript 06_1_impute_sex_plot.r --impute_sex_table ${IMPUTESEX_TABLE} --y-ncalled ${Y_NCALLED}
If you have a single homogeneous population, you can use IBD estimation using Hail's identity_by_descent
method.
Inputs:
- Filepath to hard-calls MatrixTable from step 1:
${MT_HARDCALLS}
- Filepath to the list of samples remaining after
03_2_initial_sample_filter.r
:${SAMPLE_LIST_INITIAL_QC}
- Filepath to a collection of pruned variants to filter to before evalauting IBD (
${PLINK_FILES}.keep.variant_list
, output from04_1_prune_sequencing_snps.sh
):${PRUNED_VARIANTS}
Outputs:
- Filepath to output .tsv to contain IBD information for plotting:
${IBD_OUTPUT}
- Filepath to output .tsv file set of related samples using (almost) maximally independent set:
${SAMPLE_LIST_RELATED}
python 07_0_ibd.py ${MT_HARDCALLS} ${SAMPLE_LIST_INITIAL_QC} ${PRUNED_VARIANTS} ${IBD_OUTPUT} ${SAMPLE_LIST_RELATED}
Plot the resultant IBD estimates and colour code them:
Inputs:
- Filepath to the IBD file output from
07_0_ibd.py
(${IBD_OUTPUT}
):--ibd_file
- PI HAT threshold to use for 'related' (default: 0.2):
--ibd_threshold
Rscript 06_1_impute_sex_plot.r --ibd_file ${IBD_OUTPUT} --ibd_threshold 0.2
If you have multiple superpopulations, you should use PC-relate to identify related samples Hail's implementation of PC-relate using the pc_relate
method.
Inputs:
- Filepath to hard-calls MatrixTable from step 1:
${MT_HARDCALLS}
- Filepath to the list of samples remaining after
03_2_initial_sample_filter.r
:${SAMPLE_LIST_INITIAL_QC}
Outputs:
- Filepath to output .tsv to contain PC-relate information for plotting:
${PC_RELATE_OUTPUT}
- Filepath to output .tsv file set of related samples using (almost) maximally independent set:
${SAMPLE_LIST_RELATED}
python 07_0_pc_relate.py ${MT_HARDCALLS} ${SAMPLE_LIST_INITIAL_QC} ${PC_RELATE_OUTPUT} ${SAMPLE_LIST_RELATED}
I still have a few steps to include.
- final_variant_qc
- final_sample_qc
- creation of a final set of PCs to include in association analyses
- create and check a finalised curated matrix table
- spit out a vcf, plink, or bgen file from the matrixtable for downstream analyses that have input format requirements