Core operations in human GWAS workloads #696

hammer · 2021-10-02T21:16:39Z

hammer
Oct 2, 2021
Maintainer

This is a summary of the "core" functionality we've identified in human GWAS pipelines, along with the scaling characteristics of each class of operations (for n variants, m samples):

Summary statistics for QC [O(nm)]
- HWE and AF: reductions over sample axis (i.e. for variants)
- Call rates: for variants and samples
- Genotype depth/quality: for variants and samples when genotypes derived from WGS/WES
LD estimation / pruning [O(nmd)]
- The d indicates variant density measured as number of variants in fixed size bp windows (1000 kbp is common)
- Full O(n^2) LD matrices are very difficult to compute for imputed datasets
- Computing LD between variants in fixed size windows is an important optimization, and defining these windows based on genomic distance rather than variant count is more common
- Identifying a maximally independent set of variants is important for downstream relatedness and population structure estimation (i.e. PCA)
  - PCs will often represent LD between small groups of variants rather than population structure w/o this
Relatedness estimation / pruning [O(nm^2)]
- Kinship/IBD matrix estimation methods do not often include computational shortcuts (other than using self-reported pedigrees) for windowing or otherwise restricting calculations along the sample dimension -- the m^2 was not a problem in the past but biobanks are making it one now
- Identifying a maximally independent set of samples is important for population structure estimation and association testing
  - Samples should be unrelated in order for vanilla PCA to not capture family structure and admixture
  - Fixed effect regression model coefficients are biased when samples are not independent
Variant normalization [O(nm)]
- Prior to PCA it is often necessary to liftover or harmonize alleles, and therefore calls, between datasets (e.g. when projecting samples on 1KG PC space)
Population structure estimation [O(nmk)]
- Approximate, truncated SVD for k principal components is O(nmk), randomized approaches can reduce further to O(nm log(k))
- Sample projections into PCA space are used often for QC
- PCs representing pop. structure are critical covariates in association tests
Association testing [O(nm)]
- Fixed effect models are still most common w/ large datasets, and must include penalized or regularized likelihoods to avoid degenerate results with separation (e.g. firth likelihoods avoid nan coefficients when SNPs are perfect or near-perfect predictors of binary phenotypes)
- Mixed effect models (LMM) require kinship estimations that are very expensive to compute, though bespoke optimizations for this exist that make LMM for biobank-scale datasets possible (e.g. SAIGE and BOLT-LMM) -- otherwise, naive LMM is [O(nm^2)]
- Genomic control and result visualization are also necessary (i.e. manhattan and QQ plots)

Notably, relatedness estimation / pruning is the most difficult of all operations above to scale because outside of using self-reported relatedness or external reference panels, it is the only one that is still superlinear in the number of variants or samples.

This list is based largely on the following:

Shared functionality across general-purpose GWAS packages like Hail, Glow, Scikit-allel, SNPRelate, and bignspr (see https://discourse.pystatgen.org/t/gwas-toolkit-comparisons/17/4)
Pipelines that use any of the above and/or other single-purpose tools including TOPMed and Nealelab/UK_Biobank_GWAS
The Bycroft et al. 2018 UKBB QC pipeline making heavy use of PLINK/KING
- see this README for more details on how this process works as well as an attempt to recreate a lot of it with Hail on canine data from Hayward et al. 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core operations in human GWAS workloads #696

{{title}}

Replies: 0 comments

Select a reply

Core operations in human GWAS workloads #696

hammer Oct 2, 2021 Maintainer

Replies: 0 comments

hammer
Oct 2, 2021
Maintainer