Core operations in human GWAS workloads #696
hammer
started this conversation in
Discourse import
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(Posted by @eric-czech)
This is a summary of the "core" functionality we've identified in human GWAS pipelines, along with the scaling characteristics of each class of operations (for
n
variants,m
samples):O(nm)
][O(nmd)]
d
indicates variant density measured as number of variants in fixed size bp windows (1000 kbp is common)O(n^2)
LD matrices are very difficult to compute for imputed datasets[O(nm^2)]
m^2
was not a problem in the past but biobanks are making it one now[O(nm)]
[O(nmk)]
k
principal components isO(nmk)
, randomized approaches can reduce further toO(nm log(k))
[O(nm)]
[O(nm^2)]
Notably, relatedness estimation / pruning is the most difficult of all operations above to scale because outside of using self-reported relatedness or external reference panels, it is the only one that is still superlinear in the number of variants or samples.
This list is based largely on the following:
Beta Was this translation helpful? Give feedback.
All reactions