LD pruning and IBD #3662
Replies: 8 comments
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Mar 29, 2021 at 15:41) danking said:Hi hailstorm , I’m sorry you’re having trouble! I think we can fix this quickly. Are you running this on a laptop, on a private cluster, on an Amazon EMR cluster, or on a Google Dataproc cluster? How many cores are you using? Do you have a reason to set the block_size to 75? That severely negatively impacts performance. I suggest leaving it as the default, 4096. Are you sure you need to run ld prune on all 500,000 variants? I suspect most relatedness methods do not need that many variants to provide good estimates. I believe many folks use only several thousand variants for relatedness. How much memory does the the computer or cluster have? If a cluster, how much memory per executor or VM do you have? How do you start your job? Do you use |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Mar 29, 2021 at 16:21) danking said:How many total cores did the 3 node cluster have? Hail unfortunately is only moderately fast on a single computer. Hail is valuable because it can scale up to hundreds or thousands of cores. Most of our users use Google Dataproc or Amazon EMR to briefly access very large clusters. In the cloud, we pay per core-hour, so 1000 cores for one hour costs the same as 10 cores for 100 hours. If you’d like to try Hail on the cloud, we have introductory material in the docs. Can you link to the post that sets block_size to 75? I’d like to fix that post. The different executables do not affect performance but the affect how you set parameters. What is the output of: echo $PYSPARK_SUBMIT_ARGS This variable needs to specify how much memory is available to Hail. See this post for information on how to set that variable. Try setting both the executor memory and driver memory to the total amount of RAM on your computer. |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Apr 01, 2021 at 09:07) hailstorm said:Thanks again Dan. Will try the options you have listed in the above thread. This Github post thread has reference to block size and and window size. Maybe a followup thread to experiment with lower figures will help others in the future? github.com/hail-is/hail |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Apr 01, 2021 at 12:46) danking said:Thanks! I’ve edited that post with a note. Please do share the results of the new options! |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Apr 03, 2021 at 07:21) hailstorm said:Thanks Dan! I tried all three methods IBD, King and PC-relate with the above sample set (1000 samples x 500,000 SNPs) On a 10 node (each with 4 CPUs/16G RAM) cluster:
The King method estimates kinship coefficient as 0.49 of some samples that are completely unrelated (validated using another program to find shared DNA segments). |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Apr 05, 2021 at 14:16) danking said:That’s quite concerning! I’d really like to get to the bottom of that, if you don’t mind helping! Can you share any other information about the dataset? What is the ancestral structure of the samples? How many of the samples have recent admixture? Is it possible for me to get access to it (or some appropriately sanitized version) for technical development purposes? What percent of genotypes are missing? What kinships do IBD and pc-relate report for those sample-pairs? I’m not too surprised about the timings. IBD uses a very naive model (it assumes a single homogenous population). Regarding PC-Relate, to what did you set |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Apr 16, 2021 at 09:41) hailstorm said:Thanks Dan! I figured it was easier to export the Mt to VCF and then run the downstream analysis. I am sure this will become a constraint as the data grows in size, but feel this is faster with current dataset. Additionally, have more control over what options are fed into king or pc-relate. |
Beta Was this translation helpful? Give feedback.
-
Note
The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.
(Mar 28, 2021 at 10:04) hailstorm said:
Hi - I have not been successful in using LD prune. Below is the code I am using.
It takes a long time (hours) for a mt of ~1000 individuals having ~500,000 genotypes each.
I have tried various server configs but does not seem to help. Is there an example on LD pruning. My end goal is to find relatedness (king) within samples.
Any help is appreciated.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions