LD pruning and IBD #3662

iris-garden · 2024-05-09T20:45:23Z

iris-garden
May 9, 2024
Maintainer

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Mar 28, 2021 at 10:04) hailstorm said:

Hi - I have not been successful in using LD prune. Below is the code I am using.

b_mt = mt.filter_rows(hl.len(mt.alleles) == 2)
t = b_mt.repartition(100)
b_mt.write("repartitioned.mt", overwrite=True)

r_mt = hl.read_matrix_table('repartitioned.mt')
pruned_tbl = hl.ld_prune(r_mt.GT, r2 = 0.2, bp_window_size = 500000, block_size=75)

It takes a long time (hours) for a mt of ~1000 individuals having ~500,000 genotypes each.

I have tried various server configs but does not seem to help. Is there an example on LD pruning. My end goal is to find relatedness (king) within samples.

Any help is appreciated.

Thanks!

iris-garden · 2024-05-09T20:45:25Z

iris-garden
May 9, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Mar 29, 2021 at 15:41) danking said:

Hi hailstorm ,

I’m sorry you’re having trouble! I think we can fix this quickly.

Are you running this on a laptop, on a private cluster, on an Amazon EMR cluster, or on a Google Dataproc cluster? How many cores are you using?

Do you have a reason to set the block_size to 75? That severely negatively impacts performance. I suggest leaving it as the default, 4096.

Are you sure you need to run ld prune on all 500,000 variants? I suspect most relatedness methods do not need that many variants to provide good estimates. I believe many folks use only several thousand variants for relatedness.

How much memory does the the computer or cluster have? If a cluster, how much memory per executor or VM do you have?

How do you start your job? Do you use ipython, python3, pyspark, spark-submit, or one of the cloud provider commands?

0 replies

iris-garden · 2024-05-09T20:45:25Z

iris-garden
May 9, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Mar 29, 2021 at 16:00) hailstorm said:

Thanks danking! Good to hear this can be fixed

This is being run on a 8 vCPU / 32GB RAM GC instance. But I have tried on a 3 node cluster and the results are the same.
I noticed block_size 75 on one of the posts and thought this enhances performance. Will reset it default as you had suggested.
This was a default setting in one of your posts. I will try a lower number and see if that helps.
Tried both ipython and python3.

I will run with the above settings and see if it improves the execution time.

Is using one over the other enhance performance?- ipython , python3 , pyspark , spark-submit

0 replies

iris-garden · 2024-05-09T20:45:26Z

iris-garden
May 9, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Mar 29, 2021 at 16:21) danking said:

How many total cores did the 3 node cluster have? Hail unfortunately is only moderately fast on a single computer. Hail is valuable because it can scale up to hundreds or thousands of cores. Most of our users use Google Dataproc or Amazon EMR to briefly access very large clusters. In the cloud, we pay per core-hour, so 1000 cores for one hour costs the same as 10 cores for 100 hours. If you’d like to try Hail on the cloud, we have introductory material in the docs.

Can you link to the post that sets block_size to 75? I’d like to fix that post.

The different executables do not affect performance but the affect how you set parameters. What is the output of:

echo $PYSPARK_SUBMIT_ARGS

This variable needs to specify how much memory is available to Hail. See this post for information on how to set that variable. Try setting both the executor memory and driver memory to the total amount of RAM on your computer.

0 replies

iris-garden · 2024-05-09T20:45:27Z

iris-garden
May 9, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Apr 01, 2021 at 09:07) hailstorm said:

Thanks again Dan. Will try the options you have listed in the above thread.

This Github post thread has reference to block size and and window size. Maybe a followup thread to experiment with lower figures will help others in the future?

github.com/hail-is/hail

0 replies

iris-garden · 2024-05-09T20:45:28Z

iris-garden
May 9, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Apr 01, 2021 at 12:46) danking said:

Thanks! I’ve edited that post with a note.

Please do share the results of the new options!

0 replies

iris-garden · 2024-05-09T20:45:28Z

iris-garden
May 9, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Apr 03, 2021 at 07:21) hailstorm said:

Thanks Dan! I tried all three methods IBD, King and PC-relate with the above sample set (1000 samples x 500,000 SNPs)

On a 10 node (each with 4 CPUs/16G RAM) cluster:

IBD is the fastest (few mins)
PC-relate is the slowest (few hours).
King takes ~32 mins but don’t think the results are correct.

The King method estimates kinship coefficient as 0.49 of some samples that are completely unrelated (validated using another program to find shared DNA segments).

0 replies

iris-garden · 2024-05-09T20:45:29Z

iris-garden
May 9, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Apr 05, 2021 at 14:16) danking said:

That’s quite concerning! I’d really like to get to the bottom of that, if you don’t mind helping!

Can you share any other information about the dataset? What is the ancestral structure of the samples? How many of the samples have recent admixture? Is it possible for me to get access to it (or some appropriately sanitized version) for technical development purposes? What percent of genotypes are missing? What kinships do IBD and pc-relate report for those sample-pairs?

I’m not too surprised about the timings. IBD uses a very naive model (it assumes a single homogenous population). Regarding PC-Relate, to what did you set statistics? By default, Hail computes kinship, ibd-0, ibd-1, and ibd-2. I expect the statistics='kin' to take about 2x the time of King.

0 replies

iris-garden · 2024-05-09T20:45:30Z

iris-garden
May 9, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Apr 16, 2021 at 09:41) hailstorm said:

Thanks Dan! I figured it was easier to export the Mt to VCF and then run the downstream analysis. I am sure this will become a constraint as the data grows in size, but feel this is faster with current dataset. Additionally, have more control over what options are fed into king or pc-relate.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LD pruning and IBD #3662

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

LD pruning and IBD #3662

iris-garden May 9, 2024 Maintainer

(Mar 28, 2021 at 10:04) hailstorm said:

Replies: 8 comments

iris-garden May 9, 2024 Maintainer Author

(Mar 29, 2021 at 15:41) danking said:

iris-garden May 9, 2024 Maintainer Author

(Mar 29, 2021 at 16:00) hailstorm said:

iris-garden May 9, 2024 Maintainer Author

(Mar 29, 2021 at 16:21) danking said:

iris-garden May 9, 2024 Maintainer Author

(Apr 01, 2021 at 09:07) hailstorm said:

iris-garden May 9, 2024 Maintainer Author

(Apr 01, 2021 at 12:46) danking said:

iris-garden May 9, 2024 Maintainer Author

(Apr 03, 2021 at 07:21) hailstorm said:

iris-garden May 9, 2024 Maintainer Author

(Apr 05, 2021 at 14:16) danking said:

iris-garden May 9, 2024 Maintainer Author

(Apr 16, 2021 at 09:41) hailstorm said:

iris-garden
May 9, 2024
Maintainer

iris-garden
May 9, 2024
Maintainer Author

iris-garden
May 9, 2024
Maintainer Author

iris-garden
May 9, 2024
Maintainer Author

iris-garden
May 9, 2024
Maintainer Author

iris-garden
May 9, 2024
Maintainer Author

iris-garden
May 9, 2024
Maintainer Author

iris-garden
May 9, 2024
Maintainer Author

iris-garden
May 9, 2024
Maintainer Author