Loading large tree sequences with pyslim #303

santaci · 2022-09-28T10:25:36Z

santaci
Sep 28, 2022

Perhaps this is a naive question given the nature of ts files, but I was wondering if there was a way to load tree sequence files in a multi-threaded or parallel manner using pyslim?

Unfortunately I have tree sequence files that I need to load several times if I want to work on them in parallel. Granted, they are about 5-6 Gb each, but the loading alone takes anywhere from 10-15 min. Is there a more clever way to load these tree sequence files? I don't have issues with ts files that are lighter (e.g. ~1 Gb) but these 5-6 Gb seem to take way longer to load.

bhaller · 2022-09-28T12:38:24Z

bhaller
Sep 28, 2022
Maintainer

Nowadays tskit.load() is preferred over pyslim.load(), as I understand it, and @petrelharp will of course know better but I think pyslim.load() might just be a pass-through to tskit. So this sounds like an issue that might be better filed against tskit? It sounds like your big tree sequence file might be a good test case for them to optimize for; it would be very useful if you could upload it somewhere so that it can be downloaded for testing and profiling. Of course it's a very large file, so it might be a bit difficult to find a place to host it; but if you don't have an obvious place to put it, your IT department should be able to lend a hand. I doubt the tskit folks will want to parallelize their load algorithm any time soon (that sort of work is tremendously complex, and a load time of 15 minutes is a bit painful but not the end of the world), but there may certainly be other ways to speed up the load that would be worth looking at. Tagging @benjeffery and @jeromekelleher.

11 replies

santaci Sep 28, 2022
Author

Sure, it was as follows:
real 2m28.050s
user 1m3.057s
sys 0m21.504s

I'll also try updating my pyslim and tskit in a different environment and give it a go as well.

jeromekelleher Sep 28, 2022
Maintainer

OK, so a lot of the time is waiting for IO, which makes sense for such large files. Still, 1 minute of CPU time seems excessive for loading. I wonder where the time is being spent.

petrelharp Sep 28, 2022
Maintainer

Wait up! pyslim.load( ) no longer exists (I think it produces a deprecation error), and certainly did do some slow stuff that was made obsolete by tskit. So - @santaci, could you update your pyslim and tskit, then try again? Have a look at this page for insructions, but there's very little to change.

benjeffery Sep 29, 2022
Maintainer

@santaci If you could share the file that would be good. In the mean time can you paste in the output of tskit info? Seeing the sizes of tables would be interesting. I assume most of the time is building the various indexes that are created on load, but it still seems excessive.

santaci Sep 29, 2022
Author

@benjeffery I've posted here the output of tskit info using version 0.3.4:
sequence_length: 68937975.0
trees: 16447189
samples: 20016158
individuals: 10015389
nodes: 30489023
edges: 130099466
sites: 22599769
mutations: 22599769
migrations: 0
populations: 4
provenances: 5

I've also updated my tskit in a separate environment to version 0.5.2 and obtain the following tables with a slower run time:

real 2m5,385s
user 1m41,021s
sys 0m24,209s

I can share the tree sequence via our IT storage hosting here: https://sid.erda.dk/share_redirect/aNOb2oRrJm
Just be sure to rename it as *.trees file when downloading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading large tree sequences with pyslim #303

{{title}}

Replies: 1 comment 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Loading large tree sequences with pyslim #303

santaci Sep 28, 2022

Replies: 1 comment · 11 replies

bhaller Sep 28, 2022 Maintainer

santaci Sep 28, 2022 Author

jeromekelleher Sep 28, 2022 Maintainer

petrelharp Sep 28, 2022 Maintainer

benjeffery Sep 29, 2022 Maintainer

santaci Sep 29, 2022 Author

santaci
Sep 28, 2022

Replies: 1 comment 11 replies

bhaller
Sep 28, 2022
Maintainer

santaci Sep 28, 2022
Author

jeromekelleher Sep 28, 2022
Maintainer

petrelharp Sep 28, 2022
Maintainer

benjeffery Sep 29, 2022
Maintainer

santaci Sep 29, 2022
Author