You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Perhaps this is a naive question given the nature of ts files, but I was wondering if there was a way to load tree sequence files in a multi-threaded or parallel manner using pyslim?
Unfortunately I have tree sequence files that I need to load several times if I want to work on them in parallel. Granted, they are about 5-6 Gb each, but the loading alone takes anywhere from 10-15 min. Is there a more clever way to load these tree sequence files? I don't have issues with ts files that are lighter (e.g. ~1 Gb) but these 5-6 Gb seem to take way longer to load.
Nowadays tskit.load() is preferred over pyslim.load(), as I understand it, and @petrelharp will of course know better but I think pyslim.load() might just be a pass-through to tskit. So this sounds like an issue that might be better filed against tskit? It sounds like your big tree sequence file might be a good test case for them to optimize for; it would be very useful if you could upload it somewhere so that it can be downloaded for testing and profiling. Of course it's a very large file, so it might be a bit difficult to find a place to host it; but if you don't have an obvious place to put it, your IT department should be able to lend a hand. I doubt the tskit folks will want to parallelize their load algorithm any time soon (that sort of work is tremendously complex, and a load time of 15 minutes is a bit painful but not the end of the world), but there may certainly be other ways to speed up the load that would be worth looking at. Tagging @benjeffery and @jeromekelleher.
OK, so a lot of the time is waiting for IO, which makes sense for such large files. Still, 1 minute of CPU time seems excessive for loading. I wonder where the time is being spent.
Wait up! pyslim.load( ) no longer exists (I think it produces a deprecation error), and certainly did do some slow stuff that was made obsolete by tskit. So - @santaci, could you update your pyslim and tskit, then try again? Have a look at this page for insructions, but there's very little to change.
@santaci If you could share the file that would be good. In the mean time can you paste in the output of tskit info? Seeing the sizes of tables would be interesting. I assume most of the time is building the various indexes that are created on load, but it still seems excessive.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Perhaps this is a naive question given the nature of ts files, but I was wondering if there was a way to load tree sequence files in a multi-threaded or parallel manner using pyslim?
Unfortunately I have tree sequence files that I need to load several times if I want to work on them in parallel. Granted, they are about 5-6 Gb each, but the loading alone takes anywhere from 10-15 min. Is there a more clever way to load these tree sequence files? I don't have issues with ts files that are lighter (e.g. ~1 Gb) but these 5-6 Gb seem to take way longer to load.
Beta Was this translation helpful? Give feedback.
All reactions