Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Section - Efficiently working with large trees #8

Open
benjeffery opened this issue Aug 4, 2021 · 1 comment
Open

Section - Efficiently working with large trees #8

benjeffery opened this issue Aug 4, 2021 · 1 comment

Comments

@benjeffery
Copy link
Member

For this section we wish to highlight how tskit allows efficient processing of large, single trees.

So far I have converted trees from http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ to tskit (thanks @jeromekelleher for starting code here) this gives a tree with 799318 nodes weighing in at 120MB (half of this is metadata).
map_mutations for each of the 27754 sites can be performed in an average of 19ms per site.

Next steps:

  • Perf as a function of number of nodes/samples in the tree?
  • Measure perf of subsetting operations based on metadata.
  • Possible comparison with usher?
@jeromekelleher
Copy link
Member

I'm imaging this section as a narrative showing that we can do real world things easily with tskit, efficiently, using the Python API. So, we say we loaded can load the trees into memory (x ms). Then identified identical samples. I guess a reasonable thing to aim for would be to duplicate some matUtils/usher operations using the Python API, and report the relative timings of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants