Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage with huge datasets #87

Open
maximilian-heeg opened this issue Aug 8, 2023 · 6 comments
Open

Memory usage with huge datasets #87

maximilian-heeg opened this issue Aug 8, 2023 · 6 comments

Comments

@maximilian-heeg
Copy link
Contributor

Hi,

Thank you so much for providing Baysor. I recently updated my installation to version 0.6.2, and it is running great with Julia 1.9.

In our lab, we have recently generated new (huge) spatial datasets with up to 250 million transcripts (using a 500 gene panel), and we were planning to use Baysor for cell segmentation. I was expecting that this requires a lot of memory, so I did some benchmarking with smaller FOVs of the dataset (see below).

Memory usage and transcripts

It seems to that, that memory use scales linear with the number of transcripts. Extrapolating this, I would assume that our dataset with 250 million transcripts requires approximately 5-6 TB memory (which I unfortunately don't even have on our HPC).

Are there any solutions to that? Is there an easy way of creating smaller tiles and stitching them back together? I think, with the increasing panel sizes and imaging areas of commercial solutions, this might become an important limitation for many users soon.

Any help/ideas/suggestions are greatly appreciated.

Max

@sebgoti
Copy link

sebgoti commented Aug 15, 2023

A bit unrelated question to the issue but may I ask @maximilian-heeg how is your lab using Baysor on the HPC? I am trying to run it using Singularity to avoid installing things at the HPC level, so far not lucky (even though the docker container works). Thanks and sorry for any spam to your issue!

@VPetukhov
Copy link
Collaborator

@maximilian-heeg , thank you for this test! It's indeed a problem. We're working on memory optimizations for v0.7.0, and if it works as expected, it should drastically reduce the memory size (10 folds or so).

As for tiling, we also plan to add this graph cut idea, but it's not there yet. So the only thing you could atm is manually split the data by FOVs.

@VPetukhov
Copy link
Collaborator

@sebgoti , a short answer: I didn't try Baysor with Singularity. We have our lab servers, which are just big singular machines, so no clusters. If you need some input on your situation, I'd be happy to continue the discussion in a separate issue.

@maximilian-heeg
Copy link
Contributor Author

@sebgoti I have tried to run the Docker container using Singularity, but that did not work for me on the HPC. I ended up installing juliaup in a conda environment and then building baysor as described in the Readme. Viel Erfolg!

@VPetukhov Thank you so much for the answer and your work on this. For us, getting a good segmentation is currently the bottleneck of processing spatial data. I will try to split it into multiple FOVs.

@cbiagii
Copy link

cbiagii commented Aug 28, 2023

@VPetukhov, you say split the data by FOVs using the fov_name column of the transcripts.csv.gz file?

@mjleone
Copy link

mjleone commented Jan 31, 2024

@VPetukhov Hello, I and members of my lab are also very curious about if the new release is still in progress, and expected release time if you know. We are working with data between 10 million and 25 million transcripts, and hard to use current version with our resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants