Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you manage models in distributed Dask #416

Open
Padarn opened this issue Dec 31, 2021 · 2 comments
Open

How do you manage models in distributed Dask #416

Padarn opened this issue Dec 31, 2021 · 2 comments

Comments

@Padarn
Copy link

Padarn commented Dec 31, 2021

Hi guys, I really love using Dask as the backbone for this problem but I have a question:

If you use a GPU enabled model for both feature extraction and feature matching, how will the Dask workers manage the GPU memory required for these tasks?

For example, with superpoint https://github.com/borglab/gtsfm/blob/master/gtsfm/frontend/detector_descriptor/superpoint.py it looks like this class will be initialized on all workers? So do you run the risk of running out of GPU memory if your extractor and matcher GPU models are quite large?

Thanks again for the exciting project

@johnwlambert
Copy link
Collaborator

Hi @Padarn, thanks for your interest in our work.

Great question. Currently when a GPU is available, there are minimum GPU RAM requirements we expect from a user's hardware. For example, the GPU RAM must be sufficient to support at least inference with 1 model on 1 worker. Currently, a user must anticipate the amount of GPU memory each worker will use, when choosing the number of workers. In the future, we will automate this further.

However, all these networks can also run on the CPU. We've specifically looked for and are using models that have low RAM requirements (e.g. PatchmatchNet).

@Padarn
Copy link
Author

Padarn commented Jan 1, 2022

Hi @johnwlambert, thanks for your response.

I guess maybe I am missing something from Dask, but how do you ensure that a single machine does not get assigned work to run tasks for more than one GPU model at the same time? Or workers don't complete tasks in parallel?

To clarify, the situation I am imagining is that a worker currently doing feature matching is assigned a feature extraction task (or more likely vice versa).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants