Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistributedManager with torch ProcessGroups #58

Closed
Tracked by #113
ankurmahesh opened this issue Jun 27, 2023 · 3 comments · Fixed by #92
Closed
Tracked by #113

DistributedManager with torch ProcessGroups #58

ankurmahesh opened this issue Jun 27, 2023 · 3 comments · Fixed by #92
Assignees
Labels
enhancement New feature or request external Issues/PR filed by people outside the team

Comments

@ankurmahesh
Copy link

ankurmahesh commented Jun 27, 2023

I'd like to use the DistributedManager alongside the metrics implemented in ensemble_metrics to calculate ensemble means of different subsets of ranks. I noticed that there is some of the functionality (at least to create different groups of processes) implemented in the distributed manager

However, this code is commented out. Is the reason that this functionality is not safe, as described on this page? I was thinking of performing this operation and using barriers to ensure that processes are synchronized.

Screen Shot 2023-06-26 at 5 55 50 PM

@akshaysubr
Copy link
Collaborator

@ankurmahesh Thanks for the interested in using the distributed utils on modulus. You're right, there are existing utilities to create different groups to communicate data in specific patterns and to synchronize a subset of processes. This is currently commented out for a very simple reason, we don't yet have a way to test this code in our CI pipelines. We are planning on working on that and adding this functionality back in to the DistributedManager in the very near future.

Would also be great to get some more info on your specific use case so we can prioritize this work appropriately.

@akshaysubr akshaysubr self-assigned this Jun 27, 2023
@NickGeneva NickGeneva added enhancement New feature or request ? - Needs Triage Need team to review and classify labels Jul 18, 2023
@akshaysubr
Copy link
Collaborator

Just wanted to update that we are starting to work on this issue. We will be able to fully support this once #74 is resolved so we can test DistributedManager better.

@akshaysubr akshaysubr added 1 - On Deck To be worked on next and removed ? - Needs Triage Need team to review and classify labels Jul 21, 2023
@NickGeneva NickGeneva added the external Issues/PR filed by people outside the team label Jul 28, 2023
@akshaysubr akshaysubr added 2 - In Progress Currently a work in progress 4 - In Review Currently Under Review and removed 1 - On Deck To be worked on next 2 - In Progress Currently a work in progress labels Jul 29, 2023
@ankurmahesh
Copy link
Author

Thank you for the update! I look forward to trying it out.

For my use case, I hope to use DistributedManager to apply EnsembleMean to take the ensemble mean based on the output of different subgroups of processors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request external Issues/PR filed by people outside the team
Projects
None yet
3 participants