You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to use the DistributedManager alongside the metrics implemented in ensemble_metrics to calculate ensemble means of different subsets of ranks. I noticed that there is some of the functionality (at least to create different groups of processes) implemented in the distributed manager
However, this code is commented out. Is the reason that this functionality is not safe, as described on this page? I was thinking of performing this operation and using barriers to ensure that processes are synchronized.
The text was updated successfully, but these errors were encountered:
@ankurmahesh Thanks for the interested in using the distributed utils on modulus. You're right, there are existing utilities to create different groups to communicate data in specific patterns and to synchronize a subset of processes. This is currently commented out for a very simple reason, we don't yet have a way to test this code in our CI pipelines. We are planning on working on that and adding this functionality back in to the DistributedManager in the very near future.
Would also be great to get some more info on your specific use case so we can prioritize this work appropriately.
Just wanted to update that we are starting to work on this issue. We will be able to fully support this once #74 is resolved so we can test DistributedManager better.
Thank you for the update! I look forward to trying it out.
For my use case, I hope to use DistributedManager to apply EnsembleMean to take the ensemble mean based on the output of different subgroups of processors.
I'd like to use the DistributedManager alongside the metrics implemented in ensemble_metrics to calculate ensemble means of different subsets of ranks. I noticed that there is some of the functionality (at least to create different groups of processes) implemented in the distributed manager
However, this code is commented out. Is the reason that this functionality is not safe, as described on this page? I was thinking of performing this operation and using barriers to ensure that processes are synchronized.
The text was updated successfully, but these errors were encountered: