Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#12652: switch all-gather to worker initiated edm termination mode #14078

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

SeanNijjar
Copy link
Contributor

@SeanNijjar SeanNijjar commented Oct 22, 2024

Ticket

Link to Github Issue

Problem description

This change isn't strictly required. However, making this change will simplify some follow-up changes.

What's changed

This is a purely tactical/refactoring change that will simplify some follow up all-gather changes. Namely:

  • We will be able to start deleting explicit message counting/computation code on host
  • We will be able to migrate All-gather to improved CCL infra (like the slicing infra that reduce scatter uses) - commonizing it further
    • We'll be better setup to use the same tensor iterating code as reduce scatter
  • This will reduce amount of code change needed to support merged sender/receiver workers for all-gather

Checklist

@SeanNijjar SeanNijjar linked an issue Oct 22, 2024 that may be closed by this pull request
@@ -750,7 +738,7 @@ operation::ProgramWithCallbacks all_gather_multi_core_with_workers_helper(
auto &sender_edm_builder = is_buffer_in_clockwise_direction(b) ? clockwise_edm_builders.at(i) : counter_clockwise_edm_builders.at(i);
log_trace(tt::LogOp, "Adding sender EDM channel");
EriscDatamoverBuilder::ChannelBufferInterface const& sender_channel_buffer_info =
sender_edm_builder.add_sender_channel(sender_worker_writer_semaphore_id, clockwise_link_buffer_num_messages_to_send.at(b), sender_worker_coords);
sender_edm_builder.add_sender_channel(sender_worker_writer_semaphore_id, 1, sender_worker_coords);
Copy link
Contributor

@avoraTT avoraTT Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is changing num_eth_messages_to_forward to 1 going to affect perf? Or is it just a semantic change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the code for readability. This just means the channel will actually forward data - the specific value itself doesn't matter so the API should be cleaned up. Thanks for the question.

@@ -302,6 +302,7 @@ operation::ProgramWithCallbacks all_gather_multi_core_with_workers_helper(
worker_defines["INTERLEAVED_MEM_LAYOUT"] = "1";
}

constexpr ttnn::ccl::EriscDataMoverTerminationMode edm_termination_mode = ttnn::ccl::EriscDataMoverTerminationMode::WORKER_INITIATED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all gather not use EriscDataMoverTerminationMode::MESSAGE_COUNT_REACHED mode at all now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Both to simplify host code and to enable migration to fabric we won't be able to use a message_count type of mode anymore.

Copy link
Contributor

@avoraTT avoraTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just added some questions for understanding.

@SeanNijjar
Copy link
Contributor Author

Hit some regressions and this work is being paused so I'll mark it draft.

@SeanNijjar SeanNijjar marked this pull request as draft October 24, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch all-gather to worker initiated termination mode
5 participants