-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#5560: Add all_reduce
op (experimental)
#13853
Conversation
a6c9d8c
to
83e452a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (1/6)
namespace ttnn { | ||
|
||
void AllReduce::validate(const std::vector<Tensor>& input_tensors) const { | ||
for (auto const& t : input_tensors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable name t
is too short, expected at least 3 characters
|
||
void AllReduce::validate(const std::vector<Tensor>& input_tensors) const { | ||
for (auto const& t : input_tensors) { | ||
TT_FATAL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid do-while loops
TT_FATAL( | ||
t.get_legacy_shape()[this->scatter_dim] / this->ring_size > 0, | ||
"All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim); | ||
TT_FATAL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid do-while loops
t.get_legacy_shape()[this->scatter_dim] % this->ring_size == 0, | ||
"All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim); | ||
|
||
TT_FATAL(!t.is_sharded(), "Sharded tensors are not supported for all reduce currently"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid do-while loops
t.get_legacy_shape()[this->scatter_dim] % this->ring_size == 0, | ||
"All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim); | ||
|
||
TT_FATAL(!t.is_sharded(), "Sharded tensors are not supported for all reduce currently"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean expression can be simplified by DeMorgan's theorem
} | ||
} | ||
|
||
std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const { | |
auto AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const -> std::vector<ttnn::SimpleShape> { |
|
||
std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const { | ||
auto shape = input_tensors[0].get_logical_shape(); | ||
TT_FATAL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid do-while loops
TT_FATAL( | ||
shape[this->scatter_dim] % this->ring_size == 0, | ||
"The size of the scatter dimension must be a multiple of the ring size"); | ||
shape[this->scatter_dim] /= this->ring_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
narrowing conversion from uint32_t
(aka unsigned int
) to signed type int32_t
(aka int
) is implementation-defined
shape[this->scatter_dim] % this->ring_size == 0, | ||
"The size of the scatter dimension must be a multiple of the ring size"); | ||
shape[this->scatter_dim] /= this->ring_size; | ||
return std::vector<ttnn::SimpleShape>(input_tensors.size(), shape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid repeating the return type from the declaration; use a braced initializer list instead
return std::vector<ttnn::SimpleShape>(input_tensors.size(), shape); | ||
} | ||
|
||
std::vector<Tensor> AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
std::vector<Tensor> AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const { | |
auto AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const -> std::vector<Tensor> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (2/6)
operation::ProgramWithCallbacks AllReduce::create_program( | ||
const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
operation::ProgramWithCallbacks AllReduce::create_program( | |
const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const { | |
auto AllReduce::create_program( | |
const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const -> operation::ProgramWithCallbacks { |
this->user_defined_num_buffers_per_channel); | ||
} | ||
|
||
static ttnn::operations::binary::BinaryOpType convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
static ttnn::operations::binary::BinaryOpType convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) { | |
static auto convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) -> ttnn::operations::binary::BinaryOpType { |
namespace operations{ | ||
namespace experimental{ | ||
namespace ccl{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested namespaces can be concatenated
namespace operations{ | |
namespace experimental{ | |
namespace ccl{ | |
namespace operations::experimental::ccl{ |
} // namespace ccl | ||
} // namespace experimental | ||
} // namespace operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested namespaces can be concatenated
} // namespace ccl | |
} // namespace experimental | |
} // namespace operations | |
} // namespace operations::experimental::ccl | |
namespace operations{ | ||
namespace experimental{ | ||
namespace ccl{ | ||
Tensor all_reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
Tensor all_reduce( | |
auto all_reduce( |
const MemoryConfig& output_mem_config, | ||
ttnn::ccl::Topology topology, | ||
const std::optional<size_t> user_defined_num_workers, | ||
const std::optional<size_t> user_defined_num_buffers_per_channel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
const std::optional<size_t> user_defined_num_buffers_per_channel) { | |
const std::optional<size_t> user_defined_num_buffers_per_channel) -> Tensor { |
namespace operations{ | ||
namespace experimental{ | ||
namespace ccl{ | ||
Tensor all_reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function all_reduce
has cognitive complexity of 32 (threshold 25)
const std::optional<size_t> user_defined_num_workers, | ||
const std::optional<size_t> user_defined_num_buffers_per_channel) { | ||
ttnn::operations::binary::BinaryOpType binary_op_type = convert_reduce_type_to_eltwise_type(math_op); | ||
TT_FATAL(std::getenv("TT_METAL_SLOW_DISPATCH_MODE") == nullptr, "This op is only supported for Fast Dispatch"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid do-while loops
const std::optional<size_t> user_defined_num_buffers_per_channel) { | ||
ttnn::operations::binary::BinaryOpType binary_op_type = convert_reduce_type_to_eltwise_type(math_op); | ||
TT_FATAL(std::getenv("TT_METAL_SLOW_DISPATCH_MODE") == nullptr, "This op is only supported for Fast Dispatch"); | ||
TT_FATAL(topology == ttnn::ccl::Topology::Ring, "All Reduce op is currently supported only on Ring topology"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid do-while loops
const std::vector<Tensor>& input_tensors, | ||
const std::vector<std::optional<const Tensor>>& optional_input_tensors, | ||
const std::vector<std::optional<Tensor>>& optional_output_tensors) mutable -> std::vector<Tensor> { | ||
TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid do-while loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (3/6)
const std::vector<Tensor>& input_tensors, | ||
const std::vector<std::optional<const Tensor>>& optional_input_tensors, | ||
const std::vector<std::optional<Tensor>>& optional_output_tensors) mutable -> std::vector<Tensor> { | ||
TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the empty
method should be used to check for emptiness instead of size
TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none"); | |
TT_FATAL(!input_tensors.empty(), "All reduce op expects an input tensor but it received none"); |
break; | ||
} | ||
} | ||
TT_FATAL(receiver_device_id != std::nullopt || sender_device_id != std::nullopt, "Error in all reduce op setup"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid do-while loops
namespace operations{ | ||
namespace experimental{ | ||
namespace ccl{ | ||
Tensor all_reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function ttnn::operations::experimental::ccl::all_reduce
has a definition with different parameter names
namespace ttnn { | ||
|
||
struct AllReduce { | ||
const ttnn::operations::binary::BinaryOpType binary_op_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member binary_op_type
of type const ttnn::operations::binary::BinaryOpType
is const qualified
|
||
struct AllReduce { | ||
const ttnn::operations::binary::BinaryOpType binary_op_type; | ||
const uint32_t scatter_dim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member scatter_dim
of type const uint32_t
(aka const unsigned int
) is const qualified
struct AllReduce { | ||
const ttnn::operations::binary::BinaryOpType binary_op_type; | ||
const uint32_t scatter_dim; | ||
const uint32_t num_links; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member num_links
of type const uint32_t
(aka const unsigned int
) is const qualified
const ttnn::operations::binary::BinaryOpType binary_op_type; | ||
const uint32_t scatter_dim; | ||
const uint32_t num_links; | ||
const uint32_t ring_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member ring_size
of type const uint32_t
(aka const unsigned int
) is const qualified
const uint32_t scatter_dim; | ||
const uint32_t num_links; | ||
const uint32_t ring_size; | ||
const uint32_t ring_index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member ring_index
of type const uint32_t
(aka const unsigned int
) is const qualified
const uint32_t num_links; | ||
const uint32_t ring_size; | ||
const uint32_t ring_index; | ||
const std::optional<chip_id_t> receiver_device_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member receiver_device_id
of type const std::optional<chip_id_t>
(aka const optional<int>
) is const qualified
const uint32_t ring_size; | ||
const uint32_t ring_index; | ||
const std::optional<chip_id_t> receiver_device_id; | ||
const std::optional<chip_id_t> sender_device_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member sender_device_id
of type const std::optional<chip_id_t>
(aka const optional<int>
) is const qualified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (4/6)
const uint32_t ring_index; | ||
const std::optional<chip_id_t> receiver_device_id; | ||
const std::optional<chip_id_t> sender_device_id; | ||
const MemoryConfig output_mem_config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member output_mem_config
of type const MemoryConfig
is const qualified
const std::optional<chip_id_t> receiver_device_id; | ||
const std::optional<chip_id_t> sender_device_id; | ||
const MemoryConfig output_mem_config; | ||
const ttnn::ccl::Topology topology; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member topology
of type const ttnn::ccl::Topology
is const qualified
const std::optional<chip_id_t> sender_device_id; | ||
const MemoryConfig output_mem_config; | ||
const ttnn::ccl::Topology topology; | ||
const std::optional<size_t> user_defined_num_workers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member user_defined_num_workers
of type const std::optional<size_t>
(aka const optional<unsigned long>
) is const qualified
const MemoryConfig output_mem_config; | ||
const ttnn::ccl::Topology topology; | ||
const std::optional<size_t> user_defined_num_workers; | ||
const std::optional<size_t> user_defined_num_buffers_per_channel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member user_defined_num_buffers_per_channel
of type const std::optional<size_t>
(aka const optional<unsigned long>
) is const qualified
const std::optional<size_t> user_defined_num_buffers_per_channel; | ||
|
||
void validate(const std::vector<Tensor> &input_tensors) const; | ||
std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const; | |
auto compute_output_shapes(const std::vector<Tensor> &input_tensors) const -> std::vector<ttnn::SimpleShape>; |
|
||
void validate(const std::vector<Tensor> &input_tensors) const; | ||
std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const; | ||
std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const; | |
auto create_output_tensors(const std::vector<Tensor> &input_tensors) const -> std::vector<Tensor>; |
operation::ProgramWithCallbacks create_program( | ||
const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
operation::ProgramWithCallbacks create_program( | |
const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const; | |
auto create_program( | |
const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const -> operation::ProgramWithCallbacks; |
namespace operations{ | ||
namespace experimental{ | ||
namespace ccl{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested namespaces can be concatenated
namespace operations{ | |
namespace experimental{ | |
namespace ccl{ | |
namespace operations::experimental::ccl{ |
} // namespace ccl | ||
} // namespace experimental | ||
} // namespace operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested namespaces can be concatenated
} // namespace ccl | |
} // namespace experimental | |
} // namespace operations | |
} // namespace operations::experimental::ccl | |
namespace operations{ | ||
namespace experimental{ | ||
namespace ccl{ | ||
Tensor all_reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
Tensor all_reduce( | |
auto all_reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (5/6)
const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG, | ||
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring, | ||
const std::optional<size_t> user_defined_num_workers = std::nullopt, | ||
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt); | |
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt) -> Tensor; |
namespace ccl{ | ||
Tensor all_reduce( | ||
const Tensor &input_tensor, | ||
const uint32_t scatter_split_dim, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter scatter_split_dim
is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions
const uint32_t scatter_split_dim, | |
uint32_t scatter_split_dim, |
const Tensor &input_tensor, | ||
const uint32_t scatter_split_dim, | ||
ttnn::operations::reduction::ReduceType reduce_op = ttnn::operations::reduction::ReduceType::Sum, | ||
const uint32_t num_links = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter num_links
is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions
const uint32_t num_links = 1, | |
uint32_t num_links = 1, |
const uint32_t num_links = 1, | ||
const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG, | ||
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring, | ||
const std::optional<size_t> user_defined_num_workers = std::nullopt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter user_defined_num_workers
is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions
const std::optional<size_t> user_defined_num_workers = std::nullopt, | |
std::optional<size_t> user_defined_num_workers = std::nullopt, |
const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG, | ||
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring, | ||
const std::optional<size_t> user_defined_num_workers = std::nullopt, | ||
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter user_defined_num_buffers_per_channel
is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt); | |
std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt); |
namespace operations { | ||
namespace experimental { | ||
namespace ccl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested namespaces can be concatenated
namespace operations { | |
namespace experimental { | |
namespace ccl { | |
namespace operations::experimental::ccl { |
} // namespace ccl | ||
} // namespace experimental | ||
} // namespace operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested namespaces can be concatenated
} // namespace ccl | |
} // namespace experimental | |
} // namespace operations | |
} // namespace operations::experimental::ccl | |
namespace ccl { | ||
|
||
struct ExecuteAllReduce { | ||
static ttnn::Tensor invoke( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
static ttnn::Tensor invoke( | |
static auto invoke( |
const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt, | ||
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring, | ||
const std::optional<size_t> num_workers = std::nullopt, | ||
const std::optional<size_t> num_buffers_per_channel = std::nullopt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
const std::optional<size_t> num_buffers_per_channel = std::nullopt); | |
const std::optional<size_t> num_buffers_per_channel = std::nullopt) -> ttnn::Tensor; |
struct ExecuteAllReduce { | ||
static ttnn::Tensor invoke( | ||
const ttnn::Tensor& input_tensor, | ||
const uint32_t scatter_dim, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter scatter_dim
is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions
const uint32_t scatter_dim, | |
uint32_t scatter_dim, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (6/6)
const ttnn::Tensor& input_tensor, | ||
const uint32_t scatter_dim, | ||
ttnn::operations::reduction::ReduceType math_op, | ||
const uint32_t num_links = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter num_links
is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions
const uint32_t num_links = 1, | |
uint32_t num_links = 1, |
const uint32_t num_links = 1, | ||
const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt, | ||
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring, | ||
const std::optional<size_t> num_workers = std::nullopt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter num_workers
is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions
const std::optional<size_t> num_workers = std::nullopt, | |
std::optional<size_t> num_workers = std::nullopt, |
const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt, | ||
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring, | ||
const std::optional<size_t> num_workers = std::nullopt, | ||
const std::optional<size_t> num_buffers_per_channel = std::nullopt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter num_buffers_per_channel
is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions
const std::optional<size_t> num_buffers_per_channel = std::nullopt); | |
std::optional<size_t> num_buffers_per_channel = std::nullopt); |
|
||
namespace ttnn::operations::experimental::ccl { | ||
|
||
ttnn::Tensor ExecuteAllReduce::invoke( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
ttnn::Tensor ExecuteAllReduce::invoke( | |
auto ExecuteAllReduce::invoke( |
const std::optional<ttnn::MemoryConfig>& memory_config, | ||
ttnn::ccl::Topology topology, | ||
const std::optional<size_t> num_workers, | ||
const std::optional<size_t> num_buffers_per_channel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a trailing return type for this function
const std::optional<size_t> num_buffers_per_channel) { | |
const std::optional<size_t> num_buffers_per_channel) -> ttnn::Tensor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (1/1)
} // namespace experimental | ||
} // namespace operations | ||
|
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
* tenstorrent#5560: Add all_reduce op * tenstorrent#5560: Add shapes to the test * tenstorrent#5560: Combine all_gather with launch_op * #0: Update copyright * tenstorrent#5560: all_gather + local reduce * tenstorrent#5560: Allocate tensor properly * tenstorrent#5560: Reduce gathered dim * tenstorrent#5560: Add to frequent pipeline * tenstorrent#5560: Add cases with NC dim
Tracking Issue : #5560
Add
all_reduce
op under experimental folderQ: Is only sum as reduce op is supported?
A: Today yes. Future, no. We'd need to support min, max, and other reduction operations.
Q: What is the output shape after all_reduce, is it equal to the shape on one device?
A: Yes (for cases where for whatever reason one of the devices may have less data, it would be the size of the larger tensors, and the device with a smaller tensor would logically be padded with identity value elements. Whether this is a legitimate use case for this operation I am not sure atm).
Q: After all reduce, will all devices have the same reduced tensor?
A: Yes
Q: Do we do reduce over the batch dimension only?
A: Not exactly. All-reduce is a rank-wise operation. That is we are doing element-wise reductions for tensors across devices. For reference, we are following the semantics outlined here.
Q: What is the memory footprint on one device? e.g we have 32 devices and call reduce on tensor with volume 16k how how many dram will be allocated during this op?
A:
Today
memory consumption is inefficient because we are providing the least optimized functional implementation for all-reduce (composite all-gather + eltwise operation).
if # devices = n and input tensor size per device = m, then the total memory working set size (per chip) will be (today) (n + 1) * m
That is n * m needed to store the intermediate all-gather results and then another m to store the final output. After the operation the n *m can be returned to the allocation
Near future
We will enable the reduce_scatter + all-gather (composite) optimization that will both reduce memory pressure and improve performance. This has 2 constraints:
All-gather and reduce scatter must enable support for non-uniform tensor sizes across chips
the input tensor shape must have a dim size > n
technically it just needs a volume > n but that would require multi-dimension scatter/concat dim for reduce scatter/all-gather which is totally out of the picture for the foreseeable future
With this version total memory pressure (per chip) would be m + m * (1/n) where m * (1/n) is the size of the intermediate tensor between reduce_scatter and all-gather
Long Term
We will support a proper non-composite implementation in which the memory needed per chip will simply b m
Checklist