Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#5560: Add all_reduce op (experimental) #13853

Merged
merged 9 commits into from
Oct 21, 2024
Merged

#5560: Add all_reduce op (experimental) #13853

merged 9 commits into from
Oct 21, 2024

Conversation

Aswinmcw
Copy link
Contributor

@Aswinmcw Aswinmcw commented Oct 16, 2024

Tracking Issue : #5560

Add all_reduce op under experimental folder

Q: Is only sum as reduce op is supported?

A: Today yes. Future, no. We'd need to support min, max, and other reduction operations.

Q: What is the output shape after all_reduce, is it equal to the shape on one device?

A: Yes (for cases where for whatever reason one of the devices may have less data, it would be the size of the larger tensors, and the device with a smaller tensor would logically be padded with identity value elements. Whether this is a legitimate use case for this operation I am not sure atm).

Q: After all reduce, will all devices have the same reduced tensor?

A: Yes

Q: Do we do reduce over the batch dimension only?

A: Not exactly. All-reduce is a rank-wise operation. That is we are doing element-wise reductions for tensors across devices. For reference, we are following the semantics outlined here.

Q: What is the memory footprint on one device? e.g we have 32 devices and call reduce on tensor with volume 16k how how many dram will be allocated during this op?

A:

Today

memory consumption is inefficient because we are providing the least optimized functional implementation for all-reduce (composite all-gather + eltwise operation).

if # devices = n and input tensor size per device = m, then the total memory working set size (per chip) will be (today) (n + 1) * m

That is n * m needed to store the intermediate all-gather results and then another m to store the final output. After the operation the n *m can be returned to the allocation

Near future

We will enable the reduce_scatter + all-gather (composite) optimization that will both reduce memory pressure and improve performance. This has 2 constraints:

All-gather and reduce scatter must enable support for non-uniform tensor sizes across chips
the input tensor shape must have a dim size > n
technically it just needs a volume > n but that would require multi-dimension scatter/concat dim for reduce scatter/all-gather which is totally out of the picture for the foreseeable future
With this version total memory pressure (per chip) would be m + m * (1/n) where m * (1/n) is the size of the intermediate tensor between reduce_scatter and all-gather

Long Term

We will support a proper non-composite implementation in which the memory needed per chip will simply b m

Checklist

@Aswinmcw Aswinmcw force-pushed the Aswinmcw/all_reduce_op branch 2 times, most recently from a6c9d8c to 83e452a Compare October 16, 2024 08:01
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/6)

namespace ttnn {

void AllReduce::validate(const std::vector<Tensor>& input_tensors) const {
for (auto const& t : input_tensors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-identifier-length ⚠️
variable name t is too short, expected at least 3 characters


void AllReduce::validate(const std::vector<Tensor>& input_tensors) const {
for (auto const& t : input_tensors) {
TT_FATAL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

TT_FATAL(
t.get_legacy_shape()[this->scatter_dim] / this->ring_size > 0,
"All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim);
TT_FATAL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

t.get_legacy_shape()[this->scatter_dim] % this->ring_size == 0,
"All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim);

TT_FATAL(!t.is_sharded(), "Sharded tensors are not supported for all reduce currently");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

t.get_legacy_shape()[this->scatter_dim] % this->ring_size == 0,
"All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim);

TT_FATAL(!t.is_sharded(), "Sharded tensors are not supported for all reduce currently");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-simplify-boolean-expr ⚠️
boolean expression can be simplified by DeMorgan's theorem

}
}

std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const {
auto AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const -> std::vector<ttnn::SimpleShape> {


std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const {
auto shape = input_tensors[0].get_logical_shape();
TT_FATAL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

TT_FATAL(
shape[this->scatter_dim] % this->ring_size == 0,
"The size of the scatter dimension must be a multiple of the ring size");
shape[this->scatter_dim] /= this->ring_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ bugprone-narrowing-conversions ⚠️
narrowing conversion from uint32_t (aka unsigned int) to signed type int32_t (aka int) is implementation-defined

shape[this->scatter_dim] % this->ring_size == 0,
"The size of the scatter dimension must be a multiple of the ring size");
shape[this->scatter_dim] /= this->ring_size;
return std::vector<ttnn::SimpleShape>(input_tensors.size(), shape);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-return-braced-init-list ⚠️
avoid repeating the return type from the declaration; use a braced initializer list instead

return std::vector<ttnn::SimpleShape>(input_tensors.size(), shape);
}

std::vector<Tensor> AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
std::vector<Tensor> AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const {
auto AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const -> std::vector<Tensor> {

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (2/6)

Comment on lines +42 to +33
operation::ProgramWithCallbacks AllReduce::create_program(
const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
operation::ProgramWithCallbacks AllReduce::create_program(
const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const {
auto AllReduce::create_program(
const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const -> operation::ProgramWithCallbacks {

this->user_defined_num_buffers_per_channel);
}

static ttnn::operations::binary::BinaryOpType convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
static ttnn::operations::binary::BinaryOpType convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) {
static auto convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) -> ttnn::operations::binary::BinaryOpType {

Comment on lines +70 to +62
namespace operations{
namespace experimental{
namespace ccl{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change
namespace operations{
namespace experimental{
namespace ccl{
namespace operations::experimental::ccl{

Comment on lines +140 to +136
} // namespace ccl
} // namespace experimental
} // namespace operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change
} // namespace ccl
} // namespace experimental
} // namespace operations
} // namespace operations::experimental::ccl

namespace operations{
namespace experimental{
namespace ccl{
Tensor all_reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
Tensor all_reduce(
auto all_reduce(

const MemoryConfig& output_mem_config,
ttnn::ccl::Topology topology,
const std::optional<size_t> user_defined_num_workers,
const std::optional<size_t> user_defined_num_buffers_per_channel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
const std::optional<size_t> user_defined_num_buffers_per_channel) {
const std::optional<size_t> user_defined_num_buffers_per_channel) -> Tensor {

namespace operations{
namespace experimental{
namespace ccl{
Tensor all_reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-function-cognitive-complexity ⚠️
function all_reduce has cognitive complexity of 32 (threshold 25)

const std::optional<size_t> user_defined_num_workers,
const std::optional<size_t> user_defined_num_buffers_per_channel) {
ttnn::operations::binary::BinaryOpType binary_op_type = convert_reduce_type_to_eltwise_type(math_op);
TT_FATAL(std::getenv("TT_METAL_SLOW_DISPATCH_MODE") == nullptr, "This op is only supported for Fast Dispatch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

const std::optional<size_t> user_defined_num_buffers_per_channel) {
ttnn::operations::binary::BinaryOpType binary_op_type = convert_reduce_type_to_eltwise_type(math_op);
TT_FATAL(std::getenv("TT_METAL_SLOW_DISPATCH_MODE") == nullptr, "This op is only supported for Fast Dispatch");
TT_FATAL(topology == ttnn::ccl::Topology::Ring, "All Reduce op is currently supported only on Ring topology");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

const std::vector<Tensor>& input_tensors,
const std::vector<std::optional<const Tensor>>& optional_input_tensors,
const std::vector<std::optional<Tensor>>& optional_output_tensors) mutable -> std::vector<Tensor> {
TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (3/6)

const std::vector<Tensor>& input_tensors,
const std::vector<std::optional<const Tensor>>& optional_input_tensors,
const std::vector<std::optional<Tensor>>& optional_output_tensors) mutable -> std::vector<Tensor> {
TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-container-size-empty ⚠️
the empty method should be used to check for emptiness instead of size

Suggested change
TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none");
TT_FATAL(!input_tensors.empty(), "All reduce op expects an input tensor but it received none");

break;
}
}
TT_FATAL(receiver_device_id != std::nullopt || sender_device_id != std::nullopt, "Error in all reduce op setup");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

namespace operations{
namespace experimental{
namespace ccl{
Tensor all_reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-inconsistent-declaration-parameter-name ⚠️
function ttnn::operations::experimental::ccl::all_reduce has a definition with different parameter names

namespace ttnn {

struct AllReduce {
const ttnn::operations::binary::BinaryOpType binary_op_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member binary_op_type of type const ttnn::operations::binary::BinaryOpType is const qualified


struct AllReduce {
const ttnn::operations::binary::BinaryOpType binary_op_type;
const uint32_t scatter_dim;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member scatter_dim of type const uint32_t (aka const unsigned int) is const qualified

struct AllReduce {
const ttnn::operations::binary::BinaryOpType binary_op_type;
const uint32_t scatter_dim;
const uint32_t num_links;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member num_links of type const uint32_t (aka const unsigned int) is const qualified

const ttnn::operations::binary::BinaryOpType binary_op_type;
const uint32_t scatter_dim;
const uint32_t num_links;
const uint32_t ring_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member ring_size of type const uint32_t (aka const unsigned int) is const qualified

const uint32_t scatter_dim;
const uint32_t num_links;
const uint32_t ring_size;
const uint32_t ring_index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member ring_index of type const uint32_t (aka const unsigned int) is const qualified

const uint32_t num_links;
const uint32_t ring_size;
const uint32_t ring_index;
const std::optional<chip_id_t> receiver_device_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member receiver_device_id of type const std::optional<chip_id_t> (aka const optional<int>) is const qualified

const uint32_t ring_size;
const uint32_t ring_index;
const std::optional<chip_id_t> receiver_device_id;
const std::optional<chip_id_t> sender_device_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member sender_device_id of type const std::optional<chip_id_t> (aka const optional<int>) is const qualified

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (4/6)

const uint32_t ring_index;
const std::optional<chip_id_t> receiver_device_id;
const std::optional<chip_id_t> sender_device_id;
const MemoryConfig output_mem_config;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member output_mem_config of type const MemoryConfig is const qualified

const std::optional<chip_id_t> receiver_device_id;
const std::optional<chip_id_t> sender_device_id;
const MemoryConfig output_mem_config;
const ttnn::ccl::Topology topology;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member topology of type const ttnn::ccl::Topology is const qualified

const std::optional<chip_id_t> sender_device_id;
const MemoryConfig output_mem_config;
const ttnn::ccl::Topology topology;
const std::optional<size_t> user_defined_num_workers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member user_defined_num_workers of type const std::optional<size_t> (aka const optional<unsigned long>) is const qualified

const MemoryConfig output_mem_config;
const ttnn::ccl::Topology topology;
const std::optional<size_t> user_defined_num_workers;
const std::optional<size_t> user_defined_num_buffers_per_channel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member user_defined_num_buffers_per_channel of type const std::optional<size_t> (aka const optional<unsigned long>) is const qualified

const std::optional<size_t> user_defined_num_buffers_per_channel;

void validate(const std::vector<Tensor> &input_tensors) const;
std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const;
auto compute_output_shapes(const std::vector<Tensor> &input_tensors) const -> std::vector<ttnn::SimpleShape>;


void validate(const std::vector<Tensor> &input_tensors) const;
std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const;
std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const;
auto create_output_tensors(const std::vector<Tensor> &input_tensors) const -> std::vector<Tensor>;

operation::ProgramWithCallbacks create_program(
const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
operation::ProgramWithCallbacks create_program(
const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const;
auto create_program(
const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const -> operation::ProgramWithCallbacks;

Comment on lines +35 to +36
namespace operations{
namespace experimental{
namespace ccl{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change
namespace operations{
namespace experimental{
namespace ccl{
namespace operations::experimental::ccl{

} // namespace ccl
} // namespace experimental
} // namespace operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change
} // namespace ccl
} // namespace experimental
} // namespace operations
} // namespace operations::experimental::ccl

namespace operations{
namespace experimental{
namespace ccl{
Tensor all_reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
Tensor all_reduce(
auto all_reduce(

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (5/6)

const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG,
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
const std::optional<size_t> user_defined_num_workers = std::nullopt,
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt) -> Tensor;

namespace ccl{
Tensor all_reduce(
const Tensor &input_tensor,
const uint32_t scatter_split_dim,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-avoid-const-params-in-decls ⚠️
parameter scatter_split_dim is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change
const uint32_t scatter_split_dim,
uint32_t scatter_split_dim,

const Tensor &input_tensor,
const uint32_t scatter_split_dim,
ttnn::operations::reduction::ReduceType reduce_op = ttnn::operations::reduction::ReduceType::Sum,
const uint32_t num_links = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-avoid-const-params-in-decls ⚠️
parameter num_links is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change
const uint32_t num_links = 1,
uint32_t num_links = 1,

const uint32_t num_links = 1,
const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG,
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
const std::optional<size_t> user_defined_num_workers = std::nullopt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-avoid-const-params-in-decls ⚠️
parameter user_defined_num_workers is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change
const std::optional<size_t> user_defined_num_workers = std::nullopt,
std::optional<size_t> user_defined_num_workers = std::nullopt,

const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG,
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
const std::optional<size_t> user_defined_num_workers = std::nullopt,
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-avoid-const-params-in-decls ⚠️
parameter user_defined_num_buffers_per_channel is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change
const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);
std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);

Comment on lines +14 to +16
namespace operations {
namespace experimental {
namespace ccl {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change
namespace operations {
namespace experimental {
namespace ccl {
namespace operations::experimental::ccl {

Comment on lines +30 to +31
} // namespace ccl
} // namespace experimental
} // namespace operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change
} // namespace ccl
} // namespace experimental
} // namespace operations
} // namespace operations::experimental::ccl

namespace ccl {

struct ExecuteAllReduce {
static ttnn::Tensor invoke(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
static ttnn::Tensor invoke(
static auto invoke(

const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt,
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
const std::optional<size_t> num_workers = std::nullopt,
const std::optional<size_t> num_buffers_per_channel = std::nullopt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
const std::optional<size_t> num_buffers_per_channel = std::nullopt);
const std::optional<size_t> num_buffers_per_channel = std::nullopt) -> ttnn::Tensor;

struct ExecuteAllReduce {
static ttnn::Tensor invoke(
const ttnn::Tensor& input_tensor,
const uint32_t scatter_dim,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-avoid-const-params-in-decls ⚠️
parameter scatter_dim is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change
const uint32_t scatter_dim,
uint32_t scatter_dim,

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (6/6)

const ttnn::Tensor& input_tensor,
const uint32_t scatter_dim,
ttnn::operations::reduction::ReduceType math_op,
const uint32_t num_links = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-avoid-const-params-in-decls ⚠️
parameter num_links is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change
const uint32_t num_links = 1,
uint32_t num_links = 1,

const uint32_t num_links = 1,
const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt,
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
const std::optional<size_t> num_workers = std::nullopt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-avoid-const-params-in-decls ⚠️
parameter num_workers is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change
const std::optional<size_t> num_workers = std::nullopt,
std::optional<size_t> num_workers = std::nullopt,

const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt,
ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
const std::optional<size_t> num_workers = std::nullopt,
const std::optional<size_t> num_buffers_per_channel = std::nullopt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ readability-avoid-const-params-in-decls ⚠️
parameter num_buffers_per_channel is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change
const std::optional<size_t> num_buffers_per_channel = std::nullopt);
std::optional<size_t> num_buffers_per_channel = std::nullopt);


namespace ttnn::operations::experimental::ccl {

ttnn::Tensor ExecuteAllReduce::invoke(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
ttnn::Tensor ExecuteAllReduce::invoke(
auto ExecuteAllReduce::invoke(

const std::optional<ttnn::MemoryConfig>& memory_config,
ttnn::ccl::Topology topology,
const std::optional<size_t> num_workers,
const std::optional<size_t> num_buffers_per_channel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change
const std::optional<size_t> num_buffers_per_channel) {
const std::optional<size_t> num_buffers_per_channel) -> ttnn::Tensor {

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

Comment on lines +141 to +137
} // namespace experimental
} // namespace operations

This comment was marked as spam.

@Aswinmcw Aswinmcw merged commit 115b033 into main Oct 21, 2024
145 of 146 checks passed
@Aswinmcw Aswinmcw deleted the Aswinmcw/all_reduce_op branch October 21, 2024 06:19
ct-clmsn pushed a commit to ct-clmsn/tt-metal that referenced this pull request Nov 12, 2024
* tenstorrent#5560: Add all_reduce op

* tenstorrent#5560: Add shapes to the test

* tenstorrent#5560: Combine all_gather with launch_op

* #0: Update copyright

* tenstorrent#5560: all_gather + local reduce

* tenstorrent#5560: Allocate tensor properly

* tenstorrent#5560: Reduce gathered dim

* tenstorrent#5560: Add to frequent pipeline

* tenstorrent#5560: Add cases with NC dim
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants