#5560: Add `all_reduce` op (experimental) #13853

Aswinmcw · 2024-10-16T07:54:16Z

Tracking Issue : #5560

Add all_reduce op under experimental folder

Q: Is only sum as reduce op is supported?

A: Today yes. Future, no. We'd need to support min, max, and other reduction operations.

Q: What is the output shape after all_reduce, is it equal to the shape on one device?

A: Yes (for cases where for whatever reason one of the devices may have less data, it would be the size of the larger tensors, and the device with a smaller tensor would logically be padded with identity value elements. Whether this is a legitimate use case for this operation I am not sure atm).

Q: After all reduce, will all devices have the same reduced tensor?

A: Yes

Q: Do we do reduce over the batch dimension only?

A: Not exactly. All-reduce is a rank-wise operation. That is we are doing element-wise reductions for tensors across devices. For reference, we are following the semantics outlined here.

Q: What is the memory footprint on one device? e.g we have 32 devices and call reduce on tensor with volume 16k how how many dram will be allocated during this op?

A:

Today

memory consumption is inefficient because we are providing the least optimized functional implementation for all-reduce (composite all-gather + eltwise operation).

if # devices = n and input tensor size per device = m, then the total memory working set size (per chip) will be (today) (n + 1) * m

That is n * m needed to store the intermediate all-gather results and then another m to store the final output. After the operation the n *m can be returned to the allocation

Near future

We will enable the reduce_scatter + all-gather (composite) optimization that will both reduce memory pressure and improve performance. This has 2 constraints:

All-gather and reduce scatter must enable support for non-uniform tensor sizes across chips
the input tensor shape must have a dim size > n
technically it just needs a volume > n but that would require multi-dimension scatter/concat dim for reduce scatter/all-gather which is totally out of the picture for the foreseeable future
With this version total memory pressure (per chip) would be m + m * (1/n) where m * (1/n) is the size of the intermediate tensor between reduce_scatter and all-gather

Long Term

We will support a proper non-composite implementation in which the memory needed per chip will simply b m

Checklist

Post commit CI passes - https://github.com/tenstorrent/tt-metal/actions/runs/11363163144
T3K Pipelines - https://github.com/tenstorrent/tt-metal/actions/runs/11363166696
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
New/Existing tests provide coverage for changes

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/6)

github-actions · 2024-10-16T08:01:49Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+namespace ttnn {
+
+void AllReduce::validate(const std::vector<Tensor>& input_tensors) const {
+    for (auto const& t : input_tensors) {


⚠️ readability-identifier-length ⚠️
variable name t is too short, expected at least 3 characters

github-actions · 2024-10-16T08:01:49Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+
+void AllReduce::validate(const std::vector<Tensor>& input_tensors) const {
+    for (auto const& t : input_tensors) {
+        TT_FATAL(


⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

github-actions · 2024-10-16T08:01:50Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+        TT_FATAL(
+            t.get_legacy_shape()[this->scatter_dim] / this->ring_size > 0,
+            "All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim);
+        TT_FATAL(


⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

github-actions · 2024-10-16T08:01:50Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+            t.get_legacy_shape()[this->scatter_dim] % this->ring_size == 0,
+            "All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim);
+
+        TT_FATAL(!t.is_sharded(), "Sharded tensors are not supported for all reduce currently");


⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

github-actions · 2024-10-16T08:01:50Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+            t.get_legacy_shape()[this->scatter_dim] % this->ring_size == 0,
+            "All reduce input tensor shape on dim {} must be divisible by ring size", this->scatter_dim);
+
+        TT_FATAL(!t.is_sharded(), "Sharded tensors are not supported for all reduce currently");


⚠️ readability-simplify-boolean-expr ⚠️
boolean expression can be simplified by DeMorgan's theorem

github-actions · 2024-10-16T08:01:50Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+    }
+}
+
+std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const {


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const {

auto AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const -> std::vector<ttnn::SimpleShape> {

github-actions · 2024-10-16T08:01:50Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+
+std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const {
+    auto shape = input_tensors[0].get_logical_shape();
+    TT_FATAL(


⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

github-actions · 2024-10-16T08:01:50Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+    TT_FATAL(
+        shape[this->scatter_dim] % this->ring_size == 0,
+        "The size of the scatter dimension must be a multiple of the ring size");
+    shape[this->scatter_dim] /= this->ring_size;


⚠️ bugprone-narrowing-conversions ⚠️
narrowing conversion from uint32_t (aka unsigned int) to signed type int32_t (aka int) is implementation-defined

github-actions · 2024-10-16T08:01:50Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+        shape[this->scatter_dim] % this->ring_size == 0,
+        "The size of the scatter dimension must be a multiple of the ring size");
+    shape[this->scatter_dim] /= this->ring_size;
+    return std::vector<ttnn::SimpleShape>(input_tensors.size(), shape);


⚠️ modernize-return-braced-init-list ⚠️
avoid repeating the return type from the declaration; use a braced initializer list instead

github-actions · 2024-10-16T08:01:50Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+    return std::vector<ttnn::SimpleShape>(input_tensors.size(), shape);
+}
+
+std::vector<Tensor> AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const {


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

std::vector<Tensor> AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const {

auto AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const -> std::vector<Tensor> {

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (2/6)

github-actions · 2024-10-16T08:02:01Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+operation::ProgramWithCallbacks AllReduce::create_program(
+    const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const {


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

operation::ProgramWithCallbacks AllReduce::create_program(

const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const {

auto AllReduce::create_program(

const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const -> operation::ProgramWithCallbacks {

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+        this->user_defined_num_buffers_per_channel);
+}
+
+static ttnn::operations::binary::BinaryOpType convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) {


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

static ttnn::operations::binary::BinaryOpType convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) {

static auto convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) -> ttnn::operations::binary::BinaryOpType {

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+namespace operations{
+namespace experimental{
+namespace ccl{


⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change

namespace operations{

namespace experimental{

namespace ccl{

namespace operations::experimental::ccl{

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+} // namespace ccl
+} // namespace experimental
+} // namespace operations


⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change

} // namespace ccl

} // namespace experimental

} // namespace operations

} // namespace operations::experimental::ccl

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+namespace operations{
+namespace experimental{
+namespace ccl{
+Tensor all_reduce(


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

Tensor all_reduce(

auto all_reduce(

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+    const MemoryConfig& output_mem_config,
+    ttnn::ccl::Topology topology,
+    const std::optional<size_t> user_defined_num_workers,
+    const std::optional<size_t> user_defined_num_buffers_per_channel) {


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

const std::optional<size_t> user_defined_num_buffers_per_channel) {

const std::optional<size_t> user_defined_num_buffers_per_channel) -> Tensor {

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+namespace operations{
+namespace experimental{
+namespace ccl{
+Tensor all_reduce(


⚠️ readability-function-cognitive-complexity ⚠️
function all_reduce has cognitive complexity of 32 (threshold 25)

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+    const std::optional<size_t> user_defined_num_workers,
+    const std::optional<size_t> user_defined_num_buffers_per_channel) {
+    ttnn::operations::binary::BinaryOpType binary_op_type = convert_reduce_type_to_eltwise_type(math_op);
+    TT_FATAL(std::getenv("TT_METAL_SLOW_DISPATCH_MODE") == nullptr, "This op is only supported for Fast Dispatch");


⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+    const std::optional<size_t> user_defined_num_buffers_per_channel) {
+    ttnn::operations::binary::BinaryOpType binary_op_type = convert_reduce_type_to_eltwise_type(math_op);
+    TT_FATAL(std::getenv("TT_METAL_SLOW_DISPATCH_MODE") == nullptr, "This op is only supported for Fast Dispatch");
+    TT_FATAL(topology == ttnn::ccl::Topology::Ring, "All Reduce op is currently supported only on Ring topology");


⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

github-actions · 2024-10-16T08:02:02Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+            const std::vector<Tensor>& input_tensors,
+            const std::vector<std::optional<const Tensor>>& optional_input_tensors,
+            const std::vector<std::optional<Tensor>>& optional_output_tensors) mutable -> std::vector<Tensor> {
+            TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none");


⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (3/6)

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+            const std::vector<Tensor>& input_tensors,
+            const std::vector<std::optional<const Tensor>>& optional_input_tensors,
+            const std::vector<std::optional<Tensor>>& optional_output_tensors) mutable -> std::vector<Tensor> {
+            TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none");


⚠️ readability-container-size-empty ⚠️
the empty method should be used to check for emptiness instead of size

Suggested change

TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none");

TT_FATAL(!input_tensors.empty(), "All reduce op expects an input tensor but it received none");

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+                    break;
+                }
+            }
+            TT_FATAL(receiver_device_id != std::nullopt || sender_device_id != std::nullopt, "Error in all reduce op setup");


⚠️ cppcoreguidelines-avoid-do-while ⚠️
avoid do-while loops

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+namespace operations{
+namespace experimental{
+namespace ccl{
+    Tensor all_reduce(


⚠️ readability-inconsistent-declaration-parameter-name ⚠️
function ttnn::operations::experimental::ccl::all_reduce has a definition with different parameter names

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+namespace ttnn {
+
+struct AllReduce {
+    const ttnn::operations::binary::BinaryOpType binary_op_type;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member binary_op_type of type const ttnn::operations::binary::BinaryOpType is const qualified

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+
+struct AllReduce {
+    const ttnn::operations::binary::BinaryOpType binary_op_type;
+    const uint32_t scatter_dim;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member scatter_dim of type const uint32_t (aka const unsigned int) is const qualified

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+struct AllReduce {
+    const ttnn::operations::binary::BinaryOpType binary_op_type;
+    const uint32_t scatter_dim;
+    const uint32_t num_links;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member num_links of type const uint32_t (aka const unsigned int) is const qualified

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const ttnn::operations::binary::BinaryOpType binary_op_type;
+    const uint32_t scatter_dim;
+    const uint32_t num_links;
+    const uint32_t ring_size;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member ring_size of type const uint32_t (aka const unsigned int) is const qualified

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const uint32_t scatter_dim;
+    const uint32_t num_links;
+    const uint32_t ring_size;
+    const uint32_t ring_index;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member ring_index of type const uint32_t (aka const unsigned int) is const qualified

github-actions · 2024-10-16T08:02:14Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const uint32_t num_links;
+    const uint32_t ring_size;
+    const uint32_t ring_index;
+    const std::optional<chip_id_t> receiver_device_id;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member receiver_device_id of type const std::optional<chip_id_t> (aka const optional<int>) is const qualified

github-actions · 2024-10-16T08:02:15Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const uint32_t ring_size;
+    const uint32_t ring_index;
+    const std::optional<chip_id_t> receiver_device_id;
+    const std::optional<chip_id_t> sender_device_id;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member sender_device_id of type const std::optional<chip_id_t> (aka const optional<int>) is const qualified

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (4/6)

github-actions · 2024-10-16T08:02:26Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const uint32_t ring_index;
+    const std::optional<chip_id_t> receiver_device_id;
+    const std::optional<chip_id_t> sender_device_id;
+    const MemoryConfig output_mem_config;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member output_mem_config of type const MemoryConfig is const qualified

github-actions · 2024-10-16T08:02:26Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const std::optional<chip_id_t> receiver_device_id;
+    const std::optional<chip_id_t> sender_device_id;
+    const MemoryConfig output_mem_config;
+    const ttnn::ccl::Topology topology;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member topology of type const ttnn::ccl::Topology is const qualified

github-actions · 2024-10-16T08:02:26Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const std::optional<chip_id_t> sender_device_id;
+    const MemoryConfig output_mem_config;
+    const ttnn::ccl::Topology topology;
+    const std::optional<size_t> user_defined_num_workers;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member user_defined_num_workers of type const std::optional<size_t> (aka const optional<unsigned long>) is const qualified

github-actions · 2024-10-16T08:02:26Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const MemoryConfig output_mem_config;
+    const ttnn::ccl::Topology topology;
+    const std::optional<size_t> user_defined_num_workers;
+    const std::optional<size_t> user_defined_num_buffers_per_channel;


⚠️ cppcoreguidelines-avoid-const-or-ref-data-members ⚠️
member user_defined_num_buffers_per_channel of type const std::optional<size_t> (aka const optional<unsigned long>) is const qualified

github-actions · 2024-10-16T08:02:26Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const std::optional<size_t> user_defined_num_buffers_per_channel;
+
+    void validate(const std::vector<Tensor> &input_tensors) const;
+    std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const;


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const;

auto compute_output_shapes(const std::vector<Tensor> &input_tensors) const -> std::vector<ttnn::SimpleShape>;

github-actions · 2024-10-16T08:02:27Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+
+    void validate(const std::vector<Tensor> &input_tensors) const;
+    std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const;
+    std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const;


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const;

auto create_output_tensors(const std::vector<Tensor> &input_tensors) const -> std::vector<Tensor>;

github-actions · 2024-10-16T08:02:27Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    operation::ProgramWithCallbacks create_program(
+        const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const;


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

operation::ProgramWithCallbacks create_program(

const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const;

auto create_program(

const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const -> operation::ProgramWithCallbacks;

github-actions · 2024-10-16T08:02:27Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+namespace operations{
+namespace experimental{
+namespace ccl{


⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change

namespace operations{

namespace experimental{

namespace ccl{

namespace operations::experimental::ccl{

github-actions · 2024-10-16T08:02:27Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+} // namespace ccl
+} // namespace experimental
+} // namespace operations


⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change

} // namespace ccl

} // namespace experimental

} // namespace operations

} // namespace operations::experimental::ccl

github-actions · 2024-10-16T08:02:27Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+namespace operations{
+namespace experimental{
+namespace ccl{
+    Tensor all_reduce(


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

Tensor all_reduce(

auto all_reduce(

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (5/6)

github-actions · 2024-10-16T08:02:38Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG,
+    ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
+    const std::optional<size_t> user_defined_num_workers = std::nullopt,
+    const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);

const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt) -> Tensor;

github-actions · 2024-10-16T08:02:39Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+namespace ccl{
+    Tensor all_reduce(
+    const Tensor &input_tensor,
+    const uint32_t scatter_split_dim,


⚠️ readability-avoid-const-params-in-decls ⚠️
parameter scatter_split_dim is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change

const uint32_t scatter_split_dim,

uint32_t scatter_split_dim,

github-actions · 2024-10-16T08:02:39Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const Tensor &input_tensor,
+    const uint32_t scatter_split_dim,
+    ttnn::operations::reduction::ReduceType reduce_op = ttnn::operations::reduction::ReduceType::Sum,
+    const uint32_t num_links = 1,


⚠️ readability-avoid-const-params-in-decls ⚠️
parameter num_links is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change

const uint32_t num_links = 1,

uint32_t num_links = 1,

github-actions · 2024-10-16T08:02:39Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const uint32_t num_links = 1,
+    const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG,
+    ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
+    const std::optional<size_t> user_defined_num_workers = std::nullopt,


⚠️ readability-avoid-const-params-in-decls ⚠️
parameter user_defined_num_workers is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change

const std::optional<size_t> user_defined_num_workers = std::nullopt,

std::optional<size_t> user_defined_num_workers = std::nullopt,

github-actions · 2024-10-16T08:02:39Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.hpp

+    const MemoryConfig& output_mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG,
+    ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
+    const std::optional<size_t> user_defined_num_workers = std::nullopt,
+    const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);


⚠️ readability-avoid-const-params-in-decls ⚠️
parameter user_defined_num_buffers_per_channel is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change

const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);

std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);

github-actions · 2024-10-16T08:02:39Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.hpp

+namespace operations {
+namespace experimental {
+namespace ccl {


⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change

namespace operations {

namespace experimental {

namespace ccl {

namespace operations::experimental::ccl {

github-actions · 2024-10-16T08:02:39Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.hpp

+}  // namespace ccl
+}  // namespace experimental
+}  // namespace operations


⚠️ modernize-concat-nested-namespaces ⚠️
nested namespaces can be concatenated

Suggested change

} // namespace ccl

} // namespace experimental

} // namespace operations

} // namespace operations::experimental::ccl

github-actions · 2024-10-16T08:02:39Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.hpp

+namespace ccl {
+
+struct ExecuteAllReduce {
+    static ttnn::Tensor invoke(


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

static ttnn::Tensor invoke(

static auto invoke(

github-actions · 2024-10-16T08:02:40Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.hpp

+        const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt,
+        ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
+        const std::optional<size_t> num_workers = std::nullopt,
+        const std::optional<size_t> num_buffers_per_channel = std::nullopt);


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

const std::optional<size_t> num_buffers_per_channel = std::nullopt);

const std::optional<size_t> num_buffers_per_channel = std::nullopt) -> ttnn::Tensor;

github-actions · 2024-10-16T08:02:40Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.hpp

+struct ExecuteAllReduce {
+    static ttnn::Tensor invoke(
+        const ttnn::Tensor& input_tensor,
+        const uint32_t scatter_dim,


⚠️ readability-avoid-const-params-in-decls ⚠️
parameter scatter_dim is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change

const uint32_t scatter_dim,

uint32_t scatter_dim,

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (6/6)

github-actions · 2024-10-16T08:02:52Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.hpp

+        const ttnn::Tensor& input_tensor,
+        const uint32_t scatter_dim,
+        ttnn::operations::reduction::ReduceType math_op,
+        const uint32_t num_links = 1,


⚠️ readability-avoid-const-params-in-decls ⚠️
parameter num_links is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change

const uint32_t num_links = 1,

uint32_t num_links = 1,

github-actions · 2024-10-16T08:02:52Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.hpp

+        const uint32_t num_links = 1,
+        const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt,
+        ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
+        const std::optional<size_t> num_workers = std::nullopt,


⚠️ readability-avoid-const-params-in-decls ⚠️
parameter num_workers is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change

const std::optional<size_t> num_workers = std::nullopt,

std::optional<size_t> num_workers = std::nullopt,

github-actions · 2024-10-16T08:02:52Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.hpp

+        const std::optional<ttnn::MemoryConfig>& memory_config = std::nullopt,
+        ttnn::ccl::Topology topology = ttnn::ccl::Topology::Ring,
+        const std::optional<size_t> num_workers = std::nullopt,
+        const std::optional<size_t> num_buffers_per_channel = std::nullopt);


⚠️ readability-avoid-const-params-in-decls ⚠️
parameter num_buffers_per_channel is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions

Suggested change

const std::optional<size_t> num_buffers_per_channel = std::nullopt);

std::optional<size_t> num_buffers_per_channel = std::nullopt);

github-actions · 2024-10-16T08:02:52Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.cpp

+
+namespace ttnn::operations::experimental::ccl {
+
+ttnn::Tensor ExecuteAllReduce::invoke(


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

ttnn::Tensor ExecuteAllReduce::invoke(

auto ExecuteAllReduce::invoke(

github-actions · 2024-10-16T08:02:52Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/all_reduce.cpp

+    const std::optional<ttnn::MemoryConfig>& memory_config,
+    ttnn::ccl::Topology topology,
+    const std::optional<size_t> num_workers,
+    const std::optional<size_t> num_buffers_per_channel) {


⚠️ modernize-use-trailing-return-type ⚠️
use a trailing return type for this function

Suggested change

const std::optional<size_t> num_buffers_per_channel) {

const std::optional<size_t> num_buffers_per_channel) -> ttnn::Tensor {

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

+} // namespace experimental
+} // namespace operations
+


* tenstorrent#5560: Add all_reduce op * tenstorrent#5560: Add shapes to the test * tenstorrent#5560: Combine all_gather with launch_op * #0: Update copyright * tenstorrent#5560: all_gather + local reduce * tenstorrent#5560: Allocate tensor properly * tenstorrent#5560: Reduce gathered dim * tenstorrent#5560: Add to frequent pipeline * tenstorrent#5560: Add cases with NC dim

Aswinmcw force-pushed the Aswinmcw/all_reduce_op branch 2 times, most recently from a6c9d8c to 83e452a Compare October 16, 2024 08:01

github-actions bot reviewed Oct 16, 2024

View reviewed changes

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

Comment on lines +141 to +137

} // namespace experimental

} // namespace operations

This comment was marked as spam.

Sign in to view

Aswinmcw temporarily deployed to dev October 16, 2024 09:47 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev October 16, 2024 09:48 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev October 16, 2024 09:49 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev October 16, 2024 09:58 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev October 16, 2024 10:02 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev October 21, 2024 04:56 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev October 21, 2024 04:57 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev October 21, 2024 04:59 — with GitHub Actions Inactive

Aswinmcw merged commit 115b033 into main Oct 21, 2024
145 of 146 checks passed

Aswinmcw deleted the Aswinmcw/all_reduce_op branch October 21, 2024 06:19

	std::vector<ttnn::SimpleShape> AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const {
	auto AllReduce::compute_output_shapes(const std::vector<Tensor>& input_tensors) const -> std::vector<ttnn::SimpleShape> {

	std::vector<Tensor> AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const {
	auto AllReduce::create_output_tensors(const std::vector<Tensor>& input_tensors) const -> std::vector<Tensor> {

		operation::ProgramWithCallbacks AllReduce::create_program(
		const std::vector<Tensor>& input_tensors, std::vector<Tensor>& output_tensors) const {

	static ttnn::operations::binary::BinaryOpType convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) {
	static auto convert_reduce_type_to_eltwise_type(ttnn::operations::reduction::ReduceType reduce_op) -> ttnn::operations::binary::BinaryOpType {

	const std::optional<size_t> user_defined_num_buffers_per_channel) {
	const std::optional<size_t> user_defined_num_buffers_per_channel) -> Tensor {

	TT_FATAL(input_tensors.size() >= 1, "All reduce op expects an input tensor but it received none");
	TT_FATAL(!input_tensors.empty(), "All reduce op expects an input tensor but it received none");

	std::vector<ttnn::SimpleShape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const;
	auto compute_output_shapes(const std::vector<Tensor> &input_tensors) const -> std::vector<ttnn::SimpleShape>;

	std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const;
	auto create_output_tensors(const std::vector<Tensor> &input_tensors) const -> std::vector<Tensor>;

		operation::ProgramWithCallbacks create_program(
		const std::vector<Tensor> &input_tensors, std::vector<Tensor> &output_tensors) const;

	const uint32_t scatter_split_dim,
	uint32_t scatter_split_dim,

	const std::optional<size_t> user_defined_num_workers = std::nullopt,
	std::optional<size_t> user_defined_num_workers = std::nullopt,

	const std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);
	std::optional<size_t> user_defined_num_buffers_per_channel = std::nullopt);

	const std::optional<size_t> num_buffers_per_channel = std::nullopt);
	const std::optional<size_t> num_buffers_per_channel = std::nullopt) -> ttnn::Tensor;

	const std::optional<size_t> num_workers = std::nullopt,
	std::optional<size_t> num_workers = std::nullopt,

	const std::optional<size_t> num_buffers_per_channel = std::nullopt);
	std::optional<size_t> num_buffers_per_channel = std::nullopt);


		namespace ttnn::operations::experimental::ccl {

		ttnn::Tensor ExecuteAllReduce::invoke(

	ttnn::Tensor ExecuteAllReduce::invoke(
	auto ExecuteAllReduce::invoke(

#5560: Add all_reduce op (experimental) #13853

#5560: Add all_reduce op (experimental) #13853

Conversation

Aswinmcw commented Oct 16, 2024 • edited by SeanNijjar Loading

Q: Is only sum as reduce op is supported?

Q: What is the output shape after all_reduce, is it equal to the shape on one device?

Q: After all reduce, will all devices have the same reduced tensor?

Q: Do we do reduce over the batch dimension only?

Q: What is the memory footprint on one device? e.g we have 32 devices and call reduce on tensor with volume 16k how how many dram will be allocated during this op?

Today

Near future

Long Term

Checklist

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

#5560: Add `all_reduce` op (experimental) #13853

#5560: Add `all_reduce` op (experimental) #13853

Aswinmcw commented Oct 16, 2024 •

edited by SeanNijjar

Loading