#11178 Add initial (very limited) support for line reduce scatter (#1…

…3133) this commit adds the initial support for reduce scatter. However, only a few cases are functional. Future work will improve correctness across more cases. ======= Line Reduce Scatter Algorithm ====== The algorithm for line reduce scatter will send the minimal ammount of data over each line and out of each chip. All diagrams are for an example 4-chip line reduce scatter. First the operation fractures each input tensor Input Tensors --------------- | | | | | | | | v v v v |-| |-| |-| |-| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |-| |-| |-| |-| Chip 0 1 2 3 | | Fracture | Tensors v Input Tensors --------------- | | | | | | | | v v v v |-| |-| |-| |-| | | | | | | | | |-| |-| |-| |-| | | | | | | | | |-| |-| |-| |-| | | | | | | | | |-| |-| |-| |-| | | | | | | | | |-| |-| |-| |-| Chip 0 1 2 3 With fracture tensors are reduced and collapsed to the diagonal across the chips where the diagonal shows how the fractures spatially map to the final outputs. For example, the first output is generated by reducing the top chunk of each input tensor. The reduction is performed by having each chip forward its input to its neighbour. For chips that are not at the end of the line, they reduce with their input and forward. |-| |-| |-| |-| |#|<---| |<---| |<---| | |-| |-| |-| |-| | |--->|#|<---| |<---| | |-| |-| |-| |-| | |--->| |--->|#|<---| | |-| |-| |-| |-| | |--->| |--->| |--->|#| |-| |-| |-| |-| Chip 0 1 2 3 However, note that each arrow from the diagram heading out of of a chip in a given direction shares ethernet resources for all other arrows heading in the same direction from that chip. This means there is inherently serialization here. For that reason, we schedule the chunks in some way. The general scheduling strategy is to send the chunks that are furthest from the final reduce output first and step through chunks that are incrementally closer to the final output. Each direction from a chip can be processed independently. The diagram below is annotated with the "timesteps" when each chunk is sent. Each timestep is marked relative to the chunk source. |-| t=0|-| t=0|-| t=0|-| |#|<---| |<---| |<---| | |-|t=2 |-| t=1|-| t=1|-| | |--->|#|<---| |<---| | |-|t=1 |-|t=1 |-| t=2|-| | |--->| |--->|#|<---| | |-|t=0 |-|t=0 |-|t=0 |-| | |--->| |--->| |--->|#| |-| |-| |-| |-| Chip 0 1 2 3 Finally, not that the final output requires a reduction from both directions. Given that the two directions of the line are executing completely indepdently, we require some sort of merge operation. At the time of this commit, the merge strategy is to designate a master and slave reducer direction. We arbitrarily choose the 'right' or 'clockwise' direction as the master. The master direction will write its output to the output tensor but note that this will only be a partial output. The slave direction will read from the output tensor to merge with the data from producer chip. It will read from the output tensor based on a credit passing from master (implemented via semaphores) -------Input Tensor | | | | | | |---------|---------- |----> | | | | Reader | Sender |--- From EDM | (master)| (master)| | ---------> | | | | |---------|---------| | | | |------ Output Tensor <---| | ^ | |--------- | | | |---------|---------- | --> | | | | | Reader | Sender |--| From EDM | (master)| (master)| ---------> | | | |---------|---------| As a part of the line reduce scatter implementation, new CCL componenst were added: ccl send and ccl command generators/readers. The ccl_send (kernel) was used to implement the starting ends of the lines (i.e. the first senders). Although the ccl_send provides more generic send capabilities than line reduce scatter currently requires, it was chosen because it is a basic building block also for future CCL send/recv "operations" and higher level CCL programming models. ======= CCL Send (Kernel) ======= The ccl_send kernel acts like an interpreter of CCL commands. CCL commands are, so far, limited to be a send from tensor to EDM of a tensor slice. The command specifies some information about the tensor (shape, slice/view shape, view offset, etc.). CCL send is capable of executing multiple commands back to back. In the context of line reduce scatter, the ccl_send implements the separate sends of the fractured chunks on the left and right ends of the line. To do this for a line reduce scatter, we invoke n commands where n=#chips in the line. Future commands will let an invoker specify this basic pattern as a single command. Looking at the third diagram that outlines the timesteps for each chunk, for the left/right tensors, each timestep directly maps to a separate ccl command. ======= CCL Command Generators/Readers ======= To facilitate command generation, initial components have been added to let the host serialize commands for the the ccl_send kernel. Correspondingly, command unpacking logic is also specified for each command. This is used to help simplify command generation for the host. Note that ccl_send as a standalone kernel and operation is experimental and has several limitations: - Slice reads currently constrained to page aligned slices - Host command generation doesn't support proper 4D shape support (although the kernel side will internally represent shapes as 4D) - Only one command is currently supported (send tensor slice to EDM)
tenstorrent · Sep 28, 2024 · 32ad231 · 32ad231
1 parent cf8450c
commit 32ad231
Show file tree

Hide file tree

Showing 34 changed files with 3,133 additions and 679 deletions.
diff --git a/tests/tt_eager/CMakeLists.txt b/tests/tt_eager/CMakeLists.txt
@@ -3,8 +3,10 @@ add_library(test_eager_common_libs INTERFACE)
 target_link_libraries(test_eager_common_libs INTERFACE test_common_libs)
 
 set(TT_EAGER_TESTS_OPS
+    ops/ccl/test_ccl_commands.cpp
     ops/ccl/test_ccl_helpers.cpp
     ops/ccl/test_ccl_tensor_slicers.cpp
+    ops/ccl/test_ccl_reduce_scatter_host_helpers.cpp
     ops/test_average_pool.cpp
     ops/test_eltwise_binary_op.cpp
     ops/test_eltwise_unary_op.cpp

diff --git a/tests/tt_eager/ops/ccl/test_ccl_commands.cpp b/tests/tt_eager/ops/ccl/test_ccl_commands.cpp
@@ -0,0 +1,205 @@
+// SPDX-FileCopyrightText: © 2024 Tenstorrent Inc.
+//
+// SPDX-License-Identifier: Apache-2.0
+
+#include "ttnn/cpp/ttnn/operations/ccl/common/uops/ccl_command.hpp"
+#include "ttnn/cpp/ttnn/operations/ccl/common/types/ccl_types.hpp"
+
+#include "gtest/gtest.h"
+
+#include <limits>
+#include <numeric>
+#include <ranges>
+
+using ttnn::ccl::Shape4D;
+using ttnn::ccl::cmd::tensor_shape_command_arg_t;
+using ttnn::ccl::cmd::tensor_slice_shape_command_arg_t;
+using ttnn::ccl::cmd::tensor_slice_offset_command_arg_t;
+using ttnn::ccl::cmd::worker_start_offset_command_arg_t;
+using ttnn::ccl::cmd::worker_pages_command_arg_t;
+using ttnn::ccl::cmd::full_tensor_command_arg_t;
+using ttnn::ccl::cmd::CclCommandTensor;
+
+const Shape4D<uint32_t> uninitialized_test_shape = {
+    std::numeric_limits<uint32_t>::max(),
+    std::numeric_limits<uint32_t>::max(),
+    std::numeric_limits<uint32_t>::max(),
+    std::numeric_limits<uint32_t>::max()};
+
+// tensor shape
+TEST(CclCommandArgGenerator, PackTensorShapeArg) {
+    constexpr std::size_t size_in_words = tensor_shape_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 4);
+    std::array<uint32_t, size_in_words> args;
+    std::ranges::fill(args, std::numeric_limits<uint32_t>::max());
+    Shape4D<uint32_t> test_shape = {1,2,3,4};
+    tensor_shape_command_arg_t::pack_to(args.data(), test_shape);
+    ASSERT_EQ(args[0], 1);
+    ASSERT_EQ(args[1], 2);
+    ASSERT_EQ(args[2], 3);
+    ASSERT_EQ(args[3], 4);
+}
+
+TEST(CclCommandArgGenerator, UnpackTensorShapeArg) {
+    constexpr std::size_t size_in_words = tensor_shape_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 4);
+    std::array<uint32_t, tensor_shape_command_arg_t::size_in_words()> args = {1,2,3,4};
+    Shape4D<uint32_t> test_shape = uninitialized_test_shape;
+    tensor_shape_command_arg_t::unpack(args.data(), test_shape);
+
+    ASSERT_EQ(test_shape.w, 1);
+    ASSERT_EQ(test_shape.z, 2);
+    ASSERT_EQ(test_shape.y, 3);
+    ASSERT_EQ(test_shape.x, 4);
+}
+
+// tensor slice
+TEST(CclCommandArgGenerator, PackTensorSliceShapeArg) {
+    std::array<uint32_t, tensor_slice_shape_command_arg_t::size_in_words()> args;
+    std::ranges::fill(args, std::numeric_limits<uint32_t>::max());
+    constexpr std::size_t size_in_words = tensor_slice_shape_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 4);
+    Shape4D<uint32_t> test_shape = {1,2,3,4};
+    tensor_slice_shape_command_arg_t::pack_to(args.data(), test_shape);
+    ASSERT_EQ(args[0], 1);
+    ASSERT_EQ(args[1], 2);
+    ASSERT_EQ(args[2], 3);
+    ASSERT_EQ(args[3], 4);
+}
+
+TEST(CclCommandArgGenerator, UnpackTensorSliceShapeArg) {
+    std::array<uint32_t, tensor_slice_shape_command_arg_t::size_in_words()> args = {1,2,3,4};
+    constexpr std::size_t size_in_words = tensor_slice_shape_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 4);
+    Shape4D<uint32_t> test_shape = uninitialized_test_shape;
+    tensor_slice_shape_command_arg_t::unpack(args.data(), test_shape);
+    ASSERT_EQ(test_shape.w, 1);
+    ASSERT_EQ(test_shape.z, 2);
+    ASSERT_EQ(test_shape.y, 3);
+    ASSERT_EQ(test_shape.x, 4);
+}
+
+// tensor slice offset
+TEST(CclCommandArgGenerator, PackTensorSliceOffsetArg) {
+    std::array<uint32_t, tensor_slice_offset_command_arg_t::size_in_words()> args;
+    std::ranges::fill(args, std::numeric_limits<uint32_t>::max());
+    constexpr std::size_t size_in_words = tensor_slice_offset_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 4);
+    Shape4D<uint32_t> test_shape = {1,2,3,4};
+    tensor_slice_offset_command_arg_t::pack_to(args.data(), test_shape);
+    ASSERT_EQ(args[0], 1);
+    ASSERT_EQ(args[1], 2);
+    ASSERT_EQ(args[2], 3);
+    ASSERT_EQ(args[3], 4);
+}
+
+TEST(CclCommandArgGenerator, UnpackTensorSliceOffsetArg) {
+    std::array<uint32_t, tensor_slice_offset_command_arg_t::size_in_words()> args = {1,2,3,4};
+    constexpr std::size_t size_in_words = tensor_slice_offset_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 4);
+    Shape4D<uint32_t> test_shape = uninitialized_test_shape;
+    tensor_slice_offset_command_arg_t::unpack(args.data(), test_shape);
+    ASSERT_EQ(test_shape.w, 1);
+    ASSERT_EQ(test_shape.z, 2);
+    ASSERT_EQ(test_shape.y, 3);
+    ASSERT_EQ(test_shape.x, 4);
+}
+
+// worker start offset in slice
+TEST(CclCommandArgGenerator, PackWorkerStartOffsetInSliceArg) {
+    std::array<uint32_t, worker_start_offset_command_arg_t::size_in_words()> args;
+    std::ranges::fill(args, std::numeric_limits<uint32_t>::max());
+    constexpr std::size_t size_in_words = worker_start_offset_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 4);
+    Shape4D<uint32_t> test_shape = {1,2,3,4};
+    worker_start_offset_command_arg_t::pack_to(args.data(), test_shape);
+    ASSERT_EQ(args[0], 1);
+    ASSERT_EQ(args[1], 2);
+    ASSERT_EQ(args[2], 3);
+    ASSERT_EQ(args[3], 4);
+}
+
+TEST(CclCommandArgGenerator, UnpackWorkerStartOffsetInSliceArg) {
+    std::array<uint32_t, worker_start_offset_command_arg_t::size_in_words()> args = {1,2,3,4};
+    constexpr std::size_t size_in_words = worker_start_offset_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 4);
+    Shape4D<uint32_t> test_shape = uninitialized_test_shape;
+    worker_start_offset_command_arg_t::unpack(args.data(), test_shape);
+    ASSERT_EQ(test_shape.w, 1);
+    ASSERT_EQ(test_shape.z, 2);
+    ASSERT_EQ(test_shape.y, 3);
+    ASSERT_EQ(test_shape.x, 4);
+}
+
+// worker pages per slice
+TEST(CclCommandArgGenerator, PackWorkerPagesPerSliceArg) {
+    std::array<uint32_t, worker_pages_command_arg_t::size_in_words()> args;
+    std::ranges::fill(args, std::numeric_limits<uint32_t>::max());
+    constexpr std::size_t size_in_words = worker_pages_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 1);
+    uint32_t test_value = 1;
+    worker_pages_command_arg_t::pack_to(args.data(), test_value);
+    ASSERT_EQ(args[0], 1);
+}
+
+TEST(CclCommandArgGenerator, UnpackWorkerPagesPerSliceArg) {
+    std::array<uint32_t, worker_pages_command_arg_t::size_in_words()> args = {1};
+    constexpr std::size_t size_in_words = worker_pages_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 1);
+    uint32_t test_value = 0;
+    worker_pages_command_arg_t::unpack(args.data(), test_value);
+    ASSERT_EQ(test_value, 1);
+}
+
+// full tensor
+TEST(CclCommandArgGenerator, PackFullTensorArg) {
+    constexpr std::size_t size_in_words = full_tensor_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 17);
+    std::array<uint32_t, full_tensor_command_arg_t::size_in_words()> args;
+    std::ranges::fill(args, std::numeric_limits<uint32_t>::max());
+
+    CclCommandTensor test_tensor = {
+        {0,1,2,3},
+        {4,5,6,7},
+        {8,9,10,11},
+        {12,13,14,15},
+        16
+    };
+    full_tensor_command_arg_t::pack_to(args.data(), test_tensor);
+    for (std::size_t i = 0; i < size_in_words; i++) {
+        ASSERT_EQ(args[i], i);
+    }
+}
+
+TEST(CclCommandArgGenerator, UnpackFullTensorArg) {
+    constexpr std::size_t size_in_words = full_tensor_command_arg_t::size_in_words();
+    ASSERT_EQ(size_in_words, 17);
+    std::array<uint32_t, full_tensor_command_arg_t::size_in_words()> args;
+    std::iota(args.begin(), args.end(), 0);
+
+    full_tensor_command_arg_t::field_type test_tensor = {
+        {std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max()},
+        {std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max()},
+        {std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max()},
+        {std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max(),std::numeric_limits<uint32_t>::max()},
+        std::numeric_limits<uint32_t>::max()
+    };
+    full_tensor_command_arg_t::unpack(args.data(), test_tensor);
+    ASSERT_EQ(test_tensor.tensor_shape.w, 0);
+    ASSERT_EQ(test_tensor.tensor_shape.z, 1);
+    ASSERT_EQ(test_tensor.tensor_shape.y, 2);
+    ASSERT_EQ(test_tensor.tensor_shape.x, 3);
+    ASSERT_EQ(test_tensor.tensor_slice_shape.w, 4);
+    ASSERT_EQ(test_tensor.tensor_slice_shape.z, 5);
+    ASSERT_EQ(test_tensor.tensor_slice_shape.y, 6);
+    ASSERT_EQ(test_tensor.tensor_slice_shape.x, 7);
+    ASSERT_EQ(test_tensor.tensor_slice_offset.w, 8);
+    ASSERT_EQ(test_tensor.tensor_slice_offset.z, 9);
+    ASSERT_EQ(test_tensor.tensor_slice_offset.y, 10);
+    ASSERT_EQ(test_tensor.tensor_slice_offset.x, 11);
+    ASSERT_EQ(test_tensor.worker_start_offset_in_slice.w, 12);
+    ASSERT_EQ(test_tensor.worker_start_offset_in_slice.z, 13);
+    ASSERT_EQ(test_tensor.worker_start_offset_in_slice.y, 14);
+    ASSERT_EQ(test_tensor.worker_start_offset_in_slice.x, 15);
+    ASSERT_EQ(test_tensor.worker_pages_per_slice, 16);
+}
diff --git a/tests/tt_eager/ops/ccl/test_ccl_reduce_scatter_host_helpers.cpp b/tests/tt_eager/ops/ccl/test_ccl_reduce_scatter_host_helpers.cpp
@@ -0,0 +1,89 @@
+// SPDX-FileCopyrightText: © 2024 Tenstorrent Inc.
+//
+// SPDX-License-Identifier: Apache-2.0
+
+#include "gtest/gtest.h"
+
+#include "ttnn/cpp/ttnn/operations/ccl/reduce_scatter/host/reduce_scatter_worker_builder.hpp"
+#include "ttnn/cpp/ttnn/operations/ccl/ccl_common.hpp"
+#include "ttnn/tensor/types.hpp"
+#include "ttnn/cpp/ttnn/operations/ccl/common/uops/ccl_command.hpp"
+#include "ttnn/cpp/ttnn/operations/ccl/common/types/ccl_types.hpp"
+
+#include <vector>
+#include <cstdint>
+
+using ttnn::ccl::cmd::CclCommandArg;
+using ttnn::ccl::cmd::CclCommandArgCode;
+using ttnn::ccl::cmd::CclCommandHeader;
+using ttnn::ccl::cmd::CclCommandCode;
+using ttnn::ccl::generate_slice_sequence_on_dim;
+using shape4d = ttnn::ccl::Shape4D<uint32_t>;
+TEST(LineReduceScatter, EmitCclSendSliceSequenceCommands_8Slices_1x1x32x2048Tensor_Dim3_Slice0to7)
+{
+    const std::size_t num_slices = 8;
+    const std::int64_t start_slice_index = 0;
+    const std::int64_t end_slice_index_exclusive = 8;
+    const tt_xy_pair tensor_shape(64, 1);
+    const tt_xy_pair worker_slice_shape(16, 1);
+    const std::size_t scatter_dim = 3;
+    const std::size_t worker_index = 0;
+    auto const& slices = generate_slice_sequence_on_dim(
+        tensor_shape,
+        worker_slice_shape,
+        scatter_dim,
+        num_slices,
+        start_slice_index,
+        end_slice_index_exclusive,
+        worker_index
+    );
+
+    std::vector<uint32_t> args;
+    ASSERT_EQ(slices.size(), 8);
+    ttnn::ccl::reduce_scatter_detail::emit_ccl_send_slice_sequence_commands(slices, args);
+
+    const std::size_t args_per_command_header = 1;
+    const std::size_t args_per_command_arg_header = 1;
+
+    const std::size_t args_per_full_tensor_field = CclCommandArg<CclCommandArgCode::SET_FULL_TENSOR_SLICE_SPEC_IN_PAGES>::size_in_words();
+    const std::size_t args_per_full_tensor_slice_command = args_per_command_header + args_per_command_arg_header + args_per_full_tensor_field;
+
+    const std::size_t args_per_shape_field = CclCommandArg<CclCommandArgCode::SET_TENSOR_SLICE_OFFSET_IN_PAGES>::size_in_words();
+    const std::size_t args_per_member_update = args_per_command_header + args_per_command_arg_header + args_per_shape_field;
+    const std::size_t num_commands_with_single_field_update = num_slices - 1;
+
+    ASSERT_EQ(args.size(), num_commands_with_single_field_update * args_per_member_update + args_per_full_tensor_slice_command);
+
+    shape4d expected_tensor_slice_shape = shape4d(1, 1, 1, 8);
+
+    log_info(tt::LogOp, "Commands");
+    for (std::size_t i = 0; i < args.size(); i++) {
+        log_info(tt::LogOp, "arg {}: {}", i, args[i]);
+    }
+
+
+    { // Validate the first command
+        std::size_t cmd_start_offset = 0;
+        CclCommandHeader cmd_hdr = CclCommandHeader::from_uint32(args[cmd_start_offset]);
+        CclCommandCode cmd_code = cmd_hdr.code;
+        auto arg_count = cmd_hdr.arg_count;
+        ASSERT_EQ(cmd_code, CclCommandCode::STREAM_TENSOR_TO_EDM);
+        ASSERT_EQ(arg_count, 1);
+
+        std::size_t arg_start_offset = cmd_start_offset + args_per_command_header;
+        std::size_t fields_start = arg_start_offset + args_per_command_arg_header;
+        std::size_t arg_offset = fields_start;
+        ASSERT_EQ(args[arg_offset++], 1);
+        ASSERT_EQ(args[arg_offset++], 1);
+        ASSERT_EQ(args[arg_offset++], tensor_shape.y);
+        ASSERT_EQ(args[arg_offset++], tensor_shape.x);
+
+        ASSERT_EQ(args[arg_offset++], expected_tensor_slice_shape.w);
+        ASSERT_EQ(args[arg_offset++], expected_tensor_slice_shape.z);
+        ASSERT_EQ(args[arg_offset++], expected_tensor_slice_shape.y);
+        ASSERT_EQ(args[arg_offset++], expected_tensor_slice_shape.x);
+
+
+    }
+
+}
diff --git a/tests/ttnn/unit_tests/operations/test_reduce_scatter_post_commit.py b/tests/ttnn/unit_tests/operations/test_reduce_scatter_post_commit.py
@@ -94,6 +94,7 @@ def run_reduce_scatter_test(
     function_level_defaults,
     enable_async=True,
     num_iters=1,
+    topology=ttnn.Topology.Ring,
 ):
     if len(t3k_mesh_device.get_device_ids()) != 8:
         pytest.skip("Not T3000!")
@@ -142,6 +143,7 @@ def run_reduce_scatter_test(
             math_op=math_op,
             num_links=num_links,
             memory_config=mem_config,
+            topology=topology,
         )
 
         for device_id in t3k_mesh_device.get_device_ids():
@@ -218,7 +220,7 @@ def run_reduce_scatter_test(
 )
 @pytest.mark.parametrize("math_op", [ttnn.ReduceType.Sum])
 @pytest.mark.parametrize("enable_async", [True])
-def test_reduce_scatter_post_commit(
+def test_ring_reduce_scatter_post_commit(
     t3k_mesh_device,
     num_devices,
     per_chip_output_shape,
@@ -250,6 +252,67 @@ def test_reduce_scatter_post_commit(
     )
 
 
+# ~2:45 extra time in the current state
+@pytest.mark.timeout(120)
+@pytest.mark.parametrize(
+    "num_devices, num_links",
+    [
+        (8, 1),
+    ],
+)
+@pytest.mark.parametrize(
+    "per_chip_output_shape, scatter_dim, layout",
+    [
+        ([1, 1, 32, 32 * 8], 3, ttnn.TILE_LAYOUT),
+    ],
+)
+@pytest.mark.parametrize(
+    "input_dtype",
+    [
+        ttnn.bfloat16,
+    ],
+)
+@pytest.mark.parametrize(
+    "mem_config",
+    [
+        ttnn.MemoryConfig(buffer_type=ttnn.BufferType.DRAM),
+    ],
+)
+@pytest.mark.parametrize("math_op", [ttnn.ReduceType.Sum])
+@pytest.mark.parametrize("enable_async", [True])
+def test_line_reduce_scatter_post_commit(
+    t3k_mesh_device,
+    num_devices,
+    per_chip_output_shape,
+    scatter_dim,
+    num_links,
+    math_op,
+    input_dtype,
+    layout,
+    mem_config,
+    use_program_cache,
+    function_level_defaults,
+    enable_async,
+    num_iters=1,
+):
+    run_reduce_scatter_test(
+        t3k_mesh_device,
+        num_devices,
+        per_chip_output_shape,
+        scatter_dim,
+        num_links,
+        math_op,
+        input_dtype,
+        layout,
+        mem_config,
+        use_program_cache,
+        function_level_defaults,
+        num_iters=num_iters,
+        enable_async=enable_async,
+        topology=ttnn.Topology.Linear,
+    )
+
+
 def run_reduce_scatter_sharded_test(
     t3k_mesh_device,
     num_devices,