Support for new matmul1d op with gather_in0 #14964

avoraTT · 2024-11-12T18:33:48Z

Ticket

Matmul1D with gather_in0 #14962

Problem description

Currently, matmul 1d supports mcast_in0, for when the input is sharded across all the cores. However in some cases (specifically, the case of the matmuls used in the Llama models), this poses a bottleneck to the compute, as each core must wait to receive each shard prior to processing it.

To combat this, a new option to matmul1d is proposed: gather_in0. Using this, the activations are gathered using a ring all-gather operation. This allows each core to start processing the local activation shard that is already available, and process other shards as soon as they arrive. Essentially, this overlaps the time taken to gather the activation and to do the computation.

For the FF1 matmul in llama, (M, K, N = 32, 2304, 3840), this new matmul1d takes 10us (gather in0, in1 sharded, w/ hack to enable full dest in fp32_accum), compared to the 36us (mcast in0, in1 read from dram) measured before.

See issue for a diagram of how the inputs are gathered.

What's changed

This PR adds the following changes:

A new gather_in0 flag, defaulted to false, in the MatmulMultiCoreReuseMultiCast1DProgramConfig header and pybind
A new helper function in the matmul1d program factory, to specifically handle the case when gather_in0=True
3 new kernels
i. a ring gather in0 kernel
ii. an in1 kerenel
iii. a bmm kernel (close copy of the existing one)
Validation in matmul_op.cpp for the new gather_in0 case
test_matmul_1d_gathered.py to test the new matmul configuration

Caveats

Inputs MUST be sharded, and be on the same cores. For it's intended use cases the DRAM prefetcher will be used to distribute the weights across the cores, so this does not degrade performance.
This op does not support bias, as it is not required in the llama use case and simplifies implementation
The existing bmm kernel is duplicated. Most of the structure remains the same, however there are changes to how the input CBs are read
This op supports sharding on arbitrary cores (not rectangluar) and uses the dynamic noc functionality to keep performance

Remaining TODOs:

Test on grayskull
Get official support for full dest when fp32_accum_mode = True (this is not necessary but leads to significant perf gains)
Create a PR that adds support for array inputs in CoreRangeSet, to retain ordering of arbitrary cores in a shard spec
Check perf for DRAM prefetcher core grid configuration

Checklist

Post commit CI passes
New/Existing tests provide coverage for changes

bbradelTT · 2024-11-12T20:46:36Z

ttnn/cpp/ttnn/operations/matmul/device/matmul_op_multi_core_reuse_mcast_1d_program_factory.cpp

@@ -1659,8 +1989,8 @@ operation::ProgramWithCallbacks matmul_multi_core_reuse_mcast_1d_optimized_(

    if (fp32_dest_acc_en) {
        TT_FATAL(
-            out_subblock_h * out_subblock_w <= 4,
-            "Total number of tiles in a subblock must be less than 4 when in fp32_dest_acc mode");
+            out_subblock_h * out_subblock_w <= 8,


Why change this to 8?

Isn't 16 cut in half because of half dest mode and in half again because of fp32?

If anything this could be removed since

TT_FATAL( (program_config.out_subblock_w * program_config.out_subblock_h) <= available_reg_count, "out_subblock_w {} times out_subblock_h {} needs to be at most {} to fit in hardware", program_config.out_subblock_w, program_config.out_subblock_h, available_reg_count);

has been added to validate() in matmul_op.cpp

So one of the remaining todo's in the PR is getting support for full dest mode (see here). If it isn't merged before this PR, I can revert the hard coded value.

In my testing for the shapes used in the Llama models, I have found that we can actually use 8 here, and it results in significant speedups since there's no reload.

If you enable full dest mode then that will need to be reflected in the config and get_dest_reg_count needs to return 8. You cannot use a constant.

Please either remove this test or make it do the same thing as in validate.

seems like full dest mode support has already been added in main. However, the pybind for the compute kernel config has not been updated to allow users to enable it.

I've opened a PR here that updates the pybind, and I am currently waiting for confirmation from @amahmudTT that this change is correct.

bbradelTT · 2024-11-12T20:47:44Z

ttnn/cpp/ttnn/operations/matmul/device/matmul_op_multi_core_reuse_mcast_1d_program_factory.cpp

+            .defines = mm_kernel_defines});
+
+    /* Create circular buffers */
+    uint32_t src0_cb_index = 0;


Please just use the cb constants instead of raw numbers in the assignments.

yugaoTT · 2024-11-12T22:47:00Z

tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_1d_gather_in0.py

+        # 32, 2304, 3840
+        (1, 32, 2304, 3840, ttnn.bfloat16, ttnn.bfloat4_b, ttnn.MathFidelity.LoFi, True, True, (8, 3)),
+        # 32, 2304, 3840
+        (3, 32, 2304, 3840, ttnn.bfloat16, ttnn.bfloat4_b, ttnn.MathFidelity.LoFi, True, True, (8, 3)),
+        # 32, 2304, 3840
+        (3, 32, 2304, 3840, ttnn.bfloat16, ttnn.bfloat8_b, ttnn.MathFidelity.LoFi, False, False, (8, 3)),


has the exact arbitary shard grid been tested (ie, the cores placed near the dram banks)

It has not been tested just yet. But I will do that and include results here!

Tested with the exact core grid with core near the dram banks. Results are 👍 .

ttnn/cpp/ttnn/operations/matmul/device/matmul_op.cpp

yugaoTT · 2024-11-12T22:56:22Z

...erations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation_gathered.cpp

+#ifdef MATMUL_DRAM_SHARDED
+    const bool is_worker_core = get_arg_val<uint32_t>(0) == 1;
+    // if not worker core, skip
+    if (not is_worker_core) {
+        return;
+    }
+#endif


should it be removed since it will never be triggered

yes I can get rid of this. I can also get rid of all the other things in this bmm kernel that aren't being used (untilze out, fused bias, etc). Is this fine @bbradelTT?

That's fine.

Please add appropriate checks in validate() to ensure that the inputs are not dram sharded and the other things are not being used.

yugaoTT · 2024-11-12T23:00:50Z

...tnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_ring_all_gather.cpp

+        for (uint32_t shard_cnt = 0; shard_cnt < ring_size; shard_cnt++) {
+
+            uint32_t curr_shard_write_addr = l1_write_addr_in0 + shard_size_bytes * shard_cnt;
+            uint64_t remote_curr_shard_write_addr = get_noc_addr(next_core_noc_x, next_core_noc_y, curr_shard_write_addr, noc);
+            uint32_t curr_shard_read_addr = shard_cnt == 0 ? local_shard_read_addr : l1_write_addr_in0 + shard_size_bytes * (shard_cnt - 1);
+
+
+            // Wait for signal from previous core that data has been added to this core's in0
+            noc_semaphore_wait_min(l1_signal_sem_addr, shard_cnt);
+
+            // Send data to next core
+            if (shard_cnt < ring_size - 1) { // Skip sending the last shard
+                noc_async_write(curr_shard_read_addr, remote_curr_shard_write_addr, shard_size_bytes, noc);
+
+                // Signal the next core that data is ready
+                noc_semaphore_inc(remote_signal_semaphore_addr, 1, noc);
+            }
+
+            // Do stuff for matmul fusion here
+            if (shard_cnt > 0) {
+                cb_push_back(cb_id_in2, shard_size_in_tiles);


should we have another issue tracking the support for back-pressure (use global CB)?

yugaoTT · 2024-11-13T14:14:52Z

ttnn/cpp/ttnn/operations/matmul/device/matmul_op_multi_core_reuse_mcast_1d_program_factory.cpp

+uint32_t get_preferred_noc(const uint32_t src_x, const uint32_t dst_x, const tt_metal::Device* device) {
+    /*
+        NOC0: Preferred +x -> +y
+        NOC1: Preferred -y -> -x
+    */
+    uint32_t MAX_X = device->grid_size().x;
+
+    // Get the wrapped distances
+    uint32_t dist_right = src_x < dst_x ? dst_x - src_x : MAX_X - src_x + dst_x;
+    uint32_t dist_left =  src_x < dst_x ? src_x + MAX_X - dst_x : src_x - dst_x;
+
+    return dist_right < dist_left ? 0 : 1;
+}
+


might be good try it out with the intended core arrangement (cores placed near the dram banks)

yugaoTT

hey @johanna-rock-tt what is the largest layer needed to run ? would 32, 2304, 3840 be enough, since we are buffering the full layer here, need to make sure all layers passing.

…bally_allocated address.

…tep: choose the correct NOC based on which core is next in the ring. Also, clean up the test.

…et a perf boost.

…lity such as fuse_op and bias. TODO: test batch.

…ure they are not being used.

…re grid configuration.

avoraTT added metal tt-metal issue LLMs on Metal labels Nov 12, 2024

avoraTT self-assigned this Nov 12, 2024

bbradelTT reviewed Nov 12, 2024

View reviewed changes

yugaoTT requested changes Nov 13, 2024

View reviewed changes

avoraTT mentioned this pull request Nov 15, 2024

Support dst_full_sync_en flag in the WH compute kernel config pybind #15007

Merged

1 task

avoraTT force-pushed the avora/matmul1d branch from 7633105 to 8a16f0b Compare November 18, 2024 16:08

yugaoTT approved these changes Nov 21, 2024

View reviewed changes

bbradelTT approved these changes Nov 21, 2024

View reviewed changes

avoraTT force-pushed the avora/matmul1d branch from ba0da34 to 17cfe25 Compare November 21, 2024 16:07

avoraTT added 17 commits November 26, 2024 12:11

#0: Add a basic pytest.

342885f

#0: Add code path for gather_in0 flag in matmul 1d.

60c1917

#0: Add new host code / kernels for matmul1d with gather_in0.

f423ead

#0: Add another CB for inputs to avoid workaround when using .set_glo…

69e618a

…bally_allocated address.

#0: Add support for matmul to be performed on arbitrary cores. Next s…

30021e2

…tep: choose the correct NOC based on which core is next in the ring. Also, clean up the test.

#0: WIP debugging cb corruption issue.

f0639c4

#0: Fix PCC issue when out_block_w > 4.

a58f6e2

#0: Add fix to enable the use of subblock w > 4 when applicable, to g…

d8a619b

…et a perf boost.

#0: Add support for the dynamic noc.

742bc4e

Clean up and add more validation. Remove unused/unsupported functiona…

b8913c4

…lity such as fuse_op and bias. TODO: test batch.

Add support for fuse_batch shapes in the pytest.

3aab485

Add more test cases for fused activation and other shapes.

0afac94

Enable the full_dst_sync_en flag in the computer kernel config.

1400086

Remove unnecessary code from bmm kernel, and add validation to make s…

d0d239d

…ure they are not being used.

Fix get_preferred_noc function. Validate perf on TG for prefetcher co…

baa7ebd

…re grid configuration.

Add testing for program caching and relu activation.

979168e

Rebase and add tests for larger shapes.

03647bd

avoraTT force-pushed the avora/matmul1d branch from 17cfe25 to 03647bd Compare November 26, 2024 12:49

avoraTT marked this pull request as ready for review November 26, 2024 12:49

avoraTT requested a review from TT-BrianLiu as a code owner November 26, 2024 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for new matmul1d op with gather_in0 #14964

Support for new matmul1d op with gather_in0 #14964

avoraTT commented Nov 12, 2024 •

edited

Loading

bbradelTT Nov 12, 2024

avoraTT Nov 13, 2024

bbradelTT Nov 13, 2024

avoraTT Nov 13, 2024

bbradelTT Nov 12, 2024

yugaoTT Nov 12, 2024

avoraTT Nov 13, 2024

avoraTT Nov 21, 2024

yugaoTT Nov 12, 2024

avoraTT Nov 13, 2024

bbradelTT Nov 13, 2024

yugaoTT Nov 12, 2024

yugaoTT Nov 13, 2024

yugaoTT left a comment

Support for new matmul1d op with gather_in0 #14964

Are you sure you want to change the base?

Support for new matmul1d op with gather_in0 #14964

Conversation

avoraTT commented Nov 12, 2024 • edited Loading

Ticket

Problem description

What's changed

Caveats

Remaining TODOs:

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yugaoTT left a comment

Choose a reason for hiding this comment

avoraTT commented Nov 12, 2024 •

edited

Loading