Releases: tenstorrent/tt-metal
v0.52.0-rc15
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10747521160
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
- Fused AllGather+Matmul
- PR: #11760
- #12124: support moreh_nll_loss support large wight
- PR: #12126
- [Bugfix] Fixed is allocated
- PR: #12109
- #11990: Replace ttnn.experimental.tensor.* to ttnn.* in ttnn folder
- PR: #12005
- #11132 Run Post-Commit Python Tests agai...
v0.52.0-rc14
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10745849640
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
- Fused AllGather+Matmul
- PR: #11760
- #12124: support moreh_nll_loss support large wight
- PR: #12126
- [Bugfix] Fixed is allocated
- PR: #12109
- #11990: Replace ttnn.experimental.tensor.* to ttnn.* in ttnn folder
- PR: #12005
- #11132 Run Post-Commit Python Tests agai...
v0.52.0-rc13
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10743792121
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
- Fused AllGather+Matmul
- PR: #11760
- #12124: support moreh_nll_loss support large wight
- PR: #12126
- [Bugfix] Fixed is allocated
- PR: #12109
- #11990: Replace ttnn.experimental.tensor.* to ttnn.* in ttnn folder
- PR: #12005
- #11132 Run Post-Commit Python Tests agai...
v0.52.0-rc12
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10731919856
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
- Fused AllGather+Matmul
- PR: #11760
- #12124: support moreh_nll_loss support large wight
- PR: #12126
- [Bugfix] Fixed is allocated
- PR: #12109
- #11990: Replace ttnn.experimental.tensor.* to ttnn.* in ttnn folder
- PR: #12005
- #11132 Run Post-Commit Python Tests agai...
v0.52.0-rc11
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10730882573
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
- Fused AllGather+Matmul
- PR: #11760
- #12124: support moreh_nll_loss support large wight
- PR: #12126
- [Bugfix] Fixed is allocated
- PR: #12109
- #11990: Replace ttnn.experimental.tensor.* to ttnn.* in ttnn folder
- PR: #12005
- #11132 Run Post-Commit Python Tests agai...
v0.52.0-rc9
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10702489425
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
- Fused AllGather+Matmul
- PR: #11760
- #12124: support moreh_nll_loss support large wight
- PR: #12126
- [Bugfix] Fixed is allocated
- PR: #12109
- #11990: Replace ttnn.experimental.tensor.* to ttnn.* in ttnn folder
- PR: #12005
- #11132 Run Post-Commit Python Tests agai...
v0.52.0-rc8
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10693502206
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
- Fused AllGather+Matmul
- PR: #11760
- #12124: support moreh_nll_loss support large wight
- PR: #12126
- [Bugfix] Fixed is allocated
- PR: #12109
- #11990: Replace ttnn.experimental.tensor.* to ttnn.* in ttnn folder
- PR: #12005
- #11132 Run Post-Commit Python Tests agai...
v0.52.0-rc6
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10659227832
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
- Fused AllGather+Matmul
- PR: #11760
- #12124: support moreh_nll_loss support large wight
- PR: #12126
- [Bugfix] Fixed is allocated
- PR: #12109
v0.52.0-rc5
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10641311471
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062
- #0: Fixed Conv2dConfig in broken tests
- PR: #12064
- #0: Falcon40b T3K demo mismatch tokens fixed
- PR: #12105
- #12069: Add catch and handling for device initialize exception, typic…
- PR: #12070
- Point metal to new UMD main branch
- PR: #12097
- Update CODEOWNERS
- PR: #12112
- #11993: Fix offset calculation for uneven shard in reshard fast path
- PR: #12083
- Update CODEOWNERS
- PR: #12114
- #12117: Refactor DeviceMesh->MeshDevice, DeviceGrid->MeshShape
- PR: #12118
- #11854: Move .umd that houses cluster descriptor to TT_METAL_HOME
- PR: #12113
v0.52.0-rc4
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10632943556
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
- #9630: assert out concat when concatenating along padded dimensions
- PR: #11869
- Use product codes for cards instead of arch for eager-package-main
- PR: #11976
- #11929: Move work_split_tilize
- PR: #11932
- #11693: Move DeviceModule bindings and replace ttnn.experimental APIs
- PR: #11820
- #11247: Remove in-place flag in binary operations
- PR: #11604
- #11591: Move hack delay from trisc.cc to trisck.cc before run_kernel
- PR: #11963
- #8865: Optimize softmax dispatch time
- PR: #11889
- #0: skip yolov4 failing sub_modules
- PR: #11959
- #11519: Restore path reservation for mms and convs
- PR: #11520
- #5337: Fix Mixtral total number of generated tokens in perf benchmark
- PR: #11994
- #11883: use fixed_string.size() instead of sizeof to ensure compatiablity with newer versions of reflect
- PR: #11896
- #11559: Replace tt_lib in tests/ttnn files
- PR: #11822
- #11915: Add sweep vector tagging and related infra changes
- PR: #11970
- #0: fix fetch q write assert by using correct data offset for enqueue write buffer
- PR: #11983
- update conv path in CODEOWNERS:
- PR: #11978
- enable all enablable unit tests for convs with new api
- PR: #11981
- Fix size_t compilation failure
- PR: #12003
- Update perf and latest features for llm models (Aug 26)
- PR: #11905
- Split up n300 demo tests into functionality and performance
- PR: #11969
- #10718: Fix issue with negative pipeline queue times
- PR: #12010
- #11642: demux ttnn::typecast into ttnn::experimental::typecast on gra…
- PR: #11985
- #11569: Enable Conv2D WH unit tests for UNet shapes
- PR: #11589
- #11591: Fix race by making only unpacker zero out RISCV_DEBUG_REG_DBG_FEATURE_DISABLE at start of kernel
- PR: #12011
- Update CODEOWNERS
- PR: #12048
- Add missing include to graph_trace_utils.hpp
- PR: #12050
- #0: Always initialize l1_banking allocator even when size is 0
- PR: #12047
- update slack notification include workflow run
- PR: #12054
- #8868: Fixed conv for Stride>2
- PR: #11933
- #11430: Refactoring moreh_mean
- PR: #11776
- #11832: Remove tracking of writes per block and only track last block
- PR: #11999
- #11644: Migrate AutoFormat to TTNN Experimental
- PR: #11823
- Added ttnn.i0_bw unit test
- PR: #11891
- #11938: Refactoring
moreh_bmm
- PR: #12000
- #11646: Replace ttnn.experimental.tensor.* in models/demos
- PR: #11943
- Add support for cur_pos tensor arg in sdpa decode
- PR: #11788
- #5659: Add Width Sharded support to Conv2d
- PR: #11582
- Remove noinline attribute from sdpa_decode compute kernel
- PR: #12060
- Updated sfpi compiler to address missing SFPNOP insertion
- PR: #12061
- Move compute kernel config to TTNN
- PR: #11801
- Add fold to resnet
- PR: #11940
- [BugFix] Fixed tensor::is_allocated.
- PR: #12071
- Revert "[BugFix] Fixed tensor::is_allocated."
- PR: #12082
- #8598: sinh fix
- PR: #12056
- #11646: Replace ttnn.experimental.tensor.* to ttnn.* in models/experimental, tests
- PR: #11821
- #10754: Add data-parallel support for UNet Shallow on N300
- PR: #12062