Skip to content

Releases: tenstorrent/tt-metal

v0.51.0-rc5

17 Jul 18:24
Compare
Choose a tag to compare
v0.51.0-rc5 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/9978853510

📦 Uncategorized

  • Migrate Pad Device and All references
  • #0: Multi-CQ support for R-Chip
  • #10028: Remove skip and reduce test case for moreh_groupnorm test
  • #10005: Change input tensor parameter to optional in moreh_sum_backward
  • #10004: Revise bias tensor usage in moreh_linear_backward
  • #9663: support moreh_nll_loss_unreduced
  • #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
  • #0: Update README.md grammar for idiomatic description of TT-NN
  • #9767: removed more no longer needed manually specified attributes for reflection
  • Add distributed layernorm kernel documentation
  • #10031: Fix -Werror=return-type error in composite_ops
  • #9492: update matmul path in CODEOWNERS
  • #9450: change silicon fixtures to session scope
  • Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
  • #9441: add all typecasts to unit test
  • #9801: Add cb alignment fix for blackhole that was missed in rebase
  • #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
  • #10052: Add metal pack untilize test
  • Add ttnn matmul tests to TG unit tests
  • Add ssm_prefix_scan test coverage for N=16
  • Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
  • #8450: Cleanup items pending from PR #9068
  • #10030: fix moreh_nll_loss hang
  • #7736: Remove unused reduce dim & type from reduce_init*
  • #9871: Update backward files
  • #9874: Move Unary Backward ops to TTNN
  • Update op_perf_results
  • #9962: Enable flags for profiler globals in jit build
  • Added prefill mode for mamba modules
  • Increase timeout for Mamba full model tests
  • Support multiple user indices in paged_update_cache
  • #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
  • Pack runtime arguments across brisc/ncrisc/trisc
  • Llama Demo Refactor
  • #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
  • #0: Move t3k demo tests to perf pipeline because it requires perf governor
  • #5424: Delegated sfpu reciprocal calls to gs submodule functions
  • Add trace and multi cq implementations/tests for WH Resnet
  • #0: (MINOR) Update to v0.51.0
  • #0: bump python3.8 venv versioning since apt repos updated
  • #10099: fix semaphores init for packet mux/demux
  • #10112: Drop hard pin for installation instructions for python3.8-venv in dependencies
  • Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions"
  • #0: Remove stray assert forcing single CQ on R-Chips
  • #9490: Replace tt_dnn op's usage in C++ with TTNN
  • #9874: Merge Next set of unary backward ops to TTNN
  • #10073: Move unary backward ops to TTNN
  • Unary backward op migration
  • #10087: update tt-umd submodule
  • #9959: Migrated pad to ttnn sweeps
  • Adding distributed layernorm to llama prefill
  • Add pytest xdist multiprocess to single-chip demo tests
  • Revert "Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions""
  • #10071 : Move second set of Unary Backward ops to TTNN
  • #10083: added tt::stl::json::to_json and tt::stl::json::from_json
  • #10086: Add logic for splitting cmds that exceed the subcmd limit into separate cmds for semaphores
  • #5424: Delegated sqrt api call to thirdparty gs submodule sqrt call
  • #5424: Delegated sfpu api call to sqrt for wh to submodule sqrt call
  • #0: Fix galaxy eth dispatch init to only init the specified number of cqs (galaxy only supports single cq)
  • Fix undefined memory bug in ssm_prefix_scan
  • removed weight copies from DRAM to L1
  • fix syntax issues with test dispatch workflow
  • #9609: Reorganize libs into ttnn
  • #10165: Fix build error with g++-12
  • Adding support for dram sharded matmuls
  • #10076: Migrate Unary bw ops and replace tt_eager ops with ttnn ops
  • #10072: Move next set of Unary Backward ops to TTNN
  • #9082: ping individual falcon member since slack user group is not wo…
  • #8681: Add Floor, Trunc blocker ops
  • #9419: use memcpy to avoid mem misalignment
  • #10079: Move Unary Backward ops to TTNN
  • Migrate unary ops to TTNN
  • #9945: Skip SD for nightly FD, device perf tests, and single-card demos as it hangs on di/dt
  • #10045: use struct for matmul parameter passing and update doc string
  • #10045: remove use_1d_systolic_array from ttnn matmul
  • Ngrujic/profiling
  • #9319: Upload benchmark data for t3k falcon 7b tests
  • Aliu/build opt
  • #10107: Fix hangs w/ launch_msg size >32bytes
  • [CCL] Making buffer size dynamic to input slice
  • #7617: remove failing experimental model test
  • #7618: delete failing experimental model test
  • #0: fix prefill CI for mamba
  • Move Mamba tests to wh_b0_only_eth pipeline
  • #9747: Implement ttnn::tilize in C++
  • Aliu/prevent aho tanking
  • #10045: fix up missed parameter change in mamba block model
  • #9490: Added ttnn support for unary ops py file
  • #10101: [Blackhole Bringup] Revert Zeroacc to legacy behaviour
  • Update README.md
  • #0: Fix imports after tt_lib change
  • #10226: [Blackhole Bringup] Add new sfpu files
  • Suppress g++-12 build errors with -Wno flags
  • #0: Fix BH regression caused by unaligned L1_UNRESERVED_BASE
  • #10077: Migrate Unary comparison backward ops to TTNN with Overloading
  • #10175: Remove std::function and restructure ternary_bw
  • Falcon40b attn mask optimization
  • #10074: Move Unary backward ops to TTNN
  • Replace all TT Lib Unpad with TTNN Slice
  • #10082: Migrate unary bw ops to TTNN and remove std::function
  • #9715: Use build artifacts for profiler tests
  • #9021: adding resnet api into ci.
  • Update README.md
  • Move pad_on_host/unpad_on_host to host function in TTNN
  • #9874: Move polygamma_bw to TTNN
  • #5337: increase t3k frequent test timeout
  • Update falcon40b readme
  • #0: add layernorm rmsnorm pybind, move to ttnn
  • #0: Re-enable read cache in llama_model_optimized.
  • Update Mistral/Mixtral README files
  • #0: Update LLama2/3 readme with demo details
  • #0: resnet perf fix
  • Update Mamba README.md
  • OPT convs in RN50 to get better device perf
  • Increase timeout for N300 WH-only model pipeline
  • Prefill+Decode Demo Functional Implementation
  • [Falcon7b] Add wormhole demo perf mode and output verification tests
  • Update Falcon7/40b READMEs with details on model functionality and perf-mode
  • bump python 3.8 venv package version
  • Git bisect workflow on CI runners
  • #9613: scaffolding for weekly scheduled t3k perplexity tests
  • fix syntax issue with bisect script
  • #10231: Clean up t3k runs-on tags to minimum
  • #9490: Remove tt_eager unary ops and bindings
  • only build for arch that a dispatched workflow is running for
  • Allow overloading of job name with user-defined name for new dispatch workflows
  • #10242: Migrate unary bw ops with a generalized structure to TTNN
  • #10322: commented out failing t3k tests
  • #9491: Add structure for ternary ops in ttnn
  • Move downsample from tt_eager to ttnn
  • #10250: Migrate unary backward ops with a generalized structure to TTNN
  • #10280: Mistral README update
  • #9911: Add structure and migrate 20 composite unary ops
  • #0: fix rn50 block padding
  • #10300: get the correct operation id on subsequent run
  • #0: Move host tensor construction for halo into create_program to only happen on uncached runs
  • Flash decode v2
  • #9751: Restructure ttnn transformers to new folder structure
  • #10181: Disable test_reduce_h due to sporadic failures in slow dispatch
  • #10181: Disable test_reduce_h
  • Update README.md
Read more

v0.51.0-rc4

17 Jul 02:18
Compare
Choose a tag to compare
v0.51.0-rc4 Pre-release
Pre-release

📦 Uncategorized

  • Migrate Pad Device and All references
  • #0: Multi-CQ support for R-Chip
  • #10028: Remove skip and reduce test case for moreh_groupnorm test
  • #10005: Change input tensor parameter to optional in moreh_sum_backward
  • #10004: Revise bias tensor usage in moreh_linear_backward
  • #9663: support moreh_nll_loss_unreduced
  • #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
  • #0: Update README.md grammar for idiomatic description of TT-NN
  • #9767: removed more no longer needed manually specified attributes for reflection
  • Add distributed layernorm kernel documentation
  • #10031: Fix -Werror=return-type error in composite_ops
  • #9492: update matmul path in CODEOWNERS
  • #9450: change silicon fixtures to session scope
  • Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
  • #9441: add all typecasts to unit test
  • #9801: Add cb alignment fix for blackhole that was missed in rebase
  • #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
  • #10052: Add metal pack untilize test
  • Add ttnn matmul tests to TG unit tests
  • Add ssm_prefix_scan test coverage for N=16
  • Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
  • #8450: Cleanup items pending from PR #9068
  • #10030: fix moreh_nll_loss hang
  • #7736: Remove unused reduce dim & type from reduce_init*
  • #9871: Update backward files
  • #9874: Move Unary Backward ops to TTNN
  • Update op_perf_results
  • #9962: Enable flags for profiler globals in jit build
  • Added prefill mode for mamba modules
  • Increase timeout for Mamba full model tests
  • Support multiple user indices in paged_update_cache
  • #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
  • Pack runtime arguments across brisc/ncrisc/trisc
  • Llama Demo Refactor
  • #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
  • #0: Move t3k demo tests to perf pipeline because it requires perf governor
  • #5424: Delegated sfpu reciprocal calls to gs submodule functions
  • Add trace and multi cq implementations/tests for WH Resnet
  • #0: (MINOR) Update to v0.51.0
  • #0: bump python3.8 venv versioning since apt repos updated
  • #10099: fix semaphores init for packet mux/demux
  • #10112: Drop hard pin for installation instructions for python3.8-venv in dependencies
  • Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions"
  • #0: Remove stray assert forcing single CQ on R-Chips
  • #9490: Replace tt_dnn op's usage in C++ with TTNN
  • #9874: Merge Next set of unary backward ops to TTNN
  • #10073: Move unary backward ops to TTNN
  • Unary backward op migration
  • #10087: update tt-umd submodule
  • #9959: Migrated pad to ttnn sweeps
  • Adding distributed layernorm to llama prefill
  • Add pytest xdist multiprocess to single-chip demo tests
  • Revert "Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions""
  • #10071 : Move second set of Unary Backward ops to TTNN
  • #10083: added tt::stl::json::to_json and tt::stl::json::from_json
  • #10086: Add logic for splitting cmds that exceed the subcmd limit into separate cmds for semaphores
  • #5424: Delegated sqrt api call to thirdparty gs submodule sqrt call
  • #5424: Delegated sfpu api call to sqrt for wh to submodule sqrt call
  • #0: Fix galaxy eth dispatch init to only init the specified number of cqs (galaxy only supports single cq)
  • Fix undefined memory bug in ssm_prefix_scan
  • removed weight copies from DRAM to L1
  • fix syntax issues with test dispatch workflow
  • #9609: Reorganize libs into ttnn
  • #10165: Fix build error with g++-12
  • Adding support for dram sharded matmuls
  • #10076: Migrate Unary bw ops and replace tt_eager ops with ttnn ops
  • #10072: Move next set of Unary Backward ops to TTNN
  • #9082: ping individual falcon member since slack user group is not wo…
  • #8681: Add Floor, Trunc blocker ops
  • #9419: use memcpy to avoid mem misalignment
  • #10079: Move Unary Backward ops to TTNN
  • Migrate unary ops to TTNN
  • #9945: Skip SD for nightly FD, device perf tests, and single-card demos as it hangs on di/dt
  • #10045: use struct for matmul parameter passing and update doc string
  • #10045: remove use_1d_systolic_array from ttnn matmul
  • Ngrujic/profiling
  • #9319: Upload benchmark data for t3k falcon 7b tests
  • Aliu/build opt
  • #10107: Fix hangs w/ launch_msg size >32bytes
  • [CCL] Making buffer size dynamic to input slice
  • #7617: remove failing experimental model test
  • #7618: delete failing experimental model test
  • #0: fix prefill CI for mamba
  • Move Mamba tests to wh_b0_only_eth pipeline
  • #9747: Implement ttnn::tilize in C++
  • Aliu/prevent aho tanking
  • #10045: fix up missed parameter change in mamba block model
  • #9490: Added ttnn support for unary ops py file
  • #10101: [Blackhole Bringup] Revert Zeroacc to legacy behaviour
  • Update README.md
  • #0: Fix imports after tt_lib change
  • #10226: [Blackhole Bringup] Add new sfpu files
  • Suppress g++-12 build errors with -Wno flags
  • #0: Fix BH regression caused by unaligned L1_UNRESERVED_BASE
  • #10077: Migrate Unary comparison backward ops to TTNN with Overloading
  • #10175: Remove std::function and restructure ternary_bw
  • Falcon40b attn mask optimization
  • #10074: Move Unary backward ops to TTNN
  • Replace all TT Lib Unpad with TTNN Slice
  • #10082: Migrate unary bw ops to TTNN and remove std::function
  • #9715: Use build artifacts for profiler tests
  • #9021: adding resnet api into ci.
  • Update README.md
  • Move pad_on_host/unpad_on_host to host function in TTNN
  • #9874: Move polygamma_bw to TTNN
  • #5337: increase t3k frequent test timeout
  • Update falcon40b readme
  • #0: add layernorm rmsnorm pybind, move to ttnn
  • #0: Re-enable read cache in llama_model_optimized.
  • Update Mistral/Mixtral README files
  • #0: Update LLama2/3 readme with demo details
  • #0: resnet perf fix
  • Update Mamba README.md
  • OPT convs in RN50 to get better device perf
  • Increase timeout for N300 WH-only model pipeline
  • Prefill+Decode Demo Functional Implementation
  • [Falcon7b] Add wormhole demo perf mode and output verification tests
  • Update Falcon7/40b READMEs with details on model functionality and perf-mode
  • bump python 3.8 venv package version
  • Git bisect workflow on CI runners
  • #9613: scaffolding for weekly scheduled t3k perplexity tests
  • fix syntax issue with bisect script
  • #10231: Clean up t3k runs-on tags to minimum
  • #9490: Remove tt_eager unary ops and bindings
  • only build for arch that a dispatched workflow is running for
  • Allow overloading of job name with user-defined name for new dispatch workflows
  • #10242: Migrate unary bw ops with a generalized structure to TTNN
  • #10322: commented out failing t3k tests
  • #9491: Add structure for ternary ops in ttnn
  • Move downsample from tt_eager to ttnn
  • #10250: Migrate unary backward ops with a generalized structure to TTNN
  • #10280: Mistral README update
  • #9911: Add structure and migrate 20 composite unary ops
  • #0: fix rn50 block padding
  • #10300: get the correct operation id on subsequent run
  • #0: Move host tensor construction for halo into create_program to only happen on uncached runs
  • Flash decode v2
  • #9751: Restructure ttnn transformers to new folder structure
  • #10181: Disable test_reduce_h due to sporadic failures in slow dispatch
  • #10181: Disable test_reduce_h
  • Update README.md
  • move groupnorm from ttlib to ttnn
  • Update README.md
  • Update README.md - missing footnote
  • #0: Update ttnn resnet 2cq bound due to variability

v0.51.0-rc3

16 Jul 02:20
e1835e2
Compare
Choose a tag to compare
v0.51.0-rc3 Pre-release
Pre-release

📦 Uncategorized

  • Migrate Pad Device and All references
  • #0: Multi-CQ support for R-Chip
  • #10028: Remove skip and reduce test case for moreh_groupnorm test
  • #10005: Change input tensor parameter to optional in moreh_sum_backward
  • #10004: Revise bias tensor usage in moreh_linear_backward
  • #9663: support moreh_nll_loss_unreduced
  • #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
  • #0: Update README.md grammar for idiomatic description of TT-NN
  • #9767: removed more no longer needed manually specified attributes for reflection
  • Add distributed layernorm kernel documentation
  • #10031: Fix -Werror=return-type error in composite_ops
  • #9492: update matmul path in CODEOWNERS
  • #9450: change silicon fixtures to session scope
  • Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
  • #9441: add all typecasts to unit test
  • #9801: Add cb alignment fix for blackhole that was missed in rebase
  • #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
  • #10052: Add metal pack untilize test
  • Add ttnn matmul tests to TG unit tests
  • Add ssm_prefix_scan test coverage for N=16
  • Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
  • #8450: Cleanup items pending from PR #9068
  • #10030: fix moreh_nll_loss hang
  • #7736: Remove unused reduce dim & type from reduce_init*
  • #9871: Update backward files
  • #9874: Move Unary Backward ops to TTNN
  • Update op_perf_results
  • #9962: Enable flags for profiler globals in jit build
  • Added prefill mode for mamba modules
  • Increase timeout for Mamba full model tests
  • Support multiple user indices in paged_update_cache
  • #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
  • Pack runtime arguments across brisc/ncrisc/trisc
  • Llama Demo Refactor
  • #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
  • #0: Move t3k demo tests to perf pipeline because it requires perf governor
  • #5424: Delegated sfpu reciprocal calls to gs submodule functions
  • Add trace and multi cq implementations/tests for WH Resnet
  • #0: (MINOR) Update to v0.51.0
  • #0: bump python3.8 venv versioning since apt repos updated
  • #10099: fix semaphores init for packet mux/demux
  • #10112: Drop hard pin for installation instructions for python3.8-venv in dependencies
  • Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions"
  • #0: Remove stray assert forcing single CQ on R-Chips
  • #9490: Replace tt_dnn op's usage in C++ with TTNN
  • #9874: Merge Next set of unary backward ops to TTNN
  • #10073: Move unary backward ops to TTNN
  • Unary backward op migration
  • #10087: update tt-umd submodule
  • #9959: Migrated pad to ttnn sweeps
  • Adding distributed layernorm to llama prefill
  • Add pytest xdist multiprocess to single-chip demo tests
  • Revert "Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions""
  • #10071 : Move second set of Unary Backward ops to TTNN
  • #10083: added tt::stl::json::to_json and tt::stl::json::from_json
  • #10086: Add logic for splitting cmds that exceed the subcmd limit into separate cmds for semaphores
  • #5424: Delegated sqrt api call to thirdparty gs submodule sqrt call
  • #5424: Delegated sfpu api call to sqrt for wh to submodule sqrt call
  • #0: Fix galaxy eth dispatch init to only init the specified number of cqs (galaxy only supports single cq)
  • Fix undefined memory bug in ssm_prefix_scan
  • removed weight copies from DRAM to L1
  • fix syntax issues with test dispatch workflow
  • #9609: Reorganize libs into ttnn
  • #10165: Fix build error with g++-12
  • Adding support for dram sharded matmuls
  • #10076: Migrate Unary bw ops and replace tt_eager ops with ttnn ops
  • #10072: Move next set of Unary Backward ops to TTNN
  • #9082: ping individual falcon member since slack user group is not wo…
  • #8681: Add Floor, Trunc blocker ops
  • #9419: use memcpy to avoid mem misalignment
  • #10079: Move Unary Backward ops to TTNN
  • Migrate unary ops to TTNN
  • #9945: Skip SD for nightly FD, device perf tests, and single-card demos as it hangs on di/dt
  • #10045: use struct for matmul parameter passing and update doc string
  • #10045: remove use_1d_systolic_array from ttnn matmul
  • Ngrujic/profiling
  • #9319: Upload benchmark data for t3k falcon 7b tests
  • Aliu/build opt
  • #10107: Fix hangs w/ launch_msg size >32bytes
  • [CCL] Making buffer size dynamic to input slice
  • #7617: remove failing experimental model test
  • #7618: delete failing experimental model test
  • #0: fix prefill CI for mamba
  • Move Mamba tests to wh_b0_only_eth pipeline
  • #9747: Implement ttnn::tilize in C++
  • Aliu/prevent aho tanking
  • #10045: fix up missed parameter change in mamba block model
  • #9490: Added ttnn support for unary ops py file
  • #10101: [Blackhole Bringup] Revert Zeroacc to legacy behaviour
  • Update README.md
  • #0: Fix imports after tt_lib change
  • #10226: [Blackhole Bringup] Add new sfpu files
  • Suppress g++-12 build errors with -Wno flags
  • #0: Fix BH regression caused by unaligned L1_UNRESERVED_BASE
  • #10077: Migrate Unary comparison backward ops to TTNN with Overloading
  • #10175: Remove std::function and restructure ternary_bw
  • Falcon40b attn mask optimization
  • #10074: Move Unary backward ops to TTNN
  • Replace all TT Lib Unpad with TTNN Slice
  • #10082: Migrate unary bw ops to TTNN and remove std::function
  • #9715: Use build artifacts for profiler tests
  • #9021: adding resnet api into ci.
  • Update README.md
  • Move pad_on_host/unpad_on_host to host function in TTNN
  • #9874: Move polygamma_bw to TTNN
  • #5337: increase t3k frequent test timeout
  • Update falcon40b readme
  • #0: add layernorm rmsnorm pybind, move to ttnn
  • #0: Re-enable read cache in llama_model_optimized.
  • Update Mistral/Mixtral README files
  • #0: Update LLama2/3 readme with demo details
  • #0: resnet perf fix
  • Update Mamba README.md
  • OPT convs in RN50 to get better device perf
  • Increase timeout for N300 WH-only model pipeline
  • Prefill+Decode Demo Functional Implementation
  • [Falcon7b] Add wormhole demo perf mode and output verification tests
  • Update Falcon7/40b READMEs with details on model functionality and perf-mode
  • bump python 3.8 venv package version
  • Git bisect workflow on CI runners
  • #9613: scaffolding for weekly scheduled t3k perplexity tests
  • fix syntax issue with bisect script
  • #10231: Clean up t3k runs-on tags to minimum
  • #9490: Remove tt_eager unary ops and bindings
  • only build for arch that a dispatched workflow is running for

v0.51.0-rc2

15 Jul 02:19
Compare
Choose a tag to compare
v0.51.0-rc2 Pre-release
Pre-release

📦 Uncategorized

  • Migrate Pad Device and All references
  • #0: Multi-CQ support for R-Chip
  • #10028: Remove skip and reduce test case for moreh_groupnorm test
  • #10005: Change input tensor parameter to optional in moreh_sum_backward
  • #10004: Revise bias tensor usage in moreh_linear_backward
  • #9663: support moreh_nll_loss_unreduced
  • #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
  • #0: Update README.md grammar for idiomatic description of TT-NN
  • #9767: removed more no longer needed manually specified attributes for reflection
  • Add distributed layernorm kernel documentation
  • #10031: Fix -Werror=return-type error in composite_ops
  • #9492: update matmul path in CODEOWNERS
  • #9450: change silicon fixtures to session scope
  • Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
  • #9441: add all typecasts to unit test
  • #9801: Add cb alignment fix for blackhole that was missed in rebase
  • #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
  • #10052: Add metal pack untilize test
  • Add ttnn matmul tests to TG unit tests
  • Add ssm_prefix_scan test coverage for N=16
  • Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
  • #8450: Cleanup items pending from PR #9068
  • #10030: fix moreh_nll_loss hang
  • #7736: Remove unused reduce dim & type from reduce_init*
  • #9871: Update backward files
  • #9874: Move Unary Backward ops to TTNN
  • Update op_perf_results
  • #9962: Enable flags for profiler globals in jit build
  • Added prefill mode for mamba modules
  • Increase timeout for Mamba full model tests
  • Support multiple user indices in paged_update_cache
  • #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
  • Pack runtime arguments across brisc/ncrisc/trisc
  • Llama Demo Refactor
  • #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
  • #0: Move t3k demo tests to perf pipeline because it requires perf governor
  • #5424: Delegated sfpu reciprocal calls to gs submodule functions
  • Add trace and multi cq implementations/tests for WH Resnet
  • #0: (MINOR) Update to v0.51.0
  • #0: bump python3.8 venv versioning since apt repos updated
  • #10099: fix semaphores init for packet mux/demux
  • #10112: Drop hard pin for installation instructions for python3.8-venv in dependencies
  • Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions"
  • #0: Remove stray assert forcing single CQ on R-Chips
  • #9490: Replace tt_dnn op's usage in C++ with TTNN
  • #9874: Merge Next set of unary backward ops to TTNN
  • #10073: Move unary backward ops to TTNN
  • Unary backward op migration
  • #10087: update tt-umd submodule
  • #9959: Migrated pad to ttnn sweeps
  • Adding distributed layernorm to llama prefill
  • Add pytest xdist multiprocess to single-chip demo tests
  • Revert "Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions""
  • #10071 : Move second set of Unary Backward ops to TTNN
  • #10083: added tt::stl::json::to_json and tt::stl::json::from_json
  • #10086: Add logic for splitting cmds that exceed the subcmd limit into separate cmds for semaphores
  • #5424: Delegated sqrt api call to thirdparty gs submodule sqrt call
  • #5424: Delegated sfpu api call to sqrt for wh to submodule sqrt call
  • #0: Fix galaxy eth dispatch init to only init the specified number of cqs (galaxy only supports single cq)
  • Fix undefined memory bug in ssm_prefix_scan
  • removed weight copies from DRAM to L1
  • fix syntax issues with test dispatch workflow
  • #9609: Reorganize libs into ttnn
  • #10165: Fix build error with g++-12
  • Adding support for dram sharded matmuls
  • #10076: Migrate Unary bw ops and replace tt_eager ops with ttnn ops
  • #10072: Move next set of Unary Backward ops to TTNN
  • #9082: ping individual falcon member since slack user group is not wo…
  • #8681: Add Floor, Trunc blocker ops
  • #9419: use memcpy to avoid mem misalignment
  • #10079: Move Unary Backward ops to TTNN
  • Migrate unary ops to TTNN
  • #9945: Skip SD for nightly FD, device perf tests, and single-card demos as it hangs on di/dt
  • #10045: use struct for matmul parameter passing and update doc string
  • #10045: remove use_1d_systolic_array from ttnn matmul
  • Ngrujic/profiling
  • #9319: Upload benchmark data for t3k falcon 7b tests
  • Aliu/build opt
  • #10107: Fix hangs w/ launch_msg size >32bytes
  • [CCL] Making buffer size dynamic to input slice
  • #7617: remove failing experimental model test
  • #7618: delete failing experimental model test
  • #0: fix prefill CI for mamba
  • Move Mamba tests to wh_b0_only_eth pipeline
  • #9747: Implement ttnn::tilize in C++
  • Aliu/prevent aho tanking
  • #10045: fix up missed parameter change in mamba block model
  • #9490: Added ttnn support for unary ops py file
  • #10101: [Blackhole Bringup] Revert Zeroacc to legacy behaviour
  • Update README.md
  • #0: Fix imports after tt_lib change
  • #10226: [Blackhole Bringup] Add new sfpu files
  • Suppress g++-12 build errors with -Wno flags
  • #0: Fix BH regression caused by unaligned L1_UNRESERVED_BASE
  • #10077: Migrate Unary comparison backward ops to TTNN with Overloading
  • #10175: Remove std::function and restructure ternary_bw
  • Falcon40b attn mask optimization
  • #10074: Move Unary backward ops to TTNN
  • Replace all TT Lib Unpad with TTNN Slice
  • #10082: Migrate unary bw ops to TTNN and remove std::function
  • #9715: Use build artifacts for profiler tests
  • #9021: adding resnet api into ci.
  • Update README.md

v0.51.0-rc1

11 Jul 02:01
07aacde
Compare
Choose a tag to compare
v0.51.0-rc1 Pre-release
Pre-release

📦 Uncategorized

  • Migrate Pad Device and All references
  • #0: Multi-CQ support for R-Chip
  • #10028: Remove skip and reduce test case for moreh_groupnorm test
  • #10005: Change input tensor parameter to optional in moreh_sum_backward
  • #10004: Revise bias tensor usage in moreh_linear_backward
  • #9663: support moreh_nll_loss_unreduced
  • #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
  • #0: Update README.md grammar for idiomatic description of TT-NN
  • #9767: removed more no longer needed manually specified attributes for reflection
  • Add distributed layernorm kernel documentation
  • #10031: Fix -Werror=return-type error in composite_ops
  • #9492: update matmul path in CODEOWNERS
  • #9450: change silicon fixtures to session scope
  • Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
  • #9441: add all typecasts to unit test
  • #9801: Add cb alignment fix for blackhole that was missed in rebase
  • #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
  • #10052: Add metal pack untilize test
  • Add ttnn matmul tests to TG unit tests
  • Add ssm_prefix_scan test coverage for N=16
  • Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
  • #8450: Cleanup items pending from PR #9068
  • #10030: fix moreh_nll_loss hang
  • #7736: Remove unused reduce dim & type from reduce_init*
  • #9871: Update backward files
  • #9874: Move Unary Backward ops to TTNN
  • Update op_perf_results
  • #9962: Enable flags for profiler globals in jit build
  • Added prefill mode for mamba modules
  • Increase timeout for Mamba full model tests
  • Support multiple user indices in paged_update_cache
  • #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
  • Pack runtime arguments across brisc/ncrisc/trisc
  • Llama Demo Refactor
  • #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
  • #0: Move t3k demo tests to perf pipeline because it requires perf governor
  • #5424: Delegated sfpu reciprocal calls to gs submodule functions
  • Add trace and multi cq implementations/tests for WH Resnet
  • #0: (MINOR) Update to v0.51.0

v0.50.0

10 Jul 22:04
f7c10a2
Compare
Choose a tag to compare

📦 Uncategorized

  • Fix issue with Mamba SSM A weight preprocessing
  • Make buid key unique for mmio and remote devices with same harvest mask
  • #5337: Removed eth_dispatch yaml flag from mistral tests
  • New workflow for custom test dispatch on CI runners
  • #9312: Add single-header boost-ext/reflect library as dependency
  • Opt LayerNorm/RMSNorm with 2D reduce
  • Revert "#8630: support uint8 data type"
  • #0: Fix codeowners for metal bert
  • Revert "Revert "#8630: support uint8 data type""
  • #9642: fix matmul2d in1 sharded with batch>1
  • #0: add tile layout support for GN
  • FD2 packed binary commands
  • #9082: t3k demo with slack notifications for owners. split jobs
  • Rtawfik/issue 9142
  • #9688: Remove redundant left shift in DEBUG_SANITIZE_NOC_READ_TRANSACTION_FROM_STATE
  • #9500: Update eth_interface include in tt_cluster to not be hardcoded for WH
  • #9578: Add WITH_PYTHON_BINDINGS option to allow build w/o python
  • #9587: Update CB and worker Go signals to respect max sub cmd limit introduced by dispatch packed write local copy change
  • Add support for bfloat4 weights in Mamba
  • Use in-place binary operations in Mamba block
  • #5337: Relaxed Mistral expected compilation time in CI by 1 sec
  • Mo/9406 profiler build flags
  • Add support for single col/row/core output grid for matmul 2D
  • #9725: Set release candidate releases on GitHub to pre-release, not draft, to enable downstream users
  • add tagged docker image with releases
  • Rtawfik/issue 9164
  • #5562: resolve reduce scatter issues (nd hang and correctness)
  • Create benchmarking tools for saving run/measurement data (with Falcon7b example) and model-demo utilities for verifying tokens/perf
  • #0: Fix bug with var name in single-chip falcon7b demo tests
  • #9735: fix issues with including reflect library
  • #9527: Remove usage of bcast where multiply is used
  • Mchiou/9082 slack notification owners
  • #9681: set name attribute for ttnn operations when fast runtime m…
  • #9553: Add prefix scan op for Mamba prefill
  • #9628: Merge Binary backward ops from tt_eager to TTNN
  • Namhyeong kim/support fp32 dest acc in moreh adam
  • #0: Update t3k workflow timeouts (except freq pipeline)
  • Temporary update Mixtral perf times to pass CI
  • #9479: fix cpu core worker bug
  • #4858: add typecast fp32 <-> int32
  • #0: ViT demo fix
  • #9389: Add support for integer type in sum operation
  • Transfer llama2/3 from experimental to demo folder.
  • #9657: add topk multicore to support larger dimension sizes
  • #4858: add typecast bfp8_b
  • #9082: t3k model perf split tests with slack notifications, disabled cnn
  • #0: Add ttnn/cpp to packages to enable using ttnn kernels in tt_eager ops
  • #9741: Set stricter pytest timeouts
  • #9492: Change models matmul usage to ttnn
  • #9778: test prefetcher hanging with changes to test
  • #9490: TTNN eltwise/unary migration
  • Update timeout for falcon40b t3k demo test
  • #0: Remove extra t3k falcon40b matrix test group
  • #9044: Move dispatch core x y to be part of launch msg
  • Modify rot mat each iteration to avoid allocating 10k tensors upfront
  • Optimize bcast sharded op
  • Start using reflect library
  • #0: Properly delete source folders for wheel testing
  • #9479: Update Mixtral perf estimates
  • #0: Added github community issue workflow
  • #8729: Pytest multiprocess reset infrastructure
  • Enable switching between 1 and 2 cqs in the same process
  • Fixed failing tests for SD Conv tests for WH using new conv
  • #0: Switch org-membership check to an authenticated call
  • #0: Decrease num loops in trace stress tests
  • #9628: Support optional return tensor
  • #0: Use CV to wait for cq_reader in production mode. Remove enqueue_record_event for NB calls
  • #9628: Merge second set of binary backward op from tt_eager to TTNN
  • #0: Bump bert compile time threshold since it's been intermittently failing on ci
  • Mchiou/9792 t3k runner management
  • #0: Bump up Bert inference time due to instability on ci
  • #8865: For host dispatch time measureing increese failing reference t…
  • #9484: Add output_tensor queue_id to dependency ops
  • Adding the new op: Flash Decode!
  • #0: Add missing permissions to issue notification job
  • #9275: Fix Falcon7b demo failing to run by default on an Grayskull e75
  • #9801: Account for 64B BH PCIe alignment in cq cmd sizing
  • #0: Make prefetcher early exit after fetching/reading exec_buf
  • #8683: Add Unary bitwise AND, OR
  • Ngrujic/profiling
  • #9628: Merge third set of binary backward op from tt_eager to TTNN
  • #4858: add typecast uint32
  • Migrate Pad Host Code, Bindings, C++ Usages from TT Eager to TTNN
  • Support longer sequence lengths in ssm_prefix_scan
  • #9709: Add optional transpose_a and transpose_b to ttnn matmul and linear
  • #0: Only run batch 12 bert for GS profiling and tighten some bert/resnet thresholds
  • Asarje/resnet highres 20240624
  • #9492: replace falcon specific matmul calls
  • Extend ssm_eltwise_mul for num_users > 32
  • Update documentation for adding new ttnn operation
  • Extend ssm_1d_reduce for the batch>32
  • #0: rn50 fix add api
  • #9123: Add support for optional output tensors to run in the worker t…
  • #9861: support check_tensor helper_function
  • Fix syntax issues in custom test dispatch workflow
  • Add Mixtral accuracy tests and cleanup its other tests (CI-friendly)
  • #9876: Increase timeout on falcon7b perplexity tests.
  • #9492: Remove bmm/resnet_matmul from models
  • #9410: enable fp32 precision unpacking for interm. CBs
  • #9903: Fix conditional statements and indexing of y values in CoreRange::diff
  • #9860: fix test create device apis
  • #0: delete unused code
  • #9719: fixed l1 clear issue on nlp create qkv heads decode test case
  • Fixing type in llama demo readme
  • #9892: Device only op report
  • #8704: define consts for registers that hold x-y coordinates and amount to shift address to get x-y coord
  • CODEOWNERS update
  • Abhullar/bh misc fix
  • Auto-register C++ ttnn operations in python
  • #9788: Remove TopK from TTLib and replace all references with the TTNN api
  • #0: add owners for resnet demo
  • 7-way split of eager tests
  • #9910: Improve Softplus kernel accuracy
  • #9818: Add cache check to op info V2
  • #0: update noc test bound
  • Fix branching bug in softplus kernel
  • propagate error upwards for tests in falcon 40b suite
  • #0: Fix falcon40b softmax import failure
  • #9755: move ttnn.concat to match the new file structure
  • #9837: Assign workers after performing ref count cleanup in async mode
  • #0: Make event_synchronize API safer
  • #0: Update buffer asserts to account for trace buffers
  • Clean up ttnn operation registration on python side
  • #9164: [Blackhole bringup] Add fix for unpack untilize
  • Aliu/no l1 clear
  • Restructure ttnn::permute to match the new standard format
  • #9815: Update host to pass packed write max unicast sub cmds to cq dispatch
  • Distributed layernorm op
  • #9831: re-enable test
  • #8835: cleaned up ttnn operation registration on C++ side
  • #9941: update dram/l1 to noc xy header to do the appropriate shift
  • #9336: Refactoring moreh layernorm
  • #9745: move unpad to slice ttnn cpp references
  • #9980: Update falcon updated outputs
  • Fix Main after Pad Merge
  • Update eltwise bcast unary ops to use memory_config and fix PCC issue for interleaved output
  • Update FD cmds to be PCIe aligned
  • Fix N150 product name to nebula_x1 even if its unharvested.
  • #0: add a second codeowner for conv
  • #0: Get tt-metal to compile with gcc-12
  • #9492: Change to ttnn matmul in tests and tt_eager
  • #9441: add typecast uint16->uint32
  • Move ttnn::embedding to match new pybind structure and replace C++ ttlib embeddings usage with it
Read more

v0.49.0

12 Jun 14:05
Compare
Choose a tag to compare

📦 Uncategorized

  • #5044: Add optional output to addalpha
  • #9059: Fix matmul for single core grid
  • readme update
  • #0: (MINOR) Update to v0.49.0
  • #7586: Move common models for single-card nightly to ln model
  • Update Mamba README
  • TTLIB interval to sharded sweeps
  • #0: Update dataflow api comments
  • #9196: Merge new op: Fast reduce nc into main
  • #0: New resnet50 test skipped on WH since its WIP
  • #9329: Restructure ttnn::argmax
  • #9323: Introduce template for new ttnn pull requests
  • #0: skip release build on GH runners, we already test it via build a…
  • Remove unused dependencies and fetch gtest via CPM
  • #8764: Part 3 of docs and model demos changes
  • Ngrujic/profiling
  • [Mistral-7B] Add flags for weight paths
  • Typecast int32->fp16b
  • #9258: Remove ARCH_NAME and TT_METAL_ENV from wheel testing
  • Implemented SD using new Conv API
  • #9258: Re-add wheel into release assets
  • #9361: Install Clang-17 and gdb 14.2
  • #7525: Re-skip demo batch 7 metal_BERT_large_11 on WH because it still hangs ND
  • #9206: add sfpu config reg init to llk sfpu inits
  • #9059: Avoid a couple of fatals in matmul
  • Add Galaxy support.

v0.48.0

10 Jun 18:09
Compare
Choose a tag to compare

📦 Uncategorized

  • #7744: Add support for non-4D tensor in moreh_sum, moreh_sum_backward
  • #5544: Add output tensors parameter to moreh_nll_loss op
  • #5544: Add output tensors parameter to moreh_sgd op
  • #5544: Fix package build error
  • #5544: Add output tensors parameter to moreh_linear op
  • #5544: Prevent eager unit test failures
  • #7997: Support non-4D tensor in moreh_softmax
  • #7816: Bump SD perf target
  • #8098: Remove temp buffer copying when reading from hugepage to host buffer
  • #0: Specify DEBUG_STATUS as a string literal instead of multiple chars
  • #8212: Fix uneven shards for interleaved_to_sharded op
  • #0: Refactor unpad tile to modify rt args in place and remove dynamic…
  • #7838: Add support for non-4D tensor in moreh_linear OPs
  • #0: Use split_work_for_tilize in both tilize and untilize
  • #8131: resnet-50 fix for b20.
  • Add support for multiple parameters in EltwiseUnary
  • #7625: Enable multicore for tilize with padding by default
  • Trace Support
  • #0: Switch set runtime args assertion for if kernel was placed on core to TT_ASSERT
  • #7179: enabling test case. The issue was not reproducible on 8.12 dri…
  • #4625: Multicore runs for untilize with unpadding on interleaved tensors
  • #0: Cache program cmds, convert cb configs from write linear to write packed
  • #0: Make skip and xfail optional in defining sweep tests
  • Shwetank tt/bcast op
  • #8364: Disable implicit fallback for ttnn.pad
  • #8513: Add slack notifications to several more pipelines
  • #0: Update common RT args to use no stride flag for packed cmd.
  • #0: Option to write compile_commands.json from CMake
  • #8718: eltwise testing for bfloat8
  • Add support for bfloat8 input tensors in Mamba SSM block custom kernels
  • #8460: Enable Clang-17
  • #0: Remove overhead in calling functions wrapped in tensor_impl_wrapper
  • #0: Updating the perf thresold to incorporate Merge back uneven reshard commit.
  • #6365: Add ttnn host tests
  • #6365: Revert "#6365: Add ttnn host tests (#8210)"
  • #4382: fix GH reported vulnerabilities
  • #0: bump C++ timeout limit to 45 minutes
  • update unpad doc for slice generality
  • Convert Falcon7b tt_lib ops and tensors to ttnn.experimental
  • #6365: Fix ttnn host wheel tests
  • Add git bisect script
  • #0: Move falcon40b ci unit tests to different pipeline
  • #8437: remove default matmul program config
  • #0: Add myself to ttnn codeowners
  • #0: Update README.md to include mention of TTNN_CONFIG_OVERRIDES
  • #0: Fix typos and add TTNN_CONFIG_OVERRIDES parameter descriptions to readme
  • #0: Add basic sanity checks during matmul program config creation
  • #8907: Sweep tests for tilize/untilize
  • #8902: Fixed program caching bug in nlp load slice op and added additional test cases for the op
  • #8917: Add sweep test for the fold op
  • #0: Properly support trivial single core case for 1D matmuls
  • #6343: updated test_perf with test for bloom causal_lm
  • #6343: Add functional_bloom test_demo
  • Update README.md
  • Enable optimised attention by default in falcon prefill.
  • Replace FreeList shared_ptr with local_shared_ptr
  • Add dummy_weights mode for mixtral tests
  • Refactor operation calls: Replace operation::run() with operation::launch_op()
  • Use HiFi2 to bump Falcon7b prefill PCC
  • #8902: add input and attn_mask del
  • #8930: Disable llama perf test
  • #0: Add third codeowner to matmul path
  • #0: Add create_venv.sh as environment option in installation instructions
  • #7083: Composite conv fix for relu called after matmul
  • #7525: Skip batch 7 metal BERT on WH B0 because it still hangs too often
  • #8871: Add initial infra/support for dram sharding
  • #8531: delete all makefiles
  • #0: Delete dead code from work_split.hpp
  • #8853: Uplift SFPI to latest w/ BH support
  • #8725: Warn user if kernel cache is enabled
  • #0: Minor test_prefetcher fixes
  • #5389: Move ttnn.repeat to c++
  • #8131: temp fix for PCC issue on W0.
  • Optimize e2e perf Falcon40b modifying layernorm
  • #0: Relax Falcon7b perf target
  • #0: Resolve segfault in llama async mode
  • Resnet Optimizations
  • Create Falcon7b perplexity test and utility functions for text-gen datasets
  • Revert "#8131: temp fix for PCC issue on W0."
  • bmm dram sharded opt
  • #8943: Clean up profiler python_env build flow
  • #8904: Add slack notifications for T3000 unit-tests
  • Add unet shallow functional, performance and demo test files
  • #8932: Multi-Device Mixtral Argmax Support
  • #8264: Worker thread optimizations:
  • TTNN tests for bf8 with mk tiled scalar
  • Ihamer/7468 inject noc delays
  • Support changed csv row orderings in Mixtral's op_perf_results.py
  • Correct merge issue in op_perf_results.py
  • #0: Add kernel groups to test_pgm_dispatch
  • #0: Add docs requirements to python env cache key because it can change the environment as well
  • #0: Add helper function to create CBs
  • #8973: Remove TT_METAL_ENV because we don't need it anymore
  • #5773: Move SD model to demo folder
  • #6938: Implement softplus as a single kernel
  • Model team/rotary embeddings llama
  • #8735: Fix hw/inc/blackhole files for compilation
  • Improve Mixtral perf with ttlib
  • Update README.md
  • #3712: fix old version of GN test
  • #0: Don't error on unused functions in compiler call
  • Revert " #8904: Add slack notifications for T3000 unit-tests"
  • Rtawfik/bh llk api
  • #0: Added interactive demo
  • Move Falcon7b before Mixtral in demo pipeline to workaround issue
  • #8112: Add support for ND tensors to matmul
  • #0: fix dram read benchmark
  • Fix bug in utility_functions::Profiler
  • Remove 1x1 matmul fallback on convolution and generalize convo…
  • #5389: Remove ttnn.split
  • #8767: decouple build folder name from build.cpp
  • #8735: Update common flags for BH build after sfpi module update
  • #8895: Fix ttnn.as_tensor(..) method for placing tensors on-device
  • #8539: Add cq_id to run_operation function args
  • #8632: Support fp32 dest acc en in moreh_sum and moreh_sum_backward
  • #5044: Add optional output tensor and remove autoformat in eltwise binary ops
  • #8895: Fix failing regression test in dump_tensor(...) API
  • More Resnet Optimizations
  • #4858: add typecast fp32 to uint32 op
  • #8995: refactoring moreh arange
  • #0: Add ccache option to build_metal.sh
  • Update Mixtral perf figures
  • #8349: Use BFP4_B for attention mask in falcon7b optimised prefill.
  • #0: Add CODEOWNERS for build_metal.sh
  • Rtawfik/add binary reuse metal
  • Update watcher.rst - use double backticks
  • Falcon40b tt_lib to ttnn.experimental
  • #0: fix dram sharded program cache
  • #7083: New halo fix for enabled program cache
  • #9051: Enable Llama model perf test
  • #8764: Single card WH demo tests
  • #8764: Various docs fixes for WH release
  • #0: Correct script locations for nightly single card
  • #8764: Use new device_l1_small_size fixture for SD demo interactive test
  • #9059: Update matmul test pcc
  • #0: Ensure weka mount is active for demo tests otherwise it won't run
  • #0: remove reserve to avoid bad alloc
  • #8764: Separate n150/n300 demo tests to not run BERT 11 on N150
  • Remove unnecessary llk sfpu param files
  • #9059: Add fallback for getting matmul program config
  • Add grouped convolution support
  • #8282: Support non-4d tensor and fp32_dest_acc_en for moreh nllloss backward
  • #8976: moreh_getitem receive signed integer index tensors
  • #9049: fix moreh_sgd callback and add callback test
  • #0: Remove argmax multi-device test due to segfault
  • #7724: Add prototype for autonomous streams for use in tunneller
  • #9036: GS & BH --> Combine llk param files using variable args
  • #0: optimize allgather for small tensor sizes
    ...
Read more

v0.46.0

05 Apr 13:57
Compare
Choose a tag to compare

📦 Uncategorized

  • user-triggerable C++ post-commit suite
  • #6406: add missing position_ids/attention_mask to bert demo
  • #6282: Add AdamW
  • #6315: Fix dprint tests for T3000
  • FD2: prefetch stall, dispatch wait, linear read, delay and cleanup
  • #6609: update wording in demo section of main README.md
  • #6364: Autocomplete for pybinded types
  • Asarje/ttnn rn50 b20
  • FD2.0 Test - Fix l1 buffer not page-size aligned in after FD-on-eth changes to L1_UNRESERVED_BASE
  • #6593: Add resharding to Llama2 model when possible.
  • #6572: Fix ttnn.repeat_interleave example in documentation
  • #5780: Re-enable 100K enqueue program stress test on grayskull
  • Enable basic width sharding support in all-gather
  • Alex/metal/remove cb wait markers
  • #6657: Use sysmem manager cq size instead of recomputing it each time…
  • #0: (MINOR) Add Grayskull purchase link and update version to 0.46.0
  • #5063: add TopK API to metal
  • #5480: FD2.0 Test - Fix test_prefetcher for dram paged read test (-t 3) on whb0
  • Fix logit low pcc
  • Backward op - Fixed ldexp, hardsigmoid and asin
  • #6598: Fix softplus
  • Add support for BFP4_B tensor serialization
  • Eltwise mul for different batch size
  • #6575: Split docs into separate Metalium and nn docs
  • #0: Add two separate links for documentation (tt-metalium/ttnn) on README
  • #6361: Update ttnn repeat to use correct shapes when formatting output
  • #0: Sayonaraaaaaaa
  • FD2.0 Test fix test_prefetcher add_paged_dram_data_to_worker_data dropping start_page
  • #5785: Watcher ringbuffer implementation
  • Add FD 2.0 WriteHost Command
  • #0: Put back frequent api tests because I'm an idiot
  • Optimize All Gather Interleaved Worker send/receive
  • #0: changing all #include common/* to #include tt_metal/common/*
  • #6676: Fix issues related to unary lte and gte
  • #5817: Fix lerp
  • #6589: Fix for relu_bw
  • #6633: Backward test update
  • #0: Skip logit, logiteps test
  • #0: Testing CI fix
  • #5480: Update test_prefetcher to pass added hugepage args to dispatch kernel
  • Fix l1 acc, add whb0 optimized conv tests
  • Alignment fix for eth core kernels
  • Add data parallel (multi-chip) for Falcon7b (prefill/decode) model and corresponding tests
  • CQ_DISPATCH_CMD_WRITE_PAGED support in test_dispatcher and passing tests
  • #6647: disable failing ci cpp tests and reenable cpp pipeline on CI
  • Backward test updates
  • Ngrujic/check bugs
  • Add Llama matmul perf tests to main
  • TTLIB: removing working tests from broken
  • #6443: Update backward asin and addcdiv logic
  • #0: Fix output cb size calculation in reshard op for bfp8b
  • #0: use smart ptrs in allocator
  • Jvasilje docs 0322
  • DRAM based device profiler with Tracy support
  • #6553: Fix ttnn.reshape(..) handling for bfloat16, TILE_LAYOUT
  • PR: #6746
  • Add Llama2 demo to tt-metal docs
  • Mistral-7B WH demo
  • Revert "#0: Put back frequent api tests because I'm an idiot"
  • FP32 support
  • #0: Add back frequent api tests to run.sh
  • Bteng/watcher ci3
  • Remove cpuprof
  • logo update
  • #6184: sharded row major silu support.
  • #6443: Update div_bw and backward ops test file
  • #6705: Relax forcing of keyword argument in ttnn.open_device
  • Forward op tests
  • #6691: Allow blocking of inner dim within a core for shaded in0 for 2d and 1d systolic matmuls
  • #6662: Width Sharding support for eltwise OP
  • Stable diffusion python API level perf improvements
  • Add get_compute_kernel_config_args function
  • #0: Add fd-2/main triggers for pull_request and push for post-commit
  • #5480: FD2 refactor for pre/dis patch variants
  • #6654: Add perf tests for ttnn ResNet50
  • #5480: Fix fd gtest unit test test_write_host
  • #0: Set myself as setup.py owner
  • #6780: Add mistral7b to demos list in getting started
  • #4003: re-added TTNN_ENABLE_LOGGING as runtime flag
  • #0: Fix semaphore address gen bug
  • #6769: Disable program caching for failing Llama tests.
  • #5480: Fix zero sized write transaction request that could occur in write_linear_host
  • #6077: Fix unet pcc issues
  • Remove DstSync from llk api templates
  • FP32 Support
  • #6680: Reverting move op change
  • #6443: Update asinh and softsign backward
  • Backward tests with updated test modules
  • Ngrujic/check bugs 1
  • #6654: Moving init for self.compute_kernel_config
  • #6805: reproduce the bug with sharded split_query_key_value_and_split_heads
  • #6832: Account for tile-padding in softmax for mistral 7B
  • Enable support for uint32 format to be consumed by SFPU (issue #4624)
  • #4252: fix clang build error since std::log2 only constexpr in gcc
  • #4003: log, debug and add pre- and post- hooks only for top-level ttnn ops
  • #6823: Fix core count to not include dispatch cores in op reprot
  • #6197: Align pages for interleaved <-> sharded.
  • METALIUM_GUIDE
  • Bteng/watcher post commit
  • #6443: update backward test file for relational ops and concat op
  • Revert "Bteng/watcher post commit"
  • #6443: Update backward ops
  • Backward test updates
  • #0: Add the dim 0 support repeat backward
  • Update hard related test ops
  • #6757: Remove set_profiler_location
  • #6443: Update backward ops erfinv elu hypot cos sin
  • #6861: Enable Watcher/dprint tests on T3000 CI
  • Update Mistral perf regression for CI, until issue is resolved
  • Mamba/perf v1
  • #0: remove data movement ops related to silu in SD
  • #4003: added proper fallback for getitem of ttnn.Tensor. Slice the tensor only on the tile boundary but set the shape based on whatever user provided
  • #4003: added proper fallbacks for every op that falls back to torch
  • #6731: add fix to LN width sharding
  • #5797: add back sweep test for ln
  • Integrate GroupNorm V2 to SD model
  • METALIUM_GUIDE.md updates
  • [Falcon7b] Fix bugs with inference throughput measurements in demo
  • #0: shallow unet add perf_mode
  • #6154: 2d matmul in0 height, in1 width sharding
  • #5249: Various Falcon40b test and demo cleanup
  • #0: fix incremental build
  • #0: remove upsample spill to DRAM
  • [Llama2 Prefill] Model Functionality completed
  • Watcher alignment checking for PCIe/DRAM <-> L1
  • #6920: fixed the error in whisper
  • Update METALIUM_GUIDE.md
  • #6644: save l1 buffers to data base
  • Update usage.rst
  • #6804: fix ttnn falcon7b demo regression + add to CI regressions
  • #6285: Add backward support for floor round and div_no_nan
  • [skip ci] Update INSTALLING.md
  • #6873: Add more test combinations to tt_lib sweeps add, add_unary, su…
  • Ngrujic/check bugs 3
  • #6882: Updated Mistral-7b perf estimate
  • #6850: Update install links in Sphinx docs to point directly to INSTALLING.md
  • #6619: Fix per op profiler sum
  • #6644: sync before calling print l1 buffers
  • Barsic/ttlib ops check
  • Barsic/ttlib params fix
  • #6962: Move cd tt-metal earlier in the command list of INSTALLING.md
  • #6819: Add support for CreateKernel absolute file paths
  • #6356: Remove half-half grid logic for bmms
  • #4003: added a flag to disable ttnn fallbacks. Don't throw an error w…
  • #0: Correct FW versions, tt-smi versions, and add note about tt-topology
  • #0: Capitalize tt to TT consistently for marketing
  • #0: Add myself as CODEOWNER for INSTALLING.md
  • #6644: ttnn visualizer
  • #6847: Allow disabling individual watcher features
  • #6889: Support printing/padding/tilizing multi-device tensors
  • #4003: removed ttnn.print_l1_buffers and consolidated all ttnn flags into a CONFIG class
  • #6217: tt_lib async mode support (single chipp tensors supported)
  • Reshard With Ranges
  • #4003: updated buffer report to show...
Read more

v0.45.0

22 Mar 18:03
Compare
Choose a tag to compare

🚀 Features

  • #6204: added support for num_users < 32 for update cache op.
  • #6247 Llama2 Galaxy MLP implementation

📦 Uncategorized

  • #4736: Add support for moreh_norm op
  • Fix moreh_layernorm rstd
  • #5508: Change test_moreh_layernorm.py for debugging
  • #4686: add infra for sharing global struct among ops
  • #5592: Fix pcc on Falcon 7b prefill by turning on l1 packer on MLP 4h-to-h matmul
  • Fix layernorm beta data format reconfig
  • Add linked support for in0 in1 mcast in matmul
  • #4957: optimizing construct_2d_padded_tensor_list
  • #4003: added ttnn.as_tensor and enabled support for caching torch tensor
  • Revert "#0: Fix for fail in asinh backward"
  • #5829: Use moreh_common.hpp for data movement kernels across moreh OPs
  • Barsic/ttnn ops
  • #6030: Update resnet performance metrics
  • #5876: pytest & c++ test logging cleanup
  • #0: Use both 2x2 and 2x4 machines on every scheduled run
  • Add single core matmul benchmark
  • #6079: Update FORCE_INLINE to be nop when watcher is enabled
  • #5980: Fix a hard-coded bounds check in dprint
  • #5389: merged ttl and ttnn tensor classes into one
  • Initial Performance Model
  • fix ci
  • TTNN RN50 :: on the road to match perf with TTLIB version
  • #4438: Optimized single-core fold op
  • #5589: Add repeat-interleave and addcmul sweeps
  • #6055: Add square backward support
  • #6057: Add backward support for lgamma
  • #6056: Add backward support for frac and trunc
  • #6066: Add support for backward log sigmoid
  • #6002: Add backward support for binary maximum
  • Ngrujic/improve conversion to bfloat8b in sweeps
  • #5829: Use moreh_common.hpp for compute kernels across moreh OPs
  • #0: Remove post-commit label from multi device pipeline because it's not actually post commit
  • Add pack l1 acc to resnet conv
  • #6144: Skip 512x512 cross attn 2d upblock for now in nightly because it hangs
  • #6061: Add tanhshrink, threshold, Unary EQ backward ops support
  • Width Sharded Concat for Unet
  • #5184: uncommenting various moreh test case.
  • Fix compute kernel config arg for resnet50
  • Nsmith/untilize unit test
  • Revert "Revert "#5389: merged ttl and tensor classes into one""
  • #4438: Do not use the new fold op in Resnet tests
  • Remove corerangeset that does not work on wormhole
  • #6129: Expose kernel config attrs and use 4 dst tiles for fp32 configs
  • #5391: Add device perf
  • #0: Use multiplier for wormhole b0 mulsi3
  • #4003: removed ttnn.Tensor autoclass from tensor.rst
  • TTNN MultiDevice Support
  • build artifacts
  • #4947: Add noc alignment checks to watcher
  • Add ttnn multi-chip unit test for checking device shards
  • Nsmith/fix unet
  • #6043: Random program stress test of command queues
  • Logit and logiteps backward support
  • Backward support for log2
  • Add missing ttnn tests and disable broken tests until issues are fixed
  • Fix Events feature for FD1.3 (out-of-order event ids, events feature missing) #6093
  • #5873: make top-level post commit workflow re-useable
  • #5589: add groupnorm for ttnn sweeps
  • Ngrujic/ttnn sweeps 4
  • Add ethernet datamover (EDM) - a foundational ethernet transfer engine
  • #6116: Add backward support for softshrink
  • #0: Add verbose make logs to artifact and make nicer name on metal
  • #0: Only use 2x4 setup for multi-card WH CI as 2x2 does not provide us good feedback
  • #4809 dprint tensix regs
  • #4003: fixed bloom perf test
  • #6187: Conv bugfix
  • #0: concat RM support variable stick widths across inputs
  • TTNN RN50 on WHB0
  • #6084: Lower thresholds slightly after using proper configs for device resnet
  • Fast dispatch 2.0 proof of concept
  • #6218: add pytest for matmul 1d 2d
  • #6177: use is_tensor_storage_on_device so it works for MultiDeviceStorage
  • #6082: support workers + eth cores in one program
  • #6215: Rename TensorToMeshMapper/MeshToTensorComposer
  • #6164: Update test_noc_unicast_vs_multicast_to_single_core_latency to not use same cores for producer and consumer on WH
  • #6117: Add backward support for softplus
  • #6223: remove redundant call to context switch
  • Integrate EDM with all-gather.
  • #6136: Add backward support for unary LE and GE
  • #5398: fix unicast binaries
  • Barsic/ttnn ops 2
  • #5380: Add wormhole_b0 model perf tests, only falcon7b in ttlib for now
  • #5372: Updated README.md file for demo
  • #4003: updated ttnn.concat to have a registered fallback
  • Llama2 functional bringup
  • #5589: Add working BFLOAT8_B sweeps to working folder
  • FD2.0 rename HostQ->PrefetchQ, add multi-core capability, fix NOC coords
  • #0: bugfix in ttnn resnet caught by nightly
  • #0: fix tt_bisect build bug
  • Watcher Asserts
  • #6183: add unit test for sd matmul ops
  • #6254: Make program cache per device:
  • #5394: Add functional version of Mamba architecture
  • #6257: Add temporary convenience script for 800MHz / new eth reset dependent CI
  • #5661: Enable gtests for fast dispatch + R chip
  • Alex/metal/bmm large block untilize out
  • #5389: made tensor attributes public and use ttnn::Shape instead of tt::tt_metal::Shape for storing shape
  • Revert "#6183: add unit test for sd matmul ops"
  • #4003: print all of the L1 buffers using ttnn.print_l1_buffer_state
  • #4003: print all of the L1 buffers using ttnn.print_l1_buffers
  • #4438: Implement sharded multi-core fold op for Resnet50
  • #6149: disabled the check for comparing generated report with GOLDEN_L1_BUFFER_REPORT becauson pipelines it looks different than when running locally
  • FD2.0 fixes+mcast support for write and packed_write
  • Shwetank tt/config
  • #0: Change order of device and use_program_cache fixture in remaining pytests
  • Softplus with beta and threshold param
  • Build tests during artifact creation
  • #6149: disabled test_print_l1_buffers_of_add_operation
  • #4003: updated ttnn.to_torch to work with bfloat8_b tensors that are not multiple of tile size without tile padding
  • #0: add to/from L1 reshard test
  • #0: Add back deleted shape assertions for interleaved concat
  • test errors flagged by watcher
  • #0: fix incremental build
  • Merge xuncai/llama-attention-galaxy to main: First version of llama-attention galaxy on emulated chips
  • #6329: Fixing a bug causing mismatch on indices
  • #6321: Test which sweeps read/write buffer and just checks that the e…
  • Support moreh_getitem forward
  • #6125: Update in0_block_w to be full shard width for sharded 2D systolic matmul
  • #6107: Add softsign, sign, unary ceil backward support
  • #6226: Add backward support for div
  • #6234: Add backward support for rdiv
  • #6236: Add backward support for fmod and remainder
  • #4003: added positional embeddings to bert and updated ttnn_sharded_optimized_bert to run with batch size of 12
  • Indexed Fill
  • #5589: remove dtype in gen function sweep tests where needed
  • #6347: Print built-in defines once only
  • #0: Add Mo as code owner on profiler code
  • #0: Simplify tt_lib.scripts package by adding a specific tt_eager/scripts directory and putting the production scripts in there, whereas development scripts will stay in /scripts
  • #0: Fixture reorder changes reverted for falcon_7b perf test
  • #5424: remove metal_ckernel_sfpu
  • #0: Update remaining tt_lib.program_cache calls to use device APIs
  • #6183: add unit test for sd matmul ops
  • #6289: fix dispatcher page calculation
  • #5924: Enable unet on wormhole_b0 changes
  • #6325: skip test_multi_device.py for grayskull arch
  • Alex/metal/pack untilize no repack
  • #6144: Not hanging on GS or WH with or without Watcher
  • Agrebenisan/swq hwq cardinality cleanup
  • #6146: Add backward support for conj
  • #0: bug fix UTWH div_up instead of div trunc for calculating CB sizes
  • Fix To/From Sharded Bug
  • #6206: Fix resharding page mapp...
Read more