Release v0.48.0 · tenstorrent/tt-metal

📦 Uncategorized

#7744: Add support for non-4D tensor in moreh_sum, moreh_sum_backward
- PR: #7745
#5544: Add output tensors parameter to moreh_nll_loss op
- PR: #7194
#5544: Add output tensors parameter to moreh_sgd op
- PR: #7193
#5544: Fix package build error
- PR: #7818
#5544: Add output tensors parameter to moreh_linear op
- PR: #7147
#5544: Prevent eager unit test failures
- PR: #7835
#7997: Support non-4D tensor in moreh_softmax
- PR: #7998
#7816: Bump SD perf target
- PR: #8140
#8098: Remove temp buffer copying when reading from hugepage to host buffer
- PR: #8138
#0: Specify DEBUG_STATUS as a string literal instead of multiple chars
- PR: #7981
#8212: Fix uneven shards for interleaved_to_sharded op
- PR: #8259
#0: Refactor unpad tile to modify rt args in place and remove dynamic…
- PR: #8308
#7838: Add support for non-4D tensor in moreh_linear OPs
- PR: #8388
#0: Use split_work_for_tilize in both tilize and untilize
- PR: #8470
#8131: resnet-50 fix for b20.
- PR: #8283
Add support for multiple parameters in EltwiseUnary
- PR: #8398
#7625: Enable multicore for tilize with padding by default
- PR: #8527
Trace Support
- PR: #8572
#0: Switch set runtime args assertion for if kernel was placed on core to TT_ASSERT
- PR: #8645
#7179: enabling test case. The issue was not reproducible on 8.12 dri…
- PR: #8613
#4625: Multicore runs for untilize with unpadding on interleaved tensors
- PR: #8622
#0: Cache program cmds, convert cb configs from write linear to write packed
- PR: #8604
#0: Make skip and xfail optional in defining sweep tests
- PR: #8687
Shwetank tt/bcast op
- PR: #8058
#8364: Disable implicit fallback for ttnn.pad
- PR: #8742
#8513: Add slack notifications to several more pipelines
- PR: #8685
#0: Update common RT args to use no stride flag for packed cmd.
- PR: #8696
#0: Option to write compile_commands.json from CMake
- PR: #8761
#8718: eltwise testing for bfloat8
- PR: #8753
Add support for bfloat8 input tensors in Mamba SSM block custom kernels
- PR: #8733
#8460: Enable Clang-17
- PR: #8516
#0: Remove overhead in calling functions wrapped in tensor_impl_wrapper
- PR: #8840
#0: Updating the perf thresold to incorporate Merge back uneven reshard commit.
- PR: #8849
#6365: Add ttnn host tests
- PR: #8210
#6365: Revert "#6365: Add ttnn host tests (#8210)"
- PR: #8879
#4382: fix GH reported vulnerabilities
- PR: #8876
#0: bump C++ timeout limit to 45 minutes
- PR: #8882
update unpad doc for slice generality
- PR: #8878
Convert Falcon7b tt_lib ops and tensors to ttnn.experimental
- PR: #8870
#6365: Fix ttnn host wheel tests
- PR: #8897
Add git bisect script
- PR: #8894
#0: Move falcon40b ci unit tests to different pipeline
- PR: #8891
#8437: remove default matmul program config
- PR: #8772
#0: Add myself to ttnn codeowners
- PR: #8905
#0: Update README.md to include mention of TTNN_CONFIG_OVERRIDES
- PR: #8909
#0: Fix typos and add TTNN_CONFIG_OVERRIDES parameter descriptions to readme
- PR: #8910
#0: Add basic sanity checks during matmul program config creation
- PR: #8875
#8907: Sweep tests for tilize/untilize
- PR: #8908
#8902: Fixed program caching bug in nlp load slice op and added additional test cases for the op
- PR: #8913
#8917: Add sweep test for the fold op
- PR: #8918
#0: Properly support trivial single core case for 1D matmuls
- PR: #8915
#6343: updated test_perf with test for bloom causal_lm
- PR: #8391
#6343: Add functional_bloom test_demo
- PR: #8431
Update README.md
- PR: #8927
Enable optimised attention by default in falcon prefill.
- PR: #8892
Replace FreeList shared_ptr with local_shared_ptr
- PR: #8798
Add dummy_weights mode for mixtral tests
- PR: #8864
Refactor operation calls: Replace operation::run() with operation::launch_op()
- PR: #8893
Use HiFi2 to bump Falcon7b prefill PCC
- PR: #8719
#8902: add input and attn_mask del
- PR: #8928
#8930: Disable llama perf test
- PR: #8935
#0: Add third codeowner to matmul path
- PR: #8934
#0: Add create_venv.sh as environment option in installation instructions
- PR: #8898
#7083: Composite conv fix for relu called after matmul
- PR: #8919
#7525: Skip batch 7 metal BERT on WH B0 because it still hangs too often
- PR: #8938
#8871: Add initial infra/support for dram sharding
- PR: #8901
#8531: delete all makefiles
- PR: #8546
#0: Delete dead code from work_split.hpp
- PR: #8950
#8853: Uplift SFPI to latest w/ BH support
- PR: #8854
#8725: Warn user if kernel cache is enabled
- PR: #8951
#0: Minor test_prefetcher fixes
- PR: #8955
#5389: Move ttnn.repeat to c++
- PR: #8911
#8131: temp fix for PCC issue on W0.
- PR: #8948
Optimize e2e perf Falcon40b modifying layernorm
- PR: #8969
#0: Relax Falcon7b perf target
- PR: #8972
#0: Resolve segfault in llama async mode
- PR: #8963
Resnet Optimizations
- PR: #8933
Create Falcon7b perplexity test and utility functions for text-gen datasets
- PR: #8960
Revert "#8131: temp fix for PCC issue on W0."
- PR: #8984
bmm dram sharded opt
- PR: #8947
#8943: Clean up profiler python_env build flow
- PR: #8949
#8904: Add slack notifications for T3000 unit-tests
- PR: #8906
Add unet shallow functional, performance and demo test files
- PR: #8884
#8932: Multi-Device Mixtral Argmax Support
- PR: #8990
#8264: Worker thread optimizations:
- PR: #8778
TTNN tests for bf8 with mk tiled scalar
- PR: #8485
Ihamer/7468 inject noc delays
- PR: #8889
Support changed csv row orderings in Mixtral's op_perf_results.py
- PR: #8999
Correct merge issue in op_perf_results.py
- PR: #9001
#0: Add kernel groups to test_pgm_dispatch
- PR: #8992
#0: Add docs requirements to python env cache key because it can change the environment as well
- PR: #9010
#0: Add helper function to create CBs
- PR: #8991
#8973: Remove TT_METAL_ENV because we don't need it anymore
- PR: #8974
#5773: Move SD model to demo folder
- PR: #8294
#6938: Implement softplus as a single kernel
- PR: #8249
Model team/rotary embeddings llama
- PR: #8812
#8735: Fix hw/inc/blackhole files for compilation
- PR: #8880
Improve Mixtral perf with ttlib
- PR: #8971
Update README.md
- PR: #9014
#3712: fix old version of GN test
- PR: #9017
#0: Don't error on unused functions in compiler call
- PR: #9018
Revert " #8904: Add slack notifications for T3000 unit-tests"
- PR: #9023
Rtawfik/bh llk api
- PR: #8809
#0: Added interactive demo
- PR: #9020
Move Falcon7b before Mixtral in demo pipeline to workaround issue
- PR: #9034
#8112: Add support for ND tensors to matmul
- PR: #9004
#0: fix dram read benchmark
- PR: #9019
Fix bug in utility_functions::Profiler
- PR: #9025
Remove 1x1 matmul fallback on convolution and generalize convo…
- PR: #8886
#5389: Remove ttnn.split
- PR: #9027
#8767: decouple build folder name from build.cpp
- PR: #8780
#8735: Update common flags for BH build after sfpi module update
- PR: #9024
#8895: Fix ttnn.as_tensor(..) method for placing tensors on-device
- PR: #8964
#8539: Add cq_id to run_operation function args
- PR: #9039
#8632: Support fp32 dest acc en in moreh_sum and moreh_sum_backward
- PR: #8724
#5044: Add optional output tensor and remove autoformat in eltwise binary ops
- PR: #8394
#8895: Fix failing regression test in dump_tensor(...) API
- PR: #9040
More Resnet Optimizations
- PR: #8993
#4858: add typecast fp32 to uint32 op
- PR: #9033
#8995: refactoring moreh arange
- PR: #8996
#0: Add ccache option to build_metal.sh
- PR: #9015
Update Mixtral perf figures
- PR: #9048
#8349: Use BFP4_B for attention mask in falcon7b optimised prefill.
- PR: #9047
#0: Add CODEOWNERS for build_metal.sh
- PR: #9053
Rtawfik/add binary reuse metal
- PR: #8727
Update watcher.rst - use double backticks
- PR: #9054
Falcon40b tt_lib to ttnn.experimental
- PR: #9008
#0: fix dram sharded program cache
- PR: #9031
#7083: New halo fix for enabled program cache
- PR: #8987
#9051: Enable Llama model perf test
- PR: #9052
#8764: Single card WH demo tests
- PR: #9058
#8764: Various docs fixes for WH release
- PR: #8975
#0: Correct script locations for nightly single card
- PR: #9062
#8764: Use new device_l1_small_size fixture for SD demo interactive test
- PR: #9063
#9059: Update matmul test pcc
- PR: #9061
#0: Ensure weka mount is active for demo tests otherwise it won't run
- PR: #9069
#0: remove reserve to avoid bad alloc
- PR: #9067
#8764: Separate n150/n300 demo tests to not run BERT 11 on N150
- PR: #9073
Remove unnecessary llk sfpu param files
- PR: #9065
#9059: Add fallback for getting matmul program config
- PR: #9077
Add grouped convolution support
- PR: #8341
#8282: Support non-4d tensor and fp32_dest_acc_en for moreh nllloss backward
- PR: #8966
#8976: moreh_getitem receive signed integer index tensors
- PR: #8978
#9049: fix moreh_sgd callback and add callback test
- PR: #9050
#0: Remove argmax multi-device test due to segfault
- PR: #9089
#7724: Add prototype for autonomous streams for use in tunneller
- PR: #8207
#9036: GS & BH --> Combine llk param files using variable args
- PR: #9078
#0: optimize allgather for small tensor sizes
- PR: #9087
Enable weight caching for long running Mamba tests
- PR: #9002
#5389: removed early return from validate when enable_fast_runtime_mo…
- PR: #8983
Removed unucessary ttnn.to_device() from Mixtral code
- PR: #9097
Add 2 cq implementation for Resnet
- PR: #9057
#9084: Rename dockerfile and added virtualenv installation
- PR: #9085
#0: Watcher interval to not include polling time
- PR: #9038
#0: Revert "#8264: Worker thread optimizations:"
- PR: #9107
#5389: disabled failing moreh tests
- PR: #9116
#5389: disabled failing moreh tests
- PR: #9119
#5389: disabled failing moreh tests
- PR: #9121
#0: Update Resnet perf numbers
- PR: #9120
Split dispatcher commands into packets+prefetcher relay_linear bug fix and test improvments
- PR: #8814
#6448: re-enable all-gather bidir for dim 0,1
- PR: #9104
#8890: Reduce size of pack_src|dst_format constexprs
- PR: #9115
#0: merge all kernels into one group
- PR: #9125
#7724: Disable a test to reduce runtime
- PR: #9129
ttnn multi-chip changes for galaxy support
- PR: #9090
#9026: Fix FD dispatcher wait on wrapped value
- PR: #9113
#0: Add back Async Mode optimizations
- PR: #9130
Add support for bfloat8 activations in Mamba
- PR: #8768
#9118: Fix moreh getitem, moreh nllloss validation error
- PR: #9131
Update ViT E2E number in README.md
- PR: #9136
#4858: enable typecast fp16b to uint16
- PR: #9132
#8540: Upgrade eltwise binary ops to support queue_id /output_tensor / uint output dtype
- PR: #9071
#9095: implement callback helper function
- PR: #9096
#5044: Add optional output to where op
- PR: #9055
#0: enable multi-device tensor support for moreh sum op
- PR: #9126
#5337: Mixtral dense matmul after all-gather
- PR: #9155
Update Mamba decode performance metrics
- PR: #9134
#8683: Add Unary right shift
- PR: #8921
Snijjar/issue 7724
- PR: #9138
#5044: add optional output to BW ops EQ, add, addalpha, mul
- PR: #8671
build UMD with same compiler used to compile metal and remove clang 6 as a dependency
- PR: #9133
#0: change silicon param to session scope
- PR: #9162
Mo/8223 fd2 dispatch core profiler support
- PR: #8609
#9006: single-core topk extension to include larger width and height
- PR: #9139
#9088: fix ttnn_falcon_7b single-device regression in decoder module
- PR: #9166
#7586: Create unstable branch of WH single card nightly FD
- PR: #9122
#9143: BH -> Remove unused reduce args
- PR: #9175
#8563: sweep split_query_key_value_and_split_heads, split and concat
- PR: #8610
#8407: Remove 1x1 matmul fallback on convolution and generalize convo…
- PR: #9056
#4252: Update to C++20
- PR: #9070
#9110: Move typecast to ttnn
- PR: #9146
Update TTNN sweeps - concatenate heads, embeddings
- PR: #8863
#9016: adjust nightly t3000 demo test pipeline to run Mon/Wed/Fri
- PR: #9081
#9088: fix ttnn_falcon_7b single-device regression in attention
- PR: #9183
#9167: sped up compute program hash
- PR: #9169
#9109: Add q_id to Eltwise binary EQ
- PR: #9177
#8662: add initial argmax op single core kernel implementation
- PR: #9180
#8424: Add new llk-wormhole-b0 commit: remove assert for fp32 zeroacc
- PR: #9188
#9059: adjust matmul parameters for rounding up in some scenarios
- PR: #9105
#5389: Move ttnn.repeat_interleave to c++
- PR: #8961
#9167: updated llama3 ops to not use attributes method and instead to use attribute_names + attributes_values
- PR: #9185
#8681: Add Floor , Trunc dependant ops
- PR: #8285
Fuse Mamba block residual projection with activation
- PR: #9187
#9167: sped up compute program hash
- PR: #9201
Add trace 2cq version of Resnet
- PR: #9178
#9167: changed program cache to use unique_any as the value type
- PR: #9203
#8683: Add Unary left shift
- PR: #8712
Mixtral: Add EoS token stop to demo
- PR: #9207
#0: Update Falcon7b CODEOWNERS
- PR: #9204
#8764: Part 2 fixes for docs for wormhole readiness
- PR: #9170
Correctly block for the current EP when blocking=true
- PR: #9202
Applying Llama2 Decode and Prefill Kernels to experimentals folder
- PR: #9214
#9198: Fix minor regression in some nightly tests due to small packet optimization
- PR: #9199
Fix softmax sharded program cache hit
- PR: #9212
#0: add suuport for in1 dram sharded matmul2d
- PR: #9182
#0: Fix repack_weights.py script for llama writing params.json contents using out_dir as a file
- PR: #9222
#8965: deallocate all buffers on device when closing
- PR: #9220
#0: Update noc_async_read/write docs to not specify only dram coords
- PR: #9225
#9137: clean target will now remove entire built folder
- PR: #9184
#9142: BH -> Fix pack api, add constant vector
- PR: #9181
Standardize llk sfpu inits
- PR: #9260
#0: Fix jupyterlab pinned to two different versions
- PR: #9262
#4858: add uint16 to fp16b typecast support
- PR: #9265
#0: pad subbblock size, allow mixtral shapes reach 240GB/s
- PR: #9264
#7083: conv config cleanup in python and c++ changes
- PR: #9075
#0: Add option to validate program binaries on device before enqueuing program in debug mode
- PR: #9216
#7822: Fix conditionals for bmm multi core reuse optimized for when to update rt args
- PR: #9273
#8764: Set TTNN_CONFIG_OVERRIDES if it exists in the ttnn workflow
- PR: #9257
#9270: tracy linking error fix
- PR: #9274
#9200: Use project paths in CMake
- PR: #9259
#0: Make numa node based binding opt-in
- PR: #9172
#5337: Add extrapolation and skipping to op_perf_results
- PR: #9282
Update Mistral perf figures
- PR: #9284
Improve mistral perf test for 1024 seqlen and on-device profiling
- PR: #9283
Fix log message typo (importting -> importing)
- PR: #9281
#7586: Move current wh b0 only single-card nightly tests to the ln model
- PR: #9215
[Falcon7b] Add support for 2k kv-cache size for decode l1-sharded configuration
- PR: #9219
#0: Update Llama experimental readme
- PR: #9292
#8725: Update warning for persistent kernel cache
- PR: #9285
[Falcon7b] Add option to run huggingface model in perplexity test, and add perplexity test to demo ci
- PR: #9266
#0: Skip failing resnet tests
- PR: #9301
#8658: Migrate composite unary ops to C++
- PR: #8810
#5389: updated ShardSpec to use attribute_names + attribute_values instead of attributes
- PR: #9278
#8764: Run ttnn ipynb tutorials on N150/N300
- PR: #9299
#8837: Fix Resnet trace 2cq version to write inputs on cq 1
- PR: #9293
#753: Syncing device host times for tracy profiler
- PR: #8101
#8940: Get rid of source code directories in local environment to ensure that end to end environment is valid
- PR: #8899
Fix Mixtral ttnn.eq dtype
- PR: #9306
#8764: ttnn examples in ci
- PR: #9304
Binary dest accumulation
- PR: #9272
Move program configs out of runtime codepath.
- PR: #9320
#0: Fix import error for skipping ttnn resnet tests
- PR: #9326
#0: opt dram u-bench to 267GB/s
- PR: #9311
Add ttnn argmax op
- PR: #9300
#0: Cleanup bmm multi core reuse optimized ORTAs
- PR: #9327
#9080: Migrate pipeline owners
- PR: #9310
TTNN split removal fix
- PR: #9308
Update sweeps documentation
- PR: #9157

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.48.0

📦 Uncategorized