Skip to content

v0.48.0

Compare
Choose a tag to compare
@github-actions github-actions released this 10 Jun 18:09
· 3526 commits to main since this release

📦 Uncategorized

  • #7744: Add support for non-4D tensor in moreh_sum, moreh_sum_backward
  • #5544: Add output tensors parameter to moreh_nll_loss op
  • #5544: Add output tensors parameter to moreh_sgd op
  • #5544: Fix package build error
  • #5544: Add output tensors parameter to moreh_linear op
  • #5544: Prevent eager unit test failures
  • #7997: Support non-4D tensor in moreh_softmax
  • #7816: Bump SD perf target
  • #8098: Remove temp buffer copying when reading from hugepage to host buffer
  • #0: Specify DEBUG_STATUS as a string literal instead of multiple chars
  • #8212: Fix uneven shards for interleaved_to_sharded op
  • #0: Refactor unpad tile to modify rt args in place and remove dynamic…
  • #7838: Add support for non-4D tensor in moreh_linear OPs
  • #0: Use split_work_for_tilize in both tilize and untilize
  • #8131: resnet-50 fix for b20.
  • Add support for multiple parameters in EltwiseUnary
  • #7625: Enable multicore for tilize with padding by default
  • Trace Support
  • #0: Switch set runtime args assertion for if kernel was placed on core to TT_ASSERT
  • #7179: enabling test case. The issue was not reproducible on 8.12 dri…
  • #4625: Multicore runs for untilize with unpadding on interleaved tensors
  • #0: Cache program cmds, convert cb configs from write linear to write packed
  • #0: Make skip and xfail optional in defining sweep tests
  • Shwetank tt/bcast op
  • #8364: Disable implicit fallback for ttnn.pad
  • #8513: Add slack notifications to several more pipelines
  • #0: Update common RT args to use no stride flag for packed cmd.
  • #0: Option to write compile_commands.json from CMake
  • #8718: eltwise testing for bfloat8
  • Add support for bfloat8 input tensors in Mamba SSM block custom kernels
  • #8460: Enable Clang-17
  • #0: Remove overhead in calling functions wrapped in tensor_impl_wrapper
  • #0: Updating the perf thresold to incorporate Merge back uneven reshard commit.
  • #6365: Add ttnn host tests
  • #6365: Revert "#6365: Add ttnn host tests (#8210)"
  • #4382: fix GH reported vulnerabilities
  • #0: bump C++ timeout limit to 45 minutes
  • update unpad doc for slice generality
  • Convert Falcon7b tt_lib ops and tensors to ttnn.experimental
  • #6365: Fix ttnn host wheel tests
  • Add git bisect script
  • #0: Move falcon40b ci unit tests to different pipeline
  • #8437: remove default matmul program config
  • #0: Add myself to ttnn codeowners
  • #0: Update README.md to include mention of TTNN_CONFIG_OVERRIDES
  • #0: Fix typos and add TTNN_CONFIG_OVERRIDES parameter descriptions to readme
  • #0: Add basic sanity checks during matmul program config creation
  • #8907: Sweep tests for tilize/untilize
  • #8902: Fixed program caching bug in nlp load slice op and added additional test cases for the op
  • #8917: Add sweep test for the fold op
  • #0: Properly support trivial single core case for 1D matmuls
  • #6343: updated test_perf with test for bloom causal_lm
  • #6343: Add functional_bloom test_demo
  • Update README.md
  • Enable optimised attention by default in falcon prefill.
  • Replace FreeList shared_ptr with local_shared_ptr
  • Add dummy_weights mode for mixtral tests
  • Refactor operation calls: Replace operation::run() with operation::launch_op()
  • Use HiFi2 to bump Falcon7b prefill PCC
  • #8902: add input and attn_mask del
  • #8930: Disable llama perf test
  • #0: Add third codeowner to matmul path
  • #0: Add create_venv.sh as environment option in installation instructions
  • #7083: Composite conv fix for relu called after matmul
  • #7525: Skip batch 7 metal BERT on WH B0 because it still hangs too often
  • #8871: Add initial infra/support for dram sharding
  • #8531: delete all makefiles
  • #0: Delete dead code from work_split.hpp
  • #8853: Uplift SFPI to latest w/ BH support
  • #8725: Warn user if kernel cache is enabled
  • #0: Minor test_prefetcher fixes
  • #5389: Move ttnn.repeat to c++
  • #8131: temp fix for PCC issue on W0.
  • Optimize e2e perf Falcon40b modifying layernorm
  • #0: Relax Falcon7b perf target
  • #0: Resolve segfault in llama async mode
  • Resnet Optimizations
  • Create Falcon7b perplexity test and utility functions for text-gen datasets
  • Revert "#8131: temp fix for PCC issue on W0."
  • bmm dram sharded opt
  • #8943: Clean up profiler python_env build flow
  • #8904: Add slack notifications for T3000 unit-tests
  • Add unet shallow functional, performance and demo test files
  • #8932: Multi-Device Mixtral Argmax Support
  • #8264: Worker thread optimizations:
  • TTNN tests for bf8 with mk tiled scalar
  • Ihamer/7468 inject noc delays
  • Support changed csv row orderings in Mixtral's op_perf_results.py
  • Correct merge issue in op_perf_results.py
  • #0: Add kernel groups to test_pgm_dispatch
  • #0: Add docs requirements to python env cache key because it can change the environment as well
  • #0: Add helper function to create CBs
  • #8973: Remove TT_METAL_ENV because we don't need it anymore
  • #5773: Move SD model to demo folder
  • #6938: Implement softplus as a single kernel
  • Model team/rotary embeddings llama
  • #8735: Fix hw/inc/blackhole files for compilation
  • Improve Mixtral perf with ttlib
  • Update README.md
  • #3712: fix old version of GN test
  • #0: Don't error on unused functions in compiler call
  • Revert " #8904: Add slack notifications for T3000 unit-tests"
  • Rtawfik/bh llk api
  • #0: Added interactive demo
  • Move Falcon7b before Mixtral in demo pipeline to workaround issue
  • #8112: Add support for ND tensors to matmul
  • #0: fix dram read benchmark
  • Fix bug in utility_functions::Profiler
  • Remove 1x1 matmul fallback on convolution and generalize convo…
  • #5389: Remove ttnn.split
  • #8767: decouple build folder name from build.cpp
  • #8735: Update common flags for BH build after sfpi module update
  • #8895: Fix ttnn.as_tensor(..) method for placing tensors on-device
  • #8539: Add cq_id to run_operation function args
  • #8632: Support fp32 dest acc en in moreh_sum and moreh_sum_backward
  • #5044: Add optional output tensor and remove autoformat in eltwise binary ops
  • #8895: Fix failing regression test in dump_tensor(...) API
  • More Resnet Optimizations
  • #4858: add typecast fp32 to uint32 op
  • #8995: refactoring moreh arange
  • #0: Add ccache option to build_metal.sh
  • Update Mixtral perf figures
  • #8349: Use BFP4_B for attention mask in falcon7b optimised prefill.
  • #0: Add CODEOWNERS for build_metal.sh
  • Rtawfik/add binary reuse metal
  • Update watcher.rst - use double backticks
  • Falcon40b tt_lib to ttnn.experimental
  • #0: fix dram sharded program cache
  • #7083: New halo fix for enabled program cache
  • #9051: Enable Llama model perf test
  • #8764: Single card WH demo tests
  • #8764: Various docs fixes for WH release
  • #0: Correct script locations for nightly single card
  • #8764: Use new device_l1_small_size fixture for SD demo interactive test
  • #9059: Update matmul test pcc
  • #0: Ensure weka mount is active for demo tests otherwise it won't run
  • #0: remove reserve to avoid bad alloc
  • #8764: Separate n150/n300 demo tests to not run BERT 11 on N150
  • Remove unnecessary llk sfpu param files
  • #9059: Add fallback for getting matmul program config
  • Add grouped convolution support
  • #8282: Support non-4d tensor and fp32_dest_acc_en for moreh nllloss backward
  • #8976: moreh_getitem receive signed integer index tensors
  • #9049: fix moreh_sgd callback and add callback test
  • #0: Remove argmax multi-device test due to segfault
  • #7724: Add prototype for autonomous streams for use in tunneller
  • #9036: GS & BH --> Combine llk param files using variable args
  • #0: optimize allgather for small tensor sizes
  • Enable weight caching for long running Mamba tests
  • #5389: removed early return from validate when enable_fast_runtime_mo…
  • Removed unucessary ttnn.to_device() from Mixtral code
  • Add 2 cq implementation for Resnet
  • #9084: Rename dockerfile and added virtualenv installation
  • #0: Watcher interval to not include polling time
  • #0: Revert "#8264: Worker thread optimizations:"
  • #5389: disabled failing moreh tests
  • #5389: disabled failing moreh tests
  • #5389: disabled failing moreh tests
  • #0: Update Resnet perf numbers
  • Split dispatcher commands into packets+prefetcher relay_linear bug fix and test improvments
  • #6448: re-enable all-gather bidir for dim 0,1
  • #8890: Reduce size of pack_src|dst_format constexprs
  • #0: merge all kernels into one group
  • #7724: Disable a test to reduce runtime
  • ttnn multi-chip changes for galaxy support
  • #9026: Fix FD dispatcher wait on wrapped value
  • #0: Add back Async Mode optimizations
  • Add support for bfloat8 activations in Mamba
  • #9118: Fix moreh getitem, moreh nllloss validation error
  • Update ViT E2E number in README.md
  • #4858: enable typecast fp16b to uint16
  • #8540: Upgrade eltwise binary ops to support queue_id /output_tensor / uint output dtype
  • #9095: implement callback helper function
  • #5044: Add optional output to where op
  • #0: enable multi-device tensor support for moreh sum op
  • #5337: Mixtral dense matmul after all-gather
  • Update Mamba decode performance metrics
  • #8683: Add Unary right shift
  • Snijjar/issue 7724
  • #5044: add optional output to BW ops EQ, add, addalpha, mul
  • build UMD with same compiler used to compile metal and remove clang 6 as a dependency
  • #0: change silicon param to session scope
  • Mo/8223 fd2 dispatch core profiler support
  • #9006: single-core topk extension to include larger width and height
  • #9088: fix ttnn_falcon_7b single-device regression in decoder module
  • #7586: Create unstable branch of WH single card nightly FD
  • #9143: BH -> Remove unused reduce args
  • #8563: sweep split_query_key_value_and_split_heads, split and concat
  • #8407: Remove 1x1 matmul fallback on convolution and generalize convo…
  • #4252: Update to C++20
  • #9110: Move typecast to ttnn
  • Update TTNN sweeps - concatenate heads, embeddings
  • #9016: adjust nightly t3000 demo test pipeline to run Mon/Wed/Fri
  • #9088: fix ttnn_falcon_7b single-device regression in attention
  • #9167: sped up compute program hash
  • #9109: Add q_id to Eltwise binary EQ
  • #8662: add initial argmax op single core kernel implementation
  • #8424: Add new llk-wormhole-b0 commit: remove assert for fp32 zeroacc
  • #9059: adjust matmul parameters for rounding up in some scenarios
  • #5389: Move ttnn.repeat_interleave to c++
  • #9167: updated llama3 ops to not use attributes method and instead to use attribute_names + attributes_values
  • #8681: Add Floor , Trunc dependant ops
  • Fuse Mamba block residual projection with activation
  • #9167: sped up compute program hash
  • Add trace 2cq version of Resnet
  • #9167: changed program cache to use unique_any as the value type
  • #8683: Add Unary left shift
  • Mixtral: Add EoS token stop to demo
  • #0: Update Falcon7b CODEOWNERS
  • #8764: Part 2 fixes for docs for wormhole readiness
  • Correctly block for the current EP when blocking=true
  • Applying Llama2 Decode and Prefill Kernels to experimentals folder
  • #9198: Fix minor regression in some nightly tests due to small packet optimization
  • Fix softmax sharded program cache hit
  • #0: add suuport for in1 dram sharded matmul2d
  • #0: Fix repack_weights.py script for llama writing params.json contents using out_dir as a file
  • #8965: deallocate all buffers on device when closing
  • #0: Update noc_async_read/write docs to not specify only dram coords
  • #9137: clean target will now remove entire built folder
  • #9142: BH -> Fix pack api, add constant vector
  • Standardize llk sfpu inits
  • #0: Fix jupyterlab pinned to two different versions
  • #4858: add uint16 to fp16b typecast support
  • #0: pad subbblock size, allow mixtral shapes reach 240GB/s
  • #7083: conv config cleanup in python and c++ changes
  • #0: Add option to validate program binaries on device before enqueuing program in debug mode
  • #7822: Fix conditionals for bmm multi core reuse optimized for when to update rt args
  • #8764: Set TTNN_CONFIG_OVERRIDES if it exists in the ttnn workflow
  • #9270: tracy linking error fix
  • #9200: Use project paths in CMake
  • #0: Make numa node based binding opt-in
  • #5337: Add extrapolation and skipping to op_perf_results
  • Update Mistral perf figures
  • Improve mistral perf test for 1024 seqlen and on-device profiling
  • Fix log message typo (importting -> importing)
  • #7586: Move current wh b0 only single-card nightly tests to the ln model
  • [Falcon7b] Add support for 2k kv-cache size for decode l1-sharded configuration
  • #0: Update Llama experimental readme
  • #8725: Update warning for persistent kernel cache
  • [Falcon7b] Add option to run huggingface model in perplexity test, and add perplexity test to demo ci
  • #0: Skip failing resnet tests
  • #8658: Migrate composite unary ops to C++
  • #5389: updated ShardSpec to use attribute_names + attribute_values instead of attributes
  • #8764: Run ttnn ipynb tutorials on N150/N300
  • #8837: Fix Resnet trace 2cq version to write inputs on cq 1
  • #753: Syncing device host times for tracy profiler
  • #8940: Get rid of source code directories in local environment to ensure that end to end environment is valid
  • Fix Mixtral ttnn.eq dtype
  • #8764: ttnn examples in ci
  • Binary dest accumulation
  • Move program configs out of runtime codepath.
  • #0: Fix import error for skipping ttnn resnet tests
  • #0: opt dram u-bench to 267GB/s
  • Add ttnn argmax op
  • #0: Cleanup bmm multi core reuse optimized ORTAs
  • #9080: Migrate pipeline owners
  • TTNN split removal fix
  • Update sweeps documentation