Releases
v0.45.0
🚀 Features
#6204 : added support for num_users < 32 for update cache op.
#6247 Llama2 Galaxy MLP implementation
📦 Uncategorized
#4736 : Add support for moreh_norm op
Fix moreh_layernorm rstd
#5508 : Change test_moreh_layernorm.py for debugging
#4686 : add infra for sharing global struct among ops
#5592 : Fix pcc on Falcon 7b prefill by turning on l1 packer on MLP 4h-to-h matmul
Fix layernorm beta data format reconfig
Add linked support for in0 in1 mcast in matmul
#4957 : optimizing construct_2d_padded_tensor_list
#4003 : added ttnn.as_tensor and enabled support for caching torch tensor
Revert "#0: Fix for fail in asinh backward"
#5829 : Use moreh_common.hpp for data movement kernels across moreh OPs
Barsic/ttnn ops
#6030 : Update resnet performance metrics
#5876 : pytest & c++ test logging cleanup
#0: Use both 2x2 and 2x4 machines on every scheduled run
Add single core matmul benchmark
#6079 : Update FORCE_INLINE to be nop when watcher is enabled
#5980 : Fix a hard-coded bounds check in dprint
#5389 : merged ttl and ttnn tensor classes into one
Initial Performance Model
fix ci
TTNN RN50 :: on the road to match perf with TTLIB version
#4438 : Optimized single-core fold op
#5589 : Add repeat-interleave and addcmul sweeps
#6055 : Add square backward support
#6057 : Add backward support for lgamma
#6056 : Add backward support for frac and trunc
#6066 : Add support for backward log sigmoid
#6002 : Add backward support for binary maximum
Ngrujic/improve conversion to bfloat8b in sweeps
#5829 : Use moreh_common.hpp for compute kernels across moreh OPs
#0: Remove post-commit label from multi device pipeline because it's not actually post commit
Add pack l1 acc to resnet conv
#6144 : Skip 512x512 cross attn 2d upblock for now in nightly because it hangs
#6061 : Add tanhshrink, threshold, Unary EQ backward ops support
Width Sharded Concat for Unet
#5184 : uncommenting various moreh test case.
Fix compute kernel config arg for resnet50
Nsmith/untilize unit test
Revert "Revert "#5389 : merged ttl and tensor classes into one""
#4438 : Do not use the new fold op in Resnet tests
Remove corerangeset that does not work on wormhole
#6129 : Expose kernel config attrs and use 4 dst tiles for fp32 configs
#5391 : Add device perf
#0: Use multiplier for wormhole b0 mulsi3
#4003 : removed ttnn.Tensor autoclass from tensor.rst
TTNN MultiDevice Support
build artifacts
#4947 : Add noc alignment checks to watcher
Add ttnn multi-chip unit test for checking device shards
Nsmith/fix unet
#6043 : Random program stress test of command queues
Logit and logiteps backward support
Backward support for log2
Add missing ttnn tests and disable broken tests until issues are fixed
Fix Events feature for FD1.3 (out-of-order event ids, events feature missing) #6093
#5873 : make top-level post commit workflow re-useable
#5589 : add groupnorm for ttnn sweeps
Ngrujic/ttnn sweeps 4
Add ethernet datamover (EDM) - a foundational ethernet transfer engine
#6116 : Add backward support for softshrink
#0: Add verbose make logs to artifact and make nicer name on metal
#0: Only use 2x4 setup for multi-card WH CI as 2x2 does not provide us good feedback
#4809 dprint tensix regs
#4003 : fixed bloom perf test
#6187 : Conv bugfix
#0: concat RM support variable stick widths across inputs
TTNN RN50 on WHB0
#6084 : Lower thresholds slightly after using proper configs for device resnet
Fast dispatch 2.0 proof of concept
#6218 : add pytest for matmul 1d 2d
#6177 : use is_tensor_storage_on_device
so it works for MultiDeviceStorage
#6082 : support workers + eth cores in one program
#6215 : Rename TensorToMeshMapper/MeshToTensorComposer
#6164 : Update test_noc_unicast_vs_multicast_to_single_core_latency to not use same cores for producer and consumer on WH
#6117 : Add backward support for softplus
#6223 : remove redundant call to context switch
Integrate EDM with all-gather.
#6136 : Add backward support for unary LE and GE
#5398 : fix unicast binaries
Barsic/ttnn ops 2
#5380 : Add wormhole_b0 model perf tests, only falcon7b in ttlib for now
#5372 : Updated README.md file for demo
#4003 : updated ttnn.concat to have a registered fallback
Llama2 functional bringup
#5589 : Add working BFLOAT8_B sweeps to working folder
FD2.0 rename HostQ->PrefetchQ, add multi-core capability, fix NOC coords
#0: bugfix in ttnn resnet caught by nightly
#0: fix tt_bisect build bug
Watcher Asserts
#6183 : add unit test for sd matmul ops
#6254 : Make program cache per device:
#5394 : Add functional version of Mamba architecture
#6257 : Add temporary convenience script for 800MHz / new eth reset dependent CI
#5661 : Enable gtests for fast dispatch + R chip
Alex/metal/bmm large block untilize out
#5389 : made tensor attributes public and use ttnn::Shape instead of tt::tt_metal::Shape for storing shape
Revert "#6183 : add unit test for sd matmul ops"
#4003 : print all of the L1 buffers using ttnn.print_l1_buffer_state
#4003 : print all of the L1 buffers using ttnn.print_l1_buffers
#4438 : Implement sharded multi-core fold op for Resnet50
#6149 : disabled the check for comparing generated report with GOLDEN_L1_BUFFER_REPORT becauson pipelines it looks different than when running locally
FD2.0 fixes+mcast support for write and packed_write
Shwetank tt/config
#0: Change order of device and use_program_cache fixture in remaining pytests
Softplus with beta and threshold param
Build tests during artifact creation
#6149 : disabled test_print_l1_buffers_of_add_operation
#4003 : updated ttnn.to_torch to work with bfloat8_b tensors that are not multiple of tile size without tile padding
#0: add to/from L1 reshard test
#0: Add back deleted shape assertions for interleaved concat
test errors flagged by watcher
#0: fix incremental build
Merge xuncai/llama-attention-galaxy to main: First version of llama-attention galaxy on emulated chips
#6329 : Fixing a bug causing mismatch on indices
#6321 : Test which sweeps read/write buffer and just checks that the e…
Support moreh_getitem forward
#6125 : Update in0_block_w to be full shard width for sharded 2D systolic matmul
#6107 : Add softsign, sign, unary ceil backward support
#6226 : Add backward support for div
#6234 : Add backward support for rdiv
#6236 : Add backward support for fmod and remainder
#4003 : added positional embeddings to bert and updated ttnn_sharded_optimized_bert to run with batch size of 12
Indexed Fill
#5589 : remove dtype in gen function sweep tests where needed
#6347 : Print built-in defines once only
#0: Add Mo as code owner on profiler code
#0: Simplify tt_lib.scripts package by adding a specific tt_eager/scripts directory and putting the production scripts in there, whereas development scripts will stay in /scripts
#0: Fixture reorder changes reverted for falcon_7b perf test
#5424 : remove metal_ckernel_sfpu
#0: Update remaining tt_lib.program_cache calls to use device APIs
#6183 : add unit test for sd matmul ops
#6289 : fix dispatcher page calculation
#5924 : Enable unet on wormhole_b0 changes
#6325 : skip test_multi_device.py for grayskull arch
Alex/metal/pack untilize no repack
#6144 : Not hanging on GS or WH with or without Watcher
Agrebenisan/swq hwq cardinality cleanup
#6146 : Add backward support for conj
#0: bug fix UTWH div_up instead of div trunc for calculating CB sizes
Fix To/From Sharded Bug
#6206 : Fix resharding page mapping
#5733 : ttnn/cpp: run_operation for multi-device
#5589 : TTNN - l1 loss sweep and unit tests
Add Support to Allow Input Batch Offset for Update Cache when Users < 32
Npetrovic/ttnn bin ops
Use/dprint configuration registers
#5629 : Don't create new threads during CompileProgram
, use tf to manage threadpool instead
Revert "Npetrovic/ttnn bin ops"
#6385 : Update ttnn.create_sharded_memory_config to correctly determine shard shape for height/width sharding
TestPrintEthCores fix
#6266 : Refactored Llama 2 MLP & attention
Bteng/fdworkflow cleanup
Initial perf model for WH
#6363 : Fix so remote does not try direct write to completion queue
Add support for BFP4_b format
#6378 : Disable failing test for now
fix alignment issue for indexed fill reading in batch_ids
#4003 : added register_pre_operation_hook and register_post_operation_hook
#6349 : Add missing asserts for concat op. Minor improvement to concat kernel setup code
#0: remove printf
add post-commit ttnn and model pipelines
re-direct to same internal yaml from top-level fd, ttnn, or model workflows
Bteng/ttnn model artifact dep
#4003 : remove inner ops from pre and post hooks
#5163 : Support optional output tensors in moreh groupnorm
#6424 : Split TestPrintEthCores into two kernels as workaround.
Support moreh arange row major output
#6284 : Add backward support for imag and real
#5163 : Change are_needed_outputs -> are_required_outputs
#5163 : Update MorehGroupNormBackwardGammaBetaGrad
Ngrujic/ttnn sweeps 1
#0: fix clang build
Update cache op optimizations
#6281 : Skip 2 Non-Deterministic failing Events tests for GS
Asarje/ttnn rn50 wh bfp8
#6453 : Add watcher asserts to perform CB bounds checking
#6313 Llama 2 Galaxy Decoder implementation
#5733 : ttnn multi-device cleanup memory management
#6436 : fix ttnn.to_layout() to correctly return RuntimeError
#4957 : split ttnn tests into 2 groups
#4957 : 3-way ttnn test split
#6410 : Encapsulate tensor attributes inside a shared_ptr
#5589 : TTNN mse loss sweeps
#6363 : observe max tensix slots in bidir tunneller
#6075 : add reshard support to the halo op
updates to bring post-commit pipeline time to < 30 minutes
#6123 : Add support for backward mvlgamma
#6390 :L1 loss pcc issue
#6040 : enable bidirectional support for all-gather
#6496 : No longer gate upload release step on the frequent pipelines passing, and just let them run for convenience
TTNN sweeps: binary ops and fixes
#0: Tag name for eager - Package workflow, which is the impl of the main version, with appropriate qualifiers to not confuse ppl
fix for WH
#6414 : Ensure we run single and multicore/multi device sfpu tests. Lo…
FD2.0 CQ_DISPATCH_CMD_WRITE_PAGED initial implementation and tests
#6510 : Support to have enqueue write-only and read-only tests
integrate fd multiqueue post commit into post commit
#6513 : move multi-device files under tt-metal/impl/device
#0: ttnn-falcon: add packer_l1_acc to MLP module
Add new frequent pipeline for multi nebula CI
Non-zero indices op
Add native repeat op and RM concat
Add llama2_70b into multi-nebula frequent ci pipeline
#6493 : update backward softplus with beta and threshold param
Jrock/falcon op tests
Jrock/falcon40b utility test update
Ngrujic/debug yaml based sweep tests
#6241 :Prefill on 8 chips
#6503 : Llama 2 Refactor All Test files, allow repro on any device
#5480 : Fix memory address hack in FD2 test
#5592 : Interleaved2ShardedPartialOp, Sharded2InterleavedPartialOp, Matmul1d height sharding + padding fixes
#0: Modify Bert Large Perf test to delete intermediates at the end of each iteration
Alex/metal/max pool dm perf
#6524 : clean up the to/from_device_mesh functions
#5075 : Watcher pause feature initial implementation
#6562 : Fix ttnn falcon7b by using arch-specific ComputeKernelConfig
#6374 : Fix to ensure that we never get an odd number of pages in our …
Aliu/erisc launch msg
#0: Remove temporary frequent pipeline api tests as that was meant to be a temporary stop gap for people wanting to add T3K tests until we got real CI for it
#0: Delete llama_old models and their tests because we have no need for them anymore in light of WH-only T3K llama
#4584 : Demo file for functional whisper
Ngrujic/ttnn sweeps
Silu op for Sharded layout
moreh getitem supports tilized input row major index
#6568 : Add lm-evaluation-harness support for Mamba reference model
Barsic/ttnn ops 3
Alex/metal/max pool remove init
#0: Fix Falcon40B tests for CI
FD2 test fixes
#6450 : compile fix for main
#6377 : Split perf models pipeline by arch and model collection type, as we need very specific ownership of models for Javelin
#6577 : Use CreateSemaphore api rather than hardcoded addresses in leg…
#5733 : fix multi-device to_host call
#6472 : reduce outstanding issue cmds
#5917 : Add test coverage for watcher kernei_id reporting
Unet Concat Optimization
#0: Properly declare the ttnn pybind dependency files for Make, as the previous one was trying to find them in the src directories, when they were really in the build
Fast Dispatch on Idle Ethernet Core
reduce timeout for post-commit pipelines to 45 minutes
#6462 : Upsample kernel opt
#3766 : Various fixes for Ubuntu 22.04 / Python 3.10
You can’t perform that action at this time.