Skip to content

v0.45.0

Compare
Choose a tag to compare
@github-actions github-actions released this 22 Mar 18:03
· 5690 commits to main since this release

🚀 Features

  • #6204: added support for num_users < 32 for update cache op.
  • #6247 Llama2 Galaxy MLP implementation

📦 Uncategorized

  • #4736: Add support for moreh_norm op
  • Fix moreh_layernorm rstd
  • #5508: Change test_moreh_layernorm.py for debugging
  • #4686: add infra for sharing global struct among ops
  • #5592: Fix pcc on Falcon 7b prefill by turning on l1 packer on MLP 4h-to-h matmul
  • Fix layernorm beta data format reconfig
  • Add linked support for in0 in1 mcast in matmul
  • #4957: optimizing construct_2d_padded_tensor_list
  • #4003: added ttnn.as_tensor and enabled support for caching torch tensor
  • Revert "#0: Fix for fail in asinh backward"
  • #5829: Use moreh_common.hpp for data movement kernels across moreh OPs
  • Barsic/ttnn ops
  • #6030: Update resnet performance metrics
  • #5876: pytest & c++ test logging cleanup
  • #0: Use both 2x2 and 2x4 machines on every scheduled run
  • Add single core matmul benchmark
  • #6079: Update FORCE_INLINE to be nop when watcher is enabled
  • #5980: Fix a hard-coded bounds check in dprint
  • #5389: merged ttl and ttnn tensor classes into one
  • Initial Performance Model
  • fix ci
  • TTNN RN50 :: on the road to match perf with TTLIB version
  • #4438: Optimized single-core fold op
  • #5589: Add repeat-interleave and addcmul sweeps
  • #6055: Add square backward support
  • #6057: Add backward support for lgamma
  • #6056: Add backward support for frac and trunc
  • #6066: Add support for backward log sigmoid
  • #6002: Add backward support for binary maximum
  • Ngrujic/improve conversion to bfloat8b in sweeps
  • #5829: Use moreh_common.hpp for compute kernels across moreh OPs
  • #0: Remove post-commit label from multi device pipeline because it's not actually post commit
  • Add pack l1 acc to resnet conv
  • #6144: Skip 512x512 cross attn 2d upblock for now in nightly because it hangs
  • #6061: Add tanhshrink, threshold, Unary EQ backward ops support
  • Width Sharded Concat for Unet
  • #5184: uncommenting various moreh test case.
  • Fix compute kernel config arg for resnet50
  • Nsmith/untilize unit test
  • Revert "Revert "#5389: merged ttl and tensor classes into one""
  • #4438: Do not use the new fold op in Resnet tests
  • Remove corerangeset that does not work on wormhole
  • #6129: Expose kernel config attrs and use 4 dst tiles for fp32 configs
  • #5391: Add device perf
  • #0: Use multiplier for wormhole b0 mulsi3
  • #4003: removed ttnn.Tensor autoclass from tensor.rst
  • TTNN MultiDevice Support
  • build artifacts
  • #4947: Add noc alignment checks to watcher
  • Add ttnn multi-chip unit test for checking device shards
  • Nsmith/fix unet
  • #6043: Random program stress test of command queues
  • Logit and logiteps backward support
  • Backward support for log2
  • Add missing ttnn tests and disable broken tests until issues are fixed
  • Fix Events feature for FD1.3 (out-of-order event ids, events feature missing) #6093
  • #5873: make top-level post commit workflow re-useable
  • #5589: add groupnorm for ttnn sweeps
  • Ngrujic/ttnn sweeps 4
  • Add ethernet datamover (EDM) - a foundational ethernet transfer engine
  • #6116: Add backward support for softshrink
  • #0: Add verbose make logs to artifact and make nicer name on metal
  • #0: Only use 2x4 setup for multi-card WH CI as 2x2 does not provide us good feedback
  • #4809 dprint tensix regs
  • #4003: fixed bloom perf test
  • #6187: Conv bugfix
  • #0: concat RM support variable stick widths across inputs
  • TTNN RN50 on WHB0
  • #6084: Lower thresholds slightly after using proper configs for device resnet
  • Fast dispatch 2.0 proof of concept
  • #6218: add pytest for matmul 1d 2d
  • #6177: use is_tensor_storage_on_device so it works for MultiDeviceStorage
  • #6082: support workers + eth cores in one program
  • #6215: Rename TensorToMeshMapper/MeshToTensorComposer
  • #6164: Update test_noc_unicast_vs_multicast_to_single_core_latency to not use same cores for producer and consumer on WH
  • #6117: Add backward support for softplus
  • #6223: remove redundant call to context switch
  • Integrate EDM with all-gather.
  • #6136: Add backward support for unary LE and GE
  • #5398: fix unicast binaries
  • Barsic/ttnn ops 2
  • #5380: Add wormhole_b0 model perf tests, only falcon7b in ttlib for now
  • #5372: Updated README.md file for demo
  • #4003: updated ttnn.concat to have a registered fallback
  • Llama2 functional bringup
  • #5589: Add working BFLOAT8_B sweeps to working folder
  • FD2.0 rename HostQ->PrefetchQ, add multi-core capability, fix NOC coords
  • #0: bugfix in ttnn resnet caught by nightly
  • #0: fix tt_bisect build bug
  • Watcher Asserts
  • #6183: add unit test for sd matmul ops
  • #6254: Make program cache per device:
  • #5394: Add functional version of Mamba architecture
  • #6257: Add temporary convenience script for 800MHz / new eth reset dependent CI
  • #5661: Enable gtests for fast dispatch + R chip
  • Alex/metal/bmm large block untilize out
  • #5389: made tensor attributes public and use ttnn::Shape instead of tt::tt_metal::Shape for storing shape
  • Revert "#6183: add unit test for sd matmul ops"
  • #4003: print all of the L1 buffers using ttnn.print_l1_buffer_state
  • #4003: print all of the L1 buffers using ttnn.print_l1_buffers
  • #4438: Implement sharded multi-core fold op for Resnet50
  • #6149: disabled the check for comparing generated report with GOLDEN_L1_BUFFER_REPORT becauson pipelines it looks different than when running locally
  • FD2.0 fixes+mcast support for write and packed_write
  • Shwetank tt/config
  • #0: Change order of device and use_program_cache fixture in remaining pytests
  • Softplus with beta and threshold param
  • Build tests during artifact creation
  • #6149: disabled test_print_l1_buffers_of_add_operation
  • #4003: updated ttnn.to_torch to work with bfloat8_b tensors that are not multiple of tile size without tile padding
  • #0: add to/from L1 reshard test
  • #0: Add back deleted shape assertions for interleaved concat
  • test errors flagged by watcher
  • #0: fix incremental build
  • Merge xuncai/llama-attention-galaxy to main: First version of llama-attention galaxy on emulated chips
  • #6329: Fixing a bug causing mismatch on indices
  • #6321: Test which sweeps read/write buffer and just checks that the e…
  • Support moreh_getitem forward
  • #6125: Update in0_block_w to be full shard width for sharded 2D systolic matmul
  • #6107: Add softsign, sign, unary ceil backward support
  • #6226: Add backward support for div
  • #6234: Add backward support for rdiv
  • #6236: Add backward support for fmod and remainder
  • #4003: added positional embeddings to bert and updated ttnn_sharded_optimized_bert to run with batch size of 12
  • Indexed Fill
  • #5589: remove dtype in gen function sweep tests where needed
  • #6347: Print built-in defines once only
  • #0: Add Mo as code owner on profiler code
  • #0: Simplify tt_lib.scripts package by adding a specific tt_eager/scripts directory and putting the production scripts in there, whereas development scripts will stay in /scripts
  • #0: Fixture reorder changes reverted for falcon_7b perf test
  • #5424: remove metal_ckernel_sfpu
  • #0: Update remaining tt_lib.program_cache calls to use device APIs
  • #6183: add unit test for sd matmul ops
  • #6289: fix dispatcher page calculation
  • #5924: Enable unet on wormhole_b0 changes
  • #6325: skip test_multi_device.py for grayskull arch
  • Alex/metal/pack untilize no repack
  • #6144: Not hanging on GS or WH with or without Watcher
  • Agrebenisan/swq hwq cardinality cleanup
  • #6146: Add backward support for conj
  • #0: bug fix UTWH div_up instead of div trunc for calculating CB sizes
  • Fix To/From Sharded Bug
  • #6206: Fix resharding page mapping
  • #5733: ttnn/cpp: run_operation for multi-device
  • #5589: TTNN - l1 loss sweep and unit tests
  • Add Support to Allow Input Batch Offset for Update Cache when Users < 32
  • Npetrovic/ttnn bin ops
  • Use/dprint configuration registers
  • #5629: Don't create new threads during CompileProgram, use tf to manage threadpool instead
  • Revert "Npetrovic/ttnn bin ops"
  • #6385: Update ttnn.create_sharded_memory_config to correctly determine shard shape for height/width sharding
  • TestPrintEthCores fix
  • #6266: Refactored Llama 2 MLP & attention
  • Bteng/fdworkflow cleanup
  • Initial perf model for WH
  • #6363: Fix so remote does not try direct write to completion queue
  • Add support for BFP4_b format
  • #6378: Disable failing test for now
  • fix alignment issue for indexed fill reading in batch_ids
  • #4003: added register_pre_operation_hook and register_post_operation_hook
  • #6349: Add missing asserts for concat op. Minor improvement to concat kernel setup code
  • #0: remove printf
  • add post-commit ttnn and model pipelines
  • re-direct to same internal yaml from top-level fd, ttnn, or model workflows
  • Bteng/ttnn model artifact dep
  • #4003: remove inner ops from pre and post hooks
  • #5163: Support optional output tensors in moreh groupnorm
  • #6424: Split TestPrintEthCores into two kernels as workaround.
  • Support moreh arange row major output
  • #6284: Add backward support for imag and real
  • #5163: Change are_needed_outputs -> are_required_outputs
  • #5163: Update MorehGroupNormBackwardGammaBetaGrad
  • Ngrujic/ttnn sweeps 1
  • #0: fix clang build
  • Update cache op optimizations
  • #6281: Skip 2 Non-Deterministic failing Events tests for GS
  • Asarje/ttnn rn50 wh bfp8
  • #6453: Add watcher asserts to perform CB bounds checking
  • #6313 Llama 2 Galaxy Decoder implementation
  • #5733: ttnn multi-device cleanup memory management
  • #6436: fix ttnn.to_layout() to correctly return RuntimeError
  • #4957: split ttnn tests into 2 groups
  • #4957: 3-way ttnn test split
  • #6410: Encapsulate tensor attributes inside a shared_ptr
  • #5589: TTNN mse loss sweeps
  • #6363: observe max tensix slots in bidir tunneller
  • #6075: add reshard support to the halo op
  • updates to bring post-commit pipeline time to < 30 minutes
  • #6123: Add support for backward mvlgamma
  • #6390:L1 loss pcc issue
  • #6040: enable bidirectional support for all-gather
  • #6496: No longer gate upload release step on the frequent pipelines passing, and just let them run for convenience
  • TTNN sweeps: binary ops and fixes
  • #0: Tag name for eager - Package workflow, which is the impl of the main version, with appropriate qualifiers to not confuse ppl
  • fix for WH
  • #6414: Ensure we run single and multicore/multi device sfpu tests. Lo…
  • FD2.0 CQ_DISPATCH_CMD_WRITE_PAGED initial implementation and tests
  • #6510: Support to have enqueue write-only and read-only tests
  • integrate fd multiqueue post commit into post commit
  • #6513: move multi-device files under tt-metal/impl/device
  • #0: ttnn-falcon: add packer_l1_acc to MLP module
  • Add new frequent pipeline for multi nebula CI
  • Non-zero indices op
  • Add native repeat op and RM concat
  • Add llama2_70b into multi-nebula frequent ci pipeline
  • #6493: update backward softplus with beta and threshold param
  • Jrock/falcon op tests
  • Jrock/falcon40b utility test update
  • Ngrujic/debug yaml based sweep tests
  • #6241:Prefill on 8 chips
  • #6503: Llama 2 Refactor All Test files, allow repro on any device
  • #5480: Fix memory address hack in FD2 test
  • #5592: Interleaved2ShardedPartialOp, Sharded2InterleavedPartialOp, Matmul1d height sharding + padding fixes
  • #0: Modify Bert Large Perf test to delete intermediates at the end of each iteration
  • Alex/metal/max pool dm perf
  • #6524: clean up the to/from_device_mesh functions
  • #5075: Watcher pause feature initial implementation
  • #6562: Fix ttnn falcon7b by using arch-specific ComputeKernelConfig
  • #6374: Fix to ensure that we never get an odd number of pages in our …
  • Aliu/erisc launch msg
  • #0: Remove temporary frequent pipeline api tests as that was meant to be a temporary stop gap for people wanting to add T3K tests until we got real CI for it
  • #0: Delete llama_old models and their tests because we have no need for them anymore in light of WH-only T3K llama
  • #4584: Demo file for functional whisper
  • Ngrujic/ttnn sweeps
  • Silu op for Sharded layout
  • moreh getitem supports tilized input row major index
  • #6568: Add lm-evaluation-harness support for Mamba reference model
  • Barsic/ttnn ops 3
  • Alex/metal/max pool remove init
  • #0: Fix Falcon40B tests for CI
  • FD2 test fixes
  • #6450: compile fix for main
  • #6377: Split perf models pipeline by arch and model collection type, as we need very specific ownership of models for Javelin
  • #6577: Use CreateSemaphore api rather than hardcoded addresses in leg…
  • #5733: fix multi-device to_host call
  • #6472: reduce outstanding issue cmds
  • #5917: Add test coverage for watcher kernei_id reporting
  • Unet Concat Optimization
  • #0: Properly declare the ttnn pybind dependency files for Make, as the previous one was trying to find them in the src directories, when they were really in the build
  • Fast Dispatch on Idle Ethernet Core
  • reduce timeout for post-commit pipelines to 45 minutes
  • #6462: Upsample kernel opt
  • #3766: Various fixes for Ubuntu 22.04 / Python 3.10