Skip to content

Releases: tenstorrent/tt-metal

v0.34.0

13 Oct 15:22
Compare
Choose a tag to compare

Metal

API Changes

  • CreateDevice: device_id type has changed from int to chip_id_t
  • CreateCircularBuffer: Three previous variants which only differ by CoreCoord, CoreRange, and CoreRangeSet function parameter have been compressed into one user-facing CreateCircularBuffer function that’s parameterized with std::variant<CoreCoord,CoreRange,CoreRangeSet>. Now accepts CircularBufferConfig which specifies size, data format, and page size per buffer index. Return type updated from CircularBuffer object to CircularBufferID (uintptr_t)
  • GetCircularBufferConfig: New function to retrieve a reference to configuration of a CircularBuffer. This allows the CircularBuffer config to be updated. Updates will take effect on the next call to LaunchProgram.

Tools - Profiler

Tracy Python Support : Profile python side code with tracy. Similar to cProfile, the standard python profiler module, all python function calls are picked up on tracy. Additionally, TT’s binded C++ calls are also picked up automatically. The entire python script or just desired parts of it can be profiled either at function or line level.

Extra features

Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime. The kernel uses the same get_arg_val<type>(<index>) API to retrieve it. The host uses the same tt_metal::SetRuntimeArgs(<program, <compute_kernel_id>, <Core,CoreRange> , <vector of u32 runtime args>) as DataMovement Kernel.

Eager (Ops)

Notes not yet available.

Models

  • metal_BERT_large_15: model implementation updated to use tt-DNN operation embedding that executes on GS device. Previously this model used PyTorch embedding operation executing on CPU.
  • Falcon7b: added end to end demo that is running on GS device. The demo takes a text prompt and returns text generated by the model to complete the prompt. The demo works by pre-filling the cache with decoded input prompts and then running decode for all users in parallel.

v0.33.0

06 Oct 02:29
Compare
Choose a tag to compare

Metal

Wormhole

  • Basic bringup and tests running on WH B0
  • Harvesting functionality working on WH B0
  • Basic fast dispatch functionality working on WH B0

Host API changes

  • void StartDebugPrintServer(Device *device, const std::vector<CoreCoord> & cores) no longer callable
  • Device *CreateDevice no longer requires arch parameter
  • New wrapper around Buffer API so that users don't need to look inside buffer.hpp to figure out how to construct a buffer object: Buffer CreateBuffer(Device *device, std::uint64_t size, std::uint64_t page_size, const BufferType buffer_type)
  • LaunchKernels renamed to LaunchProgram(Device *device, Program &program) to match EnqueueProgram and removed obsolete stagger_start parameter
  • void WriteRuntimeArgsToDevice(Device *device, const Program &program) moved to detail namespace
  • bool CompileProgram(Device *device, Program &program) moved to detail namespace
  • bool ConfigureDeviceWithProgram(Device *device, const Program &program) moved to detail namespace
  • bool InitializeDevice(Device *device) removed

Profiler

  • Bug fix on device side to support new FW init process in fast and slow dispatch.
  • RISC FW cleanup to avoid unnecessary function wrappers.

Watcher

  • Add more way points to watcher and add access methods to soc descriptor for, eg, harvesting
  • Add some noc sanitization and checks
  • Some bug fixes: don't read registers during kernel run, don't include wh headers on gs, allow 0 length transactions

Feature: Runtime Compute Args

  • Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels.
  • The kernel uses the same get_arg_val<type>(<index>) api to retrieve it.
  • The host uses the same tt_metal::SetRuntimeArgs( <program>, <compute_kernel_id>, <Core, CoreRange>, <vector of u32 runtime args>); as DataMovement Kernel communication as well.

Eager (Ops)

  • Added support for overriding runtime args and circular buffers
  • Added support for saving and loading tensors
  • Added support for uint32 tensor

Models

  • 5+% increase of BERT Large performance on bare metal machines.
  • 15+% increase of LLaMA 7B performance on bare metal machines.