Skip to content

v0.33.0

Compare
Choose a tag to compare
@tt-rkim tt-rkim released this 06 Oct 02:29
· 9080 commits to main since this release

Metal

Wormhole

  • Basic bringup and tests running on WH B0
  • Harvesting functionality working on WH B0
  • Basic fast dispatch functionality working on WH B0

Host API changes

  • void StartDebugPrintServer(Device *device, const std::vector<CoreCoord> & cores) no longer callable
  • Device *CreateDevice no longer requires arch parameter
  • New wrapper around Buffer API so that users don't need to look inside buffer.hpp to figure out how to construct a buffer object: Buffer CreateBuffer(Device *device, std::uint64_t size, std::uint64_t page_size, const BufferType buffer_type)
  • LaunchKernels renamed to LaunchProgram(Device *device, Program &program) to match EnqueueProgram and removed obsolete stagger_start parameter
  • void WriteRuntimeArgsToDevice(Device *device, const Program &program) moved to detail namespace
  • bool CompileProgram(Device *device, Program &program) moved to detail namespace
  • bool ConfigureDeviceWithProgram(Device *device, const Program &program) moved to detail namespace
  • bool InitializeDevice(Device *device) removed

Profiler

  • Bug fix on device side to support new FW init process in fast and slow dispatch.
  • RISC FW cleanup to avoid unnecessary function wrappers.

Watcher

  • Add more way points to watcher and add access methods to soc descriptor for, eg, harvesting
  • Add some noc sanitization and checks
  • Some bug fixes: don't read registers during kernel run, don't include wh headers on gs, allow 0 length transactions

Feature: Runtime Compute Args

  • Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels.
  • The kernel uses the same get_arg_val<type>(<index>) api to retrieve it.
  • The host uses the same tt_metal::SetRuntimeArgs( <program>, <compute_kernel_id>, <Core, CoreRange>, <vector of u32 runtime args>); as DataMovement Kernel communication as well.

Eager (Ops)

  • Added support for overriding runtime args and circular buffers
  • Added support for saving and loading tensors
  • Added support for uint32 tensor

Models

  • 5+% increase of BERT Large performance on bare metal machines.
  • 15+% increase of LLaMA 7B performance on bare metal machines.