v0.33.0
Metal
Wormhole
- Basic bringup and tests running on WH B0
- Harvesting functionality working on WH B0
- Basic fast dispatch functionality working on WH B0
Host API changes
void StartDebugPrintServer(Device *device, const std::vector<CoreCoord> & cores)
no longer callable- Device *CreateDevice no longer requires arch parameter
- New wrapper around Buffer API so that users don't need to look inside buffer.hpp to figure out how to construct a buffer object:
Buffer CreateBuffer(Device *device, std::uint64_t size, std::uint64_t page_size, const BufferType buffer_type)
LaunchKernels
renamed toLaunchProgram(Device *device, Program &program)
to matchEnqueueProgram
and removed obsoletestagger_start
parametervoid WriteRuntimeArgsToDevice(Device *device, const Program &program)
moved to detail namespacebool CompileProgram(Device *device, Program &program)
moved to detail namespacebool ConfigureDeviceWithProgram(Device *device, const Program &program)
moved to detail namespacebool InitializeDevice(Device *device)
removed
Profiler
- Bug fix on device side to support new FW init process in fast and slow dispatch.
- RISC FW cleanup to avoid unnecessary function wrappers.
Watcher
- Add more way points to watcher and add access methods to soc descriptor for, eg, harvesting
- Add some noc sanitization and checks
- Some bug fixes: don't read registers during kernel run, don't include wh headers on gs, allow 0 length transactions
Feature: Runtime Compute Args
- Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels.
- The kernel uses the same
get_arg_val<type>(<index>)
api to retrieve it. - The host uses the same
tt_metal::SetRuntimeArgs( <program>, <compute_kernel_id>, <Core, CoreRange>, <vector of u32 runtime args>);
as DataMovement Kernel communication as well.
Eager (Ops)
- Added support for overriding runtime args and circular buffers
- Added support for saving and loading tensors
- Added support for uint32 tensor
Models
- 5+% increase of BERT Large performance on bare metal machines.
- 15+% increase of LLaMA 7B performance on bare metal machines.