You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Models team observed tat profiling eth dispatch cores caused segfaults.
The segfaults are because of reading bad dram buffer index from profiler L1 control buffer.
Turns out, the profiler control buffer in the mailbox is getting corrupted in cq_prefetch and vc_th_tunneler.
This was verified by turning device side profiling fully off and initializing the buffer from host and reading it back at the end of a eth dispatch run.
In core 2,6 of device 3 we get the following read back:
mo-tenstorrent
changed the title
Profiler mailbox buffers are getting corrupted in idle_eth kernels
Profiler mailbox buffers are getting corrupted in idle_eth dispatch kernels
Nov 21, 2024
That and inconsistency in the usage of hal and device version of get_dev_addr<profiler_msg_t *> was the root cause. Cleaning all that up and using the device version everywhere fixed the issue.
Essentially some parts of the profiler code were looking at active eths' profiler buffer address for and idle eth.
Models team observed tat profiling eth dispatch cores caused segfaults.
The segfaults are because of reading bad dram buffer index from profiler L1 control buffer.
Turns out, the profiler control buffer in the mailbox is getting corrupted in
cq_prefetch
andvc_th_tunneler
.This was verified by turning device side profiling fully off and initializing the buffer from host and reading it back at the end of a eth dispatch run.
In core 2,6 of device 3 we get the following read back:
according to watcher that core is running
cq_prefetch
The buffer needs to read the following however:
repro steps:
15530_profiler_buffer_corruption
./build_metal.sh -p
export TT_METAL_DEVICE_PROFILER=1 export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml export TT_METAL_DEVICE_PROFILER_DISPATCH=1
pytest tests/ttnn/tracy/test_profiler_sync.py::test_all_devices
The text was updated successfully, but these errors were encountered: