Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler mailbox buffers are getting corrupted in idle_eth dispatch kernels #15330

Closed
mo-tenstorrent opened this issue Nov 21, 2024 · 4 comments
Assignees
Labels
bug Something isn't working P1

Comments

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Nov 21, 2024

Models team observed tat profiling eth dispatch cores caused segfaults.

The segfaults are because of reading bad dram buffer index from profiler L1 control buffer.

Turns out, the profiler control buffer in the mailbox is getting corrupted in cq_prefetch and vc_th_tunneler.

This was verified by turning device side profiling fully off and initializing the buffer from host and reading it back at the end of a eth dispatch run.

In core 2,6 of device 3 we get the following read back:

device id:3, x:2, y:6, i:0, d:160
device id:3, x:2, y:6, i:1, d:0
device id:3, x:2, y:6, i:2, d:0
device id:3, x:2, y:6, i:3, d:0
device id:3, x:2, y:6, i:4, d:0
device id:3, x:2, y:6, i:5, d:0
device id:3, x:2, y:6, i:6, d:0
device id:3, x:2, y:6, i:7, d:0
device id:3, x:2, y:6, i:8, d:4
device id:3, x:2, y:6, i:9, d:16
device id:3, x:2, y:6, i:10, d:32
device id:3, x:2, y:6, i:11, d:32
device id:3, x:2, y:6, i:12, d:65799
device id:3, x:2, y:6, i:13, d:1
device id:3, x:2, y:6, i:14, d:98800
device id:3, x:2, y:6, i:15, d:0
device id:3, x:2, y:6, i:16, d:8
device id:3, x:2, y:6, i:17, d:0
device id:3, x:2, y:6, i:18, d:0
device id:3, x:2, y:6, i:19, d:0
device id:3, x:2, y:6, i:20, d:1001045187
device id:3, x:2, y:6, i:21, d:3171794203
device id:3, x:2, y:6, i:22, d:3137157588
device id:3, x:2, y:6, i:23, d:3169631398
device id:3, x:2, y:6, i:24, d:5
device id:3, x:2, y:6, i:25, d:16
device id:3, x:2, y:6, i:26, d:32
device id:3, x:2, y:6, i:27, d:32
device id:3, x:2, y:6, i:28, d:3
device id:3, x:2, y:6, i:29, d:0
device id:3, x:2, y:6, i:30, d:26880016
device id:3, x:2, y:6, i:31, d:26880016

according to watcher that core is running cq_prefetch

The buffer needs to read the following however:

i:0, d:0
i:1, d:0
i:2, d:0
i:3, d:0
i:4, d:0
i:5, d:0
i:6, d:0
i:7, d:0
i:8, d:0
i:9, d:0
i:10, d:0
i:11, d:0
i:12, d:32
i:13, d:0
i:14, d:0
i:15, d:0
i:16, d:25
i:17, d:7
i:18, d:0
i:19, d:0
i:20, d:0
i:21, d:0
i:22, d:0
i:23, d:0
i:24, d:0
i:25, d:0
i:26, d:0
i:27, d:0
i:28, d:0
i:29, d:0
i:30, d:0
i:31, d:0

repro steps:

  1. checkout out 15530_profiler_buffer_corruption
  2. ./build_metal.sh -p
  3. export TT_METAL_DEVICE_PROFILER=1 export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml export TT_METAL_DEVICE_PROFILER_DISPATCH=1
  4. pytest tests/ttnn/tracy/test_profiler_sync.py::test_all_devices
@mo-tenstorrent mo-tenstorrent added bug Something isn't working P1 labels Nov 21, 2024
@mo-tenstorrent mo-tenstorrent changed the title Profiler mailbox buffers are getting corrupted in idle_eth kernels Profiler mailbox buffers are getting corrupted in idle_eth dispatch kernels Nov 21, 2024
@mo-tenstorrent
Copy link
Contributor Author

I have tried moving both mailbox and FW start around each by 4096 and the issue followed.

@jbaumanTT
Copy link
Contributor

Any chance this could be fixed by #15335 ? Maybe we were accidentally grabbing data from the wrong cores.

@mo-tenstorrent
Copy link
Contributor Author

That and inconsistency in the usage of hal and device version of get_dev_addr<profiler_msg_t *> was the root cause. Cleaning all that up and using the device version everywhere fixed the issue.

Essentially some parts of the profiler code were looking at active eths' profiler buffer address for and idle eth.

Moving this to the profiler board.

@mo-tenstorrent mo-tenstorrent self-assigned this Nov 21, 2024
@mo-tenstorrent
Copy link
Contributor Author

This was an issue with host profiler code and how it dealt with idle_eth. The fix for this will come as part of #10234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

No branches or pull requests

2 participants