-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Frequent freezes during training #15349
Comments
@davorchap we are not convinced that it is related to the di/dt. My expectation didt would happen anyway if we have blocking EnqueueProgram.So this is pretty annoying issue. |
In the shared branch we forced matmul to use sublocks (1, 1) to minimize possibility di/dt. It didn't change frequency of freezes. |
I took a look at the branch, and its fairly out of sync with latest main. A fairly non-deterministic hang was exposed by this commit pushed on Nov 8, for which a workaround went in yesterday. |
@tt-asaigal could you show a link to the workaround? |
I've rebased to latest main and can confirm that freezing is not an issue now. Thanks @tt-asaigal! |
Describe the bug
Running training with GPT-2S results in frequent freezes after ~300k samples.
Training restart also requires
tt-smi -r 0
Switch blocking from false to true removes the issue:
tt::tt_metal::EnqueueProgram(queue, program, false);
To Reproduce
Expected behavior
No freezes and crashes
Screenshots
One thread callstack here:
All other threads callstack:
Please complete the following environment information:
The text was updated successfully, but these errors were encountered: