Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Frequent freezes during training #15349

Closed
rfurko-tt opened this issue Nov 21, 2024 · 6 comments
Closed

[Bug Report] Frequent freezes during training #15349

rfurko-tt opened this issue Nov 21, 2024 · 6 comments
Assignees
Labels
bug Something isn't working P1 TT-train

Comments

@rfurko-tt
Copy link
Contributor

rfurko-tt commented Nov 21, 2024

Describe the bug
Running training with GPT-2S results in frequent freezes after ~300k samples.
Training restart also requires tt-smi -r 0
Switch blocking from false to true removes the issue: tt::tt_metal::EnqueueProgram(queue, program, false);

To Reproduce

  1. Branch: rfurko/kahan_summation
  2. Run tt-train/nano-gpt example
  3. Wait for 500-2000 steps
  4. Freeze

Expected behavior
No freezes and crashes

Screenshots
One thread callstack here:
Image
All other threads callstack:
Image

Please complete the following environment information:

  • OS: Ubuntu 20.04
  • Version of software: rfurko/kahan_summation
@dmakoviichuk-tt
Copy link
Contributor

@davorchap we are not convinced that it is related to the di/dt. My expectation didt would happen anyway if we have blocking EnqueueProgram.So this is pretty annoying issue.
@tt-asaigal feel free to ping Roman if you need any help to reproduce this issue.

@rfurko-tt
Copy link
Contributor Author

In the shared branch we forced matmul to use sublocks (1, 1) to minimize possibility di/dt. It didn't change frequency of freezes.

@tt-asaigal
Copy link
Contributor

I took a look at the branch, and its fairly out of sync with latest main. A fairly non-deterministic hang was exposed by this commit pushed on Nov 8, for which a workaround went in yesterday.
The branch has the hanging commit but not the workaround. Would it be possible to rebase on latest main and try the workload again?

@dmakoviichuk-tt
Copy link
Contributor

@tt-asaigal could you show a link to the workaround?

@tt-asaigal
Copy link
Contributor

please try cherry-picking these 2 in order:
9fff3ce
3036495

@rfurko-tt
Copy link
Contributor Author

I've rebased to latest main and can confirm that freezing is not an issue now. Thanks @tt-asaigal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 TT-train
Projects
None yet
Development

No branches or pull requests

3 participants