-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18
Comments
Encounter the similar issue while running PyTorch GPT 2 example on 8 Gaudi (AWS DL1 instance). Both 1.4.0 and 1.4.1 showed this error. The error we got from 1.4.0
|
Hi @JoeyTPChou, some follow up questions here:
|
Hi @greg-serochi, just want to let you know I didn't forget this issue. So since this issue is non-deterministic, my 1st epoch goes to ~3000 iteration this time and it is still running. Let me reply some of your comments: 1. Did you kill this process? The team is curious why it's listed as killed. 2. Are you just running the default commands from the GPT2 Model site? https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/GPT2 3. Does this happen with Single Card? I know this is not ideal, but we want to confirm if this is a DDP or Synapse SW issue? 4. Can you please provide the full dmesg log for this failure? However, I'd assumed what you posted today is the main error section. |
@greg-serochi How can I send the dmesg and the log files generated from |
We are taking this internally for further debug. Once we have a resolution, we will provide an update. |
Hi Greg, thanks for the update. Do you think the fix will be captured in the next release (1.5?) ? |
We cannot comment on future fixes until they are released. Any update will be in our release notes and/or we'll update this thread. |
We have made updates to our GPT2 model in the 1.5.0 version of our SynapaseAI Software stack that has been released today. |
Thanks for the great news! Does the 1.5.0 image also released? Do we need a new AMI and docker image? |
the 1.5.0 content was released today. You should use Habana's Base AMI and Docker images based on 1.5.0 Base AMI: https://aws.amazon.com/marketplace/search?searchTerms=habana%C2%AE |
Solution has been posted. No answer we assume issue was solved. Closing this issue. |
Please review this error
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
mpirun noticed that process rank 7 with PID 0 on node ip-<> exited on signal 9 (Killed).
The text was updated successfully, but these errors were encountered: