resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

anti-machinee · 2022-05-17T10:58:09Z

Please review this error

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
mpirun noticed that process rank 7 with PID 0 on node ip-<> exited on signal 9 (Killed).

JoeyTPChou · 2022-05-18T21:41:14Z

Encounter the similar issue while running PyTorch GPT 2 example on 8 Gaudi (AWS DL1 instance). Both 1.4.0 and 1.4.1 showed this error. The error we got from 1.4.0

....
254 2022-04-28 22:31:43 | INFO | root | Reducer buckets have been rebuilt in this iteration.
255 2022-04-28 22:33:47 | INFO | train_inner | epoch 001:     10 / 16405 loss=18.858, ppl=475154, wps=37885.1, ups=0.07, wpb=524288, bsz=512, num_updates=10, lr=6.099e-06, gnorm=18.355, clip=100, train_wall=148, wall=212
256 2022-04-28 22:36:06 | INFO | train_inner | epoch 001:     20 / 16405 loss=16.001, ppl=65586.3, wps=37670.6, ups=0.07, wpb=524288, bsz=512, num_updates=20, lr=1.2098e-05, gnorm=4.887, clip=100, train_wall=139, wall=352
257 2022-04-28 22:38:25 | INFO | train_inner | epoch 001:     30 / 16405 loss=14.704, ppl=26695.2, wps=37699, ups=0.07, wpb=524288, bsz=512, num_updates=30, lr=1.8097e-05, gnorm=2.606, clip=100, train_wall=139, wall=491
258 /home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/hcl_remote_device.cpp::69(onReceivedArt): The condition [ entry.seqNumber == (lastReceived + 1) ] failed. Illegal ART seqNumber, from rank (0), seqNumbe259 terminate called after throwing an instance of 'c10::Error'
260   what():  Collective call returned error
261 Exception raised from operator() at /tmp/pip-req-build-_kqiu0aw/habana_frameworks/torch/core/hccl/ProcessGroupHCCL.cpp:647 (most recent call first):
262 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f295d4eed2c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
263 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xf5 (0x7f295d4cec4d in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
264 frame #2: <unknown function> + 0x29dc7 (0x7f28b36f2dc7 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/_hccl_C.so)
265 frame #3: <unknown function> + 0x2836a (0x7f28b36f136a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/_hccl_C.so)
266 frame #4: <unknown function> + 0xd6de4 (0x7f295d3a2de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
267 frame #5: <unknown function> + 0x8609 (0x7f2969839609 in /lib/x86_64-linux-gnu/libpthread.so.0)
268 frame #6: clone + 0x43 (0x7f2969973163 in /lib/x86_64-linux-gnu/libc.so.6)
....
295 Traceback (most recent call last):
296   File "train.py", line 14, in <module>
297     cli_main()
298   File "/GPT2/fairseq_cli/train.py", line 537, in cli_main
299     distributed_utils.call_main(cfg, main)
300   File "/GPT2/fairseq/distributed/utils.py", line 369, in call_main
301     torch.multiprocessing.spawn(
302   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
303     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
304   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
305     while not context.join():
306   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 130, in join
307     raise ProcessExitedException(
308 torch.multiprocessing.spawn.ProcessExitedException: process 5 terminated with signal SIGKILL                                                                                                                                                            309 Couldn't import apex.normalization.fused_layer_norm.FusedLayerNorm, using torch.nn.LayerNorm
310 Couldn't import apex.normalization.fused_layer_norm.FusedLayerNorm, using torch.nn.LayerNorm
311 /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 128 leaked semaphore objects to clean up at shutdown
312   warnings.warn('resource_tracker: There appear to be %d '

greg-serochi · 2022-05-18T23:53:01Z

Hi @JoeyTPChou, some follow up questions here:

Did you kill this process? The team is curious why it's listed as killed.
Are you just running the default commands from the GPT2 Model site? https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/GPT2
Does this happen with Single Card? I know this is not ideal, but we want to confirm if this is a DDP or Synapse SW issue
Can you please provide the full dmesg log for this failure? However, I'd assumed what you posted today is the main error section.
We have a dedicated snapshot tool https://github.com/HabanaAI/Snapshot_For_Debug that you can run to capture the relevant log files.

JoeyTPChou · 2022-05-21T16:33:16Z

Hi @greg-serochi, just want to let you know I didn't forget this issue. So since this issue is non-deterministic, my 1st epoch goes to ~3000 iteration this time and it is still running. Let me reply some of your comments:

1. Did you kill this process? The team is curious why it's listed as killed.
No I didn't kill it. It was killed by the process.

2. Are you just running the default commands from the GPT2 Model site? https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/GPT2
Yes I am. I used the PyTorch docker file and the example script to run on 8 Gaudi on a single DL1 instance.

3. Does this happen with Single Card? I know this is not ideal, but we want to confirm if this is a DDP or Synapse SW issue?
I ran them on 8 Gaudi on AWS.

4. Can you please provide the full dmesg log for this failure? However, I'd assumed what you posted today is the main error section.
5. We have a dedicated snapshot tool https://github.com/HabanaAI/Snapshot_For_Debug that you can run to capture the relevant log files
Will try them after this run got killed.

JoeyTPChou · 2022-05-23T14:33:25Z

@greg-serochi How can I send the dmesg and the log files generated from gather_info_docker.py file?

greg-serochi · 2022-05-26T18:30:53Z

We are taking this internally for further debug. Once we have a resolution, we will provide an update.

JoeyTPChou · 2022-05-26T20:27:05Z

Hi Greg, thanks for the update. Do you think the fix will be captured in the next release (1.5?) ?

greg-serochi · 2022-05-26T20:50:37Z

We cannot comment on future fixes until they are released. Any update will be in our release notes and/or we'll update this thread.

greg-serochi · 2022-06-16T15:23:13Z

We have made updates to our GPT2 model in the 1.5.0 version of our SynapaseAI Software stack that has been released today.

JoeyTPChou · 2022-06-16T15:50:33Z

Thanks for the great news! Does the 1.5.0 image also released? Do we need a new AMI and docker image?

greg-serochi · 2022-06-16T18:13:57Z

the 1.5.0 content was released today. You should use Habana's Base AMI and Docker images based on 1.5.0

Base AMI: https://aws.amazon.com/marketplace/search?searchTerms=habana%C2%AE
Docker Images: https://gallery.ecr.aws/habanalabs/pytorch-installer

Alberto-Villarreal · 2024-11-20T01:23:18Z

Solution has been posted. No answer we assume issue was solved. Closing this issue.

piotrbocian closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

anti-machinee commented May 17, 2022 •

edited

Loading

JoeyTPChou commented May 18, 2022 •

edited

Loading

greg-serochi commented May 18, 2022 •

edited

Loading

JoeyTPChou commented May 21, 2022 •

edited

Loading

JoeyTPChou commented May 23, 2022

greg-serochi commented May 26, 2022

JoeyTPChou commented May 26, 2022

greg-serochi commented May 26, 2022

greg-serochi commented Jun 16, 2022

JoeyTPChou commented Jun 16, 2022

greg-serochi commented Jun 16, 2022 •

edited

Loading

Alberto-Villarreal commented Nov 20, 2024

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

Comments

anti-machinee commented May 17, 2022 • edited Loading

JoeyTPChou commented May 18, 2022 • edited Loading

greg-serochi commented May 18, 2022 • edited Loading

JoeyTPChou commented May 21, 2022 • edited Loading

JoeyTPChou commented May 23, 2022

greg-serochi commented May 26, 2022

JoeyTPChou commented May 26, 2022

greg-serochi commented May 26, 2022

greg-serochi commented Jun 16, 2022

JoeyTPChou commented Jun 16, 2022

greg-serochi commented Jun 16, 2022 • edited Loading

Alberto-Villarreal commented Nov 20, 2024

anti-machinee commented May 17, 2022 •

edited

Loading

JoeyTPChou commented May 18, 2022 •

edited

Loading

greg-serochi commented May 18, 2022 •

edited

Loading

JoeyTPChou commented May 21, 2022 •

edited

Loading

greg-serochi commented Jun 16, 2022 •

edited

Loading