Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

按照所指示的步骤,到了训练哪一步,得到了一个misaligned address错误 #27

Open
Jonyond-lin opened this issue Aug 9, 2023 · 1 comment

Comments

@Jonyond-lin
Copy link

你好,想请问一下,使用以下命令进行训练python tools/train_net.py --config-file configs/DPText_DETR/Pretrain/R_50_poly.yaml --num-gpus 2,完整报错信息如下:

terminate called after throwing an instance of 'c10::CUDAError'                                                                                                                                   
  what():  CUDA error: misaligned address                                                                                                                                                         
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):                                                                                
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fb64ea53a22 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10.so)                 
frame #1: <unknown function> + 0x10aa3 (0x7fb64ee10aa3 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                                          
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fb64ee12147 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)            
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fb64ea3d5a4 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10.so)                                
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7fb6f47612e9 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/to
rch/lib/libtorch_python.so)                                                                                                                                                                       
frame #5: c10d::Reducer::~Reducer() + 0x276 (0x7fb6f4757d16 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                 
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fb6f4786e32 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/
libtorch_python.so)                                                                                                                                                                               
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fb6f3ef70f6 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python
.so)                                                                                                                                                                                              
frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x1d (0x7fb6f478b47d in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/l
ibtorch_python.so)                                                                                                                                                                                
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fb6f3ef70f6 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python
.so)                                                                                                                                                                                              
frame #10: <unknown function> + 0xd891ef (0x7fb6f47891ef in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                    
frame #11: <unknown function> + 0x4ff8d0 (0x7fb6f3eff8d0 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                    
frame #12: <unknown function> + 0x500b3e (0x7fb6f3f00b3e in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                    
frame #13: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe]                                                                                                                       
frame #14: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4f9606]                                                                                                                       
frame #15: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe]                                                                                                                       
frame #16: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4f9606]                                                                                                                       
frame #17: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe]                                                                                                                       
frame #18: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x5a726b]                                                                                                                       

Traceback (most recent call last):                                                               
  File "tools/train_net.py", line 291, in <module>                                               
    launch(                                     
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/launch.py", line 67, in launch
    mp.spawn(                                   
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')                                                                                                                  
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():                                                                    
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)                                                                                                                            
torch.multiprocessing.spawn.ProcessRaisedException:                                              

-- Process 0 terminated with the following error:                                                
Traceback (most recent call last):                                                               
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)                                
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
    main_func(*args)                                                     
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()                    
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
    losses.backward()                           
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)                                                                                                            
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(                                                     
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

我的pytorch版本跟您readme中的要求一致,我的cudatoolkit版本(也就是nvcc -V)是11.6,我看您的版本是11.1,但是网上说cuda的大版本之间是兼容的,请问我需要更改为11.1吗?或者您能不能结合报错信息给一些debug的意见呢,不胜感激!~

@Jonyond-lin Jonyond-lin changed the title 按照所指示的步骤,到了训练哪一部,得到了一个misaligned address错误 按照所指示的步骤,到了训练哪一步,得到了一个misaligned address错误 Aug 9, 2023
@Jonyond-lin
Copy link
Author

试了一下把cuda切换成11.1,问题还是存在,不知道咋回事,还请大佬帮忙看看orz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant