Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用天池的数据集和work_configs里的配置文件修改了数据集路径,分类数改为2,训练时一直报RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered #5

Open
Viktorjc98 opened this issue Aug 3, 2022 · 8 comments
Labels
question Further information is requested

Comments

@Viktorjc98
Copy link

请问你遇到过这个问题吗?
Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14592,0,0], thread: [21,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14592,0,0], thread: [22,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [47,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [48,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [49,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [50,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [51,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [52,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [53,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [54,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [55,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [17,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [18,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [19,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [20,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [21,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [22,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "tools/train.py", line 180, in
main()
File "tools/train.py", line 176, in main
meta=meta)
File "/root/work/tamper/mmseg/apis/train.py", line 135, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/root/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/root/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/root/.local/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/root/work/tamper/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/.local/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
output = old_func(*new_args, **new_kwargs)
File "/root/work/tamper/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/root/work/tamper/mmseg/models/segmentors/encoder_decoder.py", line 144, in forward_train
gt_semantic_seg)
File "/root/work/tamper/mmseg/models/segmentors/encoder_decoder.py", line 88, in _decode_head_forward_train
self.train_cfg)
File "/root/work/tamper/mmseg/models/decode_heads/decode_head.py", line 203, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/root/.local/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
output = old_func(*new_args, *new_kwargs)
File "/root/work/tamper/mmseg/models/decode_heads/decode_head.py", line 240, in losses
seg_weight = self.sampler.sample(seg_logit, seg_label)
File "/root/work/tamper/mmseg/core/seg/sampler/ohem_pixel_sampler.py", line 56, in sample
sort_prob, sort_indices = seg_prob[valid_mask].sort()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f6737b1f8b2 in /root/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0xad2 (0x7f6737d71952 in /root/.local/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f6737b0ab7d in /root/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x5ff43a (0x7f66f16a043a in /root/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5ff4e6 (0x7f66f16a04e6 in /root/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #22: __libc_start_main + 0xe7 (0x7f674b287bf7 in /lib/x86_64-linux-gnu/libc.so.6)

@Junjue-Wang
Copy link
Owner

我没遇到过,应该是ohem 操作导致越界了

@Viktorjc98
Copy link
Author

那大佬最开始应该从哪个配置文件开始训练哪?

@Viktorjc98
Copy link
Author

image

@Junjue-Wang
Copy link
Owner

tamper

@Junjue-Wang Junjue-Wang added the question Further information is requested label Aug 5, 2022
@Viktorjc98
Copy link
Author

大佬,然后多模型融合就是模型分别训练,然后预测的时候结果加权一下吗?多模型预测的代码这里面有吗?是哪一个呀?

@Junjue-Wang
Copy link
Owner

是的,模型分别训练融合,这里应该没有实现

@Junjue-Wang Junjue-Wang added question Further information is requested and removed question Further information is requested labels Aug 7, 2022
@Martine-li
Copy link

请问这个问题最后怎么解决的呢

@Penguin0321
Copy link

请问训练是执行tools文件夹下的dist_train.sh脚本吗?里面的参数是设置config下的tamper文件夹路径还是文件夹内某个py文件的路径呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants