使用天池的数据集和work_configs里的配置文件修改了数据集路径，分类数改为2，训练时一直报RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered #5

Viktorjc98 · 2022-08-03T03:10:59Z

请问你遇到过这个问题吗？
Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14592,0,0], thread: [21,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14592,0,0], thread: [22,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [47,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [48,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [49,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [50,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [51,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [52,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [53,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [54,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14757,0,0], thread: [55,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [17,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [18,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [19,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [20,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [21,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [14432,0,0], thread: [22,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "tools/train.py", line 180, in
main()
File "tools/train.py", line 176, in main
meta=meta)
File "/root/work/tamper/mmseg/apis/train.py", line 135, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/root/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/root/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/root/.local/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/root/work/tamper/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/.local/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
output = old_func(*new_args, **new_kwargs)
File "/root/work/tamper/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/root/work/tamper/mmseg/models/segmentors/encoder_decoder.py", line 144, in forward_train
gt_semantic_seg)
File "/root/work/tamper/mmseg/models/segmentors/encoder_decoder.py", line 88, in _decode_head_forward_train
self.train_cfg)
File "/root/work/tamper/mmseg/models/decode_heads/decode_head.py", line 203, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/root/.local/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
output = old_func(*new_args, *new_kwargs)
File "/root/work/tamper/mmseg/models/decode_heads/decode_head.py", line 240, in losses
seg_weight = self.sampler.sample(seg_logit, seg_label)
File "/root/work/tamper/mmseg/core/seg/sampler/ohem_pixel_sampler.py", line 56, in sample
sort_prob, sort_indices = seg_prob[valid_mask].sort()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f6737b1f8b2 in /root/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0xad2 (0x7f6737d71952 in /root/.local/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f6737b0ab7d in /root/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x5ff43a (0x7f66f16a043a in /root/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5ff4e6 (0x7f66f16a04e6 in /root/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #22: __libc_start_main + 0xe7 (0x7f674b287bf7 in /lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

Junjue-Wang · 2022-08-03T08:51:23Z

我没遇到过，应该是ohem 操作导致越界了

Viktorjc98 · 2022-08-04T02:39:41Z

那大佬最开始应该从哪个配置文件开始训练哪？

Viktorjc98 · 2022-08-04T02:43:10Z

Junjue-Wang · 2022-08-05T04:49:26Z

tamper

Viktorjc98 · 2022-08-05T15:33:24Z

大佬，然后多模型融合就是模型分别训练，然后预测的时候结果加权一下吗？多模型预测的代码这里面有吗？是哪一个呀？

Junjue-Wang · 2022-08-05T15:50:43Z

是的，模型分别训练融合，这里应该没有实现

Martine-li · 2023-02-26T07:14:19Z

请问这个问题最后怎么解决的呢

Penguin0321 · 2023-03-06T14:39:32Z

请问训练是执行tools文件夹下的dist_train.sh脚本吗？里面的参数是设置config下的tamper文件夹路径还是文件夹内某个py文件的路径呢

Junjue-Wang added the question Further information is requested label Aug 5, 2022

Junjue-Wang added question Further information is requested and removed question Further information is requested labels Aug 7, 2022

CongYep mentioned this issue Nov 19, 2023

使用temper中的config，更换为自己的数据集，报错RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::Error' #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用天池的数据集和work_configs里的配置文件修改了数据集路径，分类数改为2，训练时一直报RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered #5

使用天池的数据集和work_configs里的配置文件修改了数据集路径，分类数改为2，训练时一直报RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered #5

Viktorjc98 commented Aug 3, 2022

Junjue-Wang commented Aug 3, 2022

Viktorjc98 commented Aug 4, 2022

Viktorjc98 commented Aug 4, 2022

Junjue-Wang commented Aug 5, 2022

Viktorjc98 commented Aug 5, 2022

Junjue-Wang commented Aug 5, 2022

Martine-li commented Feb 26, 2023

Penguin0321 commented Mar 6, 2023

使用天池的数据集和work_configs里的配置文件修改了数据集路径，分类数改为2，训练时一直报RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered #5

使用天池的数据集和work_configs里的配置文件修改了数据集路径，分类数改为2，训练时一直报RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered #5

Comments

Viktorjc98 commented Aug 3, 2022

Junjue-Wang commented Aug 3, 2022

Viktorjc98 commented Aug 4, 2022

Viktorjc98 commented Aug 4, 2022

Junjue-Wang commented Aug 5, 2022

Viktorjc98 commented Aug 5, 2022

Junjue-Wang commented Aug 5, 2022

Martine-li commented Feb 26, 2023

Penguin0321 commented Mar 6, 2023