Problem about ytbb dataset(BaiduNetDisk) and trainnig SiamRPN #415

wWHWw · 2020-09-01T09:30:30Z

wWHWw
Sep 1, 2020

About dataset

i download ytbb dataset and .json file from latest update BaiduNetDisk link.
unzip it to the correct path which is /Pysot/trainning_dataset/yt_bb/crop511 and /Pysot/trainning_dataset/yt_bb/train.json.
Same as VID and Coco datasets, but during training when __getitem__ in def _get_bbox is called.
Some image and anno from ytbb cannot loaded correctly, but some other is fine. Error is

        imh, imw = image.shape[:2]
        AttributeError: 'NoneType' object has no attribute 'shape'.

To my understanding this error is caused by the code below that opencv can't get correct image.

       template_image = cv2.imread(template[0])
       search_image = cv2.imread(search[0])
       template_box = self._get_bbox(template_image, template[1])
       search_box = self._get_bbox(search_image, search[1])

But the path is correctly point to normal croped image.

About Training

Environment : Ubuntu 16 cuda10.1 RTX2080Ti and TITAN Xp 64GB Mem 6GB Swap
__C.TRAIN.NUM_BATCH_SIZE is 32 and i changed __C.TRAIN.NUM_WORKERS to 0.
follow instruction my steps are:

(pysot)whw@whw-7920:/media/DATA/whw/Pysot/experiments/siamrpn_alex_dwxcorr_16gpu$ export PYTHONPATH=/media/fang/DATA/whw/Pysot:$PYTHONPATH
(pysot)whw@whw-7920:/media/DATA/whw/Pysot/experiments/siamrpn_alex_dwxcorr_16gpu$ CUDA_VISIBLE_DEVICES=0,1
(pysot)whw@whw-7920:/media/DATA/whw/Pysot/experiments/siamrpn_alex_dwxcorr_16gpu$ python -m  torch.distributed.launch --nproc_per_node=4 --master_port=2333 ../../tools/train.py --cfg config.yaml

[2020-09-01 15:02:56,589-rk0-train.py#271] init done
fatal: Not a git repository (or any parent up to mount point /media/DATA)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /media/DATA)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[2020-09-01 15:02:56,671-rk0-train.py#284] Version Information: 
commit : 
  log  : 

[2020-09-01 15:02:56,680-rk0-train.py#285] config 
{
    "META_ARC": "siamrpn_alex_dwxcorr",
    "CUDA": true,
    "TRAIN": {
        "THR_HIGH": 0.6,
        "THR_LOW": 0.3,
        "NEG_NUM": 16,
        "POS_NUM": 16,
        "TOTAL_NUM": 64,
        "EXEMPLAR_SIZE": 127,
        "SEARCH_SIZE": 255,
        "BASE_SIZE": 0,
        "OUTPUT_SIZE": 17,
        "RESUME": "",
        "PRETRAINED": "",
        "LOG_DIR": "./logs",
        "SNAPSHOT_DIR": "./snapshot",
        "EPOCH": 50,
        "START_EPOCH": 0,
        "BATCH_SIZE": 128,
        "NUM_WORKERS": 0,
        "MOMENTUM": 0.9,
        "WEIGHT_DECAY": 0.0001,
        "CLS_WEIGHT": 1.0,
        "LOC_WEIGHT": 1.2,
        "MASK_WEIGHT": 1,
        "PRINT_FREQ": 20,
        "LOG_GRADS": false,
        "GRAD_CLIP": 10.0,
        "BASE_LR": 0.005,
        "LR": {
            "TYPE": "log",
            "KWARGS": {
                "start_lr": 0.01,
                "end_lr": 0.0005
            }
        },
        "LR_WARMUP": {
            "WARMUP": true,
            "TYPE": "step",
            "EPOCH": 5,
            "KWARGS": {
                "start_lr": 0.005,
                "end_lr": 0.01,
                "step": 1
            }
        }
    },
    "DATASET": {
        "TEMPLATE": {
            "SHIFT": 4,
            "SCALE": 0.05,
            "BLUR": 0.0,
            "FLIP": 0.0,
            "COLOR": 1.0
        },
        "SEARCH": {
            "SHIFT": 64,
            "SCALE": 0.18,
            "BLUR": 0.2,
            "FLIP": 0.0,
            "COLOR": 1.0
        },
        "NEG": 0.05,
        "GRAY": 0.0,
        "NAMES": [
            "VID",
            "COCO"
        ],
        "VID": {
            "ROOT": "training_dataset/vid/crop511",
            "ANNO": "training_dataset/vid/train.json",
            "FRAME_RANGE": 100,
            "NUM_USE": 100000
        },
        "YOUTUBEBB": {
            "ROOT": "training_dataset/yt_bb/crop511",
            "ANNO": "training_dataset/yt_bb/train.json",
            "FRAME_RANGE": 3,
            "NUM_USE": -1
        },
        "COCO": {
            "ROOT": "training_dataset/coco/crop511",
            "ANNO": "training_dataset/coco/train2017.json",
            "FRAME_RANGE": 1,
            "NUM_USE": -1
        },
        "DET": {
            "ROOT": "training_dataset/det/crop511",
            "ANNO": "training_dataset/det/train.json",
            "FRAME_RANGE": 1,
            "NUM_USE": -1
        },
        "GOT10K": {
            "ROOT": "training_dataset/GOT10k/crop511",
            "ANNO": "training_dataset/GOT10k/crop511/train.json",
            "FRAME_RANGE": 1,
            "NUM_USE": -1
        },
        "VIDEOS_PER_EPOCH": 600000
    },
    "BACKBONE": {
        "TYPE": "alexnet",
        "KWARGS": {
            "width_mult": 1.0
        },
        "PRETRAINED": "pretrained_models/alexnet-bn.pth",
        "TRAIN_LAYERS": [
            "layer4",
            "layer5"
        ],
        "LAYERS_LR": 1.0,
        "TRAIN_EPOCH": 10
    },
    "ADJUST": {
        "ADJUST": false,
        "KWARGS": {},
        "TYPE": "AdjustAllLayer"
    },
    "RPN": {
        "TYPE": "DepthwiseRPN",
        "KWARGS": {
            "anchor_num": 5,
            "in_channels": 256,
            "out_channels": 256
        }
    },
    "MASK": {
        "MASK": false,
        "TYPE": "MaskCorr",
        "KWARGS": {}
    },
    "REFINE": {
        "REFINE": false,
        "TYPE": "Refine"
    },
    "ANCHOR": {
        "STRIDE": 8,
        "RATIOS": [
            0.33,
            0.5,
            1,
            2,
            3
        ],
        "SCALES": [
            8
        ],
        "ANCHOR_NUM": 5
    },
    "TRACK": {
        "TYPE": "SiamRPNTracker",
        "PENALTY_K": 0.16,
        "WINDOW_INFLUENCE": 0.4,
        "LR": 0.3,
        "EXEMPLAR_SIZE": 127,
        "INSTANCE_SIZE": 287,
        "BASE_SIZE": 0,
        "CONTEXT_AMOUNT": 0.5,
        "LOST_INSTANCE_SIZE": 831,
        "CONFIDENCE_LOW": 0.85,
        "CONFIDENCE_HIGH": 0.998,
        "MASK_THERSHOLD": 0.3,
        "MASK_OUTPUT_SIZE": 127
    }
}
[2020-09-01 15:03:08,370-rk0-dataset.py# 72] VID loaded
[2020-09-01 15:03:08,393-rk0-dataset.py#101] VID start-index 0 select [100000/3862] path_format {}.{}.{}.jpg
[2020-09-01 15:03:08,398-rk0-dataset.py# 44] loading COCO
[2020-09-01 15:03:18,904-rk0-dataset.py# 72] COCO loaded
[2020-09-01 15:03:18,921-rk0-dataset.py#101] COCO start-index 3862 select [117266/117266] path_format {}.{}.{}.jpg
[2020-09-01 15:03:22,718-rk0-dataset.py#223] shuffle done!
[2020-09-01 15:03:22,718-rk0-dataset.py#224] dataset length 30000000
[2020-09-01 15:03:23,217-rk0-train.py# 62] build dataset done
[2020-09-01 15:03:28,062-rk0-train.py#322] (WarmUPScheduler) lr spaces: 
[0.005      0.00574349 0.00659754 0.00757858 0.00870551 0.01
 0.00934181 0.00872695 0.00815255 0.00761596 0.00711469 0.00664641
 0.00620895 0.00580028 0.00541851 0.00506187 0.00472871 0.00441747
 0.00412672 0.0038551  0.00360136 0.00336433 0.00314289 0.00293603
 0.00274278 0.00256226 0.00239361 0.00223607 0.00208889 0.0019514
 0.00182297 0.00170298 0.00159089 0.00148618 0.00138836 0.00129698
 0.00121162 0.00113187 0.00105737 0.00098778 0.00092276 0.00086203
 0.00080529 0.00075229 0.00070277 0.00065652 0.00061331 0.00057294
 0.00053523 0.0005    ]
[2020-09-01 15:03:28,073-rk0-train.py#323] model prepare done
[2020-09-01 15:03:28,075-rk0-train.py#183] model
.backbone (AlexNet)
         .layer1 (Sequential)
                .0 (Conv2d)
                  - weight
                  - bias
                .1 (BatchNorm2d)
                  - weight
                  - bias
                .2 (MaxPool2d)
                .3 (ReLU)
         .layer2 (Sequential)
                .0 (Conv2d)
                  - weight
                  - bias
                .1 (BatchNorm2d)
                  - weight
                  - bias
                .2 (MaxPool2d)
                .3 (ReLU)
         .layer3 (Sequential)
                .0 (Conv2d)
                  - weight
                  - bias
                .1 (BatchNorm2d)
                  - weight
                  - bias
                .2 (ReLU)
         .layer4 (Sequential)
                .0 (Conv2d)
                  - weight
                  - bias
                .1 (BatchNorm2d)
                  - weight
                  - bias
                .2 (ReLU)
         .layer5 (Sequential)
                .0 (Conv2d)
                  - weight
                  - bias
                .1 (BatchNorm2d)
                  - weight
                  - bias
.rpn_head (DepthwiseRPN)
         .cls (DepthwiseXCorr)
             .conv_kernel (Sequential)
                         .0 (Conv2d)
                           - weight
                         .1 (BatchNorm2d)
                           - weight
                           - bias
                         .2 (ReLU)
             .conv_search (Sequential)
                         .0 (Conv2d)
                           - weight
                         .1 (BatchNorm2d)
                           - weight
                           - bias
                         .2 (ReLU)
             .head (Sequential)
                  .0 (Conv2d)
                    - weight
                  .1 (BatchNorm2d)
                    - weight
                    - bias
                  .2 (ReLU)
                  .3 (Conv2d)
                    - weight
                    - bias
         .loc (DepthwiseXCorr)
             .conv_kernel (Sequential)
                         .0 (Conv2d)
                           - weight
                         .1 (BatchNorm2d)
                           - weight
                           - bias
                         .2 (ReLU)
             .conv_search (Sequential)
                         .0 (Conv2d)
                           - weight
                         .1 (BatchNorm2d)
                           - weight
                           - bias
                         .2 (ReLU)
             .head (Sequential)
                  .0 (Conv2d)
                    - weight
                  .1 (BatchNorm2d)
                    - weight
                    - bias
                  .2 (ReLU)
                  .3 (Conv2d)
                    - weight
                    - bias

When the code get here ,it stucked. In the same time , through nvidia-smi, the processes on 2080Ti are stoped on by one,and only 3 processes on TITAN XP remained.

When workers is set to 1, DataLoader worker was killed like feedback below

Traceback (most recent call last):
  File "../../tools/train.py", line 331, in <module>
    main()
  File "../../tools/train.py", line 326, in main
    train(train_loader, dist_model, optimizer, lr_scheduler, tb_writer)
  File "../../tools/train.py", line 185, in train
    for idx, data in enumerate(train_loader):
  File "/home/fang/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 501, in __iter__
    return _DataLoaderIter(self)
  File "/home/fang/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 297, in __init__
    self._put_indices()
  File "/home/fang/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in _put_indices
    indices = next(self.sample_iter, None)
  File "/home/fang/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 138, in __iter__
    for idx in self.sampler:
  File "/home/fang/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 41, in __iter__
    indices = list(torch.randperm(len(self.dataset), generator=g))
  File "/home/fang/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/tensor.py", line 382, in <lambda>
    return iter(imap(lambda i: self[i], range(self.size(0))))
  File "/home/fang/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 227, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 14542) is killed by signal: Killed. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

Questions

About dataset:
Q1. Is it caused by the opencv didn't get correct image?
Or is Tthere other reason?
How i can fix the problem about dataloader?
Q2. Without ytbb dataset to train SiamRPN-Alexnet, will the model performance decrease a lot?
Q3. I also have crop GOT10K dataset from foolwood/SiamMask, did i added correctly in config.py?

About training process:
Q1. How can fix the problem to train SiamRPN-alex correctly by change some parameters in config.py or Linux command line?

It's my first issue and i try to present the problems completely. Sorry for the inconvenient!
If there any problem I missed, please point out directly!
So many thanks for your precious time!

Answered by wWHWw

Sep 9, 2020

And there are some other question
First question is about config.py, if I only use vid for example, to my understanding , this won't take so much time like traing with all datasets.
If I reduce number of datasets, should i change __C.DATASET.VIDEOS_PER_EPOCH ?

Nope

So, no matter how many datasets i use, all i need to change is DATASET.NAMES in sianrpn_alex_dwxcorr_16gpu/config.yaml?

As I said before, i find my training process getting slower and slower with every logger.info shows.
I noticed that during training memory-usage is about 2000+M/11019M for each GPU, but Volatile GPU-util is 0% for most of time? Is this status normal?
I wonder could this phenomenon cause my training process…

View full answer

wWHWw · 2020-09-01T09:34:08Z

wWHWw
Sep 1, 2020
Author

By the way, I can use pretrained model from zoo to test like VOT2016.
There is only one line that may need to notice.

loading VOT2016: 100%|███████████████████████████████████████| 60/60 [00:04<00:00, 13.60it/s, wiper]
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
(  1) Video: bag          Time:  1.6s Speed: 124.7fps Lost: 0
....
( 60) Video: wiper        Time:  2.1s Speed: 163.3fps Lost: 1
model total lost: 51

0 replies

wWHWw · 2020-09-05T16:04:34Z

wWHWw
Sep 5, 2020
Author

About the training setting
Though I'm not quite sure the change is right or not.
I can start training the SiamRPN by setting the --nproc_per_node smaller than 3, because I only have two GPUs.

But another question occurred to me.
That is the ETA getting longer and longer with logger.info showing the training process every time.
From 5 Days to 30 Days with 2080Ti alone, if both GPUs are used, the ETA shows even longer than one.
Any idea how to solve this?

About the dataset
The dataset problem is still here, and not only the yt_bb, but the train.json I generated for coco has the same problem.
I've been told perhaps write a small program to check the path and delete some of the error paths, would solve the problem?
I'm not quite good at it, any help would be appreciated.

0 replies

ZhiyuanChen · 2020-09-07T12:54:59Z

ZhiyuanChen
Sep 7, 2020
Maintainer

Questions

About dataset:
Q1. Is it caused by the opencv didn't get correct image?

It could be, have you checked if the path is correct?

Or is Tthere other reason?
How i can fix the problem about dataloader?

Have you checked #399?

Q2. Without ytbb dataset to train SiamRPN-Alexnet, will the model performance decrease a lot?

It's hard to tell.

Q3. I also have crop GOT10K dataset from foolwood/SiamMask, did i added correctly in config.py?

TODO

About training process:
Q1. How can fix the problem to train SiamRPN-alex correctly by change some parameters in config.py or Linux command line?

It's my first issue and i try to present the problems completely. Sorry for the inconvenient!

Please refer to the other issue.

0 replies

ZhiyuanChen · 2020-09-07T12:58:23Z

ZhiyuanChen
Sep 7, 2020
Maintainer

That is the ETA getting longer and longer with logger.info showing the training process every time.
From 5 Days to 30 Days with 2080Ti alone, if both GPUs are used, the ETA shows even longer than one.
Any idea how to solve this?

I'm actually updating this repor to PyTorch 1.5/1.6, so it should be solved by then. For now I have no idea how to solve it.

About the dataset
The dataset problem is still here, and not only the yt_bb, but the train.json I generated for coco has the same problem.
I've been told perhaps write a small program to check the path and delete some of the error paths, would solve the problem?
I'm not quite good at it, any help would be appreciated.

It could help, check if there is image before load it could be an excellent idea, but this may lead to some batch contains less image, and may requires some additional work to balance it

0 replies

wWHWw · 2020-09-08T05:55:11Z

wWHWw
Sep 8, 2020
Author

Questions

About dataset:
Q1. Is it caused by the opencv didn't get correct image?

It could be, have you checked if the path is correct?

Did you mean by check every img_path before cv.imread in the getitem?
What really confused me is why some of the image in the dataset can loaded correctly and some other can't.
I thought the root path of the yt_bb is the same , the only different is inside train.json.
But I've been told the yt_bb and json correct from BaiduNetDisk, so i can't figure this out

Or is Tthere other reason?
How i can fix the problem about dataloader?

Have you checked #399?

According to #399 , I set num_threads = 1 in par_crop.py for coco, the process a little bit slow.
Hope this will solve the problem about coco.

Thanks for the reply.

0 replies

ZhiyuanChen · 2020-09-08T06:26:22Z

ZhiyuanChen
Sep 8, 2020
Maintainer

Did you mean by check every img_path before cv.imread in the getitem?

Yes

What really confused me is why some of the image in the dataset can loaded correctly and some other can't.
I thought the root path of the yt_bb is the same , the only different is inside train.json.
But I've been told the yt_bb and json correct from BaiduNetDisk, so i can't figure this out

I have no idea either since I couldn't reproduce this error. So it could be possible that some image path is incorrect.

According to #399 , I set num_threads = 1 in par_crop.py for coco, the process a little bit slow.
Hope this will solve the problem about coco.

Great, let us know how it works.

Thanks for the reply.

No worries, glad I can help.

1 reply

wWHWw Sep 11, 2020
Author

According to #399 , I set num_threads = 1 in par_crop.py for coco, the process a little bit slow.
Hope this will solve the problem about coco.

Great, let us know how it works.

It seems worked, I tried training SiamRPN with Coco alone for about only 3 epoch due to my slow training speed, Nonetype error didn't occur to me.
I did the same to the VID dataset, But it really take so much time, like about 2 days or more.

But i find for training Siamese tracker in my local environment Volatile GPU-Util stays 0% for most of the time, 40-50% pop up about every 30 seconds.
It seemed the speed are not limited to the GPUs but other reasons like hard disk I/O or something else?

wWHWw · 2020-09-08T06:38:13Z

wWHWw
Sep 8, 2020
Author

Did you mean by check every img_path before cv.imread in the getitem?

Yes

What really confused me is why some of the image in the dataset can loaded correctly and some other can't.
I thought the root path of the yt_bb is the same , the only different is inside train.json.
But I've been told the yt_bb and json correct from BaiduNetDisk, so i can't figure this out

I have no idea either since I couldn't reproduce this error. So it could be possible that some image path is incorrect.

Thanks for the quick reply, After I check the image_path, if there really have some problem. Will simply delete the error path in the .json help? So the path lead incorrect image can't be loaded.Or there are some pair process i miss. And i noticed that you said:

About the dataset
The dataset problem is still here, and not only the yt_bb, but the train.json I generated for coco has the same problem.
I've been told perhaps write a small program to check the path and delete some of the error paths, would solve the problem?
It could help, check if there is image before load it could be an excellent idea, but this may lead to some batch contains less image, and may requires some additional work to balance it

I don't quite understand, I thought no matter how many image in the dataset , the dataloader can work correctly.

0 replies

ZhiyuanChen · 2020-09-08T06:55:51Z

ZhiyuanChen
Sep 8, 2020
Maintainer

Thanks for the quick reply, After I check the image_path, if there really have some problem. Will simply delete the error path in the .json help? So the path lead incorrect image can't be loaded.

It could, but this may riase other issue like couldn't get correct search image.

I don't quite understand, I thought no matter how many image in the dataset , the dataloader can work correctly.

If you modified the getitem in dataloader and let it pass if the image does not exist, dataloader would not load additional image to maintain the same batch size, but just return whatever it reads.

0 replies

wWHWw · 2020-09-08T08:57:10Z

wWHWw
Sep 8, 2020
Author

Thanks for the quick reply, After I check the image_path, if there really have some problem. Will simply delete the error path in the .json help? So the path lead incorrect image can't be loaded.

It could, but this may riase other issue like couldn't get correct search image.

I don't quite understand, I thought no matter how many image in the dataset , the dataloader can work correctly.

If you modified the getitem in dataloader and let it pass if the image does not exist, dataloader would not load additional image to maintain the same batch size, but just return whatever it reads.

Does the training pair from yt_bb from the one folder like yt_bb/crop511/train0000/0/-0F2NokPzeQc? If so, will just delete these folder path in json help?
I'm confused why i have the problem about ytbb , and some other didn't.=-=

And there are some other question
First question is about config.py, if I only use vid for example, to my understanding , this won't take so much time like traing with all datasets.
If I reduce number of datasets, should i change __C.DATASET.VIDEOS_PER_EPOCH ?
Second question is when I use debug model in Pycharm by setting train configuration like issue #42 I find my templte and search image are loaded to the dpu(devicd:cpu), but when we built model there is model = ModelBuilder().cuda().train() , I guess if i want to use GPU properly, i still need to use CUDA_VISIBLE_DEVICES=0,1 followed READE.md, right?

Really sorry for question after qusetion, I'm the first in my lab whose research direction is SOT. There are not very much people around know about it.

8 replies

wWHWw Jan 25, 2021
Author

请问你设置训练集的时候是只选择了ytbb吗？还是多个数据集一起，如果是多个数据集，我建议你看看你是在读哪个数据集的时候出错的，作者提供的ytbb数据集和标注是对应无误的，我当初换了作者百度云的数据集就没事了。我遇到这个问题主要原因还是线程冲突造成的两个地址对不上，所以读不到图片。

xyl-507 Jan 25, 2021

是的，我只选择了ytbb，可能为也是线程冲突了。我的命令是
CUDA_VISIBLE_DEVICES=0,1
python -m torch.distributed.launch --nproc_per_node=3 --master_port=2333 ../../tools/train.py --cfg config-8gpu.yaml
我只有两张显卡，是不是这样有问题？config文件基本没怎么改，就把其他数据集删掉了

wWHWw Jan 25, 2021
Author

线程冲突是在裁剪和生成数据的时候造成的，比如coco 的12
如果你只用作者的ytbb的话，你看看你是一张都加载不进去，还是某一张或者某一些加载不进去。
如果是一张都不行，你看看你路径和格式是不是没给对，要是有的行有的不行，我就也不太清楚了。

xyl-507 Jan 25, 2021

您好，我是按照您上面给的格式放的，不知道正确的格式是什么？我是新手，能麻烦您帮我看看吗？下面是我的图片和.json文件的路径

wWHWw Jan 25, 2021
Author

不好意思啊，我最近不知道为啥看不到github上的图片了，而且时间有点长我不太记得了，但是就按照作者的目录放是完全ok的

ZhiyuanChen · 2020-09-09T02:59:24Z

ZhiyuanChen
Sep 9, 2020
Maintainer

Does the training pair from yt_bb from the one folder like yt_bb/crop511/train0000/0/-0F2NokPzeQc? If so, will just delete these folder path in json help?
I'm confused why i have the problem about ytbb , and some other didn't.=-=

It might help

And there are some other question
First question is about config.py, if I only use vid for example, to my understanding , this won't take so much time like traing with all datasets.
If I reduce number of datasets, should i change __C.DATASET.VIDEOS_PER_EPOCH ?

Nope

Second question is when I use debug model in Pycharm by setting train configuration like issue #42 I find my templte and search image are loaded to the dpu(devicd:cpu), but when we built model there is model = ModelBuilder().cuda().train() , I guess if i want to use GPU properly, i still need to use CUDA_VISIBLE_DEVICES=0,1 followed READE.md, right?

It depends on if you want to set this env variable

0 replies

wWHWw · 2020-09-09T03:11:53Z

wWHWw
Sep 9, 2020
Author

And there are some other question
First question is about config.py, if I only use vid for example, to my understanding , this won't take so much time like traing with all datasets.
If I reduce number of datasets, should i change __C.DATASET.VIDEOS_PER_EPOCH ?

Nope

So, no matter how many datasets i use, all i need to change is DATASET.NAMES in sianrpn_alex_dwxcorr_16gpu/config.yaml?

As I said before, i find my training process getting slower and slower with every logger.info shows.
I noticed that during training memory-usage is about 2000+M/11019M for each GPU, but Volatile GPU-util is 0% for most of time? Is this status normal?
I wonder could this phenomenon cause my training process slower than it should be?
For reference , my GPU is 2080ti and titan xp with cuda-10.1.

3 replies

ZhiyuanChen Oct 2, 2020
Maintainer

So, no matter how many datasets i use, all i need to change is DATASET.NAMES in sianrpn_alex_dwxcorr_16gpu/config.yaml?

Yes

I noticed that during training memory-usage is about 2000+M/11019M for each GPU, but Volatile GPU-util is 0% for most of time? Is this status normal?

It's definitely NOT normal.

To address the problem, a profiling is necessary.

wWHWw Oct 3, 2020
Author

Really sorry for making question in a closed issue, My training speed problem has been solved with simple replacement with SSD.
Thx for your continuous replies.

ZhiyuanChen Oct 3, 2020
Maintainer

No worries, it is absolutely find to do so, it's just take longer to get response. Glad to know you have resolved it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem about ytbb dataset(BaiduNetDisk) and trainnig SiamRPN #415

{{title}}

Replies: 11 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Problem about ytbb dataset(BaiduNetDisk) and trainnig SiamRPN #415

wWHWw Sep 1, 2020

Replies: 11 comments · 12 replies

wWHWw Sep 1, 2020 Author

wWHWw Sep 5, 2020 Author

ZhiyuanChen Sep 7, 2020 Maintainer

ZhiyuanChen Sep 7, 2020 Maintainer

wWHWw Sep 8, 2020 Author

ZhiyuanChen Sep 8, 2020 Maintainer

wWHWw Sep 11, 2020 Author

wWHWw Sep 8, 2020 Author

ZhiyuanChen Sep 8, 2020 Maintainer

wWHWw Sep 8, 2020 Author

wWHWw Jan 25, 2021 Author

xyl-507 Jan 25, 2021

wWHWw Jan 25, 2021 Author

xyl-507 Jan 25, 2021

wWHWw Jan 25, 2021 Author

ZhiyuanChen Sep 9, 2020 Maintainer

wWHWw Sep 9, 2020 Author

ZhiyuanChen Oct 2, 2020 Maintainer

wWHWw Oct 3, 2020 Author

ZhiyuanChen Oct 3, 2020 Maintainer

wWHWw
Sep 1, 2020

Replies: 11 comments 12 replies

wWHWw
Sep 1, 2020
Author

wWHWw
Sep 5, 2020
Author

ZhiyuanChen
Sep 7, 2020
Maintainer

ZhiyuanChen
Sep 7, 2020
Maintainer

wWHWw
Sep 8, 2020
Author

ZhiyuanChen
Sep 8, 2020
Maintainer

wWHWw Sep 11, 2020
Author

wWHWw
Sep 8, 2020
Author

ZhiyuanChen
Sep 8, 2020
Maintainer

wWHWw
Sep 8, 2020
Author

wWHWw Jan 25, 2021
Author

wWHWw Jan 25, 2021
Author

wWHWw Jan 25, 2021
Author

ZhiyuanChen
Sep 9, 2020
Maintainer

wWHWw
Sep 9, 2020
Author

ZhiyuanChen Oct 2, 2020
Maintainer

wWHWw Oct 3, 2020
Author

ZhiyuanChen Oct 3, 2020
Maintainer