Single-machine multi-GPU
{
"Python":"3.8.10",
"torch":"1.8.1",
"torchvision":"0.9.1",
"dali": "1.2.0",
"CUDA":"11.1",
"cuDNN":8005,
"GPU":{
"#0":{
"name":"Quadro RTX 6000",
"memory":"23.65GB"
},
"#1":{
"name":"Quadro RTX 6000",
"memory":"23.65GB"
}
},
"Platform":{
"system":"Linux",
"node":"4029GP-TRT",
"version":"#83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021",
"machine":"x86_64",
"processor":"x86_64"
}
}
Batch size: 512, conv layers: 11, epochs: 5
Baseline : 276.980s
+cudnn_benchmark | +AMP | +cudnn_benchmark +AMP | ||
---|---|---|---|---|
DP | 163.740 | 104.807 | 74.948 | 73.862 |
DDP | 142.497 | 102.535 | 67.095 | 72.998 |
-
DP
:torch.nn.DataParallel
-
AMP
:torch.cuda.amp
-
DDP
:torch.nn.parallel.DistributedDataParallel
-
cudnn_benchmark
:torch.backends.cudnn.benchmark = True
-
pin_memory=True
-
non_blocking=True
-
optimizer.zero_grad(set_to_none=True)
# $1 is the epochs
./running.sh 5
# Or run the commands in the script directly.
Drop caches for i/o benchmark test.
sync
# To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
# To free reclaimable slab objects (includes dentries and inodes):
echo 2 > /proc/sys/vm/drop_caches
#To free slab objects and pagecache:
echo 3 > /proc/sys/vm/drop_caches
Batch size: 256/2, workers: 8 x 2
Bottleneck | +DALI/CPU | Bottleneck | +DALI/GPU | Bottleneck | ||
---|---|---|---|---|---|---|
HDD | ~25M/s | IO | ~40M/s | IO | ~40M/s | IO |
SSD | ~230M/s | CPU | ~500M/s | CPU | ~600M/s | IO |
-
SSD
-
DALI
: The NVIDIA Data Loading Library -
LMDB
# $1 is the script, $2 is the imagenet dataset path.
./loading.sh loading_faster.py '/datasets/ILSVRC2012/'
# Or run the commands in the script directly.
The average resolution of ImageNet images is 469x387
, but they are usually cropped to 256x256
or 224x224
in your image preprocessing step. So we could speed up reading by downscaling the image size.
Especially, the entire dataset can be loaded into memory.
# N: the max size of smaller edge
python resize_imagenet.py --src </path/to/imagenet> --dst </path/to/imagenet/resized> --max-size N
As reported in Fixing the train-test resolution discrepancy, you can use smaller image size when training models.