Recently, transformers have been widely adopted for various computer vision tasks and show promising results due to their ability to encode long-range spatial dependencies in an image effectively. However, very few studies on adopting transformers in self-supervised depth estimation have been conducted. When replacing the CNN architecture with the transformer in self-supervised learning of depth, we encounter several problems such as problematic multi-scale photometric loss function when used with transformers and, insufficient ability to capture local details. In this letter, we propose an attention-based decoder module, Pixel-Wise Skip Attention (PWSA), to enhance fine details in feature maps while keeping global context from transformers. In addition, we propose utilizing self-distillation loss with single-scale photometric loss to alleviate the instability of transformer training by using correct training signals. We demonstrate that the proposed model performs accurate predictions on large objects and thin structures that require global context and local details. Our model achieves state-of-the-art performance among the self-supervised monocular depth estimation methods on KITTI and DDAD benchmarks.
You need a machine with recent Nvidia drivers and a GPU with at least 6GB of memory (more for the bigger models at higher resolution). We recommend using docker (see nvidia-docker2 instructions) to have a reproducible environment. To setup your environment, type in a terminal (only tested in Ubuntu 18.04):
# if you want to use docker (recommended)
make docker-build
make docker-start-interactive
We will list below all commands as if run directly inside our container. To run any of the commands in a container, you can either start the container in interactive mode with make docker-start-interactive
to land in a shell where you can type those commands, or you can do it in one step:
If you want to use features related to Weights & Biases (WANDB) (for experiment management/visualization), then you should create associated accounts and configure your shell with the following environment variables:
export WANDB_ENTITY="something"
export WANDB_API_KEY="something"
To enable WANDB logging and AWS checkpoint syncing, you can then set the corresponding configuration parameters in configs/<your config>.yaml
(cf. configs/default_config.py for defaults and docs):
wandb:
dry_run: True # Wandb dry-run (not logging)
name: '' # Wandb run name
project: os.environ.get("WANDB_PROJECT", "") # Wandb project
entity: os.environ.get("WANDB_ENTITY", "") # Wandb entity
tags: [] # Wandb tags
dir: '' # Wandb save folder
checkpoint:
s3_path: '' # s3 path for AWS model syncing
s3_frequency: 1 # How often to s3 sync
If you encounter out of memory issues, try a lower batch_size
parameter in the config file.
NB: if you would rather not use docker, you could create a conda environment via following the steps in the Dockerfile and mixing conda
and pip
at your own risks...
- If you must use a CUDA11, you should install this version of torch and torchvision.
pip3 install --pre torch -f https://download.pytorch.org/whl/nightly/cu111/torch-1.11.0.dev20211017%2Bcu111-cp36-cp36m-linux_x86_64.whl -U
pip3 install --pre torchvision -f https://download.pytorch.org/whl/nightly/cu111/torchvision-0.12.0.dev20211017%2Bcu111-cp36-cp36m-linux_x86_64.whl -U
- After installing the torch or making docker-container, you must install horovod.
pip install horovod
Datasets are assumed to be downloaded in /data/datasets/<dataset-name>
(can be a symbolic link).
Dense Depth for Automated Driving (DDAD): a new dataset that leverages diverse logs from TRI's fleet of well-calibrated self-driving cars equipped with cameras and high-accuracy long-range LiDARs. Compared to existing benchmarks, DDAD enables much more accurate 360 degree depth evaluation at range, see the official DDAD repository for more info and instructions. You can also download DDAD directly via:
curl -s https://tri-ml-public.s3.amazonaws.com/github/DDAD/datasets/DDAD.tar | tar -xv -C /data/datasets/
The KITTI (raw) dataset used in our experiments can be downloaded from the KITTI website. For convenience, we provide the standard splits used for training and evaluation: eigen_zhou, eigen_train, eigen_val and eigen_test, as well as pre-computed ground-truth depth maps: original and improved. The full KITTI_raw dataset, as used in our experiments, can be directly downloaded here or with the following command:
# KITTI_raw
curl -s https://tri-ml-public.s3.amazonaws.com/github/packnet-sfm/datasets/KITTI_raw.tar | tar -xv -C /data/datasets/
- You can download pretrained weight in this github.
Any training, including fine-tuning, can be done by passing either a .yaml
config file or a .ckpt
model checkpoint to scripts/train.py:
python3 scripts/train.py <config.yaml or checkpoint.ckpt>
If you pass a config file, training will start from scratch using the parameters in that config file. Example config files are in configs.
If you pass instead a .ckpt
file, training will continue from the current checkpoint state.
Note that it is also possible to define checkpoints within the config file itself. These can be done either individually for the depth and/or pose networks or by defining a checkpoint to the model itself, which includes all sub-networks (setting the model checkpoint will overwrite depth and pose checkpoints). In this case, a new training session will start and the networks will be initialized with the model state in the .ckpt
file(s). Below we provide the locations in the config file where these checkpoints are defined:
checkpoint:
# Folder where .ckpt files will be saved during training
filepath: /path/to/where/checkpoints/will/be/saved
model:
# Checkpoint for the model (depth + pose)
checkpoint_path: /path/to/model.ckpt
depth_net:
# Checkpoint for the depth network
checkpoint_path: /path/to/depth_net.ckpt
pose_net:
# Checkpoint for the pose network
checkpoint_path: /path/to/pose_net.ckpt
Every aspect of the training configuration can be controlled by modifying the yaml config file. This include the model configuration (self-supervised, semi-supervised, loss parameters, etc), depth and pose networks configuration (choice of architecture and different parameters), optimizers and schedulers (learning rates, weight decay, etc), datasets (name, splits, depth types, etc) and much more. For a comprehensive list please refer to configs/default_config.py.
Similar to the training case, to evaluate a trained model you need to provide a .ckpt
checkpoint, followed optionally by a .yaml
config file that overrides the configuration stored in the checkpoint.
python3 scripts/eval.py --checkpoint <checkpoint.ckpt> [--config <config.yaml>]
You can also directly run inference on a single image or folder:
python3 scripts/infer.py --checkpoint <checkpoint.ckpt> --input <image or folder> --output <image or folder> [--image_shape <input shape (h,w)>]
Model | Abs.Rel. | Sqr.Rel | RMSE | RMSElog | d < 1.25 |
---|---|---|---|---|---|
ResNet18, Self-Supervised, 384x640, ImageNet → DDAD (D)* | 0.227 | 11.293 | 17.368 | 0.303 | 0.758 |
PackNet, Self-Supervised, 384x640, DDAD (D)* | 0.173 | 7.164 | 14.363 | 0.249 | 0.835 |
TransDSSL, Velself-Supervised, 384x640, DDAD (D)* | 0.151 | 3.591 | 14.350 | 0.244 | 0.815 |
Model | Abs.Rel. | Sqr.Rel | RMSE | RMSElog | d < 1.25 |
---|---|---|---|---|---|
ResNet18, Self-Supervised, 192x640, ImageNet → KITTI (K) | 0.116 | 0.811 | 4.902 | 0.198 | 0.865 |
PackNet, Self-Supervised, 192x640, KITTI (K) | 0.111 | 0.800 | 4.576 | 0.189 | 0.880 |
TransDSSL, VelSelf-Supervised, 192x640, KITTI (K) | 0.098 | 0.728 | 4.458 | 0.176 | 0.892 |
All experiments followed the Eigen et al., with Zhou et al's preprocessing to remove static training frames.
Our code is based the PackNet.
3D Packing for Self-Supervised Monocular Depth Estimation (CVPR 2020 oral)
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos and Adrien Gaidon, [paper], [video]
@inproceedings{packnet,
author = {Vitor Guizilini and Rares Ambrus and Sudeep Pillai and Allan Raventos and Adrien Gaidon},
title = {3D Packing for Self-Supervised Monocular Depth Estimation},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
primaryClass = {cs.CV}
year = {2020},
}
@ARTICLE{9851497,
author={Han, Daechan and Shin, Jeongmin and Kim, Namil and Hwang, Soonmin and Choi, Yukyung},
journal={IEEE Robotics and Automation Letters},
title={TransDSSL: Transformer Based Depth Estimation via Self-Supervised Learning},
year={2022},
volume={7},
number={4},
pages={10969-10976},
doi={10.1109/LRA.2022.3196781}}