by Jinghuan Shang, Srijan Das and Michael S. Ryoo at NeurIPS 2022
We present 3DTRL, a plug-and play layer in Transformer using 3D camera transformations to recover tokens in 3D that learns viewpoint-agnostic representations. Check our paper and project page for more details.
Quick link: [Usage] [Dataset] [Image Classification] [Action Recognition] [Video Alignment]
By 3DTRL, we can align videos from multiple viewpoints, even including ego-centric view and third-person view videos.
Third-person view | First-person view GT | Ours | DeiT+TCN |
---|
3DTRL recovers pseudo-depth of images -- getting semantically meaningful results.
├── _doc # images, gifs, etc for readme
├── action_recognition # all files related to action recognition go here, this can work stand alone
├── configs # config files for TimeSformer and +3DTRL
├── timesformer
├── datasets # data pipeline for action recognition
├── models # definitions of TimeSformer and +3DTRL
├── script.sh # launch script for action recognition
├
├── backbone # modules used by 3DTRL (depth and camera estimators)
├── model # Transformer models with 3DTRL plug-in (ViT, Swin, TnT)
├── data_pipeline # dataset class for video alignment
├── i1k_configs # Configuration files for ImageNet-1K training
├
├── 3dtrl_env.yml # conda env for image classification and video alignment
├── i1k.sh # launch script for ImageNet-1K jobs
├── imagenet_train.py # entry point of ImageNet-1K training
├── imagenet_val.py # entry point of ImageNet-1K evaluation
├── multiview_video_alignment.py # entry point of video alignment
├── utils.py # some utility functions
Environment:
conda env create -f 3dtrl_env.yml
Run:
conda activate 3dtrl
bash i1k.sh num_gpu your_imagenet_dir
Credit: We build our code for image classification on top of timm.
We release the First-Third Person View (FTPV) dataset (including MC, Panda, Lift, and Can used in our paper) at Google Drive. Download and unzip it. Please consider cite our paper if you use the datasets. Note: I also include Pouring dataset introduced by TCN paper in the drive. The reason is that I got a hard time to find a valid source to download it when doing my research. I'm re-sharing it for your convenience. Please cite them if you use Pouring.
Environment:
conda env create -f 3dtrl_env.yml
Run:
conda activate 3dtrl
python multiview_video_alignment.py --data dataset_name [--model vit_3dtrl] [--train_videos num_video_used]
Environment: we follow TimeSformer to set up the virtual environment. Then,
cd action_recognition
bash script.sh your_config_file data_location log_location
If you find our research useful, please consider cite:
@inproceedings{
3dtrl,
title={Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space},
author={Jinghuan Shang and Srijan Das and Michael S Ryoo},
booktitle={Advances in Neural Information Processing Systems},
year={2022},
}