SViTT: Temporal Learning of Sparse Video-Text Transformers (CVPR 2023)

Yi Li¹, Kyle Min², Subarna Tripathi², Nuno Vasconcelos¹

¹University of California, San Diego, ²Intel Labs

This repository contains PyTorch implementation of SViTT, a sparse multimodal transformer for video-language learning.

Get started

conda env create -n svitt --file environment.yml
conda activate svitt

Data

All datasets are expected under data/ directory with the following structure (other downstream datasets follow the same structure as MSRVTT):

data/
├── anno_pretrain/
│   └── webvid_train.json
├── anno_downstream/
│   ├── msrvtt_test1k.json
│   └── ...
├── webvid_videos/
│   └── *.mp4
├── msrvtt_videos/
│   └── *.mp4
└── ...

Raw videos should be downloaded from the websites of respective datasets. Annotations for pre-training and downstream tasks are available in the Singularity repo; additional annotations for Charades and AGQA used in this work are available here.

Example usage

We follow the same structure of training and evaluation scripts as Singularity, with additional options for temporal modeling and sparse training.

Pre-training

To train a 4-frame SViTT model on WebVid: (use arg=value to override any arguments in configs/pretrain_webvid.yaml)

bash scripts/pretrain.sh pt_webvid webvid $GPUS local \
    video_input.num_frames=4 \
    output_dir=$OUTPUT_DIR

To perform temporal sparse expansion to 8 frames:

bash scripts/pretrain.sh pt_webvid webvid $GPUS local \
    pretrained_path=$CKPT \
    video_input.num_frames=8 \
    vision_encoder_args.token_keep_rate=0.6 \
    output_dir=$OUTPUT_DIR

Downstream evaluation

It is recommended to use the same sparsity parameters (vision_encoder_args and joint_encoder_args) as the pre-trained model, though you can also override them with different values.

To evaluate zero-shot text-to-video retrieval (MSRVTT, DiDeMo):

bash scripts/eval_ret.sh $DATASET $CKPT eval-ret-$DATASET local $GPUS

To fine-tune text-to-video retrieval (Charades, SSv2):

bash scripts/train_ret.sh $DATASET $CKPT train-ret-$DATASET local $GPUS

To fine-tune video question answering (MSRVTT-QA, ActivityNet-QA, AGQA):

bash scripts/train_qa.sh $DATASET $CKPT train-qa-$DATASET local $GPUS

Acknowledgements

This project is built primarily on top of the awesome Singularity codebase. We also acknowledge the use of several other open-source repositories, including Frozen in Time, ALBEF, and 🤗 Transformers. This work was funded in part by NSF award IIS-2041009.

Citation

If you find this repo useful, please cite our work. Thanks!

@inproceedings{li2023svitt,
  title={{SViTT}: Temporal Learning of Sparse Video-Text Transformers},
  author={Li, Yi and Min, Kyle and Tripathi, Subarna and Vasconcelos, Nuno},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18919--18929},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
configs		configs
dataset		dataset
models		models
scripts		scripts
tasks		tasks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SViTT: Temporal Learning of Sparse Video-Text Transformers (CVPR 2023)

Get started

Data

Example usage

Pre-training

Downstream evaluation

Acknowledgements

Citation

About

Languages

License

JerryYLi/svitt

Folders and files

Latest commit

History

Repository files navigation

SViTT: Temporal Learning of Sparse Video-Text Transformers (CVPR 2023)

Get started

Data

Example usage

Pre-training

Downstream evaluation

Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages