Yi Li1, Kyle Min2, Subarna Tripathi2, Nuno Vasconcelos1
1University of California, San Diego, 2Intel Labs
Project page | Paper | 8-min video
This repository contains PyTorch implementation of SViTT
, a sparse multimodal transformer for video-language learning.
conda env create -n svitt --file environment.yml
conda activate svitt
All datasets are expected under data/
directory with the following structure (other downstream datasets follow the same structure as MSRVTT):
data/
├── anno_pretrain/
│ └── webvid_train.json
├── anno_downstream/
│ ├── msrvtt_test1k.json
│ └── ...
├── webvid_videos/
│ └── *.mp4
├── msrvtt_videos/
│ └── *.mp4
└── ...
Raw videos should be downloaded from the websites of respective datasets. Annotations for pre-training and downstream tasks are available in the Singularity repo; additional annotations for Charades and AGQA used in this work are available here.
We follow the same structure of training and evaluation scripts as Singularity, with additional options for temporal modeling and sparse training.
To train a 4-frame SViTT model on WebVid: (use arg=value
to override any arguments in configs/pretrain_webvid.yaml
)
bash scripts/pretrain.sh pt_webvid webvid $GPUS local \
video_input.num_frames=4 \
output_dir=$OUTPUT_DIR
To perform temporal sparse expansion to 8 frames:
bash scripts/pretrain.sh pt_webvid webvid $GPUS local \
pretrained_path=$CKPT \
video_input.num_frames=8 \
vision_encoder_args.token_keep_rate=0.6 \
output_dir=$OUTPUT_DIR
It is recommended to use the same sparsity parameters (vision_encoder_args
and joint_encoder_args
) as the pre-trained model, though you can also override them with different values.
To evaluate zero-shot text-to-video retrieval (MSRVTT, DiDeMo):
bash scripts/eval_ret.sh $DATASET $CKPT eval-ret-$DATASET local $GPUS
To fine-tune text-to-video retrieval (Charades, SSv2):
bash scripts/train_ret.sh $DATASET $CKPT train-ret-$DATASET local $GPUS
To fine-tune video question answering (MSRVTT-QA, ActivityNet-QA, AGQA):
bash scripts/train_qa.sh $DATASET $CKPT train-qa-$DATASET local $GPUS
This project is built primarily on top of the awesome Singularity codebase. We also acknowledge the use of several other open-source repositories, including Frozen in Time, ALBEF, and 🤗 Transformers. This work was funded in part by NSF award IIS-2041009.
If you find this repo useful, please cite our work. Thanks!
@inproceedings{li2023svitt,
title={{SViTT}: Temporal Learning of Sparse Video-Text Transformers},
author={Li, Yi and Min, Kyle and Tripathi, Subarna and Vasconcelos, Nuno},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18919--18929},
year={2023}
}