This repo contains source code for our NeurIPS 2023 paper:
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai, Ruiping Wang, Xilin Chen
✅ AGQA v2 Benchmark Supported
✅ EgoTaskQA Benchmark Supported
✅ NExT-QA Benchmark Supported
The project requires the following:
- PyTorch (version 1.9.0 or higher): The project was tested on PyTorch 1.11.0 with CUDA 11.3 support.
- Hardware: We have performed experiments on NVIDIA GeForce RTX 3090Ti with 24GB GPU memory. Similar or higher specifications are recommended for optimal performance.
- Python packages: Additional Python packages specified in the
requirements.txt
file are necessary. Instructions for installing these are given below.
Let's begin from creating and activating a Conda environment an virtual environment
conda create --name gfenv python=3.7
conda activate gfenv
Then, clone this repository and install the requirements.
$ git clone https://github.com/ByZ0e/Glance-Focus.git
$ cd Glance-Focus
$ pip install -r requirements.txt
You need to obtain necessary dataset and features. You can choose one of the following options to do so:
You can download the dataset annotation files and features directly to the DEFAULT_DATASET_DIR
.
We currently upload all necessary files for running on STAR benchmark. You can download from Google Drive.
It should have the following structure:
├── /STAR/
│ ├── /txt_db/
│ │ ├── action_mapping.txt
│ │ ├── events.json
│ │ ├── test.jsonl
│ │ ├── train.jsonl
│ │ └── val.jsonl
│ ├── /vis_db/
│ │ ├── s3d.pth
│ │ └── strID2numID.json
We currently upload all necessary files for running on AGQA benchmark. You can download from Google Drive.
It should have the following structure:
├── /AGQA/
│ ├── /txt_db/
│ │ ├── action_mapping.txt
│ │ ├── charades.json
│ │ ├── test_balanced.txt
│ │ ├── train_balanced.txt
│ │ └── agqa_balanced_txt_vocab.json
│ ├── /vis_db/
│ │ └── frcnn.pth
We currently upload all necessary text files for running on EgoTaskQA benchmark. You can download from Google Drive.
It should have the following structure:
├── /EgoTaskQA/
│ ├── /txt_db/
│ │ ├── hois_full_trans.json
│ │ ├── events.json
│ │ ├── formatted_train_qas_encode.json
│ │ ├── formatted_val_qas_encode.json
│ │ ├── formatted_test_qas_encode.json
│ │ └── answer_set.txt
We currently upload all necessary text files for running on NExT-QA benchmark. You can download from Google Drive.
It should have the following structure:
├── /NExT-QA/
│ ├── /txt_db/
│ │ ├── train_mc.csv
│ │ ├── val_mc.csv
│ │ ├── test_mc.csv
│ │ └── vocab.json
If you wish to reproduce the data preprocessing and video feature extraction procedures.
- Download Raw Data
- STAR: Download it from the data providers.
- AGQA: Download it from the data providers.
- EgoTaskQA: Download it from the data providers.
- NExT-QA: Download it from the data providers.
- Data Preprocessing
- Please follow the data format in Option 1 to preper the corresponding data.
- We also plan to upload the corresponding data processing code for each benchmark.
- Extract video features We follow the recent works to extract the video features. Here are some reference code:
- S3D feature: Please refer to Just-Ask.
- C3D feature: Most of the benchmarks have provided this feature, please refer to the original benchmarks.
- CLIP feature: Please refer to MIST.
With your environment set up and data ready, you can start training the model.
We support both unsupervised and supervised setting training, since some VideoQA benchmarks like NExT-QA do not provide event-level annotations.
- unsupervised setting
python train_glance_focus_uns.py --basedir expm/star --name gf_logs --device_id 0 --test_only 0 \
--qa_dataset star --base_data_dir $DEFAULT_DATASET_DIR \
--losses_type ['qa','cls','giou','cert']
- supervised setting
python train_glance_focus_sup.py --basedir expm/star --name gf_logs --device_id 0 --test_only 0 \
--qa_dataset star --base_data_dir $DEFAULT_DATASET_DIR \
--losses_type ['qa','cls','l1']
For AGQA benchmark, some parameters are different.
Note that AGQA do not have validation set, you may split it from training set by yourself.
What's more, remember to change to the AGQA dataloader.
python train_glance_focus_sup.py --basedir expm/agqa --name gf_logs --device_id 0 --test_only 0 \
--qa_dataset agqa --base_data_dir $DEFAULT_DATASET_DIR \
--batch_size 128 --nepoch 20 --lr 5e-5 --i_test 9000 --i_weight 9000 --output_dim 157 \
--train_data_file_path '{}/txt_db/train_balanced.txt'
--test_data_file_path '{}/txt_db/test_balanced.txt'
--event_anno_path '{}/txt_db/charades.json'
--app_feat_path '{}/vis_db/frcnn.pth'
--feature_dim 512
--ans2label_path '{}/txt_db/agqa_balanced_txt_vocab.json'
--losses_type ['qa','cls','l1']
For EgoTaskQA and NExT-QA benchmarks, their data formats are slightly different from STAR and AGQA.
We release the corresponding data loaders we used.
You may have to edit the data loading code.
Supervised trained on STAR dataset. Download from Google Drive.
python train_glance_focus_uns.py --device_id 0 --test_only 1 \
--qa_dataset star --base_data_dir $DEFAULT_DATASET_DIR \
--reload_model_path expm/star/gf_logs/ckpts_2024-01-17T10-30-46/model_3000.tar \
We are grateful to Just-Ask, MIST and ClipBERT, on which our codes are developed.
If you find our paper and/or code helpful, please consider citing:
@inproceedings{bai2023glance,
title={Glance and Focus: Memory Prompting for Multi-Event Video Question Answering},
author={Bai, Ziyi and Wang, Ruiping and Xilin, CHEN},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}