Primary Repository for Video Summarization Project, by Ryan Rowe, Preston Jiang, and Joseph Zhong
We are building Video Summarization using deep neural networks.
- Pipeline Architecture: Architecture inspiration.
- TODO: TODO list.
- Getting Started: Quick-start guide.
- Organization: Project organization
- Discussion: Questions to be answered.
- Reference: Reference materials.
We will be bootstrapping heavily from the Hierarchical Boundary-Aware Neural Encoder for Video Captioning implementation, and re-posing video-captioning as a video-summarization problem.
We can approach video-captioning using encoder-decoder recurrent neural architectures.
We can potentially even use a transformer decoder to post-process the decodings
frames -> low dim (encoder) -> decoder -> word -> bert -> summarization
- Download MSR-VTT dataset with available captions
- Download just the MP4, and VTT, we should be able to process the buffered frames
- Load Data
- Initialize Dataset loaders
- Download MSRVTT if necessary
- Initialize Models
- Possibly load existing weights
- Initialize Dataset loaders
- Train Loop
- Save weights, visualize if wanted
- Execute epoch step
- Evaluate loss and backprop
- Ship
- Add
VS_WORKSPACE
to point to this repository location.
- Preprocess captions by running
./src/data/caption.py
- We used the following parameters:
threshold=3
max_words=30
- We used the following parameters:
- Preprocess videos by running
./src/data/video.py
- We used the following parameters:
frequency=0.3
, or about downsampling to 5fps from 15 fps videomax_frames=100
, maximum frame-sequence length for padding
- We used the following parameters:
- Run
./scripts/train.py
./src
- scripts
- train.py
- Initialize the models, dataset loaders, calls `train_step` and `eval_step` from `train_test_utils`
- demo.py
- Live demo with either webcam or video input
- train
- train_test_utils.py
- Define per-epoch train and eval steps
- loss.py
- Define loss function used by train utils
- data
- msrvtt.py
- Defines MSRVTT torch dataset.
- Defines MSRVTT dataloader.
- video.py
- Produces and caches encoded features for dataset videos.
- caption.py
- Preprocesses and caches captions for a dataset.
- model
- object
- rnn?
- transformer
- encoder
- decoder
- utils
- cmd_line.py
- Auto argparse
- utility.py
./data
- datasets
- MSRVTT.50
- ...
- MSRVTT.100
- ...
- MSRVTT
- column
- 0000.npy
- ...
- 9999.npy
- raw
- weights
- imagenet
- model=resnet50
- 0000
- weights.pth
- model=vgg16
- 0000
- weights.pth
- sports1m
- model=c3d
- 0000
- weights.pickle
- MSRVTT.50
- arg1=val1
- ...
- arg1=val2
- arg2=val1
- arg2=val2
- lastArg=lastVal
- 0000
- 0001
- 00_10.pth
- 01_10.pth
- 02_10.pth
- weights.pth
- ...
- ...
- ...
- ...
We will heavily bootstrap from the LipReading repository, taking inspiration from the combined vision-nlp pipeline.
Our needs for Video Summarization will be similar, in that we will utilize the outputs from an existing generalized object detector to use as inputs to inform our video summarizer what content is relevant to summarize.
- How often should we detect objects? Most objects will be in frame for several seconds at least
- We will get diminishing returns on asymptotic sampling, (e.g.: sampling at 30Hz vs 40Hz)
- optimal sampling rate, is most likely around ~0.5Hz, we can also downsample video input to the object detector to save on time and memory
- Time series of
object-ids
and correspondingbounding boxes
- How to encorporate additional semantic information
- (e.g. two people talking vs kissing?)
- We can probably incorporate action detection pipelines here
- (e.g. two people talking vs kissing?)
- How to encorporate additional semantic information
http://ms-multimedia-challenge.com/static/resource/train_2017.zip
- 10k videos (41.2hrs)
- 200k (clip, sentence) pairs
- avg length: 10s each
- Contains original video audio
- Possible to extract Youtube transcriptions
- Is audio information important?
- One option: Include transcript or transcribed audio from Youtube
- However, it is possible audio information can be misleading