The YouTube8M challenge is a multi-class classification problem, where we are asked to predict for each video, given video & frame level audio and frame RGB features, to which group of categories it belongs to. The number of classes is 3807 for our subset of data.
classifier.ipynb
shows the approach to solve the problem.
The main idea is to separate video level features
with frame level features
, and apply context gating (non linear learnable unit to model interdependencies between activations) [1] for video classification
mean rgb
and mean audio
are the video level features. We pass them through Dense
layers.
mean frame rgb
and mean frame audio
are the frame the frame level feautures. We pass them through Bi-LSTM
layers.
In the end we merge the outputs of video and frame level features into a dense layer and a sigmoid layer is used to predict the tag
for the video.
[1] Miech, Antoine, Ivan Laptev, and Josef Sivic. "Learnable pooling with Context Gating for video classification." arXiv preprint arXiv:1706.06905 (2017).