Implemented Multiresolution CNN for video classification on the Sports-1M dataset based on the architecture given in [1]. The model uses two separate streams – ‘fovea’ and ‘context’ that are responsible for learning features from different scaled-down resolutions, and are concatenated later. This helps in avoiding losing important information while speeding up the training process.
Model architecture from [1]
- Images are resized to 200x200
- 170x170 crops are randomly sampled
- Horizontal flipping = 0.5
- Each pixel is mean subtracted
- Optimization - mini-batches = 32, momentum = 0.9, weight decay = 0.0005, learning rate = 0.001
- Local Response Normalization layers are replaced by Batch Normalization layers
The sports video dataset can be downloaded from this link
Achieved the highest validation accuracy of 65 % using this implementation which is comparable to the results obtained in [1]
The sample video outputs can be seen here
[1] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, "Large-Scale Video Classification with Convolutional Neural Networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1725-1732, doi: 10.1109/CVPR.2014.223. Link