Deep Predictive Coding for Multimodal Representation Learning

Abstract

In machine learning parlance, common sense reasoning relates to the capacity of learning representations that disentangle hidden factors behind spatiotemporal sensory data. In this work, we hypothesise that the predictive coding theory of perception and learning from neuroscience literature may be a good candidate for implementing such common sense inductive biases. We build upon a previous deep learning implementation of predictive coding by Lotter et al. (2016) and extend its application to the challenging task of inferring abstract, everyday human actions such as cooking and diving. Furthermore, we propose a novel application of the same architecture to process auditory data, and find that with a simple sensory substitution trick, the predictive coding model can learning useful representations. Our transfer learning experiments also demonstrate good generalisation of learned representations on the UCF-101 action classification dataset.

Research questions

To investigate the design of machines that acquire common sense by observing the world, we capitalise on a deep learning implementation of the predictive coding model published by Lotter et al. (2016). Their deep predictive coding network was shown to learn representations that disentangle latent variables correlated to the movement of objects in synthetic and natural images. We extend their study to address the following questions:

Can unsupervised predictive coding models learn higher-level spatiotemporal concepts, namely quotidian activities such as driving or exercising?
Are predictive coding inductive biases general enough so that these models can also learn from auditory information?
What are the limitations of the deep predictive coding implementation with respect to the original neuroscience theory proposed by Friston and Kiebel (2009) and Rao and Ballard (1999)?

Contributions

Our main contributions are summarised as follows:

Based on a theoretical review of the free-energy principle (Friston, 2010), we analyse some of the architectural limitations of Lotter et al. (2016) deep learning implementation, in particular, regarding the inference of hidden causes via free energy minimisation.
We extend the work of Lotter et al. (2016) by using predictive coding representations to decode higher-level concepts that require the understanding of world dynamics. The learned representations are evaluated on small-scale tasks and on UCF-101 (Soomro et al., 2012), a popular action recognition benchmark.
We train the predictive coding model on a dataset about 60 times larger than the one used in previous work (Lotter et al., 2016) and show that model continues to improve future frame predictions, even when the training dataset includes a large number of unrelated classes.

Inspired by sensory substitution literature from neuroscience (Stiles and Shimojo, 2015), a novel application of the predictive coding model is proposed for unsupervised representation learning from audio data. Our results suggest that the different modalities provide complementary information that is useful for the action classification task.

Relevant documents

Project folders

datasets: includes scripts for downloading and preprocessing of the datasets used in the experiments, including the Moments in Time and UCF-101 datasets.
models/prednet: the primary model implementation for our study. The model code is adapted from the implementation provided by Lotter, 2016. All the pipeline was reimplemented to fit our experimental needs.
models/classifier: implementation of simple SVM and LSTM classifiers used on top of predictive coding representations.

References

Friston

Friston, K., & Kiebel, S. (2009). Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521), 1211-1221.

Friston_

Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2):127.

Lotter

Lotter, W., Kreiman, G., & Cox, D. (2016). Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104.

Monfort

Monfort, M., Zhou, B., Bargal, S. A., Andonian, A., Yan, T., Ramakrishnan, K., ... & Oliva, A. (2018). Moments in Time Dataset: one million videos for event understanding. arXiv preprint arXiv:1801.03150.

Rao

Rao, R. P. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79.

Soomro

Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

Stiles

Stiles, N. R. and Shimojo, S. (2015). Auditory sensory substitution is intuitive and automatic with texture stimuli. Scientific reports, 5:15628.

Name		Name	Last commit message	Last commit date
Latest commit History 377 Commits
datasets		datasets
images		images
models		models
.gitignore		.gitignore
README.md		README.md
dissertation.pdf		dissertation.pdf
environment.md		environment.md
informatics-project-proposal.pdf		informatics-project-proposal.pdf
project-progress-report.pdf		project-progress-report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Predictive Coding for Multimodal Representation Learning

Abstract

Research questions

Contributions

Relevant documents

Project folders

References

Friston

Friston_

Lotter

Monfort

Rao

Soomro

Stiles

About

Releases

Packages

Languages

thefonseca/predictive-coding

Folders and files

Latest commit

History

Repository files navigation

Deep Predictive Coding for Multimodal Representation Learning

Abstract

Research questions

Contributions

Relevant documents

Project folders

References

Friston

Friston_

Lotter

Monfort

Rao

Soomro

Stiles

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages