This repository contains the code to reproduce the results presented in the paper Self-Supervised Bernoulli Autoencoders for Semi-Supervised Hashing.
We investigate the robustness of hashing methods based on variational autoencoders to the lack of supervision, focusing on two semi-supervised approaches currently in use. In addition, we propose a novel supervision approach in which the model uses its own predictions of the label distribution to implement the pairwise objective. Compared to the best baseline, this procedure yields similar performance in fully-supervised settings but improves significantly the results when labelled data is scarce.
The code is organised in four different scripts, one per dataset. Specifically, the scripts test_model_[data].py considers the dataset data and take as input the following parameters:
-
M is the index of the model considered. In particular, we compare three semi-supervised methods based on variational autoencoders: (M=1) VDHS-S is a variational autoencoder proposed in [1] that employs Gaussian latent variables, unsupervised learning and pointwise supervision; (M=2) PHS-GS is a variational autoencoder proposed in [2] that assumes Bernoulli latent variables, unsupervised learning, and both pointwise and pairwise supervision; and (M=3) SSB-VAE is our proposed method based on Bernoulli latent variable, unsupervised learning, pointwise supervision and self-supervision.
-
p is the level (percentage) of supervision to consider when training the autoencoder based on semi-supervised approach.
-
a, b and g a,b,g are the hyperparameters associated to the different components of the semi-supervised loss. In particular, a is the coefficient of the pointwise component, g is associated to the pairwise component and b is the weight associated to the KL divergence when computing the unsupervised loss
-
r is the number of experiments to perform, given a set of parameters. This is used to compute an average performance considering multiple initialisation of the same neural network. Notice that the results reported in the paper are computing by averaging r=5 experiments.
-
l is the size of the latent sub-space generated by the encoder. This also corresponds to the number of bits of the generated hash codes.
-
o is the file where the results are stored
The script utils.py is used to import the needed packages and all of the custom routines for performance evaluation.
The script base_networks.py contains the custom routines to define all the components of a neural networks.
The script supervised_BAE.py defines the three types of autoencoder (VDSH, PHS, SSB-VAE).
The *.sh files allow to run the all the experiments reported in the paper. In particular, the test_all_[data]-[n]bits.sh compute r times the prediction of the three methods (VDSH, PHS, SSB-VAE), given a dataset (data) and the number of bits n, for different levels of supervision p = 0.1, 0.2, ... , 0.9, 1.0
The script post_processing.py allows to collect all the results provided by the *.sh files and it computes the tables as reported in the paper.
Python 3.7
Tensorflow 2.1
In order to obtain the results reported in the paper it is necessary execute all the *.sh files as follows:
# run all *.sh files
./test_all_20news-16bits.sh
./test_all_20news-32bits.sh
./test_all_snippets-16bits.sh
./test_all_snippets-32bits.sh
./test_all_TMC-16bits.sh
./test_all_TMC-32bits.sh
./test_all_cifar-16bits.sh
./test_all_cifar-32bits.sh
At the end of the computation, the csv files containing the results are generated according to the -o parameter. Finally the script post_processing.py collects all the csv and save a new csv having the same format of the two table reported in the paper.
[1] S. Chaidaroon and Y. Fang. “Variational deep semantic hashing for text documents”. SIGIR. 2017, pp. 75–84.
[2] S. Z. Dadaneh et al. “Pairwise Supervised Hashing with Bernoulli Variational Auto-Encoder and Self-Control Gradient Estimator”. Proc. UAI. 2020