This project proposes a novel image-text fusion model that relies on the feature-level fusion technique (early fusion) for the classification of flavours. Since there are no existing datasets for flavour classification that includes both visual and textual representations, data gathering, labelling, and processing were done to generate a diverse dataset of food images, corresponding text, and their flavour class (sweet, salty, bitter, sour, umami). The images were generated by the latent diffusion model, which produced high-quality and detailed images of each dish, using its name as a prompt. After data generation, labelling techniques were explored to assign each dish a flavour class. The first labelling method was to create a mapping of each class and the ingredients that usually emit this certain flavour, and then assign each dish a class based on the highest number of ingredients from each class that exist in that dish. The second method was to create a similar mapping but for the dish names, this method was less accurate. However, to make the most out of both methods, they were combined based on certain conditions to assign final class labels for each data sample. After the data preprocessing, three classification models were implemented, one that performs text classification only, based on BERT combined with a BiLSTM network. The second one utilized transfer learning to perform image classification, based on the InceptionResNet architecture. Finally, the two models were combined into a single image-text fusion model based on the early fusion technique. The three models were trained on the gathered data, and their performances were compared. As expected, the early fusion model performed significantly better than the individual models, with an accuracy of 0.7088%.
Parts of the preprocessing code for BERT and for the classification models was taken from [1], a paper which had proposed similar fusion models.