From Sight to Speech: Exploring Image Captioning Landscapes

Welcome to the Image Captioning repository! Image captioning is an intriguing challenge that demands the generation of descriptive and precise sentences to encapsulate the content of an image. This repository serves as a comprehensive exploration into various image captioning methodologies, ranging from fundamental character and word-based approaches to advanced techniques utilizing attention mechanisms and transformer architectures.

These are some cations generated by our best model

Baseline

The first approach we tried was using a CNN as an encoder and using the output of the pooler layer as input to a RNN, then keep generating words/characters based on the previous one and the hidden state.

We provide an ableation study testing different backbones, freezing or not the CNN, using 5 or 1 caption as ground truth, etc.

Backbone	Element	Num captions	Encode Freeze	bleu1	bleu2	rougeL	meteor
Resnet	char	1	no	0.20	0.11	0.24	0.12
Resnet	char	1	yes	0.11	0.06	0.26	0.11
Resnet	char	1	until epoch 50	0.21	0.13	0.31	0.15
Resnet	word	1	no	0.54	0.30	0.36	0.32
Resnet	word	5	no	0.54	0.28	0.36	0.22
Resnet	word	1	yes	0.54	0.31	0.37	0.32
Resnet	word	1	until epoch 50	0.55	0.31	0.38	0.31
EfficientNet	word	1	until epoch 50	0.50	0.25	0.33	0.28
ConvNext	word	1	until epoch 50	0.58	0.34	0.39	0.34

Introducing Attention

Cross Attention

Badhanau Attention

Attention can be done over the input, hidden or output vector. We show some results using these 3 options.

Results

Attention	Over what	bleu1	bleu2	rougeL	meteor
Cross Attention	Input	54.85	32.27	37.71	33.49
Cross Attention	Hidden	55.97	31.48	38.21	32.39
Cross Attention	Output	51.74	29.68	36.64	27.81
Badhanau Attention	Input	57.19	34.4	39.3	34.81

The best model in this case was an LSTM, with 4 layers with Bahdanau attention over the input finetuning the CNN for 60 epochs with Convnext as starting backbone and with 15% dropout.

Attention Maps of the best model

Using Transformers

CNN + Transformer Decoder

VIT + Transformer Decoder

Results

Architecture	bleu1	bleu2	rougeL	meteor
CNN + Transformer Decoder	63.23	43.2	44.66	44.02
VIT + Transformer Decoder	65.06	45.17	45.67	44.62

The VIT model is the best model we have accomplished showing how strong this architecture is for image captioning. It comprises a Transformer decoder with 5 layers and 12 heads, utilizing teacher forcing with 5 captions during training. Leveraging pretrained word2vec embeddings and ViT as its backbone, the model undergoes finetuning after 20 epochs with a 10% dropout rate.

Visualizing VIT attention over input words

Introducing Regions

Extracting information from a trained FasterRCNN can be done in different ways, using the CNN as a backbone, the ROI features, or using the predicted bounding boxes are some approaches. We decided to actually use the bounding boxes, crop the different detected objects, and pass them through a new CNN, then we can use these features as input to any type of decoder, in our case we show results for lstm and transformer decoders.

FasterRCNN + LSTM

FasterRCNN + Transformer Decoder

Results

Architecture	bleu1	bleu2	rougeL	meteor
FasterRCNN + LSTM	53.64	30.29	36.45	32.16
FasterRCNN + Transformer Decoder	56.20	29.82	37.29	22.79

This information unveils a promising avenue in image captioning. Leveraging regions extracted from a FasterRCNN and passing them through specialized networks like LSTM or transformer decoders showcases the potential of detailed object-level understanding in generating captions. While these approaches yield notable improvements, it's noteworthy that they haven't surpassed the performance achieved by the VIT model. This suggests that while region-based approaches offer valuable insights, further exploration and refinement are needed to fully capitalize on their potential.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Figures		Figures
More_outputs		More_outputs
sbatchScripts		sbatchScripts
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
statisticsDataset.py		statisticsDataset.py
unique_wrds_vocab.py		unique_wrds_vocab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Sight to Speech: Exploring Image Captioning Landscapes

Baseline

Introducing Attention

Cross Attention

Badhanau Attention

Using Transformers

CNN + Transformer Decoder

VIT + Transformer Decoder

Introducing Regions

FasterRCNN + LSTM

FasterRCNN + Transformer Decoder

About

Releases

Packages

Contributors 2

Languages

joanlafuente/From-Sight-to-Speech

Folders and files

Latest commit

History

Repository files navigation

From Sight to Speech: Exploring Image Captioning Landscapes

Baseline

Introducing Attention

Cross Attention

Badhanau Attention

Using Transformers

CNN + Transformer Decoder

VIT + Transformer Decoder

Introducing Regions

FasterRCNN + LSTM

FasterRCNN + Transformer Decoder

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages