PyTorch implementation of the paper ‟Beyond Narrative Description: Generating Poetry from Images” by B. Liu et al., 2018.
Feel free to star the project or open an issue !
This project introduces poem generation from images. This implementation was inspired from the research paper ‟Beyond Narrative Description: Generating Poetry from Images” by Liu, Bei et al., published in 2018 at Microsoft.
The implementation is already coded with TensorFlow in the official Microsoft repository. This repository tries to rearrange implementation from “Neural Poetry Generation with Visual Inspiration.” by Li, Zhaoyang et al. and create a model architecture similar to Bei, Liu et al., with PyTorch.
To use this project, clone the repository from the command line with:
$ git clone https://github.com/arthurdjn/img2poem-pytorch
Then, navigate to the project root:
$ cd img2poem-pytorch
To train the models, you will need to download the datasets used in this project.
The datasets used are:
PoemUniMDatasetMasked
: a dataset of poems only,PoemMuliMDatasetMasked
: a dataset of paired poems and images,PoeticEmbeddedDataset
: a dataset to align poems and images.ImageSentimentDataset
: a dataset of images and polarities,
To download the dataset, use the download()
method, defined for all datasets.
It will downloads poems and images in a root
folder.
For example, you can use:
from img2poem.datasets import ImageSentimentDataset
dataset = ImageSentimentDataset.download(root='.data')
The architecture is decomposed in two parts:
- Encoder, used to extract poeticness from an image,
- Decoder, used to generate a poem from a poetic space.
The encoder is made of three CNN, used to extract scene, object, and sentiment information. To align these features in a poetic space, this encoder is used with a BERT model, to align visual feature with their paired poems.
Then, the decoder works with a discriminator which evaluates the poeticness of a generated poem.
The visual encoder is made of three CNN.
The object detection classifier is the vanilla ResNet50
, from TorchVision. More info here.
The scene classifier is a ResNet50
model fine tuned on the Places365 dataset.
You can find the weights on the MIT platform here.
To train the visual sentiment classifier, use the ImageSentimentDataset
with the ResNet50Sentiment
model.
You can use the script scripts/train_resnet50.py
to fine tune the model:
$ python scripts/train_resnet50.py
0. Hyper params...
------------------------
Batch size: 64
Learning Rate: 5e-05
Split ratio: 0.9
------------------------
1. Loading the dataset...
Loading: 100%|█████████████████████████████████| 15613/15613 [01:16<00:00, 203.41it/s]
2. Building the model...
done
3. Training...
Epoch 1/100
Training: 100%|██████████| 199/199 [01:18<00:00, 2.55it/s, train loss=0.030669]
Evaluation: 100%|██████████| 199/199 [00:24<00:00, 8.26it/s, eval loss=0.030008]
Training: loss=0.025023
Evaluation: loss=0.024733
Eval loss decreased (inf --> 0.024733).
→ Saving model...
Epoch 2/100
Training: 100%|██████████| 199/199 [01:17<00:00, 2.57it/s, train loss=0.030093]
Evaluation: 100%|██████████| 199/199 [00:24<00:00, 8.27it/s, eval loss=0.027973]
Training: loss=0.024398
Evaluation: loss=0.024037
Eval loss decreased (0.024733 --> 0.024037).
→ Saving model...
Epoch 3/100
Training: 100%|██████████| 199/199 [01:17<00:00, 2.57it/s, train loss=0.029633]
Evaluation: 100%|██████████| 199/199 [00:24<00:00, 8.28it/s, eval loss=0.029494]
Training: loss=0.023714
Evaluation: loss=0.023400
Eval loss decreased (0.024037 --> 0.023400).
→ Saving model...
...
To align visual features to a poetic space, the paired poem & image dataset is used (a.k.a multim_poem.json
).
Images and poems are both embedded:
- the poems are embedded through a BERT model into a feature vector of shape ,
- and the images are embedded with the concatenation of the visual models (objects, Scenes and Sentiment) into a feature vector of shape .
To measure the loss from the feature tensors coming from poems and images, I used the ranking loss, described in the original paper by Bei Liu et al. and Zhaoyang Li et al. implementation.
The generator is a recurrent based decoder. I used GRU
cells, as explained in the original paper, to generate a sentence from a feature tensor from the poetic space.
The discriminator is a module which classify a sequence as real, unpaired or generated (cf. the original paper)
W.I.P