We added a fairseq implementation of the Shortcuts Transformer (see fairseq directory) which can be used according to the fairseq documentation for version 0.10.2. To use the Shortcut Transformer without Feature Fusion, set the --arch parameter to shortcut_transformer argument. To use the Shortcut Transformer with Feature Fusion, set the --arch parameter to shortcut_transformer_with_feature_fusion argument. This reimplementation is comparable to the original one with respect to its efficacy relative to the base transformer model. To use it, merge the fairseq/fairseq directory in your local fairseq repository with the one provided here.
- python 3.6
- TensorFlow 1.8
Code for the reproduction of the lexical shortcut studies detailed in Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts. Please refer to the paper for hyper-parameter settings, training and evaluation datasets used, and the primary findings.
Scripts used to conduct the experiments described in the paper are provided in the 'scripts' directory. Their functionality is as follows:
-
preprocess.sh: Used to pre-process the training, development and test corpora used in our experiments (development and test corpora first have to be converted to plain text, e.g. by using input-from-sgm.perl, provided in the Moses toolkit). Adjust as needed for different language pairs.
-
train.sh: Used to train the translation models. To replicate different experiments, select the appropriate values for the --model_type and --shortcut_type flags (i.e. --model_type lexical_shortcuts_transformer and --shortcut_type lexical_plus_feature_fusion for a transformer variant equipped with lexical shortcuts and feature-fusion). See the nmt.py file for the available options. Adding the flag --embiggen_model to the training script enables the transformer-BIG configuration. To use transformer-SMALL, adjust the relevant hyper-parameter values directly in the training script.
-
test.sh: Used to obtain the test-BLEU scores reported in the paper for each trained model. --use_sacrebleu returns the (more conservative) sacreBLEU score, whereas omitting this flag will return scores obtained by the script employed to calculate validation-BLEU during training (based on multi-bleu-detok.py). The latter is roughly comparable to the BLEU calculation method employed in 'Attention Is All You Need', Vaswani et al, 2017.
-
train_classifier.sh: Used to train diagnostic lexical classifiers employed in the probing studies. Enabling --probe_encoder provides the classifier with access to the hidden states of the encoder, while omitting the flag trains the classifier on decoder states. --probe_layer denotes the ID of the encoder / decoder layer accessed by the classifier (1 being the lowest and 6 being the top-most).
-
test_classifier.sh: Used to obtain the accuracy of trained classifiers on a withheld test-set.
If you find this work useful, please consider citing the accompanying paper:
@article{emelin2019widening,
title={Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts},
author={Emelin, Denis and Titov, Ivan and Sennrich, Rico},
journal={arXiv preprint arXiv:1906.12284},
year={2019}
}