This is the official implementation for the paper: True Bilingual NMT.
The machine used for training models in this paper has the following properties:
- vCPU: 8
- RAM: 52 GB
- GPUs: 4 (Nvidia Testla T4)
- OS: Ubuntu (20.04)
- bootable disk: 100 GB
- Additional disk: 500 GB
To connect to the machine, you need to follow these steps:
Cuda 11.3 was installed using the following commands:
$ wget$
$ sudo mv /etc/apt/preferences.d/cuda-repository-pin-600
$ wget$ cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/
$ sudo apt-get update
$ sudo apt-get -y install cuda
To make cuda faster, you need to install Nvidia Collective Communication Library following these steps found here:
- Go to: NVIDIA NCCL home page, and complete the short survey.
- Download the deb package of NCCL, the package name should be
. For example, mine is:nccl-local-repo-ubuntu2004-2.11.4-cuda11.4_1.0-1_amd64.deb
- Install it using the following command:
$ sudo dpkg -i nccl-local-repo-ubuntu2004-2.11.4-cuda11.4_1.0-1_amd64.deb
- To make sure it was installed successfully, run the following command:
$ python >>> import torch >>> torch.cuda.nccl.version() (2, 10, 3) # you should get something similar
For faster training, we need to install NVIDIA's apex library:
$ git clone
$ cd apex
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
In this project, we are using python 3.8 and PyTorch 1.10. First, let's install
the virtualenv
Before installing the dependencies, we have installed a virtual environment using
which can be installed using:$ sudo apt-get install python3-virtualenv
Then, we created a virtual environment called
using the following command:virtualenv py38 -p python3.8
To activate a virtualenv, use the following command (you have to be inside the
directory):$ source py38/bin/activate
To deactivate a virtualenv, use the following command:
$ deactivate
PyTorch 1.10 can be installed using the following command:
$ pip install torch==1.10.0+cu113 -f
If everything was installed correctly, the following should work with no warnings or errors:
$ python
>>> import torch
>>> torch.cuda.is_available()
>>> torch.version.cuda
11.3 # should be the same as the install cuda driver
>>> torch.cuda.device_count()
>>> torch.cuda.get_device_name(0)
'NVIDIA Tesla T4'
You can install FastBPE for sub-work tokenization using the following steps:
- Clone the GitHub repository:
$ git clone
- Install needed dependencies:
$ sudo apt-get install python3.8-dev $ pip install Cython
- Install FastBPE python API:
$ cd fastBPE $ python install
- To make sure everything was install correctly, try importing it like so:
$ python >>> import fastBPE
You can install PyArrow using pip
like so:
pip install pyarrow
The following are the steps needed to install the tools used in this project
Tool for shuffling big files. You can install it using the following commands:
git clone
You just need to clone it:
$ git clone
You just need to clone it:
$ git clone
To install fairseq and develop locally, follow these steps:
- Clone the repository:
git clone
- Install it:
cd fairseq pip install --editable ./ python build_ext --inplace
To install TensorBoard without TensorFlow, follow these steps:
Install TensorBoard using the following command:
$ pip install tensorboard
To access TensorBoard from your local machine, you need to follow these steps:
- Connect to the server using the following command:
$ ssh -L 16006: gcp1
- Get to the
directory:$ cd /mnt/disks/adisk/true_nmt
- Activate virtualenv:
$ source py38/bin/activate
- Run TensorBoard:
$ tensorboard --logdir=logs
- Now, open this URL:
in your browser.
In this step, we are going to use the FairSeq pre-trained English-French translation model. So, we don't need to download the data or train the model. The model is already trained and we just need to download the pre-trained model by following these steps:
- Download the file:
- Unzip the file:
$ tar -xvf wmt14.en-fr.joined-dict.transformer.tar.bz2
- Rename it into appropriate name:
$ mv wmt14.en-fr.joined-dict.transformer wmt14.en_fr
- You can use it directly now, see
script for details.
To prepare the dataset for training/testing, follow these steps:
- Download the dataset for French-English benchmark. Running the following command
will create a new directory called
- At the end, the data stats were as follows:
train valid test Before Cleaning 40842333 16573 3003 After Cleaning 35789717 15259 3003
Steps for data preprocessing accroding to the model task as follows:
- Download
- Preprocess & normalize
- Tokenize
- Encode BPE
- Clean
- Binarize
- Copy from unidirectional until (5)
- Combine and add tags to the start of the sentences
- Append our tags to the vocabulary
- Binarize
- Copy from unidirectional until (3)
- Generated CSW
- Combine and add tags to the start of the sentences
- Encode BPE
- Clean
- Shuffle
- Binarize
To train the model, follow these steps:
- Running the following command will train a transformer-base model on the
binarized data:
- The training will take a while, so you can check the progress using the
following commands:
- Show logs of training:
tail -f [PATH]/training.log # e.g: tail -f checkpoints/transformer_base/fr_en/logs/training.log
- Show the CPU stats:
- Show the GPU stats:
watch -n 1 nvidia-smi
- Show tensorboard logs.
- Show logs of training:
The following table explains every flag used for training; all flags can be found here and here:
Flag | Description | Used Value |
arch | The model architecture that will be trained. This is the total list of architectures that can be used. | transformer |
save-dir | The directory to save the model checkpoints. | - |
tensorboard-logdir | The directory to save TensorBoard logs. | - |
optimizer | The optimzier that will be used. | Adam |
adam-betas | β1 and β2 that will be used with Adam. Also, Adam has a weight-decay of 0.0001. | '(0.9, 0.98)' |
lr | Learning Rate | 5e-4 |
lr-scheduler | The function of time-step that will change the learning rate while training to get better performance. This scheduler has 4000 warmup-updates. | inverse_sqrt |
dropout | The dropout probability. | 0.1 |
criterion | The loss function. | label_smoothed_cross_entropy |
label-smoothing | The label smoothing uncertainty. | 0.1 |
max-tokens | Maximum number of tokens used for training. This is used instead of `batch-size`. Also, `max-tokens-valid=max-tokens` if not specified otherwise. | 4096 |
num-workers | The total number of parallel process that will be running while training. | 8 |
validate-interval-updates | The number of batches used before validation. | 6000 |
task | The task your model is training on. Possible choices can be found here. | translation |
eval-bleu | Uses BLEU as the evaluation metric. This argument is usable only because the `task=translation` | |
eval-bleu-args | The args that will be used with BLEU metric. All possible arguments and their default values can be found here . This argument is usable only because the `task=translation` | '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' |
eval-bleu-deotk | The algorithm/framwork that will be used for detokenization the validation set. This argument is usable only because the `task=translation` | moses |
eval-bleu-remove-bpe | Removes BPE when validating. This argument is usable only because the `task=translation` | |
eval-bleu-print-samples | Print one sample per batch when validating. This argument is usable only because the `task=translation` | |
best-checkpoint-metric | The metric that will be used for evaluating the best model to be saved. | bleu |
maximize-best-checkpoint-metric | Select the largest metric value for saving `best` checkpoint. | |
max-epoch | The maximum number of training epochs. | 50 |
patience | Stop training if valid performance doesn’t improve for N validations. | 10 |
During validation, the logger will print a few values under the following acronyms:
- nll_loss: Negative log-likelihood loss.
- ppl: perplexity (the lower, the better).
- wps: Words Per Second.
- ups: Updates Per Second.
- wpb: Words Per Batch.
- bsz: batch size.
- num_updates: number of updates since the start of training.
- lr: learning rate.
- gnorm: L2 norm of the gradients.
- gb_free: GPU memory free.
- wall: total time spent training, validating, saving checkpoints (so far).
- train_wall: time taken for one training step
- oom: number of times the training was stopped because of OOM.
To evaluate your model on real data, you can use the following steps:
- Run the
file like so:bash #for bidirectional model
During scoring, the logger will print a few lines prepended by certain characters. Here is the meaning:
: is the source sentence the model has to translate.T
: is the target or the reference for the source sentence.H
: is the tokenized hypothesis translation (i.e, the tokens generated by the model), along with its score.D
: is the detokenized hypothesis translation (i.e, the sentence generated by the model without tokenization, in other words after applying the applied word tokenization in reverse), along with its score.P
To get the id of the running process of
, run the following command:ps aux | grep 'fairseq-train'
After getting the id, kill the process by running the following command:
kill -9 <id> # where <id> is the id of the process
If that didn't work, try running the following command:
for pid in $(ps -ef | awk '/python/ {print $2}'); do kill -9 $pid; done
To stop the cronjob, just comment the line in the crontab file. You can get the crontab by running the following command:
sudo crontab -e