This is the official implementation for the paper: True Bilingual NMT.
The machine used for training models in this paper has the following properties:
- vCPU: 8
- RAM: 52 GB
- GPUs: 4 (Nvidia Testla T4)
- OS: Ubuntu (20.04)
- bootable disk: 100 GB
- Additional disk: 500 GB
To connect to the machine, you need to follow these steps:
TODO...
Cuda 11.3 was installed using the following commands:
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/$ cuda-ubuntu2004.pin
$ sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/$ cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/7fa2af80.pub
$ sudo apt-get update
$ sudo apt-get -y install cuda
To make cuda faster, you need to install Nvidia Collective Communication Library following these steps found here:
- Go to: NVIDIA NCCL home page, and complete the short survey.
- Download the deb package of NCCL, the package name should be
nccl-local-repo-<os_distribution>-<cuda_version>-<architecture>.deb
. For example, mine is:nccl-local-repo-ubuntu2004-2.11.4-cuda11.4_1.0-1_amd64.deb
- Install it using the following command:
$ sudo dpkg -i nccl-local-repo-ubuntu2004-2.11.4-cuda11.4_1.0-1_amd64.deb
- To make sure it was installed successfully, run the following command:
$ python >>> import torch >>> torch.cuda.nccl.version() (2, 10, 3) # you should get something similar
For faster training, we need to install NVIDIA's apex library:
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
In this project, we are using python 3.8 and PyTorch 1.10. First, let's install
the virtualenv
tool:
-
Before installing the dependencies, we have installed a virtual environment using
virtualenv
which can be installed using:$ sudo apt-get install python3-virtualenv
-
Then, we created a virtual environment called
py38
using the following command:virtualenv py38 -p python3.8
-
To activate a virtualenv, use the following command (you have to be inside the
true-nmt
directory):$ source py38/bin/activate
-
To deactivate a virtualenv, use the following command:
$ deactivate
PyTorch 1.10 can be installed using the following command:
$ pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
If everything was installed correctly, the following should work with no warnings or errors:
$ python
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.version.cuda
11.3 # should be the same as the install cuda driver
>>> torch.cuda.device_count()
4
>>> torch.cuda.get_device_name(0)
'NVIDIA Tesla T4'
You can install FastBPE for sub-work tokenization using the following steps:
- Clone the GitHub repository:
$ git clone https://github.com/glample/fastBPE.git
- Install needed dependencies:
$ sudo apt-get install python3.8-dev $ pip install Cython
- Install FastBPE python API:
$ cd fastBPE $ python setup.py install
- To make sure everything was install correctly, try importing it like so:
$ python >>> import fastBPE
You can install PyArrow using pip
like so:
pip install pyarrow
The following are the steps needed to install the tools used in this project
Tool for shuffling big files. You can install it using the following commands:
git clone https://github.com/alexandres/terashuf.git
make
You just need to clone it:
$ git clone https://github.com/moses-smt/mosesdecoder.git
You just need to clone it:
$ git clone https://github.com/rsennrich/subword-nmt.git
To install fairseq and develop locally, follow these steps:
- Clone the repository:
git clone https://github.com/pytorch/fairseq
- Install it:
cd fairseq pip install --editable ./ python setup.py build_ext --inplace
To install TensorBoard without TensorFlow, follow these steps:
-
Install TensorBoard using the following command:
$ pip install tensorboard
To access TensorBoard from your local machine, you need to follow these steps:
- Connect to the server using the following command:
$ ssh -L 16006:127.0.0.1:6006 gcp1
- Get to the
true_nmt
directory:$ cd /mnt/disks/adisk/true_nmt
- Activate virtualenv:
$ source py38/bin/activate
- Run TensorBoard:
$ tensorboard --logdir=logs
- Now, open this URL:
http://127.0.0.1:16006/
in your browser.
In this step, we are going to use the FairSeq pre-trained English-French translation model. So, we don't need to download the data or train the model. The model is already trained and we just need to download the pre-trained model by following these steps:
- Download the file:
$ https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2
- Unzip the file:
$ tar -xvf wmt14.en-fr.joined-dict.transformer.tar.bz2
- Rename it into appropriate name:
$ mv wmt14.en-fr.joined-dict.transformer wmt14.en_fr
- You can use it directly now, see
nmt.py
script for details.
To prepare the dataset for training/testing, follow these steps:
- Download the dataset for French-English benchmark. Running the following command
will create a new directory called
wmt14_en_fr
:bash prepare-wmt14en2fr.sh
- At the end, the data stats were as follows:
train valid test Before Cleaning 40842333 16573 3003 After Cleaning 35789717 15259 3003
Steps for data preprocessing accroding to the model task as follows:
Unidirectional:
- Download
- Preprocess & normalize
- Tokenize
- Encode BPE
- Clean
- Binarize
Bidirectional:
- Copy from unidirectional until (5)
- Combine and add tags to the start of the sentences
- Append our tags to the vocabulary
- Binarize
CSW:
- Copy from unidirectional until (3)
- Generated CSW
- Combine and add tags to the start of the sentences
- Encode BPE
- Clean
- Shuffle
- Binarize
To train the model, follow these steps:
- Running the following command will train a transformer-base model on the
binarized data:
bash train.sh
- The training will take a while, so you can check the progress using the
following commands:
- Show logs of training:
tail -f [PATH]/training.log # e.g: tail -f checkpoints/transformer_base/fr_en/logs/training.log
- Show the CPU stats:
htop
- Show the GPU stats:
watch -n 1 nvidia-smi
- Show tensorboard logs.
- Show logs of training:
NOTE:
The following table explains every flag used for training; all flags can be found here and here:
Flag | Description | Used Value |
---|---|---|
arch | The model architecture that will be trained. This is the total list of architectures that can be used. | transformer |
save-dir | The directory to save the model checkpoints. | - |
tensorboard-logdir | The directory to save TensorBoard logs. | - |
optimizer | The optimzier that will be used. | Adam |
adam-betas | β1 and β2 that will be used with Adam. Also, Adam has a weight-decay of 0.0001. | '(0.9, 0.98)' |
lr | Learning Rate | 5e-4 |
lr-scheduler | The function of time-step that will change the learning rate while training to get better performance. This scheduler has 4000 warmup-updates. | inverse_sqrt |
dropout | The dropout probability. | 0.1 |
criterion | The loss function. | label_smoothed_cross_entropy |
label-smoothing | The label smoothing uncertainty. | 0.1 |
max-tokens | Maximum number of tokens used for training. This is used instead of `batch-size`. Also, `max-tokens-valid=max-tokens` if not specified otherwise. | 4096 |
num-workers | The total number of parallel process that will be running while training. | 8 |
validate-interval-updates | The number of batches used before validation. | 6000 |
task | The task your model is training on. Possible choices can be found here. | translation |
eval-bleu | Uses BLEU as the evaluation metric. This argument is usable only because the `task=translation` | |
eval-bleu-args | The args that will be used with BLEU metric. All possible arguments and their default values can be found here . This argument is usable only because the `task=translation` | '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' |
eval-bleu-deotk | The algorithm/framwork that will be used for detokenization the validation set. This argument is usable only because the `task=translation` | moses |
eval-bleu-remove-bpe | Removes BPE when validating. This argument is usable only because the `task=translation` | |
eval-bleu-print-samples | Print one sample per batch when validating. This argument is usable only because the `task=translation` | |
best-checkpoint-metric | The metric that will be used for evaluating the best model to be saved. | bleu |
maximize-best-checkpoint-metric | Select the largest metric value for saving `best` checkpoint. | |
max-epoch | The maximum number of training epochs. | 50 |
patience | Stop training if valid performance doesn’t improve for N validations. | 10 |
During validation, the logger will print a few values under the following acronyms:
- nll_loss: Negative log-likelihood loss.
- ppl: perplexity (the lower, the better).
- wps: Words Per Second.
- ups: Updates Per Second.
- wpb: Words Per Batch.
- bsz: batch size.
- num_updates: number of updates since the start of training.
- lr: learning rate.
- gnorm: L2 norm of the gradients.
- gb_free: GPU memory free.
- wall: total time spent training, validating, saving checkpoints (so far).
- train_wall: time taken for one training step
- oom: number of times the training was stopped because of OOM.
To evaluate your model on real data, you can use the following steps:
- Run the
score_*.sh
file like so:bash score_bidirectional.sh #for bidirectional model
During scoring, the logger will print a few lines prepended by certain characters. Here is the meaning:
S
: is the source sentence the model has to translate.T
: is the target or the reference for the source sentence.H
: is the tokenized hypothesis translation (i.e, the tokens generated by the model), along with its score.D
: is the detokenized hypothesis translation (i.e, the sentence generated by the model without tokenization, in other words after applying the applied word tokenization in reverse), along with its score.P
: TODO
-
To get the id of the running process of
fairseq-train
, run the following command:ps aux | grep 'fairseq-train'
-
After getting the id, kill the process by running the following command:
kill -9 <id> # where <id> is the id of the process
-
If that didn't work, try running the following command:
for pid in $(ps -ef | awk '/python/ {print $2}'); do kill -9 $pid; done
-
To stop the cronjob, just comment the line in the crontab file. You can get the crontab by running the following command:
sudo crontab -e