This repo forks THUNLP-MT/Mask-Align to adapt it to translate the NewsQA reading comprehension dataset to Spanish (see the NewsQA-es repo for more info). Mask-Align is an algorithm that aligns translations to a token level, which allows us to find the English answer span inside the translated Spanish text.
First, clone this repo:
git clone https://github.com/pln-fing-udelar/Mask-Align
cd Mask-Align/
Then, using Conda, run:
conda env create
conda activate mask-align
Finally, run the following script to download the Europarl Spanish-English parallel corpus and generate vocabularies for English and Spanish:
./scripts/train-mask-align/learn_vocabulary.sh
We need a trained Mask-Align model to align translations between English and Spanish. To download the pretrained model, run the following commands:
mkdir -p spanish-output/output
wget -O spanish-output/output/model-1.pt https://github.com/pln-fing-udelar/Mask-Align/releases/download/pretrained-model/model-1.pt
Alternatively, follow these steps to train it yourself.
Run the following script to do some preprocessing, tokenize sentences, and split the corpus:
./scripts/train-mask-align/preprocess_europarl.sh
Note you need a computer with a CUDA-capable GPU to train the model.
-
In the config file
thualign/configs/user/spanish.config
, specify the location of the following files:corpus.32k.es.shuf
corpus.32k.en.shuf
validation.32k.es
validation.32k.en
test.32k.es
test.32k.en
es.vocab
en.vocab
-
In
device_list
specify the number of GPUs. -
In
batch_size
choose the highest value that doesn't make the training stop due to lack of memory (try different numbers). -
The value of
update_cycle
must be 36000 / batch_size. -
Run:
./thualign/bin/train.sh -s spanish
-
The model checkpoints are saved under
spanish-output/output/
. The latest one has the highest number in the filename. When testing or generating, the codebase selects the best or the latest model automatically.
-
Run:
./thualign/bin/test.sh -s spanish -gvt
-
The alignments are generated in
spanish-output/output/test/alignments.txt
. -
To see the alignments in an interactive way with a web browser, run:
./thualign/scripts/visualize.py spanish-output/output/test/alignment_vizdata.pt
A threshold is used in the Mask-Align algorithm to decide which words are included in the alignment of the answers and which ones aren't. To calculate the optimal value for this threshold you can run the following script:
./scripts/calculate-threshold-and-em/calculate_threshold.sh
Also, if you want to test the alignment of answers in the NewsQA dataset, run the following script to compute the Exact Match measure over some annotated examples:
./scripts/calculate-threshold-and-em/calculate_em.sh
Run the following script to generate the answer alignments for the NewsQA-es dataset. You should have a trained
Mask-Align model and have the corpus-es/newsqa.csv
file.
./scripts/generate-alignments/generate_alignments.sh
The following 4 files are going to be generated under corpus-es/
:
output-indexes.txt
: the indexes of the answers in Spanish.output-answers.txt
: the answers in Spanish (in plain text).output-sentences.txt
: the sentences in Spanish (not tokenized).newsqa-es.csv
: a new version ofnewsqa_filtered.csv
which has the columns with the answers in Spanish.