This repo is for our paper Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation.
It contains both the dataset and all source codes in the paper.
Keywords: Japanese-English parallel dataset, educational domain machine translation, lectures translation, multistage fine-tuning
#lines | #docs | Description | |
---|---|---|---|
Test | 2068 | 50 | Human-validated |
Dev | 555 | 16 | Human-validated |
Train | 50543 | 818 | Automatic aligned High quality |
Table 1: English-Japanese parallel dataset in educational domain.
#lines | #docs | Description | |
---|---|---|---|
Test | 2009 | 90 | Human-validated |
Dev | 865 | 34 | Human-validated |
Train | 40074 | 997 | Automatic aligned High quality |
Table 2: English-Chinese parallel dataset in educational domain.
It contains high quality English-Japanese parallel sentences and documents from site Coursera. Please refer our paper for details.
Update: We updated the English-Japanese dataset and it contains more sentences. We added a new English-Chinese dataset.
Also it contain the source codes described in the paper:
- Crawling multi-language subtitle documents from Coursera using youtube-dl.
- Extracting subtitle files of the desired language pair, data normalization and data clean.
- Using machine translation and sentence embedding combined with DP to extract parallel sentence pairs in comparable document pairs.
- Multistage fine-tuning techniques to leverage out-of- and in- domain data to train a MT system for lectures domain translation.
Ja->En | En->Ja | |
---|---|---|
Coursera dataset only | 6.2 | 6.4 |
Combined with OOD datasets | 27.5 | 18.5 |
Zh->En | En->Zh | |
---|---|---|
Coursera dataset only | 14.8 | 14.5 |
Combined with OOD datasets | 29.5 | 29.1 |
Table 2: BLEU scores of using only Coursera dataset and combined with ASPEC, TED Talks datasets for Japanese-English and news commentary, TED Talks for Chinese-English with multistage fine-tuning techniques. Please refer our paper for details.
Please cite our paper if you used our code or dataset:
@inproceedings{song-etal-2020-coursera,
title = "{C}oursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation",
author = "Song, Haiyue and
Dabre, Raj and
Fujita, Atsushi and
Kurohashi, Sadao",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.449",
pages = "3640--3649",
language = "English",
ISBN = "979-10-95546-34-4",
}
@article{Haiyue Song2024,
title={Bilingual Corpus Mining and Multistage Fine-tuning for Improving Machine Translation of Lecture Transcripts},
author={Haiyue Song and Raj Dabre and Chenhui Chu and Atsushi Fujita and Sadao Kurohashi},
journal={Journal of Information Processing},
volume={32},
number={ },
pages={628-640},
year={2024},
doi={10.2197/ipsjjip.32.628}
}
If you have any question, please contact song@nlp.ist.i.kyoto-u.ac.jp