Skip to content

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

License

Notifications You must be signed in to change notification settings

shyyhs/CourseraParallelCorpusMining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Overview

This repo is for our paper Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation.

It contains both the dataset and all source codes in the paper.

Keywords: Japanese-English parallel dataset, educational domain machine translation, lectures translation, multistage fine-tuning

Dataset

#lines #docs Description
Test 2068 50 Human-validated
Dev 555 16 Human-validated
Train 50543 818 Automatic aligned
High quality

Table 1: English-Japanese parallel dataset in educational domain.

#lines #docs Description
Test 2009 90 Human-validated
Dev 865 34 Human-validated
Train 40074 997 Automatic aligned
High quality

Table 2: English-Chinese parallel dataset in educational domain.

It contains high quality English-Japanese parallel sentences and documents from site Coursera. Please refer our paper for details.

Update: We updated the English-Japanese dataset and it contains more sentences. We added a new English-Chinese dataset.

Source code

Also it contain the source codes described in the paper:

  1. Crawling multi-language subtitle documents from Coursera using youtube-dl.
  2. Extracting subtitle files of the desired language pair, data normalization and data clean.
  3. Using machine translation and sentence embedding combined with DP to extract parallel sentence pairs in comparable document pairs.
  4. Multistage fine-tuning techniques to leverage out-of- and in- domain data to train a MT system for lectures domain translation.

Experiment results

Ja->En En->Ja
Coursera dataset only 6.2 6.4
Combined with OOD datasets 27.5 18.5
Zh->En En->Zh
Coursera dataset only 14.8 14.5
Combined with OOD datasets 29.5 29.1

Table 2: BLEU scores of using only Coursera dataset and combined with ASPEC, TED Talks datasets for Japanese-English and news commentary, TED Talks for Chinese-English with multistage fine-tuning techniques. Please refer our paper for details.

Reference

Please cite our paper if you used our code or dataset:

@inproceedings{song-etal-2020-coursera,
    title = "{C}oursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation",
    author = "Song, Haiyue  and
      Dabre, Raj  and
      Fujita, Atsushi  and
      Kurohashi, Sadao",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.449",
    pages = "3640--3649",
    language = "English",
    ISBN = "979-10-95546-34-4",
}
@article{Haiyue Song2024,
  title={Bilingual Corpus Mining and Multistage Fine-tuning for Improving Machine Translation of Lecture Transcripts},
  author={Haiyue Song and Raj Dabre and Chenhui Chu and Atsushi Fujita and Sadao Kurohashi},
  journal={Journal of Information Processing},
  volume={32},
  number={ },
  pages={628-640},
  year={2024},
  doi={10.2197/ipsjjip.32.628}
}

Contact

If you have any question, please contact song@nlp.ist.i.kyoto-u.ac.jp

About

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published