There are a wide variety of techniques to employ when trying to create a new machine translation model for a low resource language or improve an existing baseline. The applicability of these techniques generally depend on the availability of parallel and monolingual corpora for the target language and the availability of parallel corpora for related languages/ domains.
Scenario #1 - The data you have is super noisy (e.g., scraped from the web), and you aren't sure which sentence pairs are "good"
Papers:
- Low-Resource Corpus Filtering using Multilingual Sentence Embeddings
- Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
Resources/ examples:
- Implementation -
fast_align
creates word alignments that can be used to score sentence pairs - Implementation -
zipporah
parallel corpus cleaner - Implementation -
bicleaner
parallel corpus cleaner - Implementation - LASER Language-Agnostic SEntence Representations
Scenario #2 - You don't have any parallel data for the source-target language pair, you only have monolingual target data
Papers:
- Phrase-Based & Neural Unsupervised Machine Translation
- Word Translation Without Parallel Data
- Unsupervised Statistical Machine Translation
Resources/ examples:
Scenario #3 - You only have a small amount of parallel data for the source-target language pair, but you have lots of parallel data for a related source-target language pair
Papers:
- Rapid Adaptation of Neural Machine Translation to New Languages
- Neural Machine Translation with Pivot Languages
- Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation
- Transfer Learning for Low-Resource Neural Machine Translation
- Trivial Transfer Learning for Low-Resource Neural Machine Translation
- Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages
Resources/ examples:
- Implementation - rapid adaptation methods (Neubig)
- Video - rapid adaptation methods (Neubig)
- Implementation - transfer learning for low resource languages (Zorph)
Scenario #4 - You only have a small amount of parallel data for the source-target language pair, but you have lots of monolingual data for the target and/or source language
Papers:
- Improving Neural Machine Translation Models with Monolingual Data
- Iterative Back-Translation for Neural Machine Translation
- Generalizing Back-Translation in Neural Machine Translation
- Improving Back-Translation with Uncertainty-based Confidence Estimation
- Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation
Resources/ examples:
Scenario #5 - You have a small amount of parallel data for the source-target language pair, but you also have a lot of parallel data for other language pairs
Papers:
- Massively Multilingual Neural Machine Translationin the Wild: Findings and Challenges
- Multilingual Neural Machine Translation With Soft Decoupled Encoding
- Meta-Learning for Low-Resource Neural Machine Translation
- Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies
Resources/ examples:
- Video - Meta-learning for low resource MT
- Blog - Exploring Massively Multilingual, Massive Neural Machine Translation
- Blog - Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System
Scenario #6 - You don't have any data for the source-target language pair, not even monolingual data, but you have a linguist or a speaker
Papers:
- Apertium: a free/open-source platform for rule-based machine translation. Machine Translation 24(1) pp. 1--18
Resources / examples:
General papers and resources about African languages or African language MT:
- Towards Neural Machine Translation for African Languages
- A Focus on Neural Machine Translation for African Languages
- Parallel Corpora for bi-lingual English-Ethiopian Languages Statistical Machine Translation
- cocohub.cc, tools for crowdsourcing parallel corpora
- Bitextor tool for mining parallel corpora from websites
- CommonCrawl split by language If the language isn't supported by CLD2, build a language model then ask them to run a perplexity filter on CommonCrawl.
- OLAC
- Glottolog
- Ethnologue (Free up to a certain amount of clicks, but many Universities have subscriptions, if you happen to be affiliated with one)