Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate word-based filtering for CJK #899

Open
Tracked by #425
eu9ene opened this issue Oct 23, 2024 · 1 comment
Open
Tracked by #425

Investigate word-based filtering for CJK #899

eu9ene opened this issue Oct 23, 2024 · 1 comment
Labels
language-coverage Issues related to covering specific languages

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Oct 23, 2024

Nikolay:
Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:

firefox-translations-training/pipeline/clean/tools/clean_parallel.py

Line 93 in 3b3f33b
ratio_len = src_len / float(trg_len)

Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3

Afterwards the text should be de-segmented again and prepared for training

Japanese tokenizer should be used in place of jieba for japanese

@eu9ene eu9ene added the language-coverage Issues related to covering specific languages label Oct 23, 2024
This was referenced Oct 23, 2024
@gregtatum
Copy link
Member

This is another case where the ICU segmenter could be useful, see #860

Screenshot of the ICU segmenter segmenting chinese text on using the Intl.Segmenter API on a word granularity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-coverage Issues related to covering specific languages
Projects
None yet
Development

No branches or pull requests

2 participants