Investigate word-based filtering for CJK #899

eu9ene · 2024-10-23T21:47:44Z

Nikolay:
Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:

firefox-translations-training/pipeline/clean/tools/clean_parallel.py

Line 93 in 3b3f33b
ratio_len = src_len / float(trg_len)

Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3

Afterwards the text should be de-segmented again and prepared for training

Japanese tokenizer should be used in place of jieba for japanese

gregtatum · 2024-10-29T15:52:35Z

This is another case where the ICU segmenter could be useful, see #860

Screenshot of the ICU segmenter segmenting chinese text on using the Intl.Segmenter API on a word granularity.

eu9ene mentioned this issue Oct 23, 2024

[meta] Train harder to segment languages, like CJK languages #425

Open

eu9ene added the language-coverage Issues related to covering specific languages label Oct 23, 2024

This was referenced Oct 23, 2024

Support CJK in OpusCleaner #742

Closed

Adjust data cleaning for CJK #900

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate word-based filtering for CJK #899

Investigate word-based filtering for CJK #899

eu9ene commented Oct 23, 2024 •

edited

Loading

gregtatum commented Oct 29, 2024

Investigate word-based filtering for CJK #899

Investigate word-based filtering for CJK #899

Comments

eu9ene commented Oct 23, 2024 • edited Loading

gregtatum commented Oct 29, 2024

eu9ene commented Oct 23, 2024 •

edited

Loading