You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nikolay:
Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:
Nikolay:
Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:
firefox-translations-training/pipeline/clean/tools/clean_parallel.py
Line 93 in 3b3f33b
ratio_len = src_len / float(trg_len)
Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3
Afterwards the text should be de-segmented again and prepared for training
Japanese tokenizer should be used in place of jieba for japanese
The text was updated successfully, but these errors were encountered: