Support for RTL languages #24
Replies: 2 comments 1 reply
-
Hi Wissam! It would be wonderful for Ecco to support RTL languages. If I'm to break down what needs to be tweaked for the display, it would be mostly in eccojs. Namely: A solution might be to add the "ـ" character to turn "ال" to "الـ" when we detect it should link up with the next token. But this seems to greatly expand the problem scope. Btw, how common are subword tokens in the Arabic tokenizers? I haven't experimented with them enough yet. If they're mostly whole words, this can be a smaller issue than indicated above. |
Beta Was this translation helpful? Give feedback.
-
I have created an issue in eccojs to track this issue: jalammar/eccojs#2 The character issue I suspect is related to BPE. "لبلورة" becomes the two tokens "لب�" "�ورة" as BPE split that word halfway through the bytes that make up that center لام. I found an issue discussing this: huggingface/tokenizers#508 |
Beta Was this translation helpful? Give feedback.
-
Hey Jay,
Thank you for this great library.
I was trying to make ecco work for arabic with
aragpt2
. I had to bump the transformer version to 4.2.0 and add my own custom gpt2 classes.But I couldn't figure out how to correctly display the text in a right-to-left direction on colab
this is the link to my fork:
https://github.com/WissamAntoun/ecco-aragpt2
Beta Was this translation helpful? Give feedback.
All reactions