You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have been trying to get SeqIO to work with HuggingFace's tokenizers for a bit but have been running into trouble with non-t5 based tokenizers. Specifically, it seems that, because they are not sentencepiece tokenizers, tokenizers for models such as GPT-2 are incompatible with SeqIO's SentencePieceVocabulary as they only have the vocab files:
Hey, i'm about to implement this in the near future (and hopefully make a pull request).
Specifically for the GPT-2 tokenizer, but it doesn't really matter.
Are there any things/pitfall I should look out for?
Hi, I have been trying to get SeqIO to work with HuggingFace's tokenizers for a bit but have been running into trouble with non-t5 based tokenizers. Specifically, it seems that, because they are not sentencepiece tokenizers, tokenizers for models such as GPT-2 are incompatible with SeqIO's
SentencePieceVocabulary
as they only have the vocab files:Is there a currently supported way to use these tokenizers with SeqIO? Or would I need to make my own vocab class?
The text was updated successfully, but these errors were encountered: