Flutter Tokenizer for NLP models
ensure to add init
await FTokenizer.init();
and to dispose
FTokenizer.dispose();
If using on with Isolate
, make shure to call await FTokenizer.init();
on the begin andFTokenizer.dispose();
before close the Isolate
FTokenizer uses rust_tokenizer See the rust_tokenizer description: Rust-tokenizer is a drop-in replacement for the tokenization methods from the Transformers library It includes a broad range of tokenizers for state-of-the-art transformers architectures, including: Sentence Piece (unigram model)
Sentence Piece (BPE model)
BERT
ALBERT
DistilBERT
RoBERTa
GPT
GPT2
ProphetNet
CTRL
Pegasus
MBart50
M2M100
NLLB
DeBERTa
DeBERTa (v2)
The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.