Training Data of Tokenizer #27

zheedong · 2024-02-29T04:20:46Z

Thanks for your great work.

In paper, you say that training data of tokenizer is 'CC3M, Unsplash, LAION-COCO, MS-COCO'. Did you use total of those three dataset? Or did you do some filtering? What is total amount of training data in tokenizer training?

And did you use same training dataset in stage 1 and stage 2 in tokenizer training?

geyuying · 2024-02-29T11:50:09Z

Yes, we use total of 'CC3M, Unsplash, LAION-COCO, MS-COCO' for training tokenizer in both stage 1 and stage2. The total amount of training data is almost 500M.

zheedong · 2024-03-04T04:24:26Z

I saw your code, but I cannot find configs about training dataset. Can you tell me more details about it? How many epochs do you train?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Data of Tokenizer #27

Training Data of Tokenizer #27

zheedong commented Feb 29, 2024 •

edited

Loading

geyuying commented Feb 29, 2024

zheedong commented Mar 4, 2024

Training Data of Tokenizer #27

Training Data of Tokenizer #27

Comments

zheedong commented Feb 29, 2024 • edited Loading

geyuying commented Feb 29, 2024

zheedong commented Mar 4, 2024

zheedong commented Feb 29, 2024 •

edited

Loading