You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In paper, you say that training data of tokenizer is 'CC3M, Unsplash, LAION-COCO, MS-COCO'. Did you use total of those three dataset? Or did you do some filtering? What is total amount of training data in tokenizer training?
And did you use same training dataset in stage 1 and stage 2 in tokenizer training?
The text was updated successfully, but these errors were encountered:
Yes, we use total of 'CC3M, Unsplash, LAION-COCO, MS-COCO' for training tokenizer in both stage 1 and stage2. The total amount of training data is almost 500M.
Thanks for your great work.
In paper, you say that training data of tokenizer is 'CC3M, Unsplash, LAION-COCO, MS-COCO'. Did you use total of those three dataset? Or did you do some filtering? What is total amount of training data in tokenizer training?
And did you use same training dataset in stage 1 and stage 2 in tokenizer training?
The text was updated successfully, but these errors were encountered: