Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Data of Tokenizer #27

Open
zheedong opened this issue Feb 29, 2024 · 2 comments
Open

Training Data of Tokenizer #27

zheedong opened this issue Feb 29, 2024 · 2 comments

Comments

@zheedong
Copy link

zheedong commented Feb 29, 2024

Thanks for your great work.

In paper, you say that training data of tokenizer is 'CC3M, Unsplash, LAION-COCO, MS-COCO'. Did you use total of those three dataset? Or did you do some filtering? What is total amount of training data in tokenizer training?

And did you use same training dataset in stage 1 and stage 2 in tokenizer training?

@geyuying
Copy link
Collaborator

Yes, we use total of 'CC3M, Unsplash, LAION-COCO, MS-COCO' for training tokenizer in both stage 1 and stage2. The total amount of training data is almost 500M.

@zheedong
Copy link
Author

zheedong commented Mar 4, 2024

I saw your code, but I cannot find configs about training dataset. Can you tell me more details about it? How many epochs do you train?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants