-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add web corpus and pre-trained models #6
Comments
This is super-large, need a new store for it http://commoncrawl.org/2017/02/january-2017-crawl-archive-now-available/ (this is 01-2017 dump size) |
What does "super-large" mean, can you be more specific? EDIT: If I'm reading the article correctly, we seem to need 8.97 TiB for the 57800 files in WET (plaintext) format. Is that right? |
@piskvorky not quite, 8.97 ТiB for 57800 compressed WET files. More than that, this is data about 1-year-old dump (now dump is bigger). super large: significantly more than current In addition to the fact that we need a different repository for "super large" files, we will also have to implement the "resume of downloading" (it's rather difficult). |
OK, this one seems to be a challenge :-) Maybe subsample? |
@piskvorky maybe it will be a good idea, but what's size we should choose for the sample and how to mark explicitly that this is "sample", probably "sample" prefix in dataset name? |
Yes. Size: probably a few GBs of bz2 plaintext or JSON. |
E.g. from Amazon's official Common Crawl dataset: https://aws.amazon.com/public-datasets/common-crawl/
By the way, the "official" pre-trained gloVe vectors were trained on this. It would be interesting to compare to other models trained on the same dataset ("official" word2vec was trained on Google News, a different corpus, using completely different preprocessing, so not directly comparable)
The text was updated successfully, but these errors were encountered: