Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add web corpus and pre-trained models #6

Open
piskvorky opened this issue Dec 2, 2017 · 6 comments
Open

Add web corpus and pre-trained models #6

piskvorky opened this issue Dec 2, 2017 · 6 comments

Comments

@piskvorky
Copy link
Owner

E.g. from Amazon's official Common Crawl dataset: https://aws.amazon.com/public-datasets/common-crawl/

By the way, the "official" pre-trained gloVe vectors were trained on this. It would be interesting to compare to other models trained on the same dataset ("official" word2vec was trained on Google News, a different corpus, using completely different preprocessing, so not directly comparable)

@menshikh-iv
Copy link
Contributor

This is super-large, need a new store for it http://commoncrawl.org/2017/02/january-2017-crawl-archive-now-available/ (this is 01-2017 dump size)

@piskvorky
Copy link
Owner Author

piskvorky commented Dec 18, 2017

What does "super-large" mean, can you be more specific?

EDIT: If I'm reading the article correctly, we seem to need 8.97 TiB for the 57800 files in WET (plaintext) format. Is that right?

@menshikh-iv
Copy link
Contributor

@piskvorky not quite, 8.97 ТiB for 57800 compressed WET files. More than that, this is data about 1-year-old dump (now dump is bigger).

super large: significantly more than current wiki dump (I mean we can add something <=10GB), but more - this is really problematic.

In addition to the fact that we need a different repository for "super large" files, we will also have to implement the "resume of downloading" (it's rather difficult).

@piskvorky
Copy link
Owner Author

OK, this one seems to be a challenge :-)

Maybe subsample?

@menshikh-iv
Copy link
Contributor

@piskvorky maybe it will be a good idea, but what's size we should choose for the sample and how to mark explicitly that this is "sample", probably "sample" prefix in dataset name?

@piskvorky
Copy link
Owner Author

piskvorky commented Dec 20, 2017

Yes. Size: probably a few GBs of bz2 plaintext or JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants