Add web corpus and pre-trained models #6

piskvorky · 2017-12-02T21:00:57Z

E.g. from Amazon's official Common Crawl dataset: https://aws.amazon.com/public-datasets/common-crawl/

By the way, the "official" pre-trained gloVe vectors were trained on this. It would be interesting to compare to other models trained on the same dataset ("official" word2vec was trained on Google News, a different corpus, using completely different preprocessing, so not directly comparable)

menshikh-iv · 2017-12-18T10:36:35Z

This is super-large, need a new store for it http://commoncrawl.org/2017/02/january-2017-crawl-archive-now-available/ (this is 01-2017 dump size)

piskvorky · 2017-12-18T15:04:05Z

What does "super-large" mean, can you be more specific?

EDIT: If I'm reading the article correctly, we seem to need 8.97 TiB for the 57800 files in WET (plaintext) format. Is that right?

menshikh-iv · 2017-12-19T09:43:47Z

@piskvorky not quite, 8.97 ТiB for 57800 compressed WET files. More than that, this is data about 1-year-old dump (now dump is bigger).

super large: significantly more than current wiki dump (I mean we can add something <=10GB), but more - this is really problematic.

In addition to the fact that we need a different repository for "super large" files, we will also have to implement the "resume of downloading" (it's rather difficult).

piskvorky · 2017-12-19T17:53:26Z

OK, this one seems to be a challenge :-)

Maybe subsample?

menshikh-iv · 2017-12-20T09:22:27Z

@piskvorky maybe it will be a good idea, but what's size we should choose for the sample and how to mark explicitly that this is "sample", probably "sample" prefix in dataset name?

piskvorky · 2017-12-20T10:25:50Z

Yes. Size: probably a few GBs of bz2 plaintext or JSON.

menshikh-iv added the need new storage label Dec 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add web corpus and pre-trained models #6

Add web corpus and pre-trained models #6

piskvorky commented Dec 2, 2017

menshikh-iv commented Dec 18, 2017

piskvorky commented Dec 18, 2017 •

edited

Loading

menshikh-iv commented Dec 19, 2017

piskvorky commented Dec 19, 2017

menshikh-iv commented Dec 20, 2017

piskvorky commented Dec 20, 2017 •

edited

Loading

Add web corpus and pre-trained models #6

Add web corpus and pre-trained models #6

Comments

piskvorky commented Dec 2, 2017

menshikh-iv commented Dec 18, 2017

piskvorky commented Dec 18, 2017 • edited Loading

menshikh-iv commented Dec 19, 2017

piskvorky commented Dec 19, 2017

menshikh-iv commented Dec 20, 2017

piskvorky commented Dec 20, 2017 • edited Loading

piskvorky commented Dec 18, 2017 •

edited

Loading

piskvorky commented Dec 20, 2017 •

edited

Loading