Scrape lists.whatwg.org #270

foolip · 2019-12-18T15:05:11Z

No description provided.

This is the verbatim result of running the following: ``` wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/ wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/commit-watchers-whatwg.org/ wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/help-whatwg.org/ wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/implementors-whatwg.org/ wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/whatwg-whatwg.org/ ``` The same commands were run in reverse order to confirm the same result, ruling out order dependence and silent network errors.

These were crawled because other mails point to the URLs, but they all say 'Sought (htdig) archive file not found'.

Command used: > find lists.whatwg.org -name '*.txt.gz' | xargs zcat | sed -n -e 's/^URL: <$.*$>$/\1/p' | xargs wget --force-directories The command was run twice to confirm the same results.

foolip · 2019-12-18T15:11:41Z

@domenic r?

This might not be complete, but it's a good start from which to check if there's more that could be scraped that wget didn't catch.

foolip · 2019-12-18T15:24:45Z

Merging this would deploy it to marquee, as in it would be there on disk, but it wouldn't be served anywhere. But it would allow me to experiment before the DNS change.

foolip · 2019-12-18T15:36:50Z

I should also note that this makes the who repo much bigger. whatwg/misc-server#107 is about fixing this, but if this repo would outlive moving static resources to a storage bucket, then it might not be a good idea to increase its size. If so, create a new repo just for this?

domenic

This looks to be about 3x larger than forums, which already exists and I am not particularly bothered by. So I think it's reasonable to keep it here. (We may want to remove the listing of all files from the rsync in the deploy script though? I dunno.)

We could probably remove admin.cgi/*

foolip · 2019-12-18T18:06:05Z

Some rough projections before it's too late... this PR adds ~10k emails, i.e. files named 0*.html, adding about 100MB. From whatwg/meta#153 (comment) I'd guess we'd add at least 40k more, so maybe 400MB more.

I dunno, maybe this repo is fine, but I think I'll do this work in a separate repo first at least until we need to deploy it the first time.

@domenic or do you think even blowing up the repo 10x doesn't matter?

domenic · 2019-12-18T18:08:15Z

I'm mostly just trying to compare to my existing experience with medium-sized repos like whatwg/html (with full history) and what we already have here. Anything similar to those seems OK.

foolip · 2019-12-18T19:06:57Z

OK, I'll scrape web.archive.org and see how big the result is, then it's easier to decide.

foolip · 2019-12-18T19:59:11Z

I'll work on this in https://github.com/foolip/whatwg.org/tree/scrape-lists, will delete admin.cgi.

annevk · 2019-12-21T09:03:18Z

If we have all those email bodies, can we do some kind of automated search on W3C's infra to get the new permanent locations?

foolip · 2020-01-16T11:57:35Z

I think we could, but the tricky part is determining what the original lists.whatwg.org URLs for those bodies should be, since those are the URLs we'd be trying to revive or redirect.

foolip added 3 commits December 17, 2019 13:35

Remove htdig.cgi

bbded95

These were crawled because other mails point to the URLs, but they all say 'Sought (htdig) archive file not found'.

Scrape additional attachments from lists.whatwg.org

6a4bfeb

Command used: > find lists.whatwg.org -name '*.txt.gz' | xargs zcat | sed -n -e 's/^URL: <$.*$>$/\1/p' | xargs wget --force-directories The command was run twice to confirm the same results.

domenic approved these changes Dec 18, 2019

View reviewed changes

foolip closed this Dec 18, 2019

foolip deleted the scrape-lists branch December 18, 2019 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape lists.whatwg.org #270

Scrape lists.whatwg.org #270

foolip commented Dec 18, 2019

foolip commented Dec 18, 2019

foolip commented Dec 18, 2019

foolip commented Dec 18, 2019

domenic left a comment

foolip commented Dec 18, 2019

domenic commented Dec 18, 2019

foolip commented Dec 18, 2019

foolip commented Dec 18, 2019

annevk commented Dec 21, 2019

foolip commented Jan 16, 2020

Scrape lists.whatwg.org #270

Scrape lists.whatwg.org #270

Conversation

foolip commented Dec 18, 2019

foolip commented Dec 18, 2019

foolip commented Dec 18, 2019

foolip commented Dec 18, 2019

domenic left a comment

Choose a reason for hiding this comment

foolip commented Dec 18, 2019

domenic commented Dec 18, 2019

foolip commented Dec 18, 2019

foolip commented Dec 18, 2019

annevk commented Dec 21, 2019

foolip commented Jan 16, 2020