-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape lists.whatwg.org #270
Conversation
This is the verbatim result of running the following: ``` wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/ wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/commit-watchers-whatwg.org/ wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/help-whatwg.org/ wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/implementors-whatwg.org/ wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/whatwg-whatwg.org/ ``` The same commands were run in reverse order to confirm the same result, ruling out order dependence and silent network errors.
These were crawled because other mails point to the URLs, but they all say 'Sought (htdig) archive file not found'.
Command used: > find lists.whatwg.org -name '*.txt.gz' | xargs zcat | sed -n -e 's/^URL: <\(.*\)>$/\1/p' | xargs wget --force-directories The command was run twice to confirm the same results.
@domenic r? This might not be complete, but it's a good start from which to check if there's more that could be scraped that wget didn't catch. |
Merging this would deploy it to marquee, as in it would be there on disk, but it wouldn't be served anywhere. But it would allow me to experiment before the DNS change. |
I should also note that this makes the who repo much bigger. whatwg/misc-server#107 is about fixing this, but if this repo would outlive moving static resources to a storage bucket, then it might not be a good idea to increase its size. If so, create a new repo just for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be about 3x larger than forums, which already exists and I am not particularly bothered by. So I think it's reasonable to keep it here. (We may want to remove the listing of all files from the rsync in the deploy script though? I dunno.)
We could probably remove admin.cgi/*
Some rough projections before it's too late... this PR adds ~10k emails, i.e. files named I dunno, maybe this repo is fine, but I think I'll do this work in a separate repo first at least until we need to deploy it the first time. @domenic or do you think even blowing up the repo 10x doesn't matter? |
I'm mostly just trying to compare to my existing experience with medium-sized repos like whatwg/html (with full history) and what we already have here. Anything similar to those seems OK. |
OK, I'll scrape web.archive.org and see how big the result is, then it's easier to decide. |
I'll work on this in https://github.com/foolip/whatwg.org/tree/scrape-lists, will delete admin.cgi. |
If we have all those email bodies, can we do some kind of automated search on W3C's infra to get the new permanent locations? |
I think we could, but the tricky part is determining what the original lists.whatwg.org URLs for those bodies should be, since those are the URLs we'd be trying to revive or redirect. |
No description provided.