Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape lists.whatwg.org #270

Closed
wants to merge 3 commits into from
Closed

Scrape lists.whatwg.org #270

wants to merge 3 commits into from

Conversation

foolip
Copy link
Member

@foolip foolip commented Dec 18, 2019

No description provided.

This is the verbatim result of running the following:
```
wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/
wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/commit-watchers-whatwg.org/
wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/help-whatwg.org/
wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/implementors-whatwg.org/
wget --execute robots=off --mirror --page-requisites http://lists.whatwg.org/pipermail/whatwg-whatwg.org/
```

The same commands were run in reverse order to confirm the same
result, ruling out order dependence and silent network errors.
These were crawled because other mails point to the URLs, but they all
say 'Sought (htdig) archive file not found'.
Command used:

> find lists.whatwg.org -name '*.txt.gz' | xargs zcat | sed -n -e 's/^URL: <\(.*\)>$/\1/p' | xargs wget --force-directories

The command was run twice to confirm the same results.
@foolip
Copy link
Member Author

foolip commented Dec 18, 2019

@domenic r?

This might not be complete, but it's a good start from which to check if there's more that could be scraped that wget didn't catch.

@foolip
Copy link
Member Author

foolip commented Dec 18, 2019

Merging this would deploy it to marquee, as in it would be there on disk, but it wouldn't be served anywhere. But it would allow me to experiment before the DNS change.

@foolip
Copy link
Member Author

foolip commented Dec 18, 2019

I should also note that this makes the who repo much bigger. whatwg/misc-server#107 is about fixing this, but if this repo would outlive moving static resources to a storage bucket, then it might not be a good idea to increase its size. If so, create a new repo just for this?

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be about 3x larger than forums, which already exists and I am not particularly bothered by. So I think it's reasonable to keep it here. (We may want to remove the listing of all files from the rsync in the deploy script though? I dunno.)

We could probably remove admin.cgi/*

@foolip
Copy link
Member Author

foolip commented Dec 18, 2019

Some rough projections before it's too late... this PR adds ~10k emails, i.e. files named 0*.html, adding about 100MB. From whatwg/meta#153 (comment) I'd guess we'd add at least 40k more, so maybe 400MB more.

I dunno, maybe this repo is fine, but I think I'll do this work in a separate repo first at least until we need to deploy it the first time.

@domenic or do you think even blowing up the repo 10x doesn't matter?

@domenic
Copy link
Member

domenic commented Dec 18, 2019

I'm mostly just trying to compare to my existing experience with medium-sized repos like whatwg/html (with full history) and what we already have here. Anything similar to those seems OK.

@foolip
Copy link
Member Author

foolip commented Dec 18, 2019

OK, I'll scrape web.archive.org and see how big the result is, then it's easier to decide.

@foolip foolip closed this Dec 18, 2019
@foolip foolip deleted the scrape-lists branch December 18, 2019 19:09
@foolip
Copy link
Member Author

foolip commented Dec 18, 2019

I'll work on this in https://github.com/foolip/whatwg.org/tree/scrape-lists, will delete admin.cgi.

@annevk
Copy link
Member

annevk commented Dec 21, 2019

If we have all those email bodies, can we do some kind of automated search on W3C's infra to get the new permanent locations?

@foolip
Copy link
Member Author

foolip commented Jan 16, 2020

I think we could, but the tricky part is determining what the original lists.whatwg.org URLs for those bodies should be, since those are the URLs we'd be trying to revive or redirect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants