-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move ever-growing *.spec.whatwg.org storage off of the VM disk #107
Comments
To clarify, request forwarding is a backend matter and does not involve redirects? |
DigitalOcean Spaces doesn't support serving a website from it directly, but this is tracked in https://ideas.digitalocean.com/ideas/DO-I-318. The smallest change that would work is to let nginx continue to handle redirects, and for requests that don't redirect proxy that to an internal Spaces endpoint. Spaces wouldn't itself ever respond with a redirect, at least not until https://ideas.digitalocean.com/ideas/DO-I-318 is fixed. For all of the static sites, I think our requirements are:
|
The most elaborate redirect rules are in https://github.com/whatwg/misc-server/blob/master/debian/marquee/nginx/sites/whatwg.org.conf. |
Sorry, to restate my question, will our end-user-visible response URLs remain unchanged? |
Yes, of course, any solution that doesn't give full control of the URL layout I'd just rule out :) |
Numbers in whatwg/meta#161 (comment) suggest that everything would easily fit in a Git repo, but you can't serve a website from a repo so that doesn't solve everything here. |
Hijacking this issue to drop some notes about using a CDN, which isn't the same problem as running out of disk space... Some numbers based on using I'm not sure about our numbers, I'm pretty sure they're the the compressed size, but we're not using 30*872 GiB ~= 26 TiB of transfer per month, more like 4-5 TiB. So this analysis is probably all wrong :) |
It looks like https://www.digitalocean.com/products/app-platform/ could be something to look into for this. From a cursory view, it seems more like AppEngine, in that it supports Node.js and other languages, static content, and you don't manage the servers yourself. |
I have looked into using DigitalOcean spaces with nginx in front, using The main problem this runs into is that a S3-like storage bucket is just a set of named objects whose names are paths, it's not a file system. The following can't be done in the usual way and needs some other solution:
I think that if the first problem could be solved, then the second can be done with a |
It looks like DigitalOcean Spaces is maybe particularly bad at this: S3 has a whole "website hosting mode", see e.g. their docs on index.html files. Whereas https://www.digitalocean.com/community/questions/spaces-set-index-html-as-default-landing-page seems to have seen no activity. Maybe using S3 (which we already do for PR preview) would be the right way to go here? |
Hmm, I hadn't consider just using AWS S3, but that would probably solve most of this. What's not great about it is that we'd depend on both DigitialOcean and S3 being healthy at all times. What mystifies me is that neither S3 nor spaces seems to have a way to set a |
S3 has a complicated system: https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-page-redirect.html . It is a bit mystifying why they don't allow something simpler. E.g. the most flexible option, the JSON rules, is capped at 50. And the per-object redirect doesn't seem to let you choose the status codes. |
Probably a bad idea to diversify even further, but there's also Netlify which has very straightforward |
If we could put objects in the bucket which the nginx front end turns into a redirect to add a slash, then I think we'd be set. (We'd also need to generate file listings but that could be a deploy step, not too hard I think.) @domenic do you know if S3 when hosting a static web site will redirect "directories" with no trailing slash to add a slash? One option we could look into is "deprecating" URLs with a trailing slash and writing redirect rules for the ones we currently have. But I don't love having to muck around with our URLs because we're changing the storage solution. |
From https://docs.aws.amazon.com/AmazonS3/latest/userguide/IndexDocumentSupport.html :
So, it sounds like it will 302 redirect them. That appears to be similar to what we have today (e.g. |
https://github.com/aws-samples/amazon-cloudfront-secure-static-site looks fairly promising for this. |
I won't be able to make time from WHATWG infra work this year, so here's a brain dump. The /var/www/html.spec.whatwg.org/ directory on marquee is 29 GB, that's the biggest problem in any migration. As a Git repository it's 6GB, so that rules out any solution of the shape "put everything in Git and deploy on every commit". That's unfortunate, because there are many options for that. A solution would take the shape of a storage bucket which deploys write into, and a frontend/CDN that just serves from that bucket. The hard part of that is preserving all of our redirects, and I've seen no storage buckets which have built-in redirect support that's expressive enough. (S3 has some stuff, not enough.) We would need something like https://developers.cloudflare.com/rules/url-forwarding/bulk-redirects/reference/csv-file-format/ I think. This problem ought to be easy for someone who has experience maintaining large websites and migrating between hosting... if they were meticulous about preserving redirects. That's all. |
This week marquee, which hosts all static whatwg.org sites, grew its disk usage past 80% of its 30GB and triggered an alert. I've increased the size to 50GB for now.
The constant increase is because of commit snapshots. We could compress on disk or deduplicate more, but it would still slowly grow, indefinitely. We shouldn't store these files on a fixed-size block device, but in an object store where there is no fixed upper limit.
DigitalOcean Spaces is a solution we could use, by letting nginx forward requests to it.
However, by still having all requests hit nginx we wouldn't be making full use of a solution like this. Spaces has a CDN feature with certificate handling, but it requires control over the DNS and is thus blocked by #75.
The text was updated successfully, but these errors were encountered: