-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two 200 before the 304 #138
Comments
I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right? If yes, do you mean that this has been introduced in 2.0? Or is it the same in 1.7? BTW: I just discovered a bug with JIT 1.7 when I tried to test the reponses using cURL. (I don't know for sure if JIT 2.0 has the same behaviour.) The "If-Modified-Since" logic is not working properly. I only get a 304 response if the value of the
I wouldn't care about this, because HEAD requests are standardized, AFAIK, and should work with any HTTP server. As you said, you can't rely on getting any cache headers from the remote server, (but that is the same with 200 responses). Apart from that, none of your solutions sounds good to me. "Randomly poking" a remote server is not a clear solution, and it scales terribly. If you have some thousand external images in the cache, you will generate a lot of requests/traffic even if you don't need to serve a single images to a client (because it is used in old, archived articles, for example). This might mean less traffic than with the current behaviour, but it might as well be the other way 'round, depending on the percentage (and frequency) of cached images being actually served by JIT. More questions:
If the latter is the case, there may be reasons for this (for example: servers not sending cache headers, as you mentioned above).
Agreed. I once tested putting all my image files to Amazon AWS and serving them via JIT. Yes, this was significantly slower, you could see that in your browser when loading a page (without browser cache, of course). Using CloudFront made it a lot faster and smoother (but still a bit slower than with local files). At that time I also thought about speeding things up with external images, but I couldn't find a proper solution. I think that checking if a file still exists is the least JIT should do even with external images. So I decided to go for local images (and a bigger hard drive on the server)... (The best idea I had was delivering the cache file first, then checking for the external file. This would solve any speed issues, but it would mean that you deliver a cache file for at least one more time if the original file has been removed. That felt wrong to me.) About ETags: I personally don't use them at all, I remove all of them (in Apache):
Last question: Does JIT really use the remote files's ETag and cache headers? (I don't know, because I even reset image cache headers in Apache.) Sorry for posting a lot of questions and no solutions. But it's a delicate matter (and I love JIT). |
Same here !
You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)
I know! That the origin of my work on that extension haha!
Totally.
Right now, it's when there is no cache. When a cache file is created, it lasts forever, without ever checking if the url got modified. I was trying to solve #136 and stumble upon this for external images.
Indeed. But how can you choose to check if you cache is still valid ?
Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename. |
Sorry, I still don't get it. Do you mean that JIT itself sends 3 answers to a single request??? Can you explain in simple words what is going on (on "Remote server", "JIT server" and "Client")?
So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file. (I vaguely remember that I myself implemented this a long time ago.)
By sending a request to the remote server (using "If-Modifed since"), see above.
Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario. |
Haha! Forget the remote server. The remote server always send 200 responses since we never sent it any cache headers.
Yes!
Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?
Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals. |
Let's assume that the JIT response to the client is:
Ah, I see. It should be different, of course. If it is the same client, it will send
You are right. We only check if the file is still available under the given URL. (Which is probably sufficient for services like Flickr, for example, which have unique filenames.) My idea would make things more complicated, so I am not sure if it's a good one. :-)
Yep. |
Cool. 2.0 does not :( But only for external images. The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it... |
I knew you and I would have fun with this when I posted it yesterday night ;) |
:-)
AFAIK, you don't need to send any date with a 304 response. (As far as I see, JIT sets the header anyway, but either my Apache or my nginx frontend server seem to remove it if the status code is 304.) Anyway, for external images, JIT 1.7 will always do a HEAD request to the remote server; this is the important line: $last_modified = strtotime(Image::getHttpHeaderFieldValue($image_path, 'Last-Modified')); A bit later: if($_SERVER['HTTP_IF_MODIFIED_SINCE'] == $last_modified_gmt || str_replace('"', NULL, stripslashes($_SERVER['HTTP_IF_NONE_MATCH'])) == $etag){
... |
Apache does it. nginx most likely also does it. It's in the spec: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.5 This link is also in our codebase ;)
Indeed. That's what I am trying to remove while not breaking things as I currently did. By caching the file on the spot, we could avoid using the data from the remote server but we must still be able to avoid having an infinite cache. A proxy would only use the max-age won't it ? |
Aha, so we get to the point, finally! :-)) This functionality has been implemented on purpose in order to no longer deliver a cache file if the original (external) file no longer exists. I think it is a bad idea to avoid the HEAD request, because in this case you will serve the cache file (at least for a while) although the original file has been removed from the interwebs. |
I totally agree with you. But right now, this HEAD request causes problems since it loads the network card with request under high load. It's funny because there was lots of code to avoid serving bad cache for external resources, but we would gladly sent bad cache for our local files ;) |
I'd rather send invalid cache file and be able to support more requests |
AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).
I understand that this might be desired in certain systems. But not always. So what about making it optional? Just adding an option like "Don't check if external images still exist". The simplest implementation would deliver the cache file forever. A more advanced — like mentioned above — might deliver the cache file first, then do the HEAD request (and remove the cache file if the original file has gone away). |
No I was referring to #136 which states that we serve the cache even if the file changed.
Let me bring random back! Set the configurable factor to one, the HEAD request is always made, bad cache is never used. That would be the default since it's the old behaviour. Set it to 0.5 or even 0.1 to lessen the chances of issuing a HEAD request to validate the cache but may send an invalid response. Otherwise, we need to hit the Database or other storage to expire the cache on a url basis. Like I said, performance is key, so I think this is a no-go. |
Like you, I am not fond of involving the database. So yes, your proposal sounds like a solution. (Documentation is king, of course.) |
Great! |
The way we handled HEAD request was 1. Different from the GET request, since it lacked some checks 2. Way to general for our needs This commit refactors the old things with a single brand new function that will 1. Make sure the request is a 200 2. Return only cache related headers, since that's the goal Re #136 Cc #138 Cc #130 Cc #137
This commit cleans up the code and fixes a bunch of issues. * Cache validation There was a bug, #137, which would made the cache unusable with external images. This has been fixed. The function also makes sure to validate that the original file, either remote or local, has not changed since the last modification of the cache file itself. If the cache file is no longer valid, it gets deleted. This also fixes #136. * Last modified value There was a problem describe in #138 where a second request to a url with proper cache heading would still send a 200, because the Last-Modified value was read from a different source. This commit changes this by doing the following. 1. Always use a local mtime: either the original local file or our tempfile created from the http request. 2. If cache is enabled, always use the cache file mtime, even when the request creates the cache file. This insure repeatable results for all subsequent requests. This brings a side effect that is needed: before, we used to touch the cache file when found valid and we must need to stop this to be able to get repeatable results. --- Throughout the process, values are "exported" into the parameter array when certain condition are met. This insure that the values sent to the user will be the right one. Sorry for the big commit, I did my best to split things up :) Fixes #137 Fixes #136 Fixes #138
@michael-e Do not rush anything, just a reminder. I am not in a rush to release this, I'll wait for you. |
This commits adds a new hidden configuration value, `cache_invalidation_odds` which allows developer to control how often cache is validated. This is particularly usefull on stressed servers, and even better if serving remote images, since it can lead to a 60% request time diminution. The default behaviour is not changed and cache validation will always occur. It also updates the x-jit debug header to specify the result of the cache validation Fixes #138
@michael-e again, no rush, just wanted to let you know that my random thingny works really great. I'll deploy it in production tomorrow 👍 |
Cool. Tell us how it works in production! |
Still doing extensive testing, and it looks promising 👍 |
I have only done a short test (looking at the debug headers), and it seems to work "as advertised". But I guess in this case your production environment is a much better test! |
When fetching an external image, jit now issues 2 HTTP 200 responses before sending the 304. Here's why:
I wanted to be able to server the cache from the external server #130 and #136
The two request are shaped like this:
I can't seem to figure out how we can get rid of one or the other, but I think we should never do both, as it's the way right now. Should we:
The more I think about it, the more I want performance. The simplest thing would be to number 4 including the following:
Never cache external images that do not send proper cache headers ?See Proposition: Never cache external images that do not send proper cache headers #139The text was updated successfully, but these errors were encountered: