Two 200 before the 304 #138

nitriques · 2016-07-21T03:28:09Z

When fetching an external image, jit now issues 2 HTTP 200 responses before sending the 304. Here's why:

The modified time and etag are calculated with a 2nd http request to get the headers (I wonder why, more on that later)
I've added things to use the filemtime of the cache file, which then creates new last modified and etag (so our logic fails)
You then get a proper 304.

I wanted to be able to server the cache from the external server #130 and #136

The two request are shaped like this:

The first one requests the whole file, skipping the headers
The second one only request the headers. Based on libcurl's documentation, doing so sends a HEAD http request and should (if the remote server's smart enough) only sends headers...

I can't seem to figure out how we can get rid of one or the other, but I think we should never do both, as it's the way right now. Should we:

Do the HEAD request always, to make sure the cache is valid ? What about servers that do not return cache headers ?
Store this info in the DB, to be able to regenerate the original etag ?
Give up ? I mean, this behavior occurs only for the first person to hit the url, since all others will get the cache and never see the original response ?
Change our we cache external images ?

The more I think about it, the more I want performance. The simplest thing would be to number 4 including the following:

~~Never cache external images that do not send proper cache headers ?~~ See Proposition: Never cache external images that do not send proper cache headers #139
Always respond with the cache file's mtime and etag and hide the one from the origin ?
Randomly poke the remote server to validate the cache ?

michael-e · 2016-07-21T11:07:21Z

jit now issues 2 HTTP 200 responses before sending the 304

I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right?

If yes, do you mean that this has been introduced in 2.0? Or is it the same in 1.7?

BTW: I just discovered a bug with JIT 1.7 when I tried to test the reponses using cURL. (I don't know for sure if JIT 2.0 has the same behaviour.) The "If-Modified-Since" logic is not working properly. I only get a 304 response if the value of the If-Modified-Since header matches the Last-Modified response header exactly. Now if I add some time to the If-Modified-Since value (asking for a later point in time), it should still be a 304, but it is not. I get a 200 response instead.

The second one only request the headers. Based on libcurl's documentation, doing so sends a HEAD http request and should (if the remote server's smart enough) only sends headers...

I wouldn't care about this, because HEAD requests are standardized, AFAIK, and should work with any HTTP server. As you said, you can't rely on getting any cache headers from the remote server, (but that is the same with 200 responses).

Apart from that, none of your solutions sounds good to me. "Randomly poking" a remote server is not a clear solution, and it scales terribly. If you have some thousand external images in the cache, you will generate a lot of requests/traffic even if you don't need to serve a single images to a client (because it is used in old, archived articles, for example). This might mean less traffic than with the current behaviour, but it might as well be the other way 'round, depending on the percentage (and frequency) of cached images being actually served by JIT.

More questions:

Are we sure that JIT sends two requests to the remote server?
Does this only occur once or with every request?

If the latter is the case, there may be reasons for this (for example: servers not sending cache headers, as you mentioned above).

The more I think about it, the more I want performance.

Agreed. I once tested putting all my image files to Amazon AWS and serving them via JIT. Yes, this was significantly slower, you could see that in your browser when loading a page (without browser cache, of course). Using CloudFront made it a lot faster and smoother (but still a bit slower than with local files). At that time I also thought about speeding things up with external images, but I couldn't find a proper solution. I think that checking if a file still exists is the least JIT should do even with external images. So I decided to go for local images (and a bigger hard drive on the server)...

(The best idea I had was delivering the cache file first, then checking for the external file. This would solve any speed issues, but it would mean that you deliver a cache file for at least one more time if the original file has been removed. That felt wrong to me.)

About ETags: I personally don't use them at all, I remove all of them (in Apache):

If you're not taking advantage of the flexible validation model that ETags provide, it's better to just remove the ETag altogether. The Last-Modified header validates based on the component's timestamp.
—https://developer.yahoo.com/blogs/ydn/high-performance-sites-rule-13-configure-etags-7211.html

Last question: Does JIT really use the remote files's ETag and cache headers? (I don't know, because I even reset image cache headers in Apache.)

Sorry for posting a lot of questions and no solutions. But it's a delicate matter (and I love JIT).

nitriques · 2016-07-21T14:45:04Z

But it's a delicate matter (and I love JIT).

Same here !

I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right?

You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)

The "If-Modified-Since" logic is not working properly.

I know! That the origin of my work on that extension haha!

Are we sure that JIT sends two requests to the remote server?

Totally.

Does this only occur once or with every request?

Right now, it's when there is no cache. When a cache file is created, it lasts forever, without ever checking if the url got modified. I was trying to solve #136 and stumble upon this for external images.

"Randomly poking" a remote server is not a clear solution, and it scales terribly.

Indeed. But how can you choose to check if you cache is still valid ?

Does JIT really use the remote files's ETag and cache headers?

Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename.

michael-e · 2016-07-21T14:58:52Z

You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)

Sorry, I still don't get it. Do you mean that JIT itself sends 3 answers to a single request??? Can you explain in simple words what is going on (on "Remote server", "JIT server" and "Client")?

When a cache file is created, it lasts forever, without ever checking if the url got modified.

So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file. (I vaguely remember that I myself implemented this a long time ago.)

"Randomly poking" a remote server is not a clear solution, and it scales terribly.

Indeed. But how can you choose to check if you cache is still valid ?

By sending a request to the remote server (using "If-Modifed since"), see above.

Does JIT really use the remote files's ETag and cache headers?

Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename.

Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario.

nitriques · 2016-07-21T15:40:48Z

Can you explain in simple words

Haha!

Forget the remote server. The remote server always send 200 responses since we never sent it any cache headers.

First request: The server requests the external image, cache it and serve it as a 200.
Second request: The server gets the cache headers from the client, is not able to use it since the cache has a different last modified value than the one send in request number 1. This also returns a 200 (which I am trying to transform into a 304).
Third request: The 304 is properly sent.

So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file.

Yes!

By sending a request to the remote server (using "If-Modifed since")

Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?

Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario.

Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals.

michael-e · 2016-07-21T17:14:55Z

First request: The server requests the external image, cache it and serve it as a 200.

Let's assume that the JIT response to the client is: 200 and Last-Modified: Tue, 19 Jul 2016 10:43:29 GMT

Second request: The server gets the cache headers from the client, is not able to use it since the cache has a different last modified value than the one send in request number 1. This also returns a 200 (which I am trying to transform into a 304).

Ah, I see. It should be different, of course. If it is the same client, it will send If-Modified-Since: Tue, 19 Jul 2016 10:43:29 GMT. In this case JIT should send a 304 repsonse. (JIT 1.7 does this, according to my tests.)

By sending a request to the remote server (using "If-Modifed since")

Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?

You are right. We only check if the file is still available under the given URL. (Which is probably sufficient for services like Flickr, for example, which have unique filenames.) My idea would make things more complicated, so I am not sure if it's a good one. :-)

Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals.

Yep.

nitriques · 2016-07-21T17:35:49Z

JIT 1.7 does this, according to my tests.

Cool. 2.0 does not :( But only for external images. The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it...

nitriques · 2016-07-21T17:36:11Z

I knew you and I would have fun with this when I posted it yesterday night ;)

michael-e · 2016-07-21T17:57:41Z

I knew you and I would have fun with this when I posted it yesterday night ;)

:-)

The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it...

AFAIK, you don't need to send any date with a 304 response. (As far as I see, JIT sets the header anyway, but either my Apache or my nginx frontend server seem to remove it if the status code is 304.)

Anyway, for external images, JIT 1.7 will always do a HEAD request to the remote server; this is the important line:

$last_modified = strtotime(Image::getHttpHeaderFieldValue($image_path, 'Last-Modified'));

A bit later:

if($_SERVER['HTTP_IF_MODIFIED_SINCE'] == $last_modified_gmt || str_replace('"', NULL, stripslashes($_SERVER['HTTP_IF_NONE_MATCH'])) == $etag){
    ...

nitriques · 2016-07-21T18:14:33Z

my Apache or my nginx frontend server seem to remove it if the status code is 304.

Apache does it. nginx most likely also does it. It's in the spec: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.5

This link is also in our codebase ;)

For external images, JIT 1.7 will always do a HEAD request

Indeed. That's what I am trying to remove while not breaking things as I currently did.

By caching the file on the spot, we could avoid using the data from the remote server but we must still be able to avoid having an infinite cache. A proxy would only use the max-age won't it ?

michael-e · 2016-07-21T18:37:29Z

That's what I am trying to remove while not breaking things as I currently did.

Aha, so we get to the point, finally! :-))

This functionality has been implemented on purpose in order to no longer deliver a cache file if the original (external) file no longer exists. I think it is a bad idea to avoid the HEAD request, because in this case you will serve the cache file (at least for a while) although the original file has been removed from the interwebs.

nitriques · 2016-07-21T18:48:55Z

I totally agree with you. But right now, this HEAD request causes problems since it loads the network card with request under high load.

It's funny because there was lots of code to avoid serving bad cache for external resources, but we would gladly sent bad cache for our local files ;)

nitriques · 2016-07-21T18:49:16Z

I'd rather send invalid cache file and be able to support more requests

michael-e · 2016-07-21T18:55:55Z

but we would gladly sent bad cache for our local files ;)

AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).

I'd rather send invalid cache file and be able to support more requests

I understand that this might be desired in certain systems. But not always. So what about making it optional? Just adding an option like "Don't check if external images still exist". The simplest implementation would deliver the cache file forever. A more advanced — like mentioned above — might deliver the cache file first, then do the HEAD request (and remove the cache file if the original file has gone away).

nitriques · 2016-07-21T21:49:30Z

AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).

No I was referring to #136 which states that we serve the cache even if the file changed.

So what about making it optional

Let me bring random back! Set the configurable factor to one, the HEAD request is always made, bad cache is never used. That would be the default since it's the old behaviour. Set it to 0.5 or even 0.1 to lessen the chances of issuing a HEAD request to validate the cache but may send an invalid response.

Otherwise, we need to hit the Database or other storage to expire the cache on a url basis. Like I said, performance is key, so I think this is a no-go.

michael-e · 2016-07-22T10:55:00Z

Like you, I am not fond of involving the database. So yes, your proposal sounds like a solution. (Documentation is king, of course.)

nitriques · 2016-07-22T14:25:16Z

Great!

The way we handled HEAD request was 1. Different from the GET request, since it lacked some checks 2. Way to general for our needs This commit refactors the old things with a single brand new function that will 1. Make sure the request is a 200 2. Return only cache related headers, since that's the goal Re #136 Cc #138 Cc #130 Cc #137

This commit cleans up the code and fixes a bunch of issues. * Cache validation There was a bug, #137, which would made the cache unusable with external images. This has been fixed. The function also makes sure to validate that the original file, either remote or local, has not changed since the last modification of the cache file itself. If the cache file is no longer valid, it gets deleted. This also fixes #136. * Last modified value There was a problem describe in #138 where a second request to a url with proper cache heading would still send a 200, because the Last-Modified value was read from a different source. This commit changes this by doing the following. 1. Always use a local mtime: either the original local file or our tempfile created from the http request. 2. If cache is enabled, always use the cache file mtime, even when the request creates the cache file. This insure repeatable results for all subsequent requests. This brings a side effect that is needed: before, we used to touch the cache file when found valid and we must need to stop this to be able to get repeatable results. --- Throughout the process, values are "exported" into the parameter array when certain condition are met. This insure that the values sent to the user will be the right one. Sorry for the big commit, I did my best to split things up :) Fixes #137 Fixes #136 Fixes #138

nitriques · 2016-07-28T04:12:54Z

@michael-e Do not rush anything, just a reminder. I am not in a rush to release this, I'll wait for you.

This commits adds a new hidden configuration value, `cache_invalidation_odds` which allows developer to control how often cache is validated. This is particularly usefull on stressed servers, and even better if serving remote images, since it can lead to a 60% request time diminution. The default behaviour is not changed and cache validation will always occur. It also updates the x-jit debug header to specify the result of the cache validation Fixes #138

nitriques · 2016-08-02T03:27:36Z

@michael-e again, no rush, just wanted to let you know that my random thingny works really great. I'll deploy it in production tomorrow 👍

michael-e · 2016-08-02T20:20:33Z

Cool. Tell us how it works in production!

nitriques · 2016-08-02T23:12:23Z

Still doing extensive testing, and it looks promising 👍

michael-e · 2016-08-13T20:26:09Z

I have only done a short test (looking at the debug headers), and it seems to work "as advertised". But I guess in this case your production environment is a much better test!

nitriques added the Bug label Jul 21, 2016

nitriques added this to the 2.1.0 milestone Jul 21, 2016

nitriques self-assigned this Jul 21, 2016

nitriques added the Fixed in integration label Aug 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two 200 before the 304 #138

Two 200 before the 304 #138

nitriques commented Jul 21, 2016 •

edited

Loading

michael-e commented Jul 21, 2016 •

edited

Loading

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016 •

edited

Loading

nitriques commented Jul 21, 2016

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016 •

edited

Loading

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016

nitriques commented Jul 21, 2016

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016 •

edited

Loading

nitriques commented Jul 21, 2016

michael-e commented Jul 22, 2016

nitriques commented Jul 22, 2016

nitriques commented Jul 28, 2016

nitriques commented Aug 2, 2016

michael-e commented Aug 2, 2016

nitriques commented Aug 2, 2016

michael-e commented Aug 13, 2016

Two 200 before the 304 #138

Two 200 before the 304 #138

Comments

nitriques commented Jul 21, 2016 • edited Loading

michael-e commented Jul 21, 2016 • edited Loading

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016 • edited Loading

nitriques commented Jul 21, 2016

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016 • edited Loading

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016

nitriques commented Jul 21, 2016

nitriques commented Jul 21, 2016

michael-e commented Jul 21, 2016 • edited Loading

nitriques commented Jul 21, 2016

michael-e commented Jul 22, 2016

nitriques commented Jul 22, 2016

nitriques commented Jul 28, 2016

nitriques commented Aug 2, 2016

michael-e commented Aug 2, 2016

nitriques commented Aug 2, 2016

michael-e commented Aug 13, 2016

nitriques commented Jul 21, 2016 •

edited

Loading

michael-e commented Jul 21, 2016 •

edited

Loading

michael-e commented Jul 21, 2016 •

edited

Loading

michael-e commented Jul 21, 2016 •

edited

Loading

michael-e commented Jul 21, 2016 •

edited

Loading