Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Corrupt deflate stream' error on some websites (curl and Firefox are fine) #84

Open
Shnatsel opened this issue Feb 5, 2021 · 16 comments

Comments

@Shnatsel
Copy link

Shnatsel commented Feb 5, 2021

Some websites, such as hajime.us, fail to load using attohttpc: Io Error: corrupt deflate stream. They load fine using Firefox and the curl command-line tool.

Tested using this code. Test tool output from all affected websites: attohttpc-deflate-corrupt-stream.tar.gz

40 websites out of the top million from Feb 3 Tranco list are affected.

I suspect this is an issue with the underlying DEFLATE implementation, but assistance in isolating the failure (e.g. dumping the DEFLATE stream so I could report a bug against miniz_oxide) would be appreciated.

@adamreichold
Copy link
Contributor

If I understand things correctly, you should be able to get the compressed response using allow_compression(false) while manually inserting the relevant header using header("Accept-Encoding", "gzip, deflate").

@adamreichold
Copy link
Contributor

I tried http://landolts.com and I get a corrupt deflate stream error with the rust_backend as well as the miniz-sys features of the flate2 crate. Using the zlib does not yield a decompression error but an unexpected EOF instead.

In all cases, the actual body seems to be decompressed completely. I therefore wonder whether the content length reported by the server is correct...

@adamreichold
Copy link
Contributor

Here is the body of the above request attached: body.gz

It came with the headers:

server: "nginx"
date: "Fri, 05 Feb 2021 20:28:22 GMT"
content-type: "text/html; charset=UTF-8"
content-length: "6068"
connection: "close"
x-powered-by: "PHP/7.0.0p1"
set-cookie: "PHPSESSID=ac27hsia4s1obmtvrk6jetrf40; path=/"
set-cookie: "mobile=false; path=/"
set-cookie: "user-agent=330cf4ec2a9149ebd093962feb701e34; path=/"
expires: "Mon, 26 Jul 1997 05:00:00 GMT"
cache-control: "no-store, no-cache, must-revalidate"
cache-control: "post-check=0, pre-check=0"
pragma: "no-cache"
last-modified: "Fri, 05 Feb 2021 20:28:22 GMT"
content-encoding: "gzip"
vary: "Accept-Encoding"

gzip does not seem to like it either:

> zcat body.gz
...
gzip: body.gz: unexpected end of file

but that also suggesta that the unexpected EOF I got using the zlib feature is just the way it says corrupt deflate stream...

@adamreichold
Copy link
Contributor

From reading into cURL's source, my initial guess would be that its handling of expected but ignored trailer bytes in https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L135 might make the difference...

@adamreichold
Copy link
Contributor

And cURL's seems ignore an error condition which I do not understand yet: https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L221

@sbstp
Copy link
Owner

sbstp commented Feb 6, 2021

I wonder if we could bypass this issue by simply sending Accept-Encoding: gzip instead of Accept-Encoding: gzip, deflate. In practice I think deflate is almost never used by websites. And chances are that gzip is supported if deflate also is. I think reqwest only supports gzip as well.

Worst case, gzip is not supported and the content is sent in plain.

@sbstp
Copy link
Owner

sbstp commented Feb 6, 2021

I think we might be running into these kind of errors.

@Shnatsel
Copy link
Author

Shnatsel commented Feb 6, 2021

I've also seen 20 "invalid gzip header" errors in the top 1M. Here's the data: invalid-gzip-header.tar.gz

That does sound like the issues with "deflate" encoding that the article talks about.

@adamreichold
Copy link
Contributor

I wonder if we could bypass this issue by simply sending Accept-Encoding: gzip instead of Accept-Encoding: gzip, deflate.

I do not yet understand is how this relates to my tests against http://landolts.com: The server is nginx, i.e. not a Microsoft implementation, the headers indicate that the result is gzip-encoded. Do you think the header is incorrect and this is a deflate-stream nonetheless?

The error message comes from https://github.com/rust-lang/flate2-rs/blob/90d9e5ed866742ce8b3946d156830e300d1e5aab/src/zio.rs#L152 and this code is generic w.r.t. to gzip or deflate headers, so I don't think it refers to the actual format in use.

@sbstp
Copy link
Owner

sbstp commented Feb 6, 2021

I tried playing with the accept endoing header that we send to landolts.com, and the error occurs if we have gzip in the accepted encodings, but not deflate or identity. So it seems like their server configuration might be broken, the gzip they are sending is not really gzip.

@adamreichold
Copy link
Contributor

the gzip they are sending is not really gzip.

While I agree in principle, the observation that both cURL and Firefox are able to handle this suggests there are workarounds. Especially, even us and flate2 basically decompress everything and only fail at EOF. Judging from the cURL code, there is quite a bit of variability of how gzip is implemented in the wild.

@sbstp
Copy link
Owner

sbstp commented Feb 6, 2021

For what it's worth, I did a test with reqwest, and it seems like it also has this problem. It would be neat to get to the bottom of this and fix it across the ecosystem.

use reqwest::blocking::Client;

fn main() -> Result<(), reqwest::Error> {
    let client = Client::new();
    let req = client
        .get("http://landolts.com")
        .header("Accept-Encoding", "gzip")
        .build()?;
    println!("{:?}", req.headers());
    let resp = client.execute(req)?;
    println!("{}", resp.text()?);
    Ok(())
}
{"accept-encoding": "gzip"}
Error: reqwest::Error { kind: Decode, source: Custom { kind: UnexpectedEof, error: "unexpected end of file" } }

@sbstp
Copy link
Owner

sbstp commented Feb 6, 2021

I think we might be able to find some information on this stack overflow answer by Mark Adler.

@Shnatsel
Copy link
Author

Shnatsel commented Feb 6, 2021

I have the same test code implemented for 4 clients and growing in https://github.com/Shnatsel/rust-http-clients-smoke-test, it might come in handy for comparing behavior between clients.

@adamreichold
Copy link
Contributor

adamreichold commented Feb 6, 2021

My current guess is that flate2 expects the stream to end as described in https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L344, i.e. with a CRC and a size field, whereas cURL tries to read the trailer, but only errs if there is extra data, not if part of the trailer is missing: https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L135

But admittedly, I am not very confident in my reading of the cURL code. But at least, missing CRC and size information would explain why the body is completely decompressed and only then an error is raised. It would also make sense to e.g. give flate2 a flag that make its processing more lenient w.r.t. this redundant information.

@sbstp
Copy link
Owner

sbstp commented Feb 7, 2021

Golang's http library has this issue as well. Looks like curl is one of the few places that figured it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants