Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interop runner counts one connection attempt too many when parsing logs #125

Open
huitema opened this issue Jul 27, 2020 · 4 comments
Open

Comments

@huitema
Copy link
Contributor

huitema commented Jul 27, 2020

Looking at the logs of this handshake loss trial, I find that the test failed because Expected 50 handshakes. Got: 51. Yet, the client and server logs only mention 50 connections, and the client downloaded the expected 50 files.

The test of the number of handshakes is probably mistaking a failed attempt for a successful one, leading to a false negative in the result. False negatives are unhelpful. They waste time when people investigate, and they push people to just dismiss test results without investigating them further.

@marten-seemann
Copy link
Collaborator

Here's how the number of handshakes is counted: https://github.com/marten-seemann/quic-interop-runner/blob/56fcba0285f063d8e0f4f8d068335d2c85712f5f/testcases.py#L163-L168

Why is this leading to overcounting in this case?

@huitema
Copy link
Contributor Author

huitema commented Jul 27, 2020

In the C1 case, it happens if a client's initial is corrupted in a way that changes the CID. The server will reply to both the correct Initial and the corrupted Initial. In the L1 case, it happens if a server times out one connection attempt and starts a new one later. This is indeed what happened here:

  1. The client started a connection attempt with ICID 6e89084ca8a26853
  2. The server allocated the SCID 6dbee6fcc50be74b to the context and sent a first flight
  3. The server never got any packet back from the client and closed the context after 15 seconds
  4. The server then received a repeated Initial from the client, sequence number 7, same ICID
  5. The server allocated the SCID 3f9e5813a5f2167b to that new connection
  6. That second connection completed correctly.

I confirmed that scenario by checking the client side qlog and the corresponding server side qlog. (I am creating an issue in Picoquic because the two connections where logged under the same ICID-derived file ID and the second file erased the first one -- I should really derive the file ID from the SCID.)

If you counted the handshake with the client side PCAP, you would could 50 connections because none of the Server Initial with the first SCID made it to the client.

In fact, there is a remarkable set of circumstance that caused the client to not receive any of the server first flight, and the server to not receive any of the repeated client initial with sequences 1 to 6. Given a 30% loss rate, this kind of loss sequence is supposed to have a probability of about 0.002%, but it seems to happen much more frequently than that.

@marten-seemann
Copy link
Collaborator

In the C1 case, it happens if a client's initial is corrupted in a way that changes the CID. The server will reply to both the correct Initial and the corrupted Initial.

That shouldn't happen though. If the CID is corrupted, the AEAD tag won't be correct. Or am I missing something here?

@huitema
Copy link
Contributor Author

huitema commented Jul 27, 2020

Yes, you are right about C1. But the L1 case was pretty clear once I managed to identify the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants