Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential issue when using inside a K8s cluster with dynamic scaling of server instances #16

Open
dhardtke opened this issue Aug 9, 2023 · 3 comments

Comments

@dhardtke
Copy link

dhardtke commented Aug 9, 2023

Hi!

We are very grateful that we were able to migrate from the redis adapter to this socket.io adapter to get rid of one dependency since Mongo is our DB anyway.

However, we are currently investigating an issue where our backends are behaving very strangely when after a so called Rolling Update in our K8s cluster where new backend instances are spawned and the old ones are shut down once the new ones are ready, some backends are unable to deliver any socket messages to clients connected to other backends.
Unfortunately, since a couple of days it is also happening without any deployments.

What we observed
It seems like the backends are sending heartbeat signals (though we could not see them in the DB collection because our collection size was quite small: 1 MB) so with 6 backends every socket message requires 5 other backends to respond.
However, even after tweaking the requestsTimeout, we still see timeout reached: only 0 responses received out of 5 or 4/5 in our logs. And today, we noticed the message kept showing up even though the backends themselves were all running perfectly fine.

Questions

  • Does anyone have any recommendations or is it not possible to use this adapter in a K8s cluster with dynamic scaling where a node can potentially go offline any minute and new ones can come up?
  • Is it an issue if the capped collection size is reached due to large socket objects or a large number of connected sockets?

I am not sure there is an issue in the adapter so this issue is not just a potential bug report but also an ask for a pointer in the right direction. We (3 senior devs) have been speculating and pondering about this all day and have not come up with a solution (yet).
Thank you :)

@darrachequesne
Copy link
Member

Hi! I think the problem is that during the rolling upgrade, the server that sends a request sees more servers than expected, so the request eventually times out. See here.

You could play with the values of heartbeatInterval (5s by default) and heartbeatTimeout (10s) to reduce the size of the window, so that the deleted pods are removed more quickly. The failure window will still exist though, so you will certainly need to retry the request.

Now, should the retry mechanism be implemented in the library itself? I'm open to discuss about that.

@dhardtke
Copy link
Author

Thank you. We are already running the backends with a heartbeatInterval of 2.5s and heartbeatTimeout of 5s. It still happens, though, and our primary concern is that it also happens during regular operation of the backends (i.e., even when no rolling upgrade occurs).

And what's even stranger is that the backends do not recover from this, so they are unable to hold any socket connection open. We have to restart them manually then.

In my investigation when such a timeout happens it seems like the other backends did correctly answer but the requesting backend did not receive the inserted document in time (even with a capped collection size of 1 GB).

@darrachequesne
Copy link
Member

And what's even stranger is that the backends do not recover from this, so they are unable to hold any socket connection open. We have to restart them manually then.

That's weird indeed, as the adapter part should be rather independent from the connection handling. I will try to reproduce the issue locally.

Do you know how many documents are inserted per second? What is the state of the change stream, according to the logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants