-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use network loops to read/write packets in isolated goroutines #151
Conversation
Thanks @vishnureddy17, The new approach is somewhat similar to that used in the v3 Client (albeit this client waits for errors almost every time it sends to Could you please add some detail as to the benefits that you believe this change delivers? Moving the reads and writes off to separate goroutines does have some potential benefits but, in my opinion, comes at the cost of significantly increasing complexity. Just to be clear - I'm not saying that this is not the right approach; however at this point I'm not convinced the benefits outweigh the negatives. Personally I find:
easier to read/reason about than:
Based on experience in the v3 client this approach makes it very easy to introduce races, deadlocks and leaking goroutines (I have put a lot of time into finding/resolving these in the v3 client). For example consider what happens (with your current implementation) when:
Note: this is theoretical I have not put a huge amount of time into confirming this can happen but it looks like a realistic scenario). Edit Here is another one (much more likely to be an issue).
These examples may seem convoluted but I hit plenty of similar problems in the v3 client that caused production issues (and these are very hard to track down, particularly when they are user reported and you cannot duplicate them!). The above leak is not really an issue if the program exists at that point; however, when used with something like Anyway I'm keen to discuss - I will add a few other comments to your PR (thanks for the submission by the way - its great to see activity on this package picking up again!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in addition to my general comments.
c.publishPackets = make(chan *packets.Publish) | ||
go c.incoming() | ||
go c.PingHandler.Start(c.Conn, 30*time.Second) | ||
fakeConnect(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This triggers a warning when resting with the race detector (go test --race
). This is not an issue for live use but needs to be resolved so there are no alerts when testing. To fix swap this with the line below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tests need to be changed so that we aren't relying on manually initializing the client within the tests. I'd like to rewrite them so we just set up the test server to respond to CONNECT
packets and call Client.connect()
to initialize everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"we aren't relying on manually initializing the client" - the way this currently works is generally pretty convenient (especially being able to tell the 'broker' exactly what response to send). The V3 client tests rely upon an external broker which works OK but it complicates things (need to start a broker, ensure the relevant port is free etc).
I have actually been working on a new TEST server that extends the current approach (it's more like a real broker but only supports a single connection). This is required because autopaho is currently not well tested and I'm working on a persistent session option (which will need to be thoroughly tested!).
} | ||
} | ||
|
||
func (c *Client) writeLoop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take this approach then I believe this function should range over c.outgoingPackets
(with the channel being closed in c.stop()
). The rationale for this is that it's too easy to leak goroutines when you have a channel with nothing processing it's output (i.e. c.stop
is closed and then something tries to send to c.outgoingPackets
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think a tricky part of this is avoiding write operations causing a panic. We could use mutexes like is being done elsewhere, but I want to think about a solution that doesn't use mutexes. I have an idea in mind for this that I might try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its possible to avoid Mutexes (mostly) by being very careful about shutdown order. However this is tricky and, unless we are very careful with future PR's, its easy to create issues later. This is the main reason that I believe we need to be sure this is the right approach before adopting it.
@@ -551,6 +596,73 @@ func (c *Client) close() { | |||
c.debug.Println("acks tracker reset") | |||
} | |||
|
|||
func (c *Client) readLoop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is the only thing writing to c.incommingPackets
it should close the channel when it exits (this allows anything listening on that channel to shut down cleanly). Other approaches tend to lead to things happening in an unexpected order with the listeners exiting before the sender (leading to a goroutine leak i.e. at c.incomingPackets <- packet
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
} | ||
c.debug.Printf("received %s%s\n", packet.PacketType(), packetIdString) | ||
|
||
incoming <- packet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goroutine leak here.
c.stop
is closed, the outer for
exits so incoming <- packet
will block (as nothing is receiving on incoming
).
I don't really see the benefit of having using the extra goroutines here packets.ReadPacket(c.Conn)
will return an error when c.Conn
is closed (so readLoop()
can be simplified to a few lines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
@@ -39,6 +39,7 @@ type ( | |||
Unpack(*bytes.Buffer) error | |||
Buffers() net.Buffers | |||
WriteTo(io.Writer) (int64, error) | |||
ToControlPacket() *ControlPacket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this couples the interface to this specific implementation (prior to this change it was generic, only utilising the standard library). While its not currently a goal, I suspect that at some point in the future we may want to add a very basic v3 implementation and this change would complicate that (because it would limit the ability to implement Packet
in a different package).
I believe that this is probably why the functions were implemented as they were (a bit of duplication keeps things loosely coupled).
We can make this change but do need to consider the potential future impact.
@@ -321,7 +355,7 @@ func (c *Client) Connect(ctx context.Context, cp *Connect) (*Connack, error) { | |||
c.workers.Add(1) | |||
go func() { | |||
defer c.workers.Done() | |||
defer c.debug.Println("returning from ack tracker routine") | |||
defer c.debug.Println("ack tracker worker returned") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will end up being a race (because sending the queued acks requires that writeLoop()
is running but that also stops when c.stop
is closed).
Thanks for taking a careful look at this. This is all excellent feedback! I think a part of the difficulty here is that I'm trying to keep this pull request small and not change everything at once. However, this is one step in a sequence of changes I'm thinking of. My goal here was to make the fewest changes possible to have reads and writes in their own goroutines. Clearly, there are still issues. How about I continue working on my fork to see where I can take this? The end result will be a big change, but I'll do my best to make each commit understandable in isolation. I could submit a PR later with a more complete picture, and we may need to move the discussion off of GitHub once we get to that point. It's fine if it doesn't work out in the end, but I'd like to at least prove these ideas to myself.
I agree. I don't think error handling should actually be done in those goroutines. I'd like to see errors being dealt with more consistently in the client, and I see error handling being done this way:
I don't think errors should be passed between worker goroutines. They should just be passed up to the client which takes care of any necessary cleanup. I did not make that change because I was trying to make the minimal changes needed to communicate network loop idea.
The two main benefits I see are:
|
It may be worth raising an issue to discuss your longer term plan? I've used similar approaches successfully but have also been burnt due to this approach making it really easy to introduce deadlocks etc. So a fork may be the way to go (but worth collaborating to ensure its a worthwhile direction to take). Note that I am currently working on session persistence off-line (may make a test release public on my repo, but will not push to this repo until it's fully working and tested).
I can see some benefits re point 1. However most of the time when we are sending data we do need to be aware of any issues at that point in the code so appropriate action can be taken (e.g. returning an error to the caller). I believe this reduces the benefit of your approach (and a simpler technique might be a custom Ref point 2 - the ordering requirements are mainly around PUBLISH and the relevant ACKS (MQTT-4.6.0-2 to MQTT-4.6.0-4). I believe the requirement exists (I asked the spec authors about this :-) ) to enable transactional logging of messages (e.g. run INSERT on PUBLISH and COMMIT on PUBREL). So my personal opinion is that, while it would be nice to follow the spec, it's not worth adding heaps of complexity to do so (I have not seen a broker that will error on out of order ACK's and |
Another thing: Right now, there seems to be cases where packets reads and writes can block each other. It might be worth trying to avoid that.
I'm going to work on it offline to see where I can take it before I start discussing it on GitHub, the ideas I have are a bit nebulous right now, and I think getting something concrete will clarify things. Your feedback has been super helpful. And I see what you mean about how this approach can make it easy to introduce deadlocks, races, and goroutine leaks.
This is great! If you decide to make a release on your repo, I'd love to check it out. Looking forward to seeing what comes out of this :)
In my testing, Azure Event Grid Namespaces (which is currently in public preview) disconnects clients that send out-of-order ACKs. |
I've been working on this, and I have something working, including in-memory persistence. However, it only supports QoS 1 for the time-being. I don't think it's quite ready to bring into this repo, but I thought I'd post it here in case anyone is curious about the direction I'm thinking of. https://github.com/vishnureddy17/paho.golang/tree/persistence-network-loops |
Thanks @vishnureddy17 - I also hope to have a solution (quite a different direction that caters for QOS2) ready in a few days so it will be interesting to compare approaches (I've tried a lot of approaches but feel I have something workable). Will be keeping things outside the |
To be clear, what I've done does not preclude QoS 2, I just chose to omit it for now.
Yes, definitely. Good thing this is still in beta. Looking forward to seeing what you have! |
My attempt is in this branch - I believe it's almost ready; it passes the tests I've put in place so far and seems to run OK with the docker test (to which I've added a disconnect so I can check for lost messages). It does need a further review and clean-up but I thought it was better to make it public so we can discuss the different approaches and decide the best way forward. Note that the readme contains a rationale for the decisions made (and info on todo's, breaking changes etc). You will note that this is a major change - it decouples the state (including Mids and store) from I'll add a comment to issue #25 with links for those watching that issue. |
Hi @vishnureddy17, Matt |
@MattBrittan, I'm going to go ahead and close this PR from now. I think the state of repo has diverged so far from this PR at this point that it's not worth keeping around. I'm busy with other work right now, but in the future I might look to contribute some of the ideas in this PR separately. |
Here's a proposal to make the
client.go
read and write packets in isolated goroutines.I think this makes things easier to work with and makes errors on the connection easier to deal with.
Thanks @BertKleewein for the inspiration!