-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnect() can hang waiting on c.workers.Wait() #554
Comments
It looks like your issue is here:
router.go:210 is:
So one of your handlers is not returning. As per common problems in the readme:
I'm really not sure that the library can handle this any differently - |
I see, thanks for the pointer! It turns out that the handler is signaling on a channel that is insufficiently deep and blocking as a result. To elaborate further on the scenario, I have an adapter around MQTT that may be passed arbitrary handler functions and I would like to be able to recover the client by stopping and bringing up new instances of it. Let's say I wrap the handler to include a cancellation context controlled by the adapter. When I cancel the context and call Disconnect, what resources, if any, would be leaked in the mqtt lib by this approach? On a related note, while the handler seen by the mqtt lib does complete in this case (cancellation forces progress) it looks like router.go:210 always calls |
None if it shuts down properly (i.e. cancelling your context also terminates any callbacks). If the callback is blocked then go-routines are leaked (along with any resources they hold). Due to the way this library was written (over a long period by quite a few people) its not particularly easy to modify and keeping track of the various go-routines and what blocks what is difficult. This is one of the reasons that the mqtt v5 client is a total (incompatible) rewrite. I rewrote a large portion of this library a couple of years ago (simplifying, documenting, and removing a range of potential deadlocks) but was constrained by the need to avoid breaking existing clients.
See issue #459 for a discussion on this point (there was also a discussion on the Paho mailing list). This package was originally designed on the basis that the ACK is a protocol level thing (so once the message has been received the ACK should be sent regardless of any issues the user may have processing the message). As per the comments on that issue I'm open to seeing this change implemented but it's not something I'm really interested in doing myself (have no use-case). |
Thanks @MattBrittan, this was incredibly helpful. I'll keep an eye on eclipse-paho/paho.golang#53 and the v5 betas. |
Related to #550, but the hang is not at the previously noted race on
<-c.commsStopped
but prior to that on line 672 atc.workers.Wait()
: https://github.com/eclipse/paho.mqtt.golang/blob/88d53342c440d7e9fb3b2927188e6bd8645dce8c/client.go#L670-L680The repro also does not involve attempting to reuse a client via
Disconnect
/Reconnect
:SetCleanSession(False)
(not salient, hang reproes either way)Connect
the client and callSubscribe
with a QoS level 1Disconnect
on the client as part of teardownIs there an idiomatic way of releasing the client and any resources it holds other than calling
Disconnect
on it?It looks like sometimes after
c.stop
is signaled as part ofDisconnect
, the goroutine responsible for flushing any acks doesn't end up getting toc.workers.Done()
:https://github.com/eclipse/paho.mqtt.golang/blob/88d53342c440d7e9fb3b2927188e6bd8645dce8c/client.go#L579-L586
Relevant stacks I'm basing that hypothesis on highlighted below:
The text was updated successfully, but these errors were encountered: