Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventHub receiver.listenForMessage slows down significantly #240

Open
ChiefAlexander opened this issue Oct 22, 2021 · 4 comments
Open

EventHub receiver.listenForMessage slows down significantly #240

ChiefAlexander opened this issue Oct 22, 2021 · 4 comments

Comments

@ChiefAlexander
Copy link

Expected Behavior

Receive messages consistently

Actual Behavior

After a few seconds of receiving messages, the latency increases drastically and doesn't come down

I have attached a screenshot from Jaeger showing the increase in function times from an average of ~40ms to an average of ~600ms after processing less than 200 messages from EventHub. This all happens within a few seconds of startup.
before
after

I have locally patched the AMQP library trying to dig into the issue to identify where this slow down is occurring but was unable to reach any conclusions. If this is an issue within the AMQP library I am happy to open another issue there but wanted to start here to see if anyone had any ideas.

Environment

  • OS: gcr.io/distroless/static:nonroot
  • Go version: 1.17.1
  • Version of Library:
    github.com/Azure/azure-event-hubs-go/v3 v3.3.16
    github.com/Azure/go-amqp v0.16.1

Workaround

Utilizing the WebSocket connection option allowed us to have consistent message delivery. After switching our latency within the functions maintains a stable execution time of ~40ms. Which 10x'd our data throughput.
collected-data-rate

@ChiefAlexander
Copy link
Author

This is also probably related to #122 based on some of the comments.

@richardpark-msft
Copy link
Member

richardpark-msft commented Dec 2, 2021

Hi @ChiefAlexander,

I've written a stress program, which ran for 24 hours and not seen this issue so I think I'm missing some key component. Do you have a second to look at what I have to see if matches what you're doing? Or even better, if you can create a small sample that replicates the problem you're seeing I could start with that.

Some things that might be different:

  • I'm only writing/reading to a single partition
  • I'm not using any of the checkpoint/leasing code - just purely reading from an offset, continually.
  • I'm using an alpine base image (I imagine this is similar to what you're using but yours is probably even more lean)

stress program

@ChiefAlexander
Copy link
Author

I will try to replicate what we are seeing with your stress program.

A few notes:

  • We are reading from multiple partitions in some cases (4 partitions). I believe that when I was running the above tests it was only reading from a single of those partitions.
  • We are using a checkpoint but not a lease. We have actually written our own persister that writes to Firestore. However, while testing we confirmed that this would occur when not using a checkpoint and when using the default persister as a part of this library.
  • We were able to replicate this both in a container and locally run (mac)

I think we could argue about which base image is leaner 😄 . I don't think in this case the underlying OS actually matters based on our own testing.

@slamgundam
Copy link

can i ask where i can switch the connection option to websocket?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants