-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP timeouts not respected in flush #75
Comments
The timeout is on a per-request basis, not on a per-flush. |
If the metrics are not specific to flush then is there currently no control over how long a flush will hang? What is the current expectation? I'm seeing retries growing out of control into several minutes with default settings. Is this how you'd expect flush to behave? If this is expected then I would think most users would have to disable retries as several minute hangs on a function call is never ideal. You mention disabling retries by using 0 as the input. How specifically would this work? What would my control mechanism be for the single timeout? For example, I want it to try only once for 10 seconds, what exact parameters would I choose? |
I'm on holiday at the moment. I'll be able to look into this more in detail once I'm back if it's indeed a bug. Hopefully the pointers I'm about to give you will help out in the mean time. ContextRetry TimeoutFor full reference on retries you can read the logic here: https://github.com/questdb/c-questdb-client/blob/94b3890e30a5adf10555c61172aaaf8222a8a4a4/questdb-rs/src/ingress/http.rs#L220 The short of it is that the retry timeout affects for how long retry reattempts should be started. Each HTTP retry is a request. Request TimeoutThe duration of the request itself is given by the parameters documented here: https://py-questdb-client.readthedocs.io/en/latest/conf.html#http-request i.e. If this logic is not quite what you want you'd need to write your own loop, but it should work and if you're encountering issues I'd appreciate details so I can replicate them with a test case and fix anything that isn't quite right. Answers to your questions
There's no direct parameter for this. It's a function of:
That shouldn't be the case.
The request timeout parameters are designed to "scale" for both large and small requests. It is not unheard of for someone to want to flush several hundreds of MiB from a dataframe at once. Such an operation would need a bigger timeout than sending a few hundred rows. The
You should get this by setting:
This would give you two requests for 10 seconds each, or set Either way if this fixes your issue or not, let me know - timeouts can be fiddly and feedback is most appreciated. Specifically I'd be curious to know what your buffer size is. |
I very much appreciate the detailed response. Based on this information, I'm fairly confident there must be a bug then. In my specific use case the buffer grows by 9895 bytes per second, very precisely. But I cleared the entire buffer, if it grew past 10000000 bytes. I've tried many settings before reaching out and your suggestion was essentially already attempted. I specifically tested:
With this and the growing buffer I saw the flush keep failing and each flush would individually take more and more time. I saw over 800 seconds for a single flush during this exact configuration. It would make sense that the growing buffer would cause an increase in time but I thought I also do agree that in some situations it is very feasible for the full transaction to take a very long time for a large buffer. I guess I really mean that it is never ideal to not be able to control this. Right now I have no control to keep the flush function from taking less than 13 minutes. Thank you for taking time away from your holiday! I don't want to keep you from taking time off. With the auto_flush and this behaving oddly though maybe another teammate could join in? Although, I realize these things take time. Maybe it would be better to go back to the more stable previous client until these items are worked out? |
ILP/HTTP is new in 2.x.x. Instead of downgrading you can temporarily swap over to TCP: The functionality there is largely unchanged. Note that TCP does not do retries or requests. I'll take a look at this next week, possibly the second half. Unless you have production urgency, I recommend just waiting for me to investigate further. I'll come back to you in case I have more questions. |
That’s true, I’ll probably go back to tcp like you suggested on the same client version. I appreciate your assistance with this. Looking forward to hearing more. |
Noting the relevant Slack thread for posterity: https://questdb.slack.com/archives/C1NFJEER0/p1711771491609659 |
I've instrumented the Rust code (which implements the logic). For example setting up a sender like so let mut sender = server
.lsb_http()
.request_timeout(Duration::from_millis(50))?
.request_min_throughput(0)?
.retry_timeout(Duration::from_millis(100))?
.build()?; against a server that never accepts the incoming HTTP connection logs the following:
The params correctly drive expected behaviour:
I'll now check the Python wrapper (which this issue is raised against) and see if the error is somewhere in the bindings. Currently, I am yet to replicate your issue. |
Thanks for looking into this. I’m not sure how important these items are to reproduce the problem but when I encountered the problem I also was facing high latency connection/high dropped packet connection and also increasing buffer size. Maybe these 2 items could somehow be replicated as well when testing specifically with the python client. (I was able to reproduce the bad connection myself by connecting the client to a WiFi AP that was very distant). And lastly, I tested with request_min_throughput set to 0 and default. So maybe adding both iterations of that as well. Looking forward to hearing more! |
I believe this is simply a consequence (side-effect) of #73. Version 2.0.0 contains a bug where the |
If you take a look at the code I provided, that would not explain the issue. I don't have anywhere in my code anymore a row call on the sender. Because of that error (#73) that I discovered, I now have to maintain my own buffer. And I call flush manually with the buffer I build independently. So it is only possible that flush is called on the interval I define. In this case it is once every 10 seconds. I believe if you look at the code I provide it will help clear it up. |
Specially I timed one explicit flush call with all defaults to hang for over 800 seconds. I included the exact code and printout to the logs in the first comment. |
I guess to be precise. This exact one line took over 800 seconds with a default sender:
If this is not supposed to be possible then there is a bug because that is exactly what happened. The logs show how I proved that above. |
I still have no luck reproducing this. |
Yeah I can do that, I’ll try to provide it today if possible, thanks |
Hi @amunra, I've got a very minimal script that showcases the issue. For reference, I did update to 2.0.2 before testing and tested on multiple computers with identical results. If you run the script exactly as written below with no change, you will see that the time will align with the comment next to each sender object. I know the IP is not a valid one, leave this as is to reproduce the problem. It occurs when there is no/bad connection to the database.
The intention is to be able to control the timeout in some reasonable way. This is a single row which is taking up to 60 seconds. In our design, the buffer is in a list that locks on flush. So a flush that blocks for 60 seconds when the expected time is roughly 0.1 second is beyond a usable range. The conditions under which it took 822 seconds (the original post) was during extreme high latency connection. There's no way I can effectively reproduce that. So I won't be able to give you just code that reproduces the extremeness of the effect. I believe under those conditions, some responses are being received from the database and in some way that is resetting an internal timer which is extending the 30-60 seconds that you see from the test above. Let me know if this information make sense or if I need to provide more context. Thanks for the assistance. |
I appreciate the thoroughness, thank for this! I'll take a look next week. |
@amunra checking in to see if there might have been any progress on this? |
Hi Nick, I've tried out your snippet, and the first result (30 seconds with all three Sender timeouts set to 1 ms) is explained by the default 30-second connect timeout on the The other two examples are, I think, due to the same issue, the only difference being that our default timeouts allow a 2nd connection attempt, adding another 30 seconds. |
Understood, getting minimum reproducible code to produce the extended issues described seemed to be difficult for me to do. I believe it had to do with high latency connections. Somehow, getting some response seemed to exacerbate the problem. And recreating this scenario only with code was difficult for me to do. But even with part of the issue identified that's still great news. At least getting that part taken care will make the library much more predictable and therefore usable. |
Using the questdb python client version 2.0.0 it seems that the default http timeouts are not being respected when the network connection degrades. The minimal reproducible code is as follows:
For reference, there is another loop that feeds this buffer using the lock. But this is irrelevant to the problem I'm having and I've printed the lock acquire time to isolate it as an issue. The results show a growing flush time, for example this is an an exact output from my logs:
You can see the lock isn't taking almost any time so that's not the problem, but we're seeing growing flush time now seen here at 822 seconds. Here are the ping stats for reference during this time:
The expectation would be that with the default http configuration, for the flush to never take longer than 10 seconds no matter what. Currently this time is increasing over time. It may be growing in conjunction with the growing buffer but that's difficult to know without knowledge of how the internals of flush work. It also seems that once the connection degrades and the buffer starts growing, all consecutive flush attempts just fail. If the buffer gets small enough and the connection gets slightly better then the flush will finally succeed as normal.
The text was updated successfully, but these errors were encountered: