candump: Prevent file I/O blocking socket reading #381

Ipiano · 2022-10-03T14:26:40Z

tl;dr - I have some changes, but we're not sure if they're appropriate to upstream because they're a bit specific to our issues. Please advise

I'm working with some systems that are particularly I/O bound and we're seeing a lot of dropped can messages in the logs we're taking with candump.

We run candump on startup, logging to a file. Every hour, we kill it, rotate the file, and start it again. We've opted to use this approach over something like logrotate because an external tool can't guarantee we won't rotate in the middle of a write to the file (unless the kernel can?)

Regardless of if this is an appropriate rotation strategy, we're seeing a lot of random dropped messages in the middle of logs. We've determined that it's due to the fact that we occasionally saturate the disk I/O on the system - candump blocks writing to the file for a couple seconds, can't read from the socket, and the socket overflows and drops frames.

We're looking to solve this by spawning a new thread and doing all logging from there, buffering internally in an auto-expanding circular queue. This approach fixes the dropped frames issue; but currently is unbounded on the amount of memory that candump will use at runtime - in practice it's bounded by the amount of time that disk I/O blocks; once the buffer has expanded to a size that accommodates our average bus load for a fairly long block, it's pretty stable.

At the same time, we've moved file rotation into candump in that side thread; this guarantees we won't drop any frames between ending one log and starting a new one. We could also accomplish this by starting the new log before stopping the old one and generating a brief period of overlap.

So my question is really just should we put in a PR for some or all of these changes? I could see arguments against pulling in pthread as a dependency or making candump a bit heavier-weight tool. The approach of adding file rotation internally also breaks the single-tool, single-function philosophy, but I don't know how much this project strives to accomplish that.

The text was updated successfully, but these errors were encountered:

hartkopp · 2022-10-03T19:36:07Z

Hi @Ipiano ,
we had a similar request and discussion here: #268 (comment)
But this did not go further, as it wasn't reworked and rebased as suggested.
If possible I would avoid to introduce the pthread dependency if possible.
Would you please take a look into the former approach if this would also work for your use-case?
If so we could probably start from there?!?

Ipiano · 2022-10-04T16:51:13Z

Unfortunately, I think our solution will need to introduce pthread to do logging off the main thread. The larger issue for us is that if file I/O blocks for an extended period, it will prevent reading from the socket and messages are dropped. The potential to lose messages when rotating logs with something like logrotate is secondary.

We will evaluate starting with #268, finishing that, and then adding the logging thread in a fork from there.

hartkopp · 2022-10-04T18:33:58Z

Thanks! Looking forward to your updated patch.
Have you ever tried the -r <size> option to increase the socket receive buffer size?
The blocking file I/O of several seconds seems to be really long. Are you writing to slow SD-cards there?

Ipiano · 2022-10-04T19:01:48Z

I suspect we could "solve" the issue by increasing the buffer queue size, but an auto-resizing queue is preferable because we don't know how long it'll end up blocking.

No, we're writing to the disk on custom hardware that we produce. It's a known issue with our hardware + BSP combo

olerem · 2022-10-05T05:12:40Z

Hm, did i missed some thing? What are "auto-resizing queue"? As far as i know, we do never wait until the queue is full to read it.

olerem · 2022-10-05T05:15:59Z

With the pthread you are moving the queue to the user space. Introducing more complicity and more code. If IO will be even longer, you will start to drop packet in user space or in user space and kernel. In the kernel case, you will have diagnostic only one level - kernel interface drop counters. In case of double queue in the app, you will need to introduce extra diags.

marckleinebudde · 2022-10-05T06:45:43Z

Using (and increasing) the existing socket buffer between the kernel and user space is preferred than introducing another buffer.

That said, if you introduce a user space buffer, an auto-resizing buffer without bounds is IMHO a bad idea, better use a sensible sized buffer as default, and add a command line switch to set the size.

Instead of using threads, you can switch both file descriptions into non blocking mode and hang them into an epoll() loop.

Ipiano · 2022-10-05T13:55:20Z

The goal here is to avoid dropping any messages. Messages are dropped when we are unable to write to the file at the same speed that data is produced on the socket. Dropped messages are a huge issue for us because embedded systems don't write log files, so they're the only data we have to diagnose a lot of issues.

If we increase the size of the socket queue at startup, it probably helps in most cases; but we don't have an upper bound right now for how long disk I/O can block and there'd likely still be some drops; maybe we could find a number that's sufficient for 90% of cases and call it a day though. This is already supported, and would require no changes.

If we auto-resize the existing socket buffer, I'm not aware of a way to know that it needs to be resized until after we have already dropped a message; in which case a few messages will be dropped at the start of logs, but eventually it will (probably) stabilize until you reboot.

If we use a user-space buffer, we can resize it as needed in order to guarantee a message is not dropped because we can discover that the buffer is not an appropriate size before dropping the message.

I agree that not having an upper bound on it is probably not a great solution; but if we 're able to define an upper bound on the length of time a system might block up, then we can solve this with option 1 - just resizing the socket buffer.
Does this potentially indicate other issues (with the hardware, OS, and organizational culture) that probably also need to be solved? Yes. But this is the one we're trying to solve right now.

Assuming a non-blocking file descriptor would return EAGAIN in this situation, and not just hang anyway, yes; that works and we can avoid pthread. I need to check the behavior there.

Regardless, I think the answer to my initial question is "no, we don't want your 'fix' for your janky-ass hardware issues in candump"; and I'm fine with that being the answer; that's what I was initially expected when I asked.

marckleinebudde · 2022-10-05T14:00:46Z

From my point of view a sensible sized buffer should be the default. Add a command line switch to overwrite the size and an additional switch to create the bottomless buffer you need. Add extra warnings to the help output like "use at your own risk, may eat all your memory". :)

olerem mentioned this issue Oct 17, 2023

Use of J1939, can-utils, RaspberryPi, ValueCan4 #467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

candump: Prevent file I/O blocking socket reading #381

candump: Prevent file I/O blocking socket reading #381

Ipiano commented Oct 3, 2022

hartkopp commented Oct 3, 2022

Ipiano commented Oct 4, 2022

hartkopp commented Oct 4, 2022

Ipiano commented Oct 4, 2022

olerem commented Oct 5, 2022

olerem commented Oct 5, 2022 •

edited

Loading

marckleinebudde commented Oct 5, 2022

Ipiano commented Oct 5, 2022

marckleinebudde commented Oct 5, 2022

candump: Prevent file I/O blocking socket reading #381

candump: Prevent file I/O blocking socket reading #381

Comments

Ipiano commented Oct 3, 2022

hartkopp commented Oct 3, 2022

Ipiano commented Oct 4, 2022

hartkopp commented Oct 4, 2022

Ipiano commented Oct 4, 2022

olerem commented Oct 5, 2022

olerem commented Oct 5, 2022 • edited Loading

marckleinebudde commented Oct 5, 2022

Ipiano commented Oct 5, 2022

marckleinebudde commented Oct 5, 2022

olerem commented Oct 5, 2022 •

edited

Loading