-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add accumlated usage #155
Add accumlated usage #155
Conversation
I've been dogfooding this on my server and ran into a performance issue over time. Because I'm remembering every connection every established, the number of items to sort before displaying them onscreen balloons quickly: [src/display/ui_state.rs:199] self.processes.len() = 22
[src/display/ui_state.rs:200] self.remote_addresses.len() = 7808
[src/display/ui_state.rs:201] self.connections.len() = 93394 I've now implemented a pruning step in the UIState update, so I end up with something much more manageable: [src/display/ui_state.rs:199] self.processes.len() = 22
[src/display/ui_state.rs:200] self.remote_addresses.len() = 1000
[src/display/ui_state.rs:201] self.connections.len() = 1000 Where only the 1000 biggest values are retained. This fixes the performance leak and will run nicely for a long time without eating up the CPU cycles. Again, I've left some comments in the code any would love your feedback on those suggestions :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @TheLostLambda - really great work! I am very happy to see this change coming along and can't wait to see this merged and released.
Aside from the comments in the code, I mainly have one issue with the design, and that's with the addition of the MAX_BANDWIDTH_ITEMS
. I definitely understand that performance implications and am happy you caught it early (it was something I was honestly planning to look at when I saw the PR :) ).
The problems I see with the current implementation:
- The total bandwidth we'd be showing on the top line would be a little inaccurate right now. It would include bandwidth from items we'd already removed from the list. While this isn't terrible (the list is 1000 items long, after all) I can imagine if people track these things, it might get a little confusing.
- The cross-referencing between tables (eg. number of connections per process) is affected by this and becomes inaccurate.
- This also creates an issue that new items will have a hard time getting into the list as time goes on.
I have two solutions in mind for this (and would be happy to hear other suggestions and/or your thoughts about these):
-
Instead of removing items from the list, we could have a
Utilization < [count]
field in each table, and we could update it whenever we need to remove something from the list. -
We could re-count the total utilization in real time on every tick. This might be easier to implement, but I feel it's less "holistic" than the first solution.
Let me know what you think, and if there's anything else I can do to make this happen!
Thank you for all of the excellent feedback! I'll look into addressing a couple of things today and then finishing things off in the days to come :) Before I dive too deep though, could you elaborate a bit on this?
Another option would be to use a Binary Heap. We could build up the data in a binary heap (it's roughly O(1) for adding values) then they are popped off in sorted order (but the whole structure does not need to be sorted!). That way we could take only 100 or so items, but still remember everything we have seen. We could also use the purpose-built Taking 100 sorted elements from a Vec of 2,500,000 elements:
Strictly speaking, unless we delete items, we'll hit a performance wall eventually (memory or CPU). This is also only addressing the sorting step of the performance problem (which I certainly think is the limiting step, but there might be other lurking bottlenecks). Maybe a compromise is to increase
As for this, I agree it's not ideal, but I think that I would rather this be the case than the total ignoring the smaller connections. I think by recomputing it, we are no longer reporting an accurate measure of data sent / received (even if we do maintain internal consistency). |
Unfortunately, I think there is a deeply nested issue that might be blocking this functionality. I've been doing some testing, and it looks like Bandwhich is missing a lot of data. I seems to be somewhat speed dependent. If I download a 10MB file at 100KBps, then bandwhich totals up to ~6MB, but at full speed (~6.5MBps) it only comes out to be 1.6MB. Downloading a 1GB file at 25MBps reports only 185MB in bandwhich, with or without the divide_by. Furthermore, even the traditional rate mode seems broken 😢 A rate of more than 20MBps reports under 5MBps in bandwhich (even in the officially packaged version of Bandwhich). This is all on Linux. Do you have any idea what might be causing this? |
My suspicion is that it might be (related to) this: #136 |
Hmm, okay. I might be able to take a look into it, but I'm not sure when I'd get around to it. What sort of setup are you running where it works properly? |
That would be great. I hope it doesn't block your current work here? If so, how can I help? I'm using arch linux on my thinkpad t450. I get accurate reporting when downloading a large file with one connection or when downloading from multiple connections (eg. downloading a linux distribution through a torrent - I use debian). |
Curious, I'm also on Arch Linux and don't get accurate reporting in either case (on my desktop). I'll try on my Thinkpad as well I suppose. And no, I thing work on this can continue as long as the underlying issue is eventually tracked down :) What did you think about the performance compromise I mentioned back in this comment? EDIT: No luck on my Arch Linux P52s Thinkpad either... Still inaccurate reporting :( |
I'll take a closer look at your comments and address everything in the next few hours. I wanted to answer what I could quickly to get stuff out of the way for you :) |
Hey, so regarding the I definitely hear what you're saying about all of these solutions hitting some sort of bottleneck in the end (even if only after running for several weeks). I think both of the proposed solutions are great, but definitely in the "least bad thing" category. :) Let's reach for them if we decide that's the best way to go. I'll try to explain my original idea better: I propose we still use the |
I quite like the idea of that! That seems like a good way to impose a maximum limit on the number of items in the maps and still have a list that sums to the total! I'll go about implementing that soon. I think it's a quite good compromise :) In that vein, would you mind if I swapped out the BTreeMap for an ordinary HashMap? Technically it should be a little quicker. I may also explore raising the |
Sounds great. Feel free to make any changes you see fit and we can discuss their ups and downs when you're ready (or before if you'd like the input, ofc). |
Alright, added a bit more polish, but I had a question! How should the deleted entry category actually be shown in the tables? What do we show for connection count (process and address tabs) or the process name (for the connections tab)? Currently if you run this version of the code, and redirect stderr: Also, I don't presume that we want this row to show unless items have actually been removed, which means it will only be included if more than 10,000 items are already in the table. Does raw mode also read these tables? If not, what's the point of a row that no one will ever see? On some other notes, I've moved to HashMaps and am only sorting the vectors once (before, the maps would need to be sorted every time they were trimmed and again when they were displayed. A consequence of this is that all sorting operations / code (as well as the details of what map type is used) are internal to UIState now. This means it's easy to change out the sorting or internal data structures without touching any other files. Let me know what you think and what we should be in these tables! Hope you are well! |
Good point about the extra row... I think it's mostly there in order to support the total bandwidth. About the connections: hoi... tbh I'm starting to remember all the complexity involved in the feature. :) So - we could either not display them (and then change the title of the column to "active connections" or "active processes") or keep track of them in a different struct when we are in cumulative mode. Makes sense? Or do you have a different idea? |
Currently they exist in special fields of the pub processes_drift: Bandwidth,
pub remote_addresses_drift: Bandwidth,
pub connections_drift: Bandwidth, I'm not sure where they would best fit in the GUI to be honest, but there are struct with this data when cumulative mode is running (though I might be misunderstanding what you are going for). |
Ah, sorry. I think I was trying to solve a different problem. Been a long day :) |
No worries, I know how you feel :) I don't think it would be very useful in raw mode either. Should I keep the "drift tracking" in the code in case we want to use that value in the future? Or should I cut it? |
I haven't yet gone over the code, but will the total bandwidth calculation be correct without it if we're removing excess connections? |
I believe so, yes. It just adds up over time. It's not calculated from the entries or anything. |
Awesome - so let's remove them. I prefer not having things in the code unless we directly use them. |
Sounds good! I'll push something in a couple of minutes with them cut out |
Alright! This should be some pretty final code! Could I get some help updating the tests? I couldn't manage to get it working and it's broken after I removed the word "Rate" from the interface. |
Hey @TheLostLambda - I'm going to take a look this in the next few days. About the tests, we use cargo insta for the snapshots. If you install it, you can do EDIT: oops, forgot to link :) https://docs.rs/insta/0.8.1/insta/ |
No matter what I try, I can't get things to pass. Most of the tests are fine, just the removal of "Rate", but now the last handful of tests, while all of the lines are always present, they are presented in a random order. Maybe you could take a look when you get the chance? |
Off the top of my head this seems like something isn't sorted (eg. when entries have the same amount of bandwidth, they're not also sorted by name) - but that's just a guess. I can take a closer look (and review everything else) over the weekend. |
Yep, it looks like I was just being a bit silly! Moving from BTreeMaps to HashMaps means that things are no longer sorted by key, and the test packets: I've coerced the tests into passing by changing Everything should pass now :) Good catch! |
Hey @TheLostLambda - this is some fantastic work. I went over everything and the changes seems solid. For me, the only thing missing before we can merge is a test for this. I don't think it should be too involved to add one - just pass the flag and see that the traffic is accumulated. What do you think? Also, fair to note some issues this introduces:
|
Thanks for giving it a look! I can try to put together a test for it today :) As for the pause and connection issues, they were bugging me too. I tried to accumulate connections, but that would also require some major reworking, as I'll try to add some tests today and let you know how it goes! |
Sorry for the delay @imsnif ! I've duplicated all of the It seems to work well and the totals add up from snapshot to snapshot. |
let (terminal_events, terminal_draw_events, backend) = test_backend_factory(190, 50); | ||
|
||
let os_input = os_input_output(network_frames, 3); | ||
let mut opts = opts_ui(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these still use total_utilization: false
, don't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, opts_ui()
does have total_utilization
set to false, but I'm setting it to true in the line immediately after. Adding another "opts building" function was a bit messy, so I opted for that. I could define an opts_ui_total()
function if you'd rather.
The new snapshots from these tests do show totals as opposed to rates ("B" not "Bps"). Perhaps I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man, I managed to totally miss that line! My bad, sorry. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, no worries :)
Hey @TheLostLambda - I think everything looks great. I'd be happy to go ahead and merge this. Or is there anything still open that I'm missing? (I'll open issues for the above two caveats we talked about after the release). |
I think that it should be ready for merging :) I've been running it for a couple of days without issues. Perhaps we could add an issue for displaying the elapsed time when you are in total mode? I can just check how long the process has been running with |
@TheLostLambda - thank you very much for this! I'm excited about releasing this feature. :) I like your idea about the elapsed time (when in total_utilization mode). Would you like to open an issue about it? |
Sure! I'll make that issue now :) Thanks for making it so easy to contribute to this awesome repository! |
Closes #147
This works pretty well, but the existing code wasn't really designed for accumulated usage like this, so I've left in some comments where I'd appreciate some others' opinions on refactoring matters.
This feature really adds a lot to
bandwhich
for me! I have a metered connection on a server and this PR makes it possible to track down what processes use the most data over a longer period of time!This is a draft PR at the moment and I'd be hesitant to merge before we have the chance to address some of the comments I left behind :)
Additionally, the test snapshots have not been updated yet.