The Tribler core becomes unresponsive when running with a large metadata store #5578

devos50 · 2020-09-23T09:10:41Z

@synctext noticed that when running Tribler with a large metadata store, Tribler becomes slow and occasionally unresponsive. For example, search queries can take several minutes to return results. I have tested Tribler with his state directory, but was unable to reproduce this on my Mac.

We could try to integrate this state directory with the application tester, and monitor the response times of requests.

I can share the state directory on request.

qstokkink · 2020-09-23T09:20:03Z

I have my own separate 1.6 GB metadata.db and I can definitely see the core being blocked after a search. Nothing in the order of magnitude of minutes though.

As I'm running on an SSD, my best guess is that I/O is happening on the main thread and this is tied to the speed of the storage medium.

ichorid · 2020-09-23T14:57:10Z

@synctext , we need your laptop to reproduce this.

qstokkink · 2020-09-24T07:25:56Z

Reproduced by copying metadata.db to an external thumbdrive, absolute flatline (the following image IS ANIMATED, you just have to wait 40 seconds):

qstokkink · 2020-09-24T07:36:20Z

In-depth reproduction:

Copy your metadata.db to your external thumbdrive (I put mine in the /media/quinten/USB DISK/tribler_metadata/ folder). My thumbdrive is USB 2.0 and has a nice 20 MBps read and write speed.
Apply the following patch (with your respective folder of course):

diff --git a/src/tribler-core/tribler_core/session.py b/src/tribler-core/tribler_core/session.py
index dcafe99aa..ca5161843 100644
--- a/src/tribler-core/tribler_core/session.py
+++ b/src/tribler-core/tribler_core/session.py
@@ -415,7 +415,8 @@ class Session(TaskManager):
             channels_dir = self.config.get_chant_channels_dir()
             metadata_db_name = 'metadata.db' if not self.config.get_chant_testnet() else 'metadata_testnet.db'
             database_path = self.config.get_state_dir() / 'sqlite' / metadata_db_name
-            self.mds = MetadataStore(database_path, channels_dir, self.trustchain_keypair)
+            import pathlib
+            self.mds = MetadataStore(pathlib.Path('/media/quinten/USB DISK/tribler_metadata/metadata.db'), channels_dir, self.trustchain_keypair)
 
         # IPv8
         if self.config.get_ipv8_enabled():

Run Tribler and open the IPv8/health panel. Maybe also do some keyword searches to completely decimate the main thread.

ichorid · 2020-09-24T10:34:22Z

🤦
I hoped so much that we're done with slow thumbdrives...

ichorid · 2020-09-24T10:37:35Z

The only way to guarantee that the metadata store will work even on very slow 🐌 media, is to implement an asynchronous priority queue. We already have a ticket for this: #4320

ichorid · 2020-09-24T10:41:01Z

@synctext , do you really run Tribler on a thumb drive? If you do, what's your opinion, should we keep it as a blocker, and thus prioritize making Tribler run on extremely slow storage media?

synctext · 2020-09-24T10:53:55Z

no, my Linux box is SSD storage with fast i7.
(home test Mac uses USB stick for mass storage)

ichorid · 2020-09-24T11:19:23Z

@synctext , if you really can reproduce it every time, the only way I can debug it is to connect remotely to the machine that shows the problem.

devos50 · 2020-09-24T11:42:26Z

Also related to #5208

qstokkink · 2020-09-24T13:17:26Z

This still smells like an I/O issue to me. Even SSDs can have bad performance, for instance https://haydenjames.io/linux-server-performance-disk-io-slowing-application/ :

On this server, I was able to perform a quick benchmark of the SSD after stopping services and noticed that disk performance was extremely poor. The results: 1073741824 bytes (1.1 GB) copied, 46.0156 s, 23.3 MB/s

@synctext could you provide any sort of insight (for example by following the tutorial on the linked webpage) that your disk is still O.K.?

synctext · 2020-09-25T14:41:55Z

Its not my laptop, it's Tribler :-)
Doing more fancy measurements on Monday, but I was also seeding several channels&swarms as additional system load.

sudo dd if=/dev/zero of=/tmp/test2.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.13891 s, 342 MB/s

qstokkink · 2020-09-25T14:53:16Z

@synctext That's one big write, I'm more interested in performance for many small ones (for instance many small commits to an SQL database).

sudo dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync

If that doesn't bring down your disk to +- 20 MB/S or worse, it's in the clear.

synctext · 2020-09-28T11:53:45Z

Ahhh, good to check. My SSD has quite bad performance for small fragments. Almost like hard-disk seek times.

XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=4k count=1000 oflag=dsync
4096000 bytes (4.1 MB, 3.9 MiB) copied, 3.22493 s, 1.3 MB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=2k count=1000 oflag=dsync
2048000 bytes (2.0 MB, 2.0 MiB) copied, 3.26632 s, 627 kB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=1k count=1000 oflag=dsync
1024000 bytes (1.0 MB, 1000 KiB) copied, 3.17648 s, 322 kB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync
512000 bytes (512 kB, 500 KiB) copied, 3.24638 s, 158 kB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=256 count=1000 oflag=dsync
256000 bytes (256 kB, 250 KiB) copied, 3.11922 s, 82.1 kB/s

qstokkink · 2020-09-28T13:20:11Z

@synctext Well there's the problem, that's even slower than my thumbdrive. Now how to improve that.. I don't know :)

qstokkink · 2020-09-29T07:55:30Z

Ok, let's try to get a satisfactory resolution for this issue.

On the software mitigation front:

As outlined by @ichorid, the only way to effectively mitigate the effect of slow storage media on the main thread is to offload IO to a separate thread (Add a separate queue for threaded asynchronous Channels DB access #4320), to be solved December 2020 according to The Plan, 6-21 #5587.
This particular problem seems to disappear for larger writes. @kozlovsky can we somehow easily batch multiple small pending write jobs into bigger transactions?

On the hardware front, one (or more) of the following may be at play:

The TRIM configuration.
The RAID configuration.
The IO scheduler configuration.
The RAM pagination configuration.
A (nearly) full disk.
Other?

On the software front, these are very invasive changes that need proper testing and these have been--rightfully--postponed to 7.6 already. The possible issues with the hardware configuration are local to your machine, @synctext, and are Tribler version independent. From the current state of affairs, I would say that this issue should no longer be a blocker for 7.5.3 and be moved into 7.6.

synctext · 2020-09-29T10:41:53Z

Agreed. This 3-minute blocking behaviour has not been reproduced or been easily traced to a bug; so lets leave this for 7.6 to investigate further.

My Ubuntu box has decent performance, just not if you force dd to use synchronized I/O. Further improving our async IO is obviously on our ToDo list, once we have a dashboard and PopularityCommunity progress.

XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=256 count=1000
256000 bytes (256 kB, 250 KiB) copied, 0.00495174 s, 51.7 MB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=256 count=1000 oflag=dsync
256000 bytes (256 kB, 250 KiB) copied, 3.11922 s, 82.1 kB/s

The Web docs: Specifying the oflag=dsync flag on dd which will dramatically slow down write speed to the output file. Use synchronized I/O for data. For the output file, this forces a physical write of output data on each write.

synctext · 2020-09-29T10:45:46Z

We could try to integrate this state directory with the application tester, and monitor the response times of requests.

Great item for ToDo list, @devos50 So maximum stress test:

big metadata.db of few GByte
joining numerous channels (25+)
seeding some swarms (25+)
doing keyword search on a popular keyword with 1000+ possible matches

ichorid · 2021-09-28T14:07:50Z

@kozlovsky basically solved this by some SQL magic

devos50 · 2021-09-28T14:16:22Z

Has this been tested with the notorious metadata.db of @synctext ?

ichorid · 2021-09-28T14:27:21Z

Has this been tested with the notorious metadata.db of @synctext ?

yes. I've tested it with an even bigger DB. Of course, there is no real solution to this unless we embrace Channels 3.0 architecture #4677

devos50 added type: bug type: blocker labels Sep 23, 2020

devos50 added this to the V7.5: core refactoring milestone Sep 23, 2020

synctext modified the milestones: V7.5: core refactoring, V7.6: Stability, usability, performance Sep 29, 2020

drew2a modified the milestones: 7.6.0 November: Stability, usability, performance, Next-next release Nov 4, 2020

drew2a modified the milestones: Next-next release, Backlog Sep 15, 2021

ichorid closed this as completed Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Tribler core becomes unresponsive when running with a large metadata store #5578

The Tribler core becomes unresponsive when running with a large metadata store #5578

devos50 commented Sep 23, 2020 •

edited

Loading

qstokkink commented Sep 23, 2020

ichorid commented Sep 23, 2020

qstokkink commented Sep 24, 2020 •

edited

Loading

qstokkink commented Sep 24, 2020 •

edited

Loading

ichorid commented Sep 24, 2020

ichorid commented Sep 24, 2020

ichorid commented Sep 24, 2020

synctext commented Sep 24, 2020

ichorid commented Sep 24, 2020

devos50 commented Sep 24, 2020

qstokkink commented Sep 24, 2020

synctext commented Sep 25, 2020

qstokkink commented Sep 25, 2020

synctext commented Sep 28, 2020

qstokkink commented Sep 28, 2020

qstokkink commented Sep 29, 2020

synctext commented Sep 29, 2020

synctext commented Sep 29, 2020

ichorid commented Sep 28, 2021

devos50 commented Sep 28, 2021

ichorid commented Sep 28, 2021

The Tribler core becomes unresponsive when running with a large metadata store #5578

The Tribler core becomes unresponsive when running with a large metadata store #5578

Comments

devos50 commented Sep 23, 2020 • edited Loading

qstokkink commented Sep 23, 2020

ichorid commented Sep 23, 2020

qstokkink commented Sep 24, 2020 • edited Loading

qstokkink commented Sep 24, 2020 • edited Loading

ichorid commented Sep 24, 2020

ichorid commented Sep 24, 2020

ichorid commented Sep 24, 2020

synctext commented Sep 24, 2020

ichorid commented Sep 24, 2020

devos50 commented Sep 24, 2020

qstokkink commented Sep 24, 2020

synctext commented Sep 25, 2020

qstokkink commented Sep 25, 2020

synctext commented Sep 28, 2020

qstokkink commented Sep 28, 2020

qstokkink commented Sep 29, 2020

synctext commented Sep 29, 2020

synctext commented Sep 29, 2020

ichorid commented Sep 28, 2021

devos50 commented Sep 28, 2021

ichorid commented Sep 28, 2021

devos50 commented Sep 23, 2020 •

edited

Loading

qstokkink commented Sep 24, 2020 •

edited

Loading

qstokkink commented Sep 24, 2020 •

edited

Loading