Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Tribler core becomes unresponsive when running with a large metadata store #5578

Closed
devos50 opened this issue Sep 23, 2020 · 21 comments
Closed

Comments

@devos50
Copy link
Contributor

devos50 commented Sep 23, 2020

@synctext noticed that when running Tribler with a large metadata store, Tribler becomes slow and occasionally unresponsive. For example, search queries can take several minutes to return results. I have tested Tribler with his state directory, but was unable to reproduce this on my Mac.

We could try to integrate this state directory with the application tester, and monitor the response times of requests.

I can share the state directory on request.

@qstokkink
Copy link
Contributor

I have my own separate 1.6 GB metadata.db and I can definitely see the core being blocked after a search. Nothing in the order of magnitude of minutes though.

afbeelding

As I'm running on an SSD, my best guess is that I/O is happening on the main thread and this is tied to the speed of the storage medium.

@ichorid
Copy link
Contributor

ichorid commented Sep 23, 2020

@synctext , we need your laptop to reproduce this.

@qstokkink
Copy link
Contributor

qstokkink commented Sep 24, 2020

Reproduced by copying metadata.db to an external thumbdrive, absolute flatline (the following image IS ANIMATED, you just have to wait 40 seconds):

triblerperf

@qstokkink
Copy link
Contributor

qstokkink commented Sep 24, 2020

In-depth reproduction:

  1. Copy your metadata.db to your external thumbdrive (I put mine in the /media/quinten/USB DISK/tribler_metadata/ folder). My thumbdrive is USB 2.0 and has a nice 20 MBps read and write speed.
  2. Apply the following patch (with your respective folder of course):
diff --git a/src/tribler-core/tribler_core/session.py b/src/tribler-core/tribler_core/session.py
index dcafe99aa..ca5161843 100644
--- a/src/tribler-core/tribler_core/session.py
+++ b/src/tribler-core/tribler_core/session.py
@@ -415,7 +415,8 @@ class Session(TaskManager):
             channels_dir = self.config.get_chant_channels_dir()
             metadata_db_name = 'metadata.db' if not self.config.get_chant_testnet() else 'metadata_testnet.db'
             database_path = self.config.get_state_dir() / 'sqlite' / metadata_db_name
-            self.mds = MetadataStore(database_path, channels_dir, self.trustchain_keypair)
+            import pathlib
+            self.mds = MetadataStore(pathlib.Path('/media/quinten/USB DISK/tribler_metadata/metadata.db'), channels_dir, self.trustchain_keypair)
 
         # IPv8
         if self.config.get_ipv8_enabled():
  1. Run Tribler and open the IPv8/health panel. Maybe also do some keyword searches to completely decimate the main thread.

@ichorid
Copy link
Contributor

ichorid commented Sep 24, 2020

🤦
I hoped so much that we're done with slow thumbdrives...

@ichorid
Copy link
Contributor

ichorid commented Sep 24, 2020

The only way to guarantee that the metadata store will work even on very slow 🐌 media, is to implement an asynchronous priority queue. We already have a ticket for this: #4320

@ichorid
Copy link
Contributor

ichorid commented Sep 24, 2020

@synctext , do you really run Tribler on a thumb drive? If you do, what's your opinion, should we keep it as a blocker, and thus prioritize making Tribler run on extremely slow storage media?

@synctext
Copy link
Member

no, my Linux box is SSD storage with fast i7.
(home test Mac uses USB stick for mass storage)

@ichorid
Copy link
Contributor

ichorid commented Sep 24, 2020

@synctext , if you really can reproduce it every time, the only way I can debug it is to connect remotely to the machine that shows the problem.

@devos50
Copy link
Contributor Author

devos50 commented Sep 24, 2020

Also related to #5208

@qstokkink
Copy link
Contributor

This still smells like an I/O issue to me. Even SSDs can have bad performance, for instance https://haydenjames.io/linux-server-performance-disk-io-slowing-application/ :

On this server, I was able to perform a quick benchmark of the SSD after stopping services and noticed that disk performance was extremely poor. The results: 1073741824 bytes (1.1 GB) copied, 46.0156 s, 23.3 MB/s

@synctext could you provide any sort of insight (for example by following the tutorial on the linked webpage) that your disk is still O.K.?

@synctext
Copy link
Member

Its not my laptop, it's Tribler :-)
Doing more fancy measurements on Monday, but I was also seeding several channels&swarms as additional system load.

sudo dd if=/dev/zero of=/tmp/test2.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.13891 s, 342 MB/s

@qstokkink
Copy link
Contributor

@synctext That's one big write, I'm more interested in performance for many small ones (for instance many small commits to an SQL database).

sudo dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync

If that doesn't bring down your disk to +- 20 MB/S or worse, it's in the clear.

@synctext
Copy link
Member

Ahhh, good to check. My SSD has quite bad performance for small fragments. Almost like hard-disk seek times.

XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=4k count=1000 oflag=dsync
4096000 bytes (4.1 MB, 3.9 MiB) copied, 3.22493 s, 1.3 MB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=2k count=1000 oflag=dsync
2048000 bytes (2.0 MB, 2.0 MiB) copied, 3.26632 s, 627 kB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=1k count=1000 oflag=dsync
1024000 bytes (1.0 MB, 1000 KiB) copied, 3.17648 s, 322 kB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync
512000 bytes (512 kB, 500 KiB) copied, 3.24638 s, 158 kB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=256 count=1000 oflag=dsync
256000 bytes (256 kB, 250 KiB) copied, 3.11922 s, 82.1 kB/s

@qstokkink
Copy link
Contributor

@synctext Well there's the problem, that's even slower than my thumbdrive. Now how to improve that.. I don't know :)

@qstokkink
Copy link
Contributor

Ok, let's try to get a satisfactory resolution for this issue.

On the software mitigation front:

On the hardware front, one (or more) of the following may be at play:

  • The TRIM configuration.
  • The RAID configuration.
  • The IO scheduler configuration.
  • The RAM pagination configuration.
  • A (nearly) full disk.
  • Other?

On the software front, these are very invasive changes that need proper testing and these have been--rightfully--postponed to 7.6 already. The possible issues with the hardware configuration are local to your machine, @synctext, and are Tribler version independent. From the current state of affairs, I would say that this issue should no longer be a blocker for 7.5.3 and be moved into 7.6.

@synctext
Copy link
Member

Agreed. This 3-minute blocking behaviour has not been reproduced or been easily traced to a bug; so lets leave this for 7.6 to investigate further.

My Ubuntu box has decent performance, just not if you force dd to use synchronized I/O. Further improving our async IO is obviously on our ToDo list, once we have a dashboard and PopularityCommunity progress.

XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=256 count=1000
256000 bytes (256 kB, 250 KiB) copied, 0.00495174 s, 51.7 MB/s
XPS13:~/GITHUB>sudo dd if=/dev/zero of=/tmp/test2.img bs=256 count=1000 oflag=dsync
256000 bytes (256 kB, 250 KiB) copied, 3.11922 s, 82.1 kB/s

The Web docs: Specifying the oflag=dsync flag on dd which will dramatically slow down write speed to the output file. Use synchronized I/O for data. For the output file, this forces a physical write of output data on each write.

@synctext
Copy link
Member

We could try to integrate this state directory with the application tester, and monitor the response times of requests.

Great item for ToDo list, @devos50 So maximum stress test:

  1. big metadata.db of few GByte
  2. joining numerous channels (25+)
  3. seeding some swarms (25+)
  4. doing keyword search on a popular keyword with 1000+ possible matches

@ichorid
Copy link
Contributor

ichorid commented Sep 28, 2021

@kozlovsky basically solved this by some SQL magic

@ichorid ichorid closed this as completed Sep 28, 2021
@devos50
Copy link
Contributor Author

devos50 commented Sep 28, 2021

Has this been tested with the notorious metadata.db of @synctext ?

@ichorid
Copy link
Contributor

ichorid commented Sep 28, 2021

Has this been tested with the notorious metadata.db of @synctext ?

yes. I've tested it with an even bigger DB. Of course, there is no real solution to this unless we embrace Channels 3.0 architecture #4677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants