-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler Settings #1826
Comments
Firstly of note, there are some default changes to standard
The first three items are changed form default, I believe, due to issues with scheduler collapse in Riak on earlier OTP versions. When using leveled backend, for Riak 3.0.10, it makes sense to revert these on-default settings as there almost all I/O is now using inbuilt dirty I/O NIFs. It is expected that scheduler collapse should no longer be an issue. Likewise the async threads are now redundant when using leveled. Caution is required if leveldb or bitcask backends are used as they do not necessarily use "Dirty" schedulers. Likewise if standard (non-tictac) kv_index_hashtree AAE is used (which will use a leveldb backend). When testing with erlang defaults, not basho defaults, and a leveled/tictacaae system - our standard 24 hour test saw a 1.2 % throughput improvement. |
By default Riak will start with one "standard" scheduler per vCPU, one "dirty" CPU scheduler per vCPU, 64 async threads and ten "dirty IO" schedulers. This means there are many more schedulers than there are vCPUs. There are two different changes to this which have been initially tried: 1- Use a lower percentage of online schedulers (100%/75% for standard schedulers, 50%/25% for dirty CPU schedulers, 4 async threads). This reduces the ratio of online schedulers to vCPUs, and brings it closer to 1:1. The comparative throughput between tests with these different configurations can be seen here: On a log scale to show the delta at the end of the test (when the test is a more realistic representation of performance in a loaded cluster): The test is a 24-hour test which is split into two phases. The first phase has a relatively higher proportion of PUTs, the second has a relatively higher proportion of GETs. The number of deletes and 2i queries remains constant. The updates are mixed between 2KB inserts, and 8KB+ inserts/updates. The test is conducted on an 8-node cluster, with each node having 96GB RAM, a series of spinning disks in RAID 10, 2 hex core CPUs (with each core showing as 2 vCPUs), and flash-backed write-cache. The backend is leveled, flushing to disk is left under the control of the operating system. The basho_bench config file is:
The x-axis of the chart shows the accumulated updates at this point of the test. The y-axis the transactions per seconds (GET, PUT, 2i query, DELETE combined). Each point represents a throughput reading measure over a 10s period. Measurements are taken every 10s in the test. At the 250M update point, the relative throughput improvement when compared to the same Riak test with basho default settings are:
This is a very large delta, much bigger than any single throughput improvement that has been delivered recently on Riak. |
However, is this throughput improvement likely to be true across other hardware configurations? Is it likely to exist with different test loads? It would be useful to dig deeper into why throughput improves under these configurations, to understand if this is a general improvement, or one specific to this configuration. |
The test is designed to be primarily CPU limited at first, and then when the load switches to be more GET biased it is expected to be CPU heavy but primarily constrained by Disk I/O. Looking at total CPU used across the nodes (sys, wait and user), comparing the three test runs: So the indication is that with a reduced scheduler count, more work is being done, but with the less or similar CPU cycles. With binding of schedulers, more work is being done, but with more use of CPU (although throughput per CPU cycle is still better than with the default setting towards the end of the test). One obvious difference in how the CPU is being used is the number of context switches: Either binding schedulers to CPU cores, or reducing the number of schedulers, has a dramatic difference in the number of context switches required. But what is the cost of a context switch? This seems a little unclear, as although there is some data available , it is difficult to know if this data is relevant to modern OS/CPU architecture - in particular efficiency savings made related to (non-)flushing TLBs. However, how are these improvements impacted by meltdown mitigations? It might be reasonable to assume there are two basic costs:
Perhaps there are other costs related to TLB shootdowns, particular costs if a scheduler is moved between CPU cores to get access to CPU time rather than being switched in and out on the same core. The "cost of a context switch" is almost certainly. a non-trivial question to answer, and will depend on a lot of factors. |
As well as increased throughput, one of the biggest headline gains in performance across the different tests, is the improvement in GET latency: There are multiple different parts of the GET process where speed has improved. The slot_fetch_time is the time within a SST file in the penciller to read the HEAD of an object out of a block. The block must be read from disk (which will almost certainly be a read from the VFS page cache), and go through a binary_to_term conversion which will include a zlib decompression, followed by a The slot_fetch is the key part of the HEAD part of the GET process. The actual GET (which will occur on 1 not 3 vnodes) has 2 parts in the cdb file. The first part is an index lookup, which requires a calculation of the dj bernstein special has of the key/SQN, and then a file position to read the integers in that position on the index (which will almost certainly be in the page cache). The average time for the index fetch (here in microseconds), varies significantly between the configurations. The final part is the actual reading of the object from the disk (which may or may not be in the page cache). This again varies in line with the headline latency changes (measured here in milliseconds): |
All of these processes involve both some CPU-related activity and some interaction with the virtual file system (and in some cases possibly the underlying disk). If we compare these timings to the timings of larger CPU-only tasks, a difference emerges. Firstly, looking at the time need to create the slot_hashlist when writing a new SST file. This is a CPU-only activity, but a background task, i.e. not directly related to any end-user latency. The timings of these in milliseconds: Intriguingly, once in the backend of the test there is no improvement in the time for this task between reduced and default scheduler counts. However, there is a clear improvement when the schedulers are bound to CPUs. Secondly, looking at the hashtree computation (milliseconds) in the leveled Inker (an infrequent background process) we can see a small gain, but only through scheduler binding: |
On the write side, when an object is PUT there is a process to update the Bookie's memory, which is a CPU only process. Timings here in microseconds: This shows the greatest latency improvement with scheduler binding. The second, and slower phase is the writing of the object within the inker, which includes some CPU work but also an interaction with the page cache. Timings here again in microseconds: Now with the I/O interaction the big improvement is related to the reduction in scheduler counts. |
Overall there seems to be a pattern. Pure CPU work is generally made faster by binding CPUs to schedulers. Work that interacts with the VFS is made faster/more-efficient by reducing the count of schedulers. |
But why would reducing the scheduler count have this impact? If we look at the actual underlying volume of data being written and read from the disk, it would be reasonable to expect that this will vary between the test runs in-line with the throughput. Looking at write KB per second (from disk): The alignment between write volume and throughput appears to be roughly present as expected. However, looking at read KBs per second (from disk): Now the alignment doesn't appear to exist, especially with reduced scheduler counts. With reduced scheduler counts, there must be more reading from the VFS (as more throughput), but this is achieved with less reading form actual disk. This is also true (but to a lesser extent) with scheduler binding. This would imply that the VFS page cache is being used much more in these cases. But how are scheduler changes making the VFS page cache more efficient? Looking at the reported memory deployed by the VFS page cache - there is not an obvious difference here: So the apparent improved VFS page cache efficiency with reduced schedulers does not have an obvious or easy explanation. |
In summary, there is a performance test where we improve throughput by 22.5% by reducing the scheduler count, and get an improvement of 15.8% by binding schedulers to CPUs. The reason why seems relatively obvious and expected for scheduler binding. Reduced context switching improve efficiency of CPU centric tasks. The reason why seems strange and mysterious for the reduced scheduler count. Yes, there are some CPU efficiency improvements potentially related to reduced context switching, but the biggest improvement appears to be in the efficiency of the VFS page cache on both reads and writes. |
Scheduler binding is a known way to improve performance in Erlang systems., and is advertised as such within the erlang documentation. In RabbitMQ, it is now the default to use scheduler binding. However, the side effects associated with other activity on the same node can be severe. In RabbitMQ this is mitigated that nodes used for RabbitMQ are dedicated to that purpose. The same mitigation could be stipulated for Riak. However, even if no application workloads co-exist on the same node, all Riak nodes will have operational software (e.g. monitoring, security, backups etc). That operational software may have a limited throughput when working correctly - but may also have error conditions where they work in unexpected ways. Overall, the potential side effects of scheduler binding seem out-of-step with the primary aim of Riak to be reliable in exceptional circumstances (as a priority over reduced latency in the median case). |
Reducing scheduler counts appears to be a safer way to improve performance, but without a full understanding of why it is improving performance, it doesn't seem to correct to make this a default setting. This is especially true, as what might be an improvement with a full beam-based backend like leveled, may not be true with NIF-based backends like eleveldb/bitcask (where dirty scheduler improvements have not been made). |
One interesting aside is - if the throughput improvements for scheduler binding and scheduler reductions are not tightly correlated, what would happen if the two changes were combined? More throughput improvements? See below for the log-scale throughput chart, this time with a green line showing the combination of the two changes: In the initial part of the test - there is an improvement over and above what can be achieved through either setting. However, towards the end of the test, the combined setting has lower throughput than either individual setting. At the 250M update mark, the throughput improvement is just 4.7% more than the default. This is hard to explain, but it is noticeable that the I/O & page cache related improvements appear to reverse when reduced schedulers is combined with bound schedulers. The throughput towards the end of the test is much more important than at the beginning - most Riak cluster spend most of their time processing requests with a large set of data already present. So combining the two changes, does not seem to be as good as making either one of the changes. |
One thing of potential interest to emerge in this thread is the The reason described by @seancribbs for changing this are related specifically to containerised environments, but perhaps it is worth adding this to the It was noticeable during testing of the combined binding/reduced setup that the "Other" count was dominant form of busyness on the standard schedulers:
|
A further note on the combination of reduced scheduler counts and binding. When this combined setting is in place, the CPU bound activity gets faster than just having the normal number of CPUs bound. However, any impact related improved interaction with the page cache is entirely negated. Hence we see in the curve, throughput improved while the cluster is primarily CPU bound, but throughput worse when the cluster is more disk bound. |
The tests have been re-run with some subtle changes to the load, and over a longer test (48 hours). Towards the end of this test, disk busyness is almost totally dominant in determining throughput. There are three variants from defaults tested:
Looking at the results with different VM/scheduler settings - improvement sin throughput in the mid-part of the test are seen with all variants: What we can see in all variants improvements in response times, especially GET: What is noticeable, is the difference in CPU utilisation between the settings. The test is run on a cluster of 12-core (24 vCPU) servers. Disabling the busy wait gives throughput improvements whilst running 3.4 vCPU less on average compared to VM defaults. This is a significant improvement in efficiency: |
Note RabbitMQ has gained improvement through scheduler binding - rabbitmq/rabbitmq-server#612. However, what is the impact of this change if there are other processes running on the same machine (at the NHS there was an incident related to scheduler binding on RabbitMQ - although in this case RabbitMQ was not being run on a dedicated machine in line with Rabbit guidance). Riak guidance is also to run Riak on dedicated machines, but even if the machine is "dedicated" there will still be alternative processes for operational reasons (e.g. monitoring), and it is not possible that such processes will always behave as expected. So I'm wary of scheduler binding as a default approach. My preference would be disable busy-wait as the recommended change, primarily because of the CPU efficiency improvements we see. However there exists a specific warning in the docs that this flag may not be available in future releases. |
As an offshoot of #1820 ... a thread related to testing different scheduler settings
The text was updated successfully, but these errors were encountered: