Tuesday, November 19, 2013

MySQL Cluster run-time environment: Part 3: Configuration recommendations

Binding threads to CPUs in the MySQL Cluster data nodes can have great benefits. So what about hyperthreading, should we use all CPU threads, or only 1 CPU thread per CPU core? The recommendation differs actually. For the most part it is beneficial to use hyperthreading. In most thread types it gives about 40% higher performance with 2 CPUs using hyperthreading compared to 1 CPU not using hyperthreading. There are a few cases where it might be beneficial to not use hyperthreading though.

The first example is for LDM threads. Using hyperthreading means we will increase the number of partitions of a table by a factor of two. In many cases this isn't beneficial, particularly when the number of LDM threads is high. I tried with using 24 LDM threads with hyperthreading on 12 CPU cores and compared it to 12 LDM threads on 12 CPU cores. This case didn't benefit from using hyperthreading of LDM threads. However if the number of LDM threads is low it would probably still pay off, so going from 2 to 4 LDM threads is still beneficial, probably also going from 4 to 8. But going from 8 to 16 is less likely to be beneficial.

I have tested send and recv threads with and without hyperthreading, my benchmarks have always improved by using hyperthreading. It has always been better to use hyperthreading. The same conclusion I've seen with tc threads. Obviously if the main thread or the rep thread for some reason becomes the major bottleneck, then it makes sense to remove use of hyperthreading here.

Avoiding use of hyperthreading can be simply done by not configuring any threads to use the second CPU on each of the CPU cores we want to avoid hyperthreading on. As an example we will configure a machine with one data node, the machine have 4 sockets with 24 cores and 48 CPU threads. The CPUs 0-5 represent the CPU threads on socket 0 and core 0-5 and thread 0, CPUs 24-29 represents socket 0 and core 0-5 and thread 1. So if we want to configure with 12 LDM threads not using hyperthreading here we could use the config:


In this configuration the LDM threads will use CPUs 0-11 which covers all cores on socket 0 and 1. No other thread is configured to use any CPU thread in those CPU sockets. The OS might still decide to use some of them, but we have left a number of empty CPUs that the OS hopefully discovers as idle and uses those to schedule OS activities. We can actually do even better than this, there is a boot config variable in Linux whereby one can specify which CPUs the OS is allowed to use. Similarly there is a similar config variable for irqbalance to ensure interrupts are not scheduled on any CPU used by the MySQL Cluster data node. The tc, send and recv thread are scheduled on any the CPU threads on socket 2 and the main, rep, wd, and io threads are using 2 cores on socket 3. The OS and other processes aren't blocked from using other CPUs, but will most likely be scheduled on any of those free CPUs with no activity on them.

The first question that one starts in specifying the MySQL Cluster run-time environment would be to decide on how many ldm threads one should use. This is dependent on how many CPUs that are accessible to the data nodes. So assuming we have access to 24 CPU threads with hyperthreading. In this case it would be natural to start by using the number of ldm threads set to 6 and not use hyperthreading. A natural initial start is to use half of the available CPU cores for ldm threads. Next one assigns about a quarter of the ldm CPU resources to tc threads, in this case we land at 3 tc threads. Next one assigns a similar number of CPUs to send and recv threads. Then one assigns the rest of the CPUs to main, rep, io and wd threads. This should give a fair amount of resources available also to the OS.

After this initial assignment it is a good idea to run some benchmark which is close to the workload of your application to see whether the config works well. For this test run one should use cpubind for all thread types to ensure that we know how much CPU resources each thread type consumes (can be easily derived looking at top with per-CPU load mode, using cpubind). First check whether ldm threads is a bottleneck, if it is then check if it is possible to increase to the next level. In this example this would mean going to 8 ldm threads then using 2/3 of the CPU resources. If this isn't possible then just make sure that the rest of the thread types have a fair amount of CPU resources to avoid any unneeded bottlenecks.

If the bottleneck isn't the ldm thread, then assign more resources to this thread type and in most cases by removing resources from non-ldm threads. There could be cases where less than half of the CPU resources are needed by the ldm threads, but I would deem those as very unusual. Given that the actual database processing is done in the ldm threads, it would be an exceptional case if other threads consume more than half of the resources.

Always remember to update the NoOfFragmentLogParts variable if changing the number of ldm threads.

After a few trials we have most likely found a decent configuration. After finding this configuration we can also consider where to use cpuset and if any threads should use the realtime or spintime variables.

So next question is when to use cpubind, cpuset, realtime and spintime.

Both cpubind and cpuset is about locking threads to an individual CPU or a set of CPUs. We could actually consider even using no cpubind and cpuset as a form of CPU locking. We are locking the threads to the set of available CPUs in the OS. Given that we might be running on a virtual OS this might actually already be a subset of the existing subset of the existing CPUs. To make any configuration of CPUs using cpubind/cpuset one has to have knowledge of how cpu ids maps to CPU sockets and CPU cores.

So the default configuration not using any cpubind/cpuset is to allow the OS to schedule the thread onto any of the available CPUs. The OS scheduler is optimised towards an interactive environment where processes need to react to human interaction. It does also a good job of server environments where a fairly high number of threads compete for a small number of CPUs. It does also do a decent job of handling server environments where there are only a handful of threads competing for a number of CPUs.

The type of threads that the normal OS schedulers have most problems to handle are long-running threads that consume a lot of CPU resources. In particular threads that run more or less constantly. What happens is that the OS scheduler eventually downgrades their priority such that other processes are given a chance to execute, since the process still wants to execute the OS now searches for a free CPU to use. This is often successful, the problem is however that this means that we migrate the thread onto a new CPU. This happens many hundreds of times per second for those busy threads. The effect is that the thread comes to a new CPU where it has no data or instructions cached in the CPU caches. So this means that a migrated thread will spend quite a lot of time to warm up CPU caches before it can run as efficiently as it could before the migration. In addition this requires more bandwidth on the memory bus which in some cases can become a bottleneck.

In the above case it is better to stay with the same CPU even if a new job is scheduled, this is the case if we can be sure that the new process isn't one more long-running thread. So this means that in order to optimise the MySQL Cluster run-time environment we need to be in control over all usages of the CPU on the machine we're using. In many cases we colocate data nodes and MySQL Server processes. We can control placement of MySQL Server processes using numactl or taskset that can be applied when starting the process or using pid when the process is already started. This ensures that the MySQL Server process is never scheduled outside the set of CPUs we gave it access to through the taskset/numactl process. This is how I control the environment when running any MySQL Cluster benchmark. Similarly I also control the benchmark processes (sysbench/dbt2 client processes/flexAsynch...).

In this manner I am certain that no application process is using the CPUs provided for execution of the data node threads. So the only threads that will execute on the CPUs are either OS kernel threads or interrupt handling. Even this can be controlled by using a special boot option isolcpus and list the cpu numbers that the OS is allowed to use. When setting this variable only the listed cpus are using a normal scheduler handling with the possibility to migrate CPUs. Usage of the rest of the CPUs can only be invoked if the application uses a locking call such as controlled by cpuset/cpubind or taskset/numactl. The final thing to control is the execution of interrupts. This is normally handled by the irqbalance process and this can be configured to avoid a bitmap specified by the irqbalance configuration variable IRQBALANCE_BANNED_CPUS. So it is possible to create a completely compartmentalized load with interrupts on certain CPUs, OS and other applications on another set of CPUs, a set of CPUs for data node, a set of CPUs for MySQL Server and a set of CPUs for any other consuming application process. Providing this is a combination of boot options, irqbalance configuration, MySQL Cluster configuration and finally using taskset/numactl on certain processes.

So when to use cpubind and when to cpuset. cpubind should mainly be used on threads that can consume up to 100% of the CPUs. This would mainly be the ldm threads normally, but can definitely be the case also for other threads dependent on the workload and the configuration. So one way of configuring MySQL Cluster environments is to use cpubind for ldm threads and put the tc, send and recv threads in a common cpuset. The nice thing with this configuration is that we can run with more threads than really necessary, so e.g. we can run with 4 tc threads, 4 send threads and 4 recv threads and put these into a cpuset consisting of e.g. 8 CPUs. In this manner the  OS can easily handle if there is a certain thread types that requires more resources for a while.

So configuring a MySQL Cluster run-time environment is about making use of static scheduling for the most critical resources and to use the flexibility of the OS scheduling to handle the less critical resources. A good design philosophy for all environments like this is to design the run-time environment with a well-known bottleneck. For MySQL Cluster data nodes we recommend to always make the ldm threads the bottleneck. The nice thing about this is that it makes it easier to understand how to handle overload and reason around it. As an example if the ldm thread are overloaded and tc threads have available resources we can ensure that the tc threads handle sending error messages about overload without even contacting the ldm threads. This can be achieved by decreasing available resources in ldm threads through configurations at least for scan operations.

Hopefully these recommendations will help you find the optimal configuration and still a safe configuration. Default configurations will work for most installations, but to get the last 10-100% performance out of the system one might need to dive a bit deeper into the configuration of the MySQL Cluster run-time environment.

Next configuration item to consider is the realtime setting. This can now be set on each thread type. Traffic queries normally execute arriving in the recv thread, sent to the tc or the ldm threads and then the reply is sent through the send thread. main thread is mainly involved in the meta-data operation which rarely are time-critical. rep threads can have a high load, but they are not part of any critical paths except for asynchronous replication to other clusters. The io thread is only involved in time-critical operations if disk data is used in MySQL Cluster. The wd threads are different, obviously it is important that the watchdog thread gets an opportunity to execute every now and then. So if other threads are using realtime it's a good idea to use realtime also on the wd thread type. recv, send and tc threads are usually doing small jobs and thus realtime scheduling might be beneficial for those threads to cut the response time. For ldm threads that execute close to 100% of the time it's debatable whether realtime is such a good idea. The OS cannot handle realtime threads executing on realtime priority for a long time, so we have implemented protection for this in the data nodes. This ensures that we decrease the priority to normal user priority even for realtime threads if they execute for too long. It's rarely any benefits of throughput to use the realtime configuration for threads. It's mainly intended to enable less variation on response time.

Another use case for realtime is when there is a mix of data node threads and other application threads where the application threads have lower priority. In this case we ensure that the data node threads gets prioritised access to CPUs before the other application threads gets access.

The final configuration item to consider is the spintime. This means that the data node thread will execute for a bit longer before entering sleep mode. So if the recv thread or any other thread sends a new message to the thread in this spintime we decrease the wake up time. The only case where this can increase throughput is if the spinning thread is the bottleneck of the data node and the spinning doesn't steal resources from any other critical thread. The main usage of spintime would be as a tool to improve response time in an environment with plentiful of CPU resources. It is important to consider that the use of spintime will increase use of CPU resources for those threads it is set on. It only applies to the thread types ldm, tc, main, rep, send and recv threads.  One should in most cases avoid mixing spintime and realtime settings on the same thread.


Christian Ehmig said...

Hi Mikael,

I found your 3 posts to be a very good & detailed read regarding "thread management" with ndbmtd. Is there any way to inspect a data node which threads are the current bottleneck and should be tuned?

We have a write heavy 4 node cluster and experience unpredictable response times for SELECT queries querying exactly the same dataset over and over (varying from 10ms to 1.5s) - so we plan to either scale up (add cpus) or scale out (add more nodes) - this depends on how we could tune the current situation with ThreadConfig since we still use MaxNoOfExecutionThreads.


Stefan Auweiler said...

Hi Mikael,

where is the benefit of configuring 12 thraeds over a set of 12 CPUs vs. binding theses 12 thraeds to the 12 CPUs?

In your example, you are proposing this for tc, send and rcv:

Wouldn't binding make more sense here, as I'd be able to monitor, which thraed loads the system?


Additionally, we would not take the disadvantage from restarting a thraed on a new CPU, after OS might have moved it...

Thanks Stefan

Mikael Ronstrom said...

The benefits of cpubind as you notice is that you get full control of the thread where it executes and can even use tools like top to check how much load the thread uses.
cpuset has a different advantage in that it makes it possible for the OS to move the thread around a bit. This can be advantageous if the CPUs are used for other purposes as well (e.g. interrupt, other programs). It can also be helpful if we want to use more than one thread per CPU.
So in the example 18,19,42,43 are used for all IO threads (can be a few of those), for the main thread, the rep thread and the watchdog thread (also includes a few TCP connect threads).
There is a balance between giving all responsibility to the OS (no binding at all) and keeping all responsibility to yourself (using cpubind) where cpuset can be an interesting way of meeting this balance.
What is appropriate for your workload depends on the need of monitoring, other applications running on the same machine.
In many cases cpuset and cpubind gives very similar results but cpuset can make the behaviour more reliable when other applications cause CPU overload.