Monday, September 30, 2013

MySQL 5.7 improves CPU scaling further


As shown in previous blogs and technical papers about MySQL 5.6, MySQL 5.6 improved scalability from 36 CPU threads sockets all the way to 60 CPU threads on a machine where each socket have 6 cores and 12 CPU threads.

Now that we have released the MySQL 5.7.2 DMR it's interesting to note that we have improved scaling yet one more step. I performed a number of simple test runs to see how our scalability have improved from MySQL 5.6.15 to MySQL 5.7.2 DMR. What the numbers clearly shows is that we have increased our scalability from 60 CPU threads to 72 CPU threads. Where we previously leveled off going from 48 CPU threads to 60 CPU threads, we're now leveling off going from 60 CPU threads to 72 CPU threads and the maximum performance is now found at 72 CPU threads compared to 60 CPU threads in MySQL 5.6.

Here we have the graph for scalability improvements of Sysbench RO.
Here is the graph for scalability improvements of Sysbench RW.
The reason is a series of improvements, both in the MySQL Server and in the InnoDB parts. One important thing is the improvement of the index locking which improves write scalability since updating indexes now is done with less concurrency problems. For read only workloads and primary key reads in particular there have been great improvements in the area of MDL locking and in this area we have almost doubled the throughput which is possible for MySQL 5.6 compared to MySQL 5.7.

In my work on scalability I have started using the perf tool found in modern Linux releases. It's an awesome tool, to get a complete performance analysis I can simply start perf record and specify the CPU and PID that I want to track. I can track on timed events, on first-level cache misses, last-level cache misses, various forms of TLB misses, branch prediction misses and much more. This should prove useful also in finding improvements also for single-threaded workloads. I have already done such analysis and improvements of the MySQL Cluster data node code and seen some stunning results. It has taken some time to get to understand the perf tool however.

The main obstacle with this tool is that the reporting sometimes points to assembler instructions which are not where the real issues are. The problem here is that the reporting is very inaccurate, the reported instruction can sometimes be as much as several dozens of instructions away from the real instruction where the bottleneck resides. In the literature this problem is referred to as skid, if an event occurs then the reported instruction is the next instruction to restart execution from after handling the event. Since a processor can have hundreds of instructions in flight at one point in time, this means that the skid (the number of instructions between the instruction that caused the event and the reported instruction) can be hundreds of instructions. This means that it becomes much more difficult to use the results from the perf tool. I read some Linux discussions about this tool and it seems not to be a priority of Linux kernel developers to do something about this skid problem. So one needs to remember that the perf output is mostly an indication of where the problem resides and no more.

Improvements of scalability in MySQL Cluster 7.3

There are two major achievements in the development of MySQL Cluster 7.3 that has led to significant improvements of scalability. The first is the move to MySQL 5.6, this means that we've removed the infamous LOCK_open bottleneck that limited scalability of MySQL Cluster 7.2 which is based on MySQL 5.5 and also all other earlier versions. The other is the improvements of the scalability in the NDB API. For sysbench we are able to process more than three times as many transactions per NDB API connection. In 7.3 it is also possible to do receive processing in separate threads that can be locked to specific CPUs which also enables higher throughput.

When testing MySQL Cluster 7.3 vs MySQL Cluster 7.2 using sysbench one can overcome the limitation of scalability in the NDB API by simply using many more NDB API connections in MySQL Cluster 7.2. However the limitation imposed by LOCK_open cannot be overcome, for sysbench this means that we can scale to usage of about 40 CPU threads and beyond that there is no additional gain of having more CPUs. When running with so high load it's actually a challenge to handle this load in the data nodes as well. What we discover is that the main bottleneck lies in the local data management threads (LDM threads). It turns out that for this particular type of threads it does actually not pay off to use hyperthreading. So best results are achieved by using 16 LDM threads that are not using hyperthreading. The problem is that when we add hyperthreading we also increase the number of LDM threads to achieve the same performance, with sysbench this means more partitions and also more work to process. So this is a rare case for MySQL where it doesn't pay off to use hyperthreading. So using this setup for MySQL Cluster we reach 7096 TPS for Sysbench RW and 9371 TPS for Sysbench RO using MySQL Cluster 7.2.

So when we take this configuration to MySQL Cluster 7.3 we can easily increase the performance since the LOCK_open bottleneck is now removed. Actually with the machine that I have access to (a standard x86/Linux box with 8 sockets and a total of 96 CPU threads on 48 cores) I can no longer get the MySQL Server to become a bottleneck for this type of standard sysbench tests. For point selects I can still reach this bottleneck since MDL locking still has some limitation in MySQL 5.6. As can be seen from the MySQL 5.7.2 DMR this is also a problem which is going away and then there will be even less problems to scale further.

So with MySQL Cluster 7.3 we are actually limited by the hardware of a 96 CPU thread box and not by any software limitation. With this box we are able to reach 8932 TPS on Sysbench RW and 11327 TPS on Sysbench RO. At this load we're using 8 CPU threads for the benchmark, 20 cores for the data nodes and the remaining 48 CPU threads for the MySQL Server. So in order to increase throughput on this machine we simply have to make better use of the HW resources at hand. With MySQL Cluster 7.3 we've made it possible to scale data nodes all the way to 32 LDM threads, so if we had access to a sufficiently big box we would be able to scale MySQL Cluster 7.3 performance even further. It's likely that we are able to scale it to about 70-80 CPU threads for the MySQL Server and this would require about 30 cores for the data nodes.

So what we see is that we're quickly reaching a point where MySQL Cluster scalability can go beyond what most customers need and it is thus very important to also start focusing on the area of CPU efficiency. In order to get an understanding of where load is spent we make good use of the perf tool in Linux which can pinpoint problems in the code with branch prediction, tight loops, cache misses and give us an idea of how to use software prefetching to improve code efficiency. Given that we program in C++, the compiler can often be assisted by introducing local variables, but one has to take care such that those local variables are not spilled to the stack in which case they only do harm to the efficiency of the program. Our initial experiments of increasing efficiency in the MySQL Cluster data nodes using this approach have been very successful.

NOTE: All the above benchmarks were done on one machine using a single MySQL Server and a single data node. Obviously MySQL Cluster is also capable of scaling in the number of nodes in addition to the scaling on one machine. In this benchmark we have however focused on how far we can scale MySQL Cluster on a single machine.