Tuesday, April 10, 2012

MySQL team increases scalability by >50% for Sysbench OLTP RO in MySQL 5.6 labs release april 2012

A MySQL team focused on performance recently met in an internal meeting to discuss and work on MySQL scalability issues. We had gathered specialists on InnoDB and all its aspects of performance including scalability, adaptive flushing and other aspects of InnoDB, we had also participants from MySQL support to help us understand what our customers need and a number of generic specialists on computer performance and in particular performance of the MySQL software.

The fruit of this meeting can be seen in the MySQL 5.6 labs release april 2012 released today. We have a new very interesting solution to the adaptive flushing problem. We also made a significant breakthrough in MySQL scalability. On one of our lab machines we were able to increase performance of the Sysbench OLTP RO test case by more than 50% by working together to find the issues and then quickly coming up with the solution to the issues. Actually in one particular test case we were able to improve MySQL performance by 6x with these scalability fixes.

In this blog I will provide some details on what we have done to improve the scalability of the MySQL Server on large servers.

MySQL have now reached a state where the solutions to the scalability is no longer only related to protected regions and their related mutexes and read-write locks or atomic variables. MySQL scalability is also affected by the type of scalability issues normally found in high-performance computing. When developing MySQL Cluster 7.2 and its scalability enhancements we encountered the same type of problems as we discovered in MySQL 5.6, so I'll describe the type of issues here.

In a modern server there are three levels of CPUs, there are CPU threads, there are CPU cores and there are CPU sockets. A typical high-end server of today can have 4 CPU sockets, 32 CPU cores and 64 CPU threads. Different vendors name this building blocks slightly differently but from a SW point of view it's sufficient to consider these 3 levels.

One issue with a multi-core SW architecture is the fact that all these 64 CPU threads share the same memory. In order to work with this memory they work with cachelines. Cacheline sizes can vary but on the highest level it's usually 64 bytes for x86 servers. So in order to read or write a particular memory area, it's necessary to read and to write entire cachelines.

The fact that cachelines are bigger than the variables that the SW works on means that sometimes we can use the same cacheline for multiple purposes. This is called false sharing. It's usually not a problem, but on servers with multiple sockets it can sometimes be an issue to share data if the data is updated very frequently.

Reading a cacheline from many CPU threads simultaneously is not a problem since each CPU thread will download its own version of the cacheline and thus there is no real limit to how fast we can read read-only cachelines. The problem arrives when we have a cacheline which is frequently updated. When a cacheline is updated, only one CPU at a time can update the cacheline and all other CPUs that want to update it, have to wait in queue until the cacheline is ready for them to update. Given that this occurs on specific cachelines it's very hard to nail down when this waiting occurs.

So in any scalable SW architecture it's important to avoid cachelines that are updated more frequently than around one million updates per second. Beyond 1 million updates of a cacheline per second, the wait times on the cacheline can quickly become significant to application performance.

One example where we found such a case was in MySQL Cluster where we had a bitmap that was updated at each message sent over the network. In 7.1 this was never an issue because there were no cases where we sent millions of messages on a node in the cluster. However in MySQL Cluster 7.2 we scaled up performance by more than 4x for each data node and we hit this wall. At first it was a mystery where the probleem came from, but after some more analysis we discovered this bitmap update. The solution was very simple here, we made sure that we only updated the bitmap in case it actually changed and this solved this issue and we could move on and reach sending almost 5M messages per second in one data node which was up from previously being limited to 1.8M messages per second.

The time it takes to update a cacheline is dependent on the number of CPU sockets used, with multiple sockets it becomes even more important to ensure that we avoid the issues with updating a cacheline too often. So the CPU scalability issues that we've solved in MySQL 5.6 should affect users that ensure that the MySQL Server is only executed on one CPU socket less.

There are also many other interesting tidbits in MySQL 5.6.5 DMR and the MySQL 5.6 labs release april 2012 which you can digest in a number of other technical blogs.