Wednesday, October 03, 2012

Scalability improvements in MySQL 5.6

At MySQL Connect last weekend the MySQL 5.6 was released as a Release Candidate. It contains a large set of improvements over MySQL 5.5. In this blog I will focus on what has changed in the area of scalability.

There are four main areas of improvement of scalability in the MySQL 5.6:
1) Splitting InnoDB Kernel mutex
2) The G5 discovery
3) Splitting LOCK_open mutex
4) Binlog group commit

In InnoDB as in so many other products there was an infamous kernel mutex. This mutex was a sort of catch all mutex used to protect the system from concurrent use of various data structures. For an implementer it was easy enough to use the kernel mutex as a way to achieve safe concurrency. The problem of course is that it hurts our scalability. The kernel mutex was a major bottleneck in many benchmarks, and even more important it also made it very difficult to resolve other scalability issues.

Already early in the MySQL 5.6 development an effort in the InnoDB development team led by Sunny Bains, took the kernel mutex and split it into around 10 new mutexes. So effectively the kernel mutex was previously used for 10 different things. Now through the split, each new mutex protects only one logical entity. This naturally had a great impact on scalability in the MySQL 5.6 release, but also will it make future developments easier. Now since each new mutex is only protecting one logical entity it is much easier to see how to improve handling of each of those 10 new mutexes when they become scalability issues. The new mutex that has the most contention now is the lock mutex, this mutex protects the metadata of tables mainly.

The G5 discovery refers to a name we gave a patch internally. We made several experiments after splitting the kernel mutex to see which of the remaining mutexes would make most sense to fix next. So some experimental patches were developed that entirely removed a number of mutexes (obviously not a fully working MySQL version, but sufficiently working for a simple benchmark). The puzzle we met was that whatever we removed we had no success in improving performance. We looked at this problem using various tools without finding the root cause of it. Finally we used oprofile and the new Linux perf tool to see which code lines that we spent time in. We eventually discovered the issue. The problem was that every time a row was read in InnoDB we updated a statistics counter. In the case of Sysbench we have a number of scans where each scan hits 100 rows, so for each Sysbench transaction we read around 410 rows. Modern 4-socket servers are fast, but if one tries to update the same variable from all CPUs at once, then the cache line which this variable resides in, will bounce back and forth between the CPUs in the system. The systems we used for testing this could handle between 2-4 million updates per second of the same variable when updates were applied from CPUs residing in different sockets. So effectively this statistics counter kept the number of rows read in a MySQL Server down to 2-4 million reads per second or converted to Sysbench transactions we could not deliver more than 5-10.000 TPS. This variable have no major effect in MySQL 5.5 since it has other scalability issues that hides this issue. But in MySQL 5.6 we solved so many scalability issues that this variable became a major bottleneck. A simple solution was obviously to simply remove this line, but obviously we developed a more proper solution that we also used in many other similar areas where this could potentially become an issue. Finding and resolving this issue was a team effort made possible by the InnoDB team, the MySQL Server team and the MySQL performance experts. So the development resources available in the Oracle MySQL organisation makes it possible to continue scaling MySQL towards new heights.

Another sort of kernel mutex is the LOCK_open mutex in MySQL. This mutex used to be a mutex to protect the metadata of a table within the MySQL Server. It was however used for many other things as well. This meant that splitting this mutex required a number of reengineering projects before we were ready to perform the actual split. Already early in the MySQL 5.6 development the ground work was finalised to perform the actual split through the work of the MySQL runtime team. The idea to the actual split came in a development meeting in Trondheim where myself and Dmitry Lenev quickly came up with an idea to protect TABLE objects separately through an array of mutexes and keep the LOCK_open only for creating new TABLE objects and other activities around table metadata changes.

Finally this idea also became a new addition to the MySQL 5.6 release. Standing on the shoulders of the G5 discovery and the split of the kernel mutex, the split of the LOCK_open mutex made the jump in performance very significant. We were able to improve performance 70% and more only based on the LOCK_open split in a number of benchmarks.

What we have discovered with MySQL 5.6 and its benchmarking is that the standard Sysbench has a limiting factor. The problem is that Sysbench only uses 1 table, this means that any protection of this single table will be part of the scalability issues in running Sysbench. Since we assume that most, if not all, applications don't direct all queries towards one table, we also started benchmarking MySQL 5.6 using 2 tables and more in Sysbench. In this case we avoid bottlenecks related to use of only a single table.

In MySQL 5.6 we also decided to spend a serious effort in improving replication scalability. We were happy with the improvements of scalability in MySQL 5.5, but saw a need to also move more attention to the replication area. We set out with a modest goal of scaling replication to 16 CPUs. We actually managed to scale replication on the server side as far as the MySQL Server itself to 48 CPUs. We managed this even with sync_binlog set to 1! So in MySQL 5.6 the binlog writing flies. Also the slave side saw major improvements in scalability in MySQL 5.6.

The problem related to writing of the binlog was that we only had one mutex protecting those writes and all its phases. This meant among other things that this mutex also protected file writes to the binlog and even sync to the disks of the file writes.

In order to solve this issue Mats Kindahl and I experimented with a number of approaches to discover what worked best. We eventually came up with a sort of queueing system. So writes to the binlog were divided into phases. The first phase happens as part of transaction execution (as it does in MySQL 5.5) where binlog writes are gathered in a data structure operated by the connection. Thus there is no concurrency issue at all in this phase. The next step is to write this data into the file buffers operated by the operating system through write system calls. The next step is the sync of the writes to the disk. The last step is the commit that happens in the storage engines. Each of those steps is handled similarly but with minor differences.

Each step has two phases, a queue phase and an execution phase. The queue and the execution phase is protected by separate locks. Whenever someone arrives to a step he grabs the queue mutex and places himself in the queue. If he is the first to arrive in the queue then he also tries to grab the execution mutex after releasing the queue mutex. This means that while waiting for the execution mutex, more connections can place their data in the queue, they will then wait for the execution owner to signal them when the work is done (this could happen after more than one step is executed). When the execution mutex is acquired, the execution mutex owner will also grab the queue mutex, grab the queue and set the queue to empty and release the queue mutex. After this he will execute all tasks in the queue.

The final commit step can be parallelised if desired, thus multiple connections can commit simultaneously. The only limitation to this is when a concurrent hot backup is happening at the same time.

What is interesting with this architecture is that it is flexible to where the bottleneck is. For some workloads the bottleneck is in the commit phase, in this case it makes sense to parallelise this part which is possible. For others it is the sync phase which is the bottleneck and for yet other workloads the writing to the buffer is the bottleneck. Wherever the bottleneck resides this architecture makes the best of the situation and scales MySQL 5.6 replication to new heights.

So with this impressive set of new improvements in MySQL 5.6, what is left to do? As usual when one develops a new feature or performance improvement one also finds out about a ton of other things one wants to improve. So the list of ideas is not empty and there is some pretty interesting scalability improvements in line also for MySQL 5.7. We have access to machines currently with 96 CPU threads and bigger machines are likely to appear, we also cooperate with other parts of Oracle where even larger systems are possible. So our aim continues to keep MySQL scalable on commodity servers. Naturally we have noted that many of our customers are sharding their data set to handle even higher workloads. A long career on distributed systems has learnt me that it is extremely important to do proper partitioning of data sets to achieve scalability on network level. But it is still extremely important to make each node in the distributed system as large as possible. This decreases maintenance activities, it minimises issues in sharding by minimising the number of shards. We are pursuing this effort both in the MySQL Cluster area and in the MySQL area where InnoDB is used as a storage engine.

No comments: