Wednesday, September 22, 2010

How to speed up Sysbench on MySQL Cluster by 14x

The time period up to the 2010 MySQL Users conference was as usual packed with hard work. The last two conferences have been very focused on getting scalability of the MySQL Server and InnoDB improved. This year had a nasty surprise after the conference in the form of ash cloud from an icelandic volcano. When you're tired from hard work and just want to go home and get some rest then the concept of staying at a hotel room with absolutely no idea of when one could return home is no fun at all. So when I finally returned home I was happy that summer was close by, vacation days were available (swedes have a lot of vacation :)).

Now the summer is gone and I am rested up again, ready to take on new challenges and we've had some really interesting meet-ups in the MySQL team to discuss future developments. The renewed energy also is sufficient to write up some of the stories from the work I did during the summer :)

During the summer I had the opportunity to also get some work done on scalability of the MySQL Cluster product as well. Given that I once was the founder of this product it was nice to return again and check where it stands in scalability terms.

The objective was to compare MySQL Cluster to the Memory engine. The result of the exercise was almost obvious from the start. The memory engine having a table lock will have very limited scalability on any workload that contains writes. It will however have very good scalability on read-only workloads as this isn't limited by the table lock since readers don't contend each other. The Cluster engine should have good and fairly even results on read and write workloads.

Much to my surprise the early results showed a completely different story. The Memory engine gave me a performance of about 1-2 tps to start with. The early results of MySQL Cluster was also very dismaying. I covered the Memory engine in a previous blog, so in this blog I will focus on the MySQL Cluster benchmarks.

So the simple task of benchmarking as usual turned into some debugging of where the performance problems comes from.

In the first experiment I used the default configuration of the Fedora Linux OS, I also used the default set-up of the MySQL Cluster storage engine. It turned out that there are huge possibilities in adapting those defaults.

First the Fedora has a feature called cpuspeed. By default this feature is activated. The feature provides power management by scaling down the CPU frequency on an entire socket. The problem is that when you run the MySQL Server with few threads, it doesn't react to the workload and scales down frequency although there is a lot of work to do. So for the MySQL Server in general this means about half the throughput on up to 16 threads. However the impact on MySQL Cluster is even worse. The performance drops severely on all thread counts. Most likely this impact comes from the very short execution times of the NDB data nodes. It's possible that this small execution times is too short to even reach the radar of the power management tools in Linux.

So a simple sudo /etc/init.d/cpuspeed stop generated a major jump in performance of the MySQL Cluster in a simple Sysbench benchmark (all the benchmarks discussed here used 1 data node and all things running on one machine unless otherwise stated).

The next simple step was to add the MySQL Cluster configuration parameter MaxNoOfExecutionThreads to the configuration scripts and set this to the maximum which is 8. This means that one thread will handle receive on the sockets, one thread will handle transaction coordination and four threads will handle local database handling. There will also be a couple of other threads which are of no importance to a benchmark.

These two configuration together added about ~3.5x in increased performance. Off to a good start, but still performance isn't at all where I want it to be.

In NDB there is a major scalability bottleneck in the mutex protecting all socket traffic. It's the NDB API's variant of the big kernel mutex. There is however one method of decreasing the impact of this mutex by turning the MySQL Server into several client nodes from an NDB perspective. This is done by adding the --ndb-cluster-connection-pool parameter when starting the MySQL Server. We achieved the best performance when setting this to 8, in a bigger cluster it would probably make more sense to set it to 2 or 3 since this resolves most of the scalability issues without using up so much nodes in the NDB cluster.

Changing this from 1 to 8 added another ~2x in performance. So now the performance is up a decent 8x from the first experiments. No wonder I was dismayed by the early results.

However the story isn't done yet :)

MySQL Cluster has another feature whereby the various threads can be locked to CPUs. By using this feature we can achieve two things, the first is that the NDB data nodes doesn't share CPUs with the MySQL Server. This has some obvious benefits from CPU cache point of view for both node types. We can also avoid that the data node threads are moved from CPU to CPU which is greatly advantegous in busy workloads. So we locked the data nodes to 6 cores. The configuration variable we used to achieve this is LockExecuteThreadToCpu which is set to a comma separated list of CPU ids.

I also locked the MySQL Server and Sysbench to different set of CPUs using the taskset program available in Linux.

Using this locking of NDB data node threads to CPUs achieved another 80% boost in performance. So the final result gave us a decent 14x performance improvement.

So in summary things that matters greatly to performance of MySQL Cluster for Sysbench with a single data node.

1) Ensure the Linux cpuspeed isn't activated

2) Make sure to set MaxNoOfExecutionThreads to 8

3) Make sure the --ndb-cluster-configuration-pool parameter to the MySQL Server using around 8 nodes per MySQL Server

4) Lock NDB Data node threads to CPUs by using the LockExecuteThreadToCpu.

5) Lock MySQL Server and Sysbench processes to different sets of CPUs from NDB Data nodes and each other.

Doing this experiments also generated a lot of interesting ideas on how to improve things even further.

How to get Sysbench on Memory engine to perform

I had the opportunity to test the Memory engine during the summer. What I expected to be a very simple exercise in running the Sysbench benchmark turned out to be a lot more difficult than I expected.

My first experiments with the Memory engine gave very dismaying results. I got 0-2 TPS which is ridiculously bad. So I couldn't really think this was proper results, so I started searching for problems in my benchmarking environment. Eventually I started setting up a normal MySQL client session to the MySQL Server while the benchmark was running and issued some of the queries in the benchmark by hand and I was surprised to see some simple queries take seconds.

EXPLAIN came to my rescue. EXPLAIN showed that the range scans in Sysbench was in fact turned into full table scans. Now this was surprising given that the primary key index in most engines is always ordered, so a range scan should normally be translated to a simple ordered index scan.

What I found is that the Memory engine has a different default compared to the other storage engines. When we designed the NDB handler we decided that any SQL users expects to have an ordered index when they create an index on a table. Since primary key indexes are always hash-based, this meant that in NDB the default primary key index is actually two indexes, one primary hash index and one secondary ordered index on the same fields.

Not so in the Memory engine. The memory engine also uses a hash-based index by default. So when you create a new index on a Memory engine and specify no type, the index wil become a hash index. Hash indexes are not very good at range queries :) so thus the surprise to me when benchmarking the Memory engine in Sysbench.

Fixing this issue required some code changes in the sysbench test. It required the proposed fix from Mark Callaghan to add secondary indexes to sysbench. This was however not sufficient since this patch only added an index by adding KEY xid (id) and didn't specifically specify this index had to be an ordered index. So I changed this to KEY xid (id) USING BTREE and off the performance went.

The Memory engine is still more or less a single threaded engine for any write workload like Sysbench RW which limits performance very much.

However the Sysbench Readonly benchmark for the Memory engine was interesting since it used no limiting locks (concurrent readers don't contend with each other). I decided to see how much scalability was achievable for a storage engine without any concurrency issues internally.

I found that performance of the Memory engine was limited to 8% more than the InnoDB engine in the same benchmark. So my interpretation of this result is that for readonly workloads, the main limiting factor on scalability is the MySQL Server scalability issues and not the InnoDB ones.

So going forward when working on further improving the scalability of the MySQL Server parts there is a perfect benchmark that can be used to see how scalable the MySQL Server part is by using the Memory engine.

I plan on also adding a feature to sysbench making it possible to use more than one Sysbench table to see how much added scalability we get when there are several tables involved in the query mix. Sysbench is a a very syntectic benchmark in this manner that it only uses one table.

Only using one table provokes more bottlenecks than is found in most normal workloads, e.g. the LOCK_open in the MySQL Server, the meta data locks introduced in MySQL 5.5, the index mutex in InnoDB and even some unbalance to the usage of multiple buffer pools come from only using one table in the benchmark.

Saturday, September 18, 2010

Multiple Buffer Pools in MySQL 5.5

In our work to improve MySQL scalability we tested many opportunities to scale the MySQL Server and the InnoDB storage engine. The InnoDB buffer pool was often one of the hottest contention points in the server. This is very natural since every access to a data page, UNDO page, index page uses the buffer pool and even more so when those pages are read from disk and written to disk.

As usual there were two ways to split the buffer pool mutex, one way is to split it functionally into different mutexes protecting different parts. There have been experiments in particular on splitting out the buffer pool page hash table, the flush list. Other parts that have been broken out in experiments are the LRU list, the free list and other data structures internally in the buffer pool. Additionally it is as usual possible to split the buffer pool into multiple buffer pools. Interestingly one can also combine using multiple buffer pools with splitting the buffer pool mutex into smaller parts. The advantage of using multiple buffer pools is that it is very rare that it is necessary to grab multiple mutexes for the buffer pool operation which quickly becomes the case when splitting the buffer pool into multiple mutex protection areas.

After working on scalability improvements in MySQL and InnoDB I noted that all the discussion was around how to split the buffer pool mutex and no dicsussion centered around how to make multiple buffer pools out of the buffer pool. I decided to investigate how difficult it would be to make this change. I quickly realised that it needed a thorugh walk through of the code. It required a code check that required checking about 150 methods and their interaction. This sounds like a very big task, but fortunately the InnoDB code is well structured and have fairly simple dependencies between its methods. After this walk through of the buffer pool code one quickly found that there were 3 different ways of getting hold of the buffer pool, one method was to calculate it using the space id and page id. This is the normal method in most methods used in the external buffer pool interface. However there were numerous occasions where we only had access to the block or page data structure and it would be a bit useless to recalculate the hash value in every method that needed access to the buffer pool data structure. So it was decided to leave a reference to the buffer pool in every page data structure. There were also a few occasions where one needed to access all buffer pools.

The analysis proved that most of the accesses to the buffer pool was completely independent of other accesses to the buffer pool for other pages. InnoDB uses read-ahead and neighbour writes in the IO operations that are started from the buffer pool. These always operate on an extent of 64 pages. Thus it made sense to map the pages of 64 pages into one buffer pool to avoid having to operate on multiple buffer pools on every IO operation.

With these design ideas there were only a few occasions where it was necessary to operate on all buffer pools. One such operation was when the log required knowledge of the page with the oldest LSN of the buffer pool. Now this operation requires looping over all buffer pools and checking the minimum LSN of each buffer pool instance. This is a fairly rare operation so isn't a scalability issue.

The other operation with requirement to loop over all pages needed a bit more care, this operation is the background operation flushing buffer pool pages to disk. A couple of problems needs consideration here. First it is necessary to flush pages regularly from all buffer pool instances, secondly it's still important to flush neighbours. Given that many disks are fairly slow, it can be problematic to spread the load in this manner to many buffer pools. This is an important consideration when deciding how many buffer pool instances to configure.

The default number of buffer pool is one and for most small configurations with less than 8 cores it's mostly a good idea not to increase this value. If you have an installation that uses 8 cores or more one should also pay attention to the disk subsystem that is used. Given that InnoDB often writes up to 64 neighbours in each operation and that the flushing should happen each second, it makes sense to have a disk subsystem capable of having 500 IO operations per second to use 8 buffer pool instances. This can be set in the innodb_io_capacity configuration variable. One SSD drive should be capable of handling this, two fast hard drives or 3 slow ones.

In our experiments we have mostly used 8 buffer pools, more buffer pools can be useful at times. The main problem with many buffer pools is related to the IO operations. It is important to have a balanced IO load in the MySQL server.

Our analysis of using multiple buffer pool instances have shown some interesting facts. First the accesses to the buffer pools is in no way evenly spread out. This is not surprising given that e.g. the root page of an index is a very frequently accessed page. So using sysbench with only one table, there will obviously be much more accesses to certain buffer pool instances. Our experiments shows that in sysbench using 8 buffer pools, the hottest buffer pool receives about one third of all accesses. Given that sysbench is a worst case scenario for the multiple buffer pool case, this means that most applications that tend to use more tables and more indexes should have a much more even load on the buffer pools.

So how much does multiple buffer pools improve the scalability of the MySQL Server. The answer is as usual dependent on application, OS, HW and so forth. But some general ideas can be found from our experiments. In sysbench using a load which is entirely held in main memory, so the disk is only used for flushing data pages and logging, in this system the multiple buffer pools can provide up to 10% improvement of the throughput in the system. In dbStress, the benchmark Dimitri uses, we have seen all the way up to 30% improvement. The reason here is most likely that dbStress uses more tables and have avoided many other bottlenecks in the MySQL Server and thus the buffer pool was a worse bottleneck in dbStress compared to sysbench. From the code it is also easy to see that the more IO operations the buffer pool performs, the more the buffer pool mutex will be acquired and also often held for a longer time. One such example is the search for a free page on the LRU list every time a read is performed into the buffer pool from the disk.

Furthermore the use of multiple buffer pool opens up for many more improvements and also it doesn't remove the possibility to split the buffer pool mutex even more.

Another manner of displaying the importance of using multiple buffer pools is the mutex statistics on the buffer pool mutex. With one buffer pool the buffer pool had about 750k accesses per second in a sysbench test where the MySQL Server had access to 16 cores. 50% of those accesses met a mutex already held, so it's obvious that the InnoDB mutex subsystem is very well aligned with the buffer pool mutex which have very short duration which makes spinning waiting for it very fruitful. Anyways a mutex which is held 50% of the time makes the buffer pool mutex a limiting factor of the MySQL Server. Quite a few threads will often spend time in the queue waiting for the buffer pool mutex. So splitting the buffer pool into 8 instances even in sysbench means that the hottest buffer pool receives about one third of the 750k accesses so should be held about 17% of the time. Our later experiments shows that the hottest buffer pool mutexes are now held up to about 14-15% of the time. So the theory matches the real world fairly well. This means that the buffer pool is still a major factor in the MySQL Scalability equation but is now more on par with the other bottlenecks in the MySQL Server.

The development project of multiple buffer pools happened at a time when the MySQL and InnoDB teams could start working together. I was impressed by the willingness to cooperate and the competence in the InnoDB team that made it possible to introduce multiple buffer pools into MySQL 5.5. Our cooperation has continued since then and this has led to improvements in productivity on both parts. So for you as a MySQL user this spells good times going forward.

Split log_sys mutex in MySQL 5.5

One important bottleneck in the MySQL Server is the log_sys mutex in InnoDB. Experiments using mutex statistics showed that this mutex was accessed about 250k times per second and that about 75% of those accesses had to queue up to get the mutex. One particular nuisance is that while holding the log_sys mutex it is necessary to grab the buffer pool mutex to put the changed pages to the start of the flush list indicating it is now the youngest dirty page in the buffer pool (this happens as part of the mini commit functionality in InnoDB). To some extent this contention point is decreased by splitting out the buffer flush list from the buffer pool mutex.

We found a simple improvement of this particular problem. The simple solution is to introduce a new mutex, log_flush_order mutex, this mutex is acquired while still holding the log_sys mutex, as soon as it is acquired we can release the log_sys mutex. This gives us the property that the log_sys mutex is available for other operations such as starting a new log write while we still serialise the input of the dirty pages into the buffer pool flush list.

As can be easily seen this solution decrease the hold time of the log_sys mutex while not decreasing the frequency it is acquired.
In our experiments we saw that this very simple solution improved a Sysbench RW test by a few percent.