Wednesday, February 15, 2012

Scalability enhancements of MySQL Cluster 7.2

For MySQL Cluster 7.2 we have worked on increasing the number of CPUs that can be used to execute the workload on the data nodes.

A MySQL Cluster 7.2 data node is functionally divided into 7 thread types.

1) The Local Data Manager threads (ldm)
2) The Transaction Coordinator threads (tc)
3) The Asynchronous Replication threads (rep)
4) The Schema Management threads (main)
5) Network receiver threads (recv)
6) Network send threads (send)
7) IO threads

In the LDM domain we have increased the maximum number of threads from 4 to 16. The LDM contains the actual data, this means that when using 16 threads the data becomes more partitioned (as usual this is all automatic in MySQL Cluster). Each LDM thread maintains its own set of data partitions, index partitions and its own REDO log. The number of LDM parts per data node isn't dynamically changeable. It's however possible to map more than one part onto each LDM thread which provides some flexibility in changing the number of LDM threads.

The TC domain contains the state of ongoing transactions. This means that every new transaction can easily be assigned to a new TC thread. This provides for easy flexibility of the number of TC threads to grow to a large number of threads. We've found that in most cases 1 TC thread per 2 LQH thread is sufficient and in many cases even 1 TC thread per 4 LQH threads. In rare cases with very high update loads it can be required to have 3-4 TC threads per 4 LQH threads. In MySQL Cluster 7.1 it was only possible to have 1 TC thread, now with MySQL Cluster 7.2 it's possible to have up to 16 TC threads. The TC domain also contains the control parts that handle the Adaptive Query Localization introduced in MySQL Cluster 7.2.

The Asynchronous Replication functionality is already in MySQL Cluster 7.1 separated into its own thread and remains so.

The Schema Management domain has been separated from the TC domain to make it necessary to scale the number of TC threads. This thread normally have very little load and we've seen no requirements on scaling this domain to more than one thread.

The Network receiver domain was bound to 1 thread on MySQL Cluster 7.1. With the increase of threads in MySQL Cluster 7.2 it is also necessary to scale up the number of receive threads. We have partitioned such that each receive thread takes care of one or more sockets to communicate with other nodes in MySQL Cluster. There is one socket per node to communicate with other nodes. In MySQL Cluster 7.2 we can now scale the number of receive threads up to 8 threads.

The Network send domain is a new thread type. In MySQL Cluster 7.1 sending was done from the other threads directly. This can provide slightly lower latency, but for highest throughput we have provided the possibility to perform send from separate send threads. There can be up to 8 send threads. It is also possible to continue without using send threads. We have also been able to improve the overload handling when using send threads.

The final thread type, the IO threads remains the same in 7.2 as in 7.1. This means that we can have multiple IO threads, either one thread per open file, or a fixed number of IO threads that handles the IO traffic. Except when using compression on disk, the IO threads have very light load.

The configuration of threads can be handled in two ways. The first is to simply set MaxNoOfExecutionThreads to an appropriate number that specifies how many threads we want to run in the data node. The second manner is to use the new ThreadConfig variable where it is possible to set both how many threads of each type to use and also to set the CPUs to bind those threads to.

We're very proud of those scalability enhancements that have made it possible to scale CPU usage per data node to more than 5x of what is possible in MySQL Cluster 7.1. In addition we've removed a number of bottlenecks making it possible to scale per data node performance by even more than 5x. Comparing our benchmark numbers for MySQL Cluster 7.2 we can see that we achieved 2.1M reads per second on 7.1 and we've achieved 17.6M reads per second, both on 8 data node set-ups using the same HW. Thus more than 8X better performance per node for MySQL Cluster 7.2 compared to 7.1.

The flexible configuration means that it is possible to optimise data node set-up to use anything from a single CPU up to a 48 CPU box. Colocating the MySQL Server and a single data node we can even make nice use of a 64 CPU box or even an 80 CPU box. As usual it is also still possible to have multiple data nodes per machine, but this is now required only for very large machines with 4 CPU sockets and more.

As an example of how we can make best use of a 24 CPU box is to use 8 ldm threads, 4 tc threads, 3 recv threads, 3 send threads and 1 rep thread for asynchronous replication. Each of those threads should be bound to a CPU. The main thread (schema management domain) and the IO threads should be possible to bind to the same CPU in most installations. Thus we have bound threads to 20 different CPUs. We should also protect these 20 CPUs from interrupts by using the IRQBALANCE_BANNED_CPUS config variable in /etc/sysconfig/irqbalance and setting it to 0xFFFFF0. Running MySQL Cluster generates a lot of interrupt processing and also OS kernel processing and it's a good idea to stay away from a number of CPUs to ensure that this activity won't conflict with the MySQL Cluster threads. When booting a Linux kernel one can even provide an option isolcpus=0-19 in grub.conf. This means that the Linux scheduler won't use these CPUs for any task. Only by using CPU affinity syscalls can a process be made to run on those CPUs. Using this together with binding MySQL Cluster threads and banning CPUs IRQ processing on these tasks should provide for a very stable performance environment of a MySQL Cluster data node.

On a 32 CPU box one upgrades the number of ldm threads to 12, increase tc threads to 6 and provide 2 more CPUs for the OS and interrupts. The number of send and receive threads should in most cases still be sufficient.

On a 40 CPU box one can go 16 ldm threads, 8 tc threads and increment send and receive threads to 4.

On a 48 CPU box it is possible to use more safety margins and use 12 tc threads instead, 2 more CPUs for the OS and interrupts and avoid using IO threads and main thread on same CPU and finally add 1 more receive thread.

You can learn more about all of the enhancements in MySQL Cluster 7.2 from our Developer Zone article.

1.05BN QPM using MySQL Cluster 7.2

We've passed another performance milestone using MySQL Cluster. In 2002 we passed the limit of 1M reads per second. Now we've passed the milestone of 1B reads per minute. We achieved 1.05BN reads per minute on an 8-node cluster using MySQL Cluster 7.2.5.

The benchmark used the latest MySQL Cluster version, available on launchpad and to be released as MySQL Cluster 7.2.5. The benchmark program is flexAsynch. Each read is a transaction consisting of a read of an entire row consisting of 25 attributes, each 4 bytes in size. flexAsynch uses the asynchronous feature of the NDB API which enables one thread to send off multiple transactions in parallel. This is handled similarly to how Node.js works with callbacks registered that reports back when a transaction is completed.

I will in a later blog discuss how an application can be designed to make optimal use of the asynchronous API to enable the application to reach these kind of performance numbers for key lookup operations.

The benchmark was executed on a set of 2-socket servers using X5670 with Infiniband interconnect and 48GB of memory per machine. There were 8 data nodes in the cluster and each machine had 1 data node placed in it. There were 10 machines running flexAsynch (1 process per machine).

We reported a similar benchmark a year ago, the results from an 8 data node set-up was 2.1M reads per second (8X improvement) and using 8 machines the set-up was 16 data nodes where we reached 4.3M reads per second (4X improvement). This dramatic improvement is possible since we have made each data node make more effective use of the available hardware threads. Previously each data node exploited about 6 CPUs with 8 threads, in this benchmark we used about 14 CPUs with 24 threads. We've also removed a number of bottlenecks and the configuration of the threads is done in such a manner as to handle many different loads well.

We have also improved the update numbers from 1.3M updates per second in an 8-node setup to now reaching 1.9M updates per second in a similar set-up.

The scripts required to run the benchmark are available on our benchmark page. We've placed the configuration file used to drive the benchmark in the dbt2-0.37.50 tarball in the examples directory.

You can learn more about all of the enhancements in MySQL Cluster 7.2 from our Developer Zone article.