Wednesday, February 15, 2012

Scalability enhancements of MySQL Cluster 7.2

For MySQL Cluster 7.2 we have worked on increasing the number of CPUs that can be used to execute the workload on the data nodes.

A MySQL Cluster 7.2 data node is functionally divided into 7 thread types.

1) The Local Data Manager threads (ldm)
2) The Transaction Coordinator threads (tc)
3) The Asynchronous Replication threads (rep)
4) The Schema Management threads (main)
5) Network receiver threads (recv)
6) Network send threads (send)
7) IO threads

In the LDM domain we have increased the maximum number of threads from 4 to 16. The LDM contains the actual data, this means that when using 16 threads the data becomes more partitioned (as usual this is all automatic in MySQL Cluster). Each LDM thread maintains its own set of data partitions, index partitions and its own REDO log. The number of LDM parts per data node isn't dynamically changeable. It's however possible to map more than one part onto each LDM thread which provides some flexibility in changing the number of LDM threads.

The TC domain contains the state of ongoing transactions. This means that every new transaction can easily be assigned to a new TC thread. This provides for easy flexibility of the number of TC threads to grow to a large number of threads. We've found that in most cases 1 TC thread per 2 LQH thread is sufficient and in many cases even 1 TC thread per 4 LQH threads. In rare cases with very high update loads it can be required to have 3-4 TC threads per 4 LQH threads. In MySQL Cluster 7.1 it was only possible to have 1 TC thread, now with MySQL Cluster 7.2 it's possible to have up to 16 TC threads. The TC domain also contains the control parts that handle the Adaptive Query Localization introduced in MySQL Cluster 7.2.

The Asynchronous Replication functionality is already in MySQL Cluster 7.1 separated into its own thread and remains so.

The Schema Management domain has been separated from the TC domain to make it necessary to scale the number of TC threads. This thread normally have very little load and we've seen no requirements on scaling this domain to more than one thread.

The Network receiver domain was bound to 1 thread on MySQL Cluster 7.1. With the increase of threads in MySQL Cluster 7.2 it is also necessary to scale up the number of receive threads. We have partitioned such that each receive thread takes care of one or more sockets to communicate with other nodes in MySQL Cluster. There is one socket per node to communicate with other nodes. In MySQL Cluster 7.2 we can now scale the number of receive threads up to 8 threads.

The Network send domain is a new thread type. In MySQL Cluster 7.1 sending was done from the other threads directly. This can provide slightly lower latency, but for highest throughput we have provided the possibility to perform send from separate send threads. There can be up to 8 send threads. It is also possible to continue without using send threads. We have also been able to improve the overload handling when using send threads.

The final thread type, the IO threads remains the same in 7.2 as in 7.1. This means that we can have multiple IO threads, either one thread per open file, or a fixed number of IO threads that handles the IO traffic. Except when using compression on disk, the IO threads have very light load.

The configuration of threads can be handled in two ways. The first is to simply set MaxNoOfExecutionThreads to an appropriate number that specifies how many threads we want to run in the data node. The second manner is to use the new ThreadConfig variable where it is possible to set both how many threads of each type to use and also to set the CPUs to bind those threads to.

We're very proud of those scalability enhancements that have made it possible to scale CPU usage per data node to more than 5x of what is possible in MySQL Cluster 7.1. In addition we've removed a number of bottlenecks making it possible to scale per data node performance by even more than 5x. Comparing our benchmark numbers for MySQL Cluster 7.2 we can see that we achieved 2.1M reads per second on 7.1 and we've achieved 17.6M reads per second, both on 8 data node set-ups using the same HW. Thus more than 8X better performance per node for MySQL Cluster 7.2 compared to 7.1.

The flexible configuration means that it is possible to optimise data node set-up to use anything from a single CPU up to a 48 CPU box. Colocating the MySQL Server and a single data node we can even make nice use of a 64 CPU box or even an 80 CPU box. As usual it is also still possible to have multiple data nodes per machine, but this is now required only for very large machines with 4 CPU sockets and more.

As an example of how we can make best use of a 24 CPU box is to use 8 ldm threads, 4 tc threads, 3 recv threads, 3 send threads and 1 rep thread for asynchronous replication. Each of those threads should be bound to a CPU. The main thread (schema management domain) and the IO threads should be possible to bind to the same CPU in most installations. Thus we have bound threads to 20 different CPUs. We should also protect these 20 CPUs from interrupts by using the IRQBALANCE_BANNED_CPUS config variable in /etc/sysconfig/irqbalance and setting it to 0xFFFFF0. Running MySQL Cluster generates a lot of interrupt processing and also OS kernel processing and it's a good idea to stay away from a number of CPUs to ensure that this activity won't conflict with the MySQL Cluster threads. When booting a Linux kernel one can even provide an option isolcpus=0-19 in grub.conf. This means that the Linux scheduler won't use these CPUs for any task. Only by using CPU affinity syscalls can a process be made to run on those CPUs. Using this together with binding MySQL Cluster threads and banning CPUs IRQ processing on these tasks should provide for a very stable performance environment of a MySQL Cluster data node.

On a 32 CPU box one upgrades the number of ldm threads to 12, increase tc threads to 6 and provide 2 more CPUs for the OS and interrupts. The number of send and receive threads should in most cases still be sufficient.

On a 40 CPU box one can go 16 ldm threads, 8 tc threads and increment send and receive threads to 4.

On a 48 CPU box it is possible to use more safety margins and use 12 tc threads instead, 2 more CPUs for the OS and interrupts and avoid using IO threads and main thread on same CPU and finally add 1 more receive thread.

You can learn more about all of the enhancements in MySQL Cluster 7.2 from our Developer Zone article.

1.05BN QPM using MySQL Cluster 7.2

We've passed another performance milestone using MySQL Cluster. In 2002 we passed the limit of 1M reads per second. Now we've passed the milestone of 1B reads per minute. We achieved 1.05BN reads per minute on an 8-node cluster using MySQL Cluster 7.2.5.

The benchmark used the latest MySQL Cluster version, available on launchpad and to be released as MySQL Cluster 7.2.5. The benchmark program is flexAsynch. Each read is a transaction consisting of a read of an entire row consisting of 25 attributes, each 4 bytes in size. flexAsynch uses the asynchronous feature of the NDB API which enables one thread to send off multiple transactions in parallel. This is handled similarly to how Node.js works with callbacks registered that reports back when a transaction is completed.

I will in a later blog discuss how an application can be designed to make optimal use of the asynchronous API to enable the application to reach these kind of performance numbers for key lookup operations.

The benchmark was executed on a set of 2-socket servers using X5670 with Infiniband interconnect and 48GB of memory per machine. There were 8 data nodes in the cluster and each machine had 1 data node placed in it. There were 10 machines running flexAsynch (1 process per machine).

We reported a similar benchmark a year ago, the results from an 8 data node set-up was 2.1M reads per second (8X improvement) and using 8 machines the set-up was 16 data nodes where we reached 4.3M reads per second (4X improvement). This dramatic improvement is possible since we have made each data node make more effective use of the available hardware threads. Previously each data node exploited about 6 CPUs with 8 threads, in this benchmark we used about 14 CPUs with 24 threads. We've also removed a number of bottlenecks and the configuration of the threads is done in such a manner as to handle many different loads well.

We have also improved the update numbers from 1.3M updates per second in an 8-node setup to now reaching 1.9M updates per second in a similar set-up.

The scripts required to run the benchmark are available on our benchmark page. We've placed the configuration file used to drive the benchmark in the dbt2-0.37.50 tarball in the examples directory.

You can learn more about all of the enhancements in MySQL Cluster 7.2 from our Developer Zone article.

Tuesday, November 01, 2011

MySQL Thread Pool: Storage Engines

I got a question from the NDB folks that are currently adapting MySQL Cluster to
MySQL 5.5 about whether any special developments are needed to adapt the NDB
storage engine for use with the thread pool. Then I realised there are more
people out there that write storage engines that want to know how to optimise
their storage engines for the thread pool.

So first of all any storage engine will work with the thread pool as they are today
without any modifications. It is however possible to improve the performance of the
MySQL Server when using the thread pool by adapting the storage engine to the
thread pool APIs.

The new API that has been added to the MySQL 5.5 server is the thd_wait interface.
This interface makes it possible for storage engines to report to a thread pool
plugin before starting a wait and after finishing a wait.

As an example, we have adapted the InnoDB storage engine by adding the thd_wait
interface calls around row locks in InnoDB and before file IO due to misses in
the InnoDB buffer pool. The InnoDB code have also been changed to make those
callbacks as part of the implementation of the --innodb-thread-concurrency and
when waiting for flushes of the buffer pool as part of checkpoints and other
activities where writes are required to ensure proper operation of InnoDB.

The NDB storage engine has very different reasons for the waits, the NDB storage engine
implements the actual data management in the NDB data nodes (these nodes runs in
separate processes separate from the MySQL Server), thus the only reason for waits
in the MySQL Server is when we're waiting for packets to return from the NDB data nodes.

Most third-party storage engines probably fit fairly well with InnoDB and/or NDB in how
they are integrated with the thread pool plugin. So there are storage engines that
perform all the work inside the MySQL Server. The more advanced such engines are likely
to also have a buffer pool and thus should consider calling the thd_wait interface
when doing IO, these storage engines are also likely to acquire row locks or some
similar level of data lock that sometimes will require an extended wait. There are
also other storage engines that are distributed in nature such as NDB, these
storage engines will want to make the callbacks to the new thread pool API when
waiting for responses on the network.

For storage engines that implement some data structure similar to the THD object in the
MySQL Server, there is one additional thing to consider. When using a thread pool it
makes sense to consider pooling such objects given that the thread pool
will pool threads. As an example, we have such an object called Ndb in the NDB API that
has the potential to be pooled. The benefits of pooling such objects are that it
means less time to create them, less memory usage and thus fewer CPU cache misses
due to their usage.

The thd_wait interface is really simple. It contains two calls thd_wait_begin and
thd_wait_end. Both calls have the THD object as the first parameter. Often the THD
object isn't known in the storage engine code when needed. In this case one simply
uses NULL as the THD object. The thd_wait interface can even handle the case where
the thd_wait interface is used from threads that are private to the storage engine.
The thread pool will discover that there is no THD object attached to the thread
and ignore the call.

The thd_wait_begin call also have a second parameter that specifies the type of
wait that will show up in the thread pool information schema tables. There
will be statistics on waits per type. There are currently 10 wait types.

To see an example of usage of this interface, search for thd_wait in the InnoDB
storage engine source code in the MySQL 5.5 community server.

The MyISAM storage engine does not use this API because MyISAM relies on the
MySQL Server for locking. Also, MyISAM assumes that the OS takes care of caching
of pages. This means that there is a very high probability that writes to the
file system are handled directly in the file system cache without involving any
long waits.

What is effect of not modifying a storage engine to implement the thd_wait
interface? The thread pool operates by trying to always have one thread active
per thread group. If the active thread is blocked and the thread pool is informed
of the block, then the thread pool can start another thread to ensure that the
thread group is being efficiently used. If the storage engine is not modified to
implement the thd_wait interface, the thread pool is not informed of the block.
In this case, the thread group will be blocked for a while until the wait is
completed or until the query is defined as stalled. The throughput of the system
can to some extent be handled in those cases by increasing the number of thread
groups.

So implementing the thd_wait interface means better throughput and also less
variance of the throughput and waiting times.

To use these interfaces in a file, include two header files (the thd_wait interface is
part of the plugin APIs in the MySQL 5.5 community and commercial servers).

#include "mysql/plugin.h"
#include "mysql/service_thd_wait.h"

Below is the most important information in these header files.

typedef enum _thd_wait_type_e {
THD_WAIT_SLEEP= 1,
THD_WAIT_DISKIO= 2,
THD_WAIT_ROW_LOCK= 3,
THD_WAIT_GLOBAL_LOCK= 4,
THD_WAIT_META_DATA_LOCK= 5,
THD_WAIT_TABLE_LOCK= 6,
THD_WAIT_USER_LOCK= 7,
THD_WAIT_BINLOG= 8,
THD_WAIT_GROUP_COMMIT= 9,
THD_WAIT_SYNC= 10,
THD_WAIT_LAST= 11
} thd_wait_type;
void thd_wait_begin(MYSQL_THD thd, int wait_type);
void thd_wait_end(MYSQL_THD thd);

THD_WAIT_SLEEP: For uninterrupted sleeps.
THD_WAIT_DISKIO: For file IO operations that are very likely to cause an actual
disk read.
THD_WAIT_ROW_LOCK: For row locks/page locks in the storage engine.
THD_WAIT_GLOBAL_LOCK: For global locks such as the global read lock in the MySQL
Server.
THD_WAIT_TABLE_LOCK: When waiting for a table lock.
THD_WAIT_META_DATA_LOCK: For waiting on a meta data lock which isn't a table lock.
THD_WAIT_USER_LOCK: For some type of special lock.
THD_WAIT_BINLOG: When waiting for the replication binlog.
THD_WAIT_SYNC: When waiting for an fsync operation.

It's quite likely we will introduce more wait types, such as the wait on the network.

Thursday, October 27, 2011

MySQL Thread Pool: Summary

A number of blogs have been written with the intent of describing how
the thread pool manages to solve the requirements of the thread pool.

These blogs are:
MySQL Thread Pool: Problem Definition
MySQL Thread Pool: Scalability Solution
MySQL Thread Pool: Limiting number of concurrent statement executions
Automated benchmark tool for DBT2, Sysbench and flexAsynch
MySQL Thread Pool: Limiting number of concurrent transactions
MySQL Thread Pool: When to use?
MySQL Thread Pool vs. Connection Pool
MySQL Thread Pool: Optimal configuration
MySQL Thread Pool: Benchmarking

There are some interesting discussions in the comments on the scalability solution blog
and on the blog about limiting number of concurrent statement executions
and finally also on the blog about when to use.

These discussions are around when to use it, what other features might be worth
considering and some remarks on the type of benchmarks that could be used to
evaluate solutions.

The requirements we had on the thread pool solution and the solutions were:
1) Split threads into groups individually handled to avoid making the
solution a problem in itself, aim is to manage one active thread per
group.

Solution:
Connections are put into a thread group at connect time by round robin.
Configurable number of thread groups. This ensures that the thread pool
itself isn't a scalability hog.

2) Wait for execution of a query until the MySQL Server has sufficient
CPU and memory resources to execute it.

Solution:
Each thread group tries to keep the number of executing queries to one or
zero. If a query is already executing in the thread group, put connection
in wait queue.

3) Prioritize queries on connections that have an ongoing transaction.

Solution:
Put waiting connections in high priority queue when a transaction is
already started on the connection.

4) Avoid deadlocks when queries are stalled or execute for a long time.

Solution:
Allow another query to execute when the executing query in the thread
group is declared as stalled (after a configurable time).