We had the opportunity to use a fair amount of machines to run a benchmark to see what throughput MySQL Cluster can achieve on a bit bigger clusters. The benchmark we use is a benchmark we developed for internal testing many years ago and shows very well the performance aspects of MySQL Cluster as discussed in some previous blogs of mine.
The benchmark is called flexAsynch, it's part of an internal series of benchmark we call the flex-series of benchmarks. It's first member was flexBench, this benchmark consisted of the following simple set of operations. First create the table with the set of attributes and the size of the attributes as specified by the startup options. Next step is to create a set of threads as specified by the startup options. Next step is that each thread will execute a number of transactions, the number which is configurable and each transaction can also run one or more operations as configured (one operation is either an insert of one record, update of one record, read of one record or delete of one record). The flexBench benchmark always starts by doing a set of inserts, then reading those, updating each record, reading it again and finally deleting all records. The flexBench benchmark also consisted of a verify phase such that we could also verify that the cluster actually read and updated the records as they should.
The flexAsynch benchmark was a further development of this benchmark, flexBench uses the synchronous NDB API, where each transaction is sent and executed per thread. This means that we can have as many outstanding transactions to the cluster as we have threads. flexAsynch uses the asynchronous NDB API, this API provides the possibility to define multiple transactions and send and execute those all at once. This means that we can get a tremendous parallelism in the application using this API. The manner in which MySQL Cluster is designed, it is actually no more expensive to update 10 records in 10 different transactions compared to updating 10 records in 1 transaction using this API. Jonas Oreland showed in his blog post how one API process using this API can handle 1 million operations per second.
The main limitation to how many operations can be executed per second is the processing in the data nodes of MySQL Cluster for this benchmark. Thus we wanted to see how well the cluster scales for this benchmark as we add more and more data nodes.
A data node in MySQL Cluster operates best when threads are locked to CPUs as shown in a previous blog of mine. Currently the main threads that operates in a data nodes is the thread handling local database operations (there are up to four of those threads), the thread doing the transaction synchronisation and finally the thread handling receive of messages on sockets connected to other data nodes or API nodes. Thus to achieve best operation one needs at least 6 CPUs to execute a data node. Personally I often configure 8 CPUs to allow for the other threads to perform their action without inhibiting query performance. Other threads are handling replication, file system interaction and cluster control.
To our disposal when running this benchmark we had access machines with dual Intel Xeon 5670 @2.93 GHz. This means 12 CPUs per socket. One thing to consider when running a benchmark like this is that the networking is an important part of the infrastructure. We had access to an Infiniband network here and used IP-over-Infiniband as communication media. It's most likely even better to use the Sockets Direct Protocol (SDP) but we had limited time to set things up and the bandwidth of IPoIB was quite sufficient. This made it possible to have more than one data node per machine.
In order to run flexAsynch on bigger clusters we also needed to handle multiple instances of flexAsynch running in parallel. In order to handle this I changed flexAsynch a little bit to enable one process to only create a table or only delete a table. I also made it possible to run the flexAsynch doing only inserts, only reads or only updates. To make it easier to get proper numbers I used a set of timers for read and update benchmarks. The first timer specified the warmup time, thus operations were executed but not counted since we're still in the phase where multiple APIs are starting up. The next timer specifies the actual time to execute the benchmark and finally a third timer specifies the cooldown time where again transactions are run but nor counted since not all APIs start and stop at exactly the same time. Using this manner we will get accurate numbers of read and update operations. For inserts we don't use timers and thus the insert numbers are less accurate.
The results of those benchmarks will be posted in blogs soon coming out.