When testing MySQL Cluster 7.3 vs MySQL Cluster 7.2 using sysbench one can overcome the limitation of scalability in the NDB API by simply using many more NDB API connections in MySQL Cluster 7.2. However the limitation imposed by LOCK_open cannot be overcome, for sysbench this means that we can scale to usage of about 40 CPU threads and beyond that there is no additional gain of having more CPUs. When running with so high load it's actually a challenge to handle this load in the data nodes as well. What we discover is that the main bottleneck lies in the local data management threads (LDM threads). It turns out that for this particular type of threads it does actually not pay off to use hyperthreading. So best results are achieved by using 16 LDM threads that are not using hyperthreading. The problem is that when we add hyperthreading we also increase the number of LDM threads to achieve the same performance, with sysbench this means more partitions and also more work to process. So this is a rare case for MySQL where it doesn't pay off to use hyperthreading. So using this setup for MySQL Cluster we reach 7096 TPS for Sysbench RW and 9371 TPS for Sysbench RO using MySQL Cluster 7.2.
So when we take this configuration to MySQL Cluster 7.3 we can easily increase the performance since the LOCK_open bottleneck is now removed. Actually with the machine that I have access to (a standard x86/Linux box with 8 sockets and a total of 96 CPU threads on 48 cores) I can no longer get the MySQL Server to become a bottleneck for this type of standard sysbench tests. For point selects I can still reach this bottleneck since MDL locking still has some limitation in MySQL 5.6. As can be seen from the MySQL 5.7.2 DMR this is also a problem which is going away and then there will be even less problems to scale further.
So with MySQL Cluster 7.3 we are actually limited by the hardware of a 96 CPU thread box and not by any software limitation. With this box we are able to reach 8932 TPS on Sysbench RW and 11327 TPS on Sysbench RO. At this load we're using 8 CPU threads for the benchmark, 20 cores for the data nodes and the remaining 48 CPU threads for the MySQL Server. So in order to increase throughput on this machine we simply have to make better use of the HW resources at hand. With MySQL Cluster 7.3 we've made it possible to scale data nodes all the way to 32 LDM threads, so if we had access to a sufficiently big box we would be able to scale MySQL Cluster 7.3 performance even further. It's likely that we are able to scale it to about 70-80 CPU threads for the MySQL Server and this would require about 30 cores for the data nodes.
So what we see is that we're quickly reaching a point where MySQL Cluster scalability can go beyond what most customers need and it is thus very important to also start focusing on the area of CPU efficiency. In order to get an understanding of where load is spent we make good use of the perf tool in Linux which can pinpoint problems in the code with branch prediction, tight loops, cache misses and give us an idea of how to use software prefetching to improve code efficiency. Given that we program in C++, the compiler can often be assisted by introducing local variables, but one has to take care such that those local variables are not spilled to the stack in which case they only do harm to the efficiency of the program. Our initial experiments of increasing efficiency in the MySQL Cluster data nodes using this approach have been very successful.
NOTE: All the above benchmarks were done on one machine using a single MySQL Server and a single data node. Obviously MySQL Cluster is also capable of scaling in the number of nodes in addition to the scaling on one machine. In this benchmark we have however focused on how far we can scale MySQL Cluster on a single machine.