Friday, May 11, 2012

History of MySQL Cluster architecture development

With the release of MySQL Cluster 7.2.5 we've released a version of MySQL Cluster where each data node is capable of using up to 51 threads in total. This is a remarkable feat for a product that only a few years ago was purely single-threaded. This article tries to explain what it is in the MySQL Cluster architecture that makes it so scalable to new HW demands.

The MySQL Cluster architecture is based on a design that is used in the world's most sold telecom switch, the AXE switch. This switch is used in all mobile networks delivered by Ericsson. MySQL Cluster also originates from Ericsson. The architecture of the AXE switch was developed in reaction to an older switch that had quality issues since there were too many global variables and too little modularity. The architecture was inspired by how HW was designed. In HW design each module can only communicate with other HW modules using signals (mostly electrical). The idea of the AXE architecture was to use this architecture also for software. The basic concept of this architecture is blocks and signals. Blocks are the software modules (similar to a HW module). Each block is self-contained and contains all the data and code for its needs. There is no shared data with other blocks. So the only method to get access to data in another block is to send a signal to the block requesting the data.

When the storage engine of MySQL Cluster, NDB Cluster was designed, the AXE architecture was used, so the NDB DBMS software contains a set of blocks (currently a little more than 20) that communicate with each other using signals.

This is the first reason why it is so easy to make MySQL Cluster scale to many CPUs. Since each block only communicates with each other through signals it means that it's trivial to move blocks between different threads unless they are using signals that are immediate and have dependency on returning to the sender block in the same state as when the signal was sent. The NDB blocks have a number of blocks that are not possible to split, these are LQH (Local Query Handler, handles communication between data storage and indexes, handles scans), TUP (data storage), ACC (hash index), TUX (ordered index) and some parts of TSMAN, PGMAN and LGMAN (tablespace, page and log management for disk data). There is however no such dependency between TC and the Local Data Manager blocks (LDM), also the handling of asynchronous replication is in a separate block that can be easily moved to any thread. We also have a number of schema management blocks that have been separated into their own threads using some special lock techniques. Thus we have already from functional division a separation into the LDM domain, the TC domain, the SPJ domain, the schema management domain, the asynchronous replication domain. Currently TC and SPJ are colocated although it's not really necessary, but TC and SPJ are extremely easy to scale since each transaction is independent of the other. Thus we have 4 thread domains and each of these can be placed into separate domains.

In MySQL Cluster 6.3 and previous version everything was done in one single thread and this configuration is still supported since it has nice real-time characteristics when everything is placed into one thread. In MySQL Cluster 7.0, the LDM blocks were separated from the rest of the blocks into separate threads. In addition the schema management blocks and the TC and SPJ domain was separated into one domain. Asynchronous replication was also separated into its own thread domain. Finally given that all of these threads requires communication with the outside world we also created a separate receive thread that receives data on TCP sockets and converts the data into prepackaged signals that the blocks can execute and puts them onto a queue to the thread that will execute the signals. We spent considerable effort in making the implementation of communication between threads extremely efficient and this communication is entirely lock-free and uses memory barriers and careful writing and reading to communicate with each other. On x86-machines this part is partly implemented in assembler to make efficient use of optimal assembler instructions.

The LDM used yet one more technique to distribute these blocks onto several threads. It makes use of the partitioning that NDB Cluster already supports, also the REDO log in NDB Cluster was also already separated into 4 log parts and thus it was straightforward to create 4 LDM modules where each block is replicated within different threads. Each LDM instance takes care of one part of the data stored in the data node. From the outside the protocol is still the same although the address of a block now also contains thread id in addition to the node id and block id. The block reference already contained space enough for this purpose so no change to the block code was required to handle this.

In MySQL Cluster 7.2 we have advanced the threading yet one more step. The first step was to separate the schema management domain from the TC and SPJ domain. Then given that TC can be partitioned to handle different transactions using a simple round robin allocation scheme, we also separated the TC domain into multiple threads. In principle there is no limit to how many TC threads we could use, we pick a number which is appropriate to the CPU requirements of the TC domain. We selected 16 as the maximum number since we've found no scenario where more TC threads are required. We also extended the number of LDM instances to a maximum of 16 by also extending the possible number of log parts to 16.

The next step was to create multiple receiver threads. This was relatively easy as well by partitioning the threads to handle different sockets (one socket is used to communicate with each other node in the cluster).

Another step we also took was to separate the send call from the block threads. Thus special send threads can be used, this increases latency slightly but improves throughput. It also minimizes the risk of bottlenecks occurring in the data nodes. We set the maximum of send threads to 8 and similarly for receive threads. Normally 3 threads should suffice for even the biggest configuration to some extent dependent on the efficiency of the OS read and write syscalls.

Thus what we can see is that the choice of a block and signal architecture have made it possible for us with very small means to bring data node scalability from 1 CPU, onwards to 8 and now onwards to at least 40. Even more can be achieved. Thus MySQL Cluster is based on a SW architecture which can easily accomodate itself to the new HW developed. It has also from the first codeline had the concept of efficient communication in its genes. The first prototype of NDB Cluster developed in 1994 used two SPARC-stations interconnected with Dolphin SCI technology. The current benchmarks we've executed used Infiniband with 56Gb/s interconnects where latency is extremely short when communicating between different machines.

The next generation of HW is likely to revolutionize handling of memory. MySQL Cluster is well prepared for this as well.

Last a small historical anecdote. In the 1990's the HW vendors had a race to deliver GHz cpus. The x86 CPU sold early 1990 was the 80486 that run at 25MHz. In 1999 the race was over when the first GHz processor was delivered. We released MySQL Cluster 7.2 two months ago running at 17.6MHz. We're soon announcing the next major improvement to this number. So will the 2010's be the decade where the race is on for which DBMS that can first deliver a GHz query execution? MySQL Cluster is well prepared for such a challenge.

5 comments:

Anonymous said...

Hi MiKael,

Thanks for the great job. Can you please tell me if:

1. There is still a limitation of maximum number of 64 nodes?

2. The data size on the node machine is limited to its RAM size?

Thanks,

-Mohammad Salman

Mohammad Salman said...

Hi MiKael,

Thanks for the great job. Can you please tell me if:

1. There is still a limitation of maximum number of 64 nodes?

2. The data size on the node machine is limited to its RAM size?

Thanks,

-Mohammad Salman

Mikael Ronstrom said...

There is still a limitation of a maximum of 48 data nodes and a total of 255 nodes (API nodes, data nodes and management server nodes).

The data size can be bigger than RAM size by defining non-indexed attributes to reside on disk.

Mohammad Salman said...

Any plans to remove the number of nodes limitations? What is the reason for this limitation?

How do I "defining non-indexed attributes to reside on disk"? Is there a url that I can check?

Thanks.

Mikael Ronstrom said...

There are various reasons for the node limits, they are as everything changeable when there is enough demand for it.

To define a non-indexed attribute, use STORAGE DISK when defining the attribute in the CREATE TABLE statement, for more details read the documentation.