Monday, November 25, 2013

How to make an efficient Scalable Key Lookup engine of MySQL Cluster

MySQL Cluster have all the ingridients to be designed as a very scalable and extremely efficient key lookup engine for the Cloud. As we have shown in earlier entries of my blog we've been able to scale MySQL Cluster 7.2 to handle 72 million key lookups per second or 4.3 billion key lookups per minute. This benchmark was actually limited by a limitation of the NDB API nodes to handle more than about 300k lookups per second and so with a maximum of 255 nodes we got to around 72 million per second in total. However in MySQL Cluster 7.3 we have removed this limitation, in addition we have also enabled scaling to even bigger data nodes, so it should be possible now to reach even higher numbers.

The aim of this blog is however not to give any new benchmark results, rather it is providing details about how the benchmark program works and how this benchmark program architecture can be used to design an efficient scalable key lookup data store.

To obtain best possible performance we want to ensure that the data node can operate as efficiently as possible. This is done by ensuring that a connection to the data node sends many key lookups bundled together. Operating on individual key lookups is possible of course, but as usual it is more efficient to operate on bigger entities than one key lookup at a time. To provide this we use a concept we call Executor Thread. The Executor Thread will only execute key lookups aimed for a certain data node. So this means that the number of Executor Threads will be a multiple of the number of data nodes (there could be more than one thread per data node if necessary). The Executor Thread will receive key lookups from an internal queue handled by the application program (in our case the flexAsynch benchmark program). The key lookups are prepared by the Definer Threads. The Definer Thread will receive a key lookup aimed for any data node, it will take this key lookup and calculate the receiving data node for this key lookup (there is API calls in the NDB API to handle this). Based on this calculation the Definer Thread will put the key lookup in the queue of the proper Executor Thread.

The architecture before the Definer Thread is dependent on the application. In the figure provided here we have shown one possible architecture where we have one receive thread that receives a flow of messages from somewhere, to process those messages we need to interpret the packets and process them, this could entail one or more key lookups. In the figure we have assumed there is one key lookup per message and that the Executor Thread can format the packet back to the sender based on the information in the internal key lookup order.



So the important part of the architecture is the Executor Thread that handles messages to one data node based on an internal data structure that defines one key lookup and defines how to process the response (this thread should do as little work as possible to ensure it can focus on communication with the data node). There should also be a Definer Thread that prepares the key lookup request and puts the request in the queue of the proper Executor Thread. The Definer Thread could also do other things and there could be few or many Definer Threads in the architecture.

So how does flexAsynch work, in this case we don't have any input traffic, we generate the key lookups in the Definer Threads. The Definer Thread has a very simple operation. It starts by preparing a set of key lookups to any data node. For each of those key lookups it puts the request in the queue of the proper Executor Thread. After placing all requests in a queue, it starts waiting for all operations to complete. After all requests have received their response from the Executor Threads we simply continue with the next batch.

The operation of the Executor Thread is also very simple, it gets the current set of key lookups waiting in queue, it prepares those for execution of the NDB API. It sends off all the operations to the data node. When the data node have executed all of the operations it reports the result back to the Definer Threads and updates some benchmark statistics, then it continues with the next batch of key lookups.

So the operation of an efficient key lookup data store is not difficult at all. To make it scale one can then add up to 48 data nodes per cluster (each is capable of handling more than 5 million key lookups per second of around 100 byte in size). Each cluster can handle a total of 255 nodes in total. Obviously it is also straightforward to operate more than one cluster to scale even further.

The benchmark code exists in storage/ndb/test/ndbapi/flexAsynch.cpp, the interesting code exists in the NEW module here (it also contains a lot of legacy code for old variants of the flexAsynch benchmark).

No comments: