Tuesday, April 12, 2011

MySQL Cluster: Designed for high-scale SQL and NoSQL web applications

As shown in a number of blogs, the MySQL Cluster SW already uses the type of features found in many NoSQL products. It has an extremely efficient API through which it is possible to shoot millions of reads and writes towards a Cluster per second. It contains partitions of its data similar to shards in NoSQL and supports both high availability of those partitions and also repartitioning of the data when new nodes are added to the Cluster. Advanced replication solutions both providing replication inside a Cluster and between Clusters makes it possible to use MySQL Cluster in an a very large number of replication configurations, even scaling across multiple global data centers.

Finally MySQL Cluster makes it possible for you to choose to stay with your data in relational tables while still using NoSQL-like APIs, supporting on-line changes of partitioning and also adding new fields to tables while still reading and writing data in the tables. Using MySQL Cluster you can use MySQL APIs, the NDB API, Cluster/J, JPA, LDAP API and even more APIs are worked on and will soon be announced.

Most web data requires heavy use for generation of web pages where the use is mostly simple queries, but very many of them. Most of the web data also requires analysis to make intelligent business decisions based on the web generated data. A prototype of parallel query for MySQL Cluster was displayed at the MySQL Users Conference 2010. Tools such as this will also make it possible to analyse data efficiently in MySQL Cluster. Thus MySQL Cluster is a very efficient tool for working with many sorts of web data while retaining ACID compliance and a rich set of tools, expertise and best practices.

MySQL Cluster running 2.46M updates per second!

In a previous blog post we showed how MySQL Cluster achieved 6.82M reads per second. This is a high number. However what is also very interesting to see is how efficient MySQL Cluster is at executing updating transactions as well. We were able to push through the 1M transactions per second wall and even past the 2M transactions per second and all the way up to 2.46M transactions per second.

Monday, April 11, 2011

MySQL Cluster doing 6.82M reads per second

We ran a number of tests to see how many reads per second we could get from MySQL Cluster. We used a modified version of flexAsynch (as shown in previous blog), where each record read was 100 bytes in size.

With a cluster of 4 data nodes operating on 2 machines we were able to process 1.15M reads per second. On a cluster consisting of 8 data nodes executing on 4 machines we were able to process 2.13M reads per second. On a 16-data node cluster with 8 machines used for data nodes, we were able to process 4.33M reads per second and finally a cluster with 32 data nodes distributed on 16 machines we executed 6.82M reads per second. The tests were run on MySQL Cluster 7.1, we're confident that similar numbers can be achieved with MySQL Cluster 7.0 and also with the new beta version MySQL Cluster 7.2.

This benchmark will give you a good idea what can be achieved with direct usage of the NDB API, and using other APIs like Cluster/J, mod-ndb, NDB-memcached.

MySQL Cluster Benchmark

We had the opportunity to use a fair amount of machines to run a benchmark to see what throughput MySQL Cluster can achieve on a bit bigger clusters. The benchmark we use is a benchmark we developed for internal testing many years ago and shows very well the performance aspects of MySQL Cluster as discussed in some previous blogs of mine.

The benchmark is called flexAsynch, it's part of an internal series of benchmark we call the flex-series of benchmarks. It's first member was flexBench, this benchmark consisted of the following simple set of operations. First create the table with the set of attributes and the size of the attributes as specified by the startup options. Next step is to create a set of threads as specified by the startup options. Next step is that each thread will execute a number of transactions, the number which is configurable and each transaction can also run one or more operations as configured (one operation is either an insert of one record, update of one record, read of one record or delete of one record). The flexBench benchmark always starts by doing a set of inserts, then reading those, updating each record, reading it again and finally deleting all records. The flexBench benchmark also consisted of a verify phase such that we could also verify that the cluster actually read and updated the records as they should.

The flexAsynch benchmark was a further development of this benchmark, flexBench uses the synchronous NDB API, where each transaction is sent and executed per thread. This means that we can have as many outstanding transactions to the cluster as we have threads. flexAsynch uses the asynchronous NDB API, this API provides the possibility to define multiple transactions and send and execute those all at once. This means that we can get a tremendous parallelism in the application using this API. The manner in which MySQL Cluster is designed, it is actually no more expensive to update 10 records in 10 different transactions compared to updating 10 records in 1 transaction using this API. Jonas Oreland showed in his blog post how one API process using this API can handle 1 million operations per second.

The main limitation to how many operations can be executed per second is the processing in the data nodes of MySQL Cluster for this benchmark. Thus we wanted to see how well the cluster scales for this benchmark as we add more and more data nodes.

A data node in MySQL Cluster operates best when threads are locked to CPUs as shown in a previous blog of mine. Currently the main threads that operates in a data nodes is the thread handling local database operations (there are up to four of those threads), the thread doing the transaction synchronisation and finally the thread handling receive of messages on sockets connected to other data nodes or API nodes. Thus to achieve best operation one needs at least 6 CPUs to execute a data node. Personally I often configure 8 CPUs to allow for the other threads to perform their action without inhibiting query performance. Other threads are handling replication, file system interaction and cluster control.

To our disposal when running this benchmark we had access machines with dual Intel Xeon 5670 @2.93 GHz. This means 12 CPUs per socket. One thing to consider when running a benchmark like this is that the networking is an important part of the infrastructure. We had access to an Infiniband network here and used IP-over-Infiniband as communication media. It's most likely even better to use the Sockets Direct Protocol (SDP) but we had limited time to set things up and the bandwidth of IPoIB was quite sufficient. This made it possible to have more than one data node per machine.

In order to run flexAsynch on bigger clusters we also needed to handle multiple instances of flexAsynch running in parallel. In order to handle this I changed flexAsynch a little bit to enable one process to only create a table or only delete a table. I also made it possible to run the flexAsynch doing only inserts, only reads or only updates. To make it easier to get proper numbers I used a set of timers for read and update benchmarks. The first timer specified the warmup time, thus operations were executed but not counted since we're still in the phase where multiple APIs are starting up. The next timer specifies the actual time to execute the benchmark and finally a third timer specifies the cooldown time where again transactions are run but nor counted since not all APIs start and stop at exactly the same time. Using this manner we will get accurate numbers of read and update operations. For inserts we don't use timers and thus the insert numbers are less accurate.

The results of those benchmarks will be posted in blogs soon coming out.

MySQL Cluster and Sharding

Sharding is here defined as the ability to partition the data into partitions defined by a condition on a set of fields. This ability is central to the workings of MySQL Cluster. Within a Cluster we automatically partition the tables into fragments (shards in the internet world). By default there is a fixed amount of fragments per node. As mentioned we also use replication inside a Cluster, the replication happens per fragment. We define the number of replicas we want in the Cluster and then the MySQL Cluster SW maintains this number of fragment replicas per fragment. These fragment replicas are all kept in synch. Thus for MySQL Cluster the sharding is automatic and happens inside the Cluster even using commodity hardware.

One of the defining features of MySQL Cluster is to keep the fragments up and running at all times and that they are restored after a Cluster crash. However MySQL Cluster also supports adding nodes to the Cluster while it is operational, this means that we can add nodes on a running Cluster and repartition the tables during normal operation. This is part of the normal MySQL Cluster and is used in operation by many users and customers to increase the size of the Clusters in production clusters.

MySQL Cluster API, the NDB API

As mentioned in a previous blog the programming API on the client side is a very important part of the performance of MySQL Cluster. Every API that is used to access the Data Server in MySQL Cluster uses the NDB API. The NDB API is used in the NDB storage handler to make it possible to access data from MySQL APIs which is residing in MySQL Cluster.

The base of the good performance of the programming API is the ability to batch operations in various manners. In early MySQL Cluster history the MySQL Storage Engine API had very few interfaces that allowed for handling multiple records at a time. As we progressed, the Storage Engine API have added several APIs that can handle multiple records at a time. There is even some development work which has been presented at the UC 2010 where the Storage Engine API now can push entire queries down to the storage engine, even join queries. This has also been presented at a recent webinar with engineers.

The NDB API uses a model where one first defines the operation to issue towards the database. The calls to build an operation doesn't interact with the actual database. The actual message is sent to the data node only after the execute method have been called. The NDB API is designed to handle batching of operations in two levels. The first level is that it is possible to batch inside one thread. This means that one can open several transactions in parallel within the same thread and execute them in parallel with one execute call. In addition it is also possible to have several threads working in parallel and it is possible for every one of those threads to also be executing multiple transactions in parallel.

So the possibilities for parallelism using the NDB API is tremendous. Much of the cost of accessing a database is paid in the networking, so by using the parallel transactions inside a thread (called Asynchronous NDB API) and by using the multithreaded capabilities of the NDB API, it is possible to decrease the networking cost greatly by making TCP/IP packets larger. Mostly the cost of sending a TCP/IP packet is Fixed_cost + #Bytes * Byte_cost. The fixed cost was in the past about the same cost as sending 60 bytes. This extra cost of small messages have to be paid both in the server part and in the client part. Thus it pays off very well to send larger messages. When the message sizes goes towards 1 kByte, the extra cost is down in the range of 6-7% extra cost compared to infinite-sized messages whereas a 200 byte message have an additional 30% added cost.

An additional benefit of batching is that there will be less context switches since handling of several messages in parallel can be handled without context switches.

You can learn more about performance optimization of your own applications by reading this whitepaper.

Saturday, April 09, 2011

MySQL Cluster performance aspects

MySQL Cluster was designed for high performance from the very beginning. To achieve high performance one has to understand many aspects of computing. As an example the protocol is very important. In the original design work in 1994 we had a master thesis student build a prototype using a protocol which was based on BER encoding and other standard parts of many telecom protocols. After seeing the code in this prototype which was several thousands of lines of code just to handle the protocol, I realised that this type of protocol will simply cost too much on both the client side as well as on the server side. So this type of prototypes in early design work is extremely useful since it would have been very difficult to change this protocol once we started down the path of developing the Data Server.

Based on this work we instead opted for a protocol where almost everything in the protocol was of fixed size and entirely based on sending 32-bit words. We didn't want a protocol which transferred bytes to avoid the extra computational complexity this would require. So the NDB protocol which is used for query processing uses a message called TCKEYREQ, this message has about 10 32-bit words describing various fixed parameters such as TableId, ConnectionId, PartitionId and so forth. There is also a 32-bit word that contains a set of bits that is used to interpret the message. Actually reading this protocol can be done, completely avoiding branches since the bits can be used to address the proper words in the protocol message through some arithmetic. The only branching needed happens in taking care of keys and the actual query information which is of variable size.

The next important component of performance is the execution model. The MySQL Cluster Data nodes uses an execution model which is extremely well suited for modern CPUs. The Data nodes uses a set of threads, where each thread implements its own little OS with a scheduler. All communication inside the data nodes is based on messages. From a SW point of view the code to receive internal messages is exactly the same as the handling of messages arriving over the network. When sending a message it's the address which defines the type of message. The address contains three parts, the node id, the thread id and the module id (block number in the code). If the message is sent to a module with the same node id and thread id as the sending thread, then the message is an internal message and it will be sent by putting the message in the local message buffer, if the node id is the same but the thread id differs, then the message will be sent to another thread. The communication between threads is extremely efficient based on shared memory communication and this code is using the most efficient ways to communicate based on the HW and the OS. Finally when the node id differs, the message is sent as a network packet over to another data node or client node. There is a TCP/IP link between all nodes (fully connected mesh) and the data node will use mechanisms to ensure that the packets sent contains as many messages as possible without sacrificing latency (the user can affect the acceptable latency through a config parameter).

Given this model it means that a thread can be actively executing thousands of queries without any need of doing any context switches. This is one reason why MySQL Cluster benefits greatly when threads are locked to certain CPU cores and there is no contention from other programs to use these CPU cores. The data nodes have their own local OS and thus work extremely efficiently when the OS scheduler stays out of the way.

This particular model of executing where each thread of execution executes until it decides to send a message (the unit of execution is always execution of a message) was very popular in the 70s because of its efficiency. It was replaced by the time-sharing model given the simplicity of the time-sharing model. When designing MySQL Cluster we decided that a Data Server to handle millions of queries per second has more requirements on the efficiency of execution compared to the requirements of the simplicity of the design. Another great benefit of this execution model is that as the load on the Data Server increases, the throughput also grows. This is so since the execution threads will execute for longer time before they will look at the sockets for incoming traffic, this means that more messages will be gathered every time and thus the cost of each message byte decreases, the same happens with sending messages that as the number of messages to execute per round grows, the more data will be sent on each send call and thus decreasing the cost of each sent message byte.

The design is extremely modular even though its using a more complex execution model. Each module can only communicate with other modules using messages and the modules share no data. Thus if an error occurs in a module it's either due to bugs in this model or due to bad input data to the module. To debug the data node we trace every important branch, every message executed with it's data. This means that if a crash occurs we have very detailed information about how the crash occurred including the last thousand or so branches taken in the code and a few thousand of the last messages executed in the data node.

The final aspect of performance is the actual implementation of the database algorithms. To cover this in one blog message is obviously not possible but it covers an efficient design of data structures (we implement a hash based index and an ordered index), efficient implementation of the actual record storage with an efficient data structure to contain the record (includes capabilities to handle variable sized data and handling NULLable fields in a storage efficient manner and even being able to add fields to a record by usage of dynamic fields which are NULL when not present in the record). It includes an efficient model for recovery and finally an efficient model for transaction handling. In all of those aspects MySQL Cluster have added additional innovation to the world of databases with a particular focus on the performance aspects.

There is actually one more important part of the performance of MySQL Cluster and this is the programming API on the client side. I will discuss this in my next blog.

Friday, April 08, 2011

MySQL Cluster - NoSQL access with some SQL

As someone noted in a blog, the NDB API is a NoSQL API that was designed 15 years ago. When I wrote my Ph.D thesis (which is the design document that NDB Cluster is based on) I called it Design and Modelling of a Parallel Data Server for Telecom Applications. The important name I used here is Data Server. It was never intended as a pure SQL DBMS. It was always intended for any needs of Data Storage. The requirements on this Data Server was also written up in my thesis for those who care to read it and included HLR's (the telecom database used to keep track of your mobile phone), News-on-Demand, Multimedia Email, Event Data Services (logging of events in the telco and web applications used for charging, billing and understanding the customers) and a genealogy application.

MySQL Cluster have been very successful in the telecom space and chances are very high that a MySQL Cluster solution is used whenever you place a mobile phone call. Also many ISPs use MySQL Cluster to handle DNS lookups, authentication and many other internet services. As an example here the ISP I use every day and through which I post this blog message is using MySQL Cluster for this type of service. So I invoke services of the MySQL Cluster every time I access the web from my home. In addition, we have seen MySQL Cluster adopted into eCommerce, session management, content delivery, user profile management and on-line gaming applications.

MySQL Cluster was from the very start designed to handle many other applications as well in the web space. Today the internet environment contains quite a few different APIs to use for handling web data. MySQL Cluster already now have a plethora of different APIs that can be used to access the basic Data Server. MySQL Cluster can be used with every possible API that can be used to access a MySQL Server. In addition we have the Cluster/J API which is a low-level Java API with similar characteristics to the NDB API. Based on the Cluster/J API we have a standard JPA interface to MySQL Cluster. We even have an LDAP interface which means that the same data can be accessed through LDAP, SQL, Cluster/J, JPA, NDB API and many other interfaces based upon these of which I am sure I don't know every one. Another interesting interface is mod-ndb which makes it possible to query MySQL Cluster using a REST API and get results in JSON.

We are however not satisfied with the set of APIs we have towards MySQL Cluster so we'll be adding even more as we go to make the Data Server capabilities available to you from even more surroundings, these will be including additional APIs commonly used in the web space. Stay tuned for Tomas Ulin's keynote at the UC and Collaborate next week.