Tuesday, September 28, 2021

Memory Management in RonDB

 Most of the memory allocated in RonDB is handled by the global memory manager. Exceptions are architecture objects and some fixed size data structures. In this presentation we will focus on the parts handled by the global memory manager.

In the global memory manager we have 13 different memory regions as shown in the figure below:



- DataMemory

- DiskPageBufferCache

- RedoBuffer

- UndoBuffer

- JobBuffer

- SendBuffers

- BackupSchemaMemory

- TransactionMemory

- ReplicationMemory

- SchemaMemory

- SchemaTransactionMemory

- QueryMemory

- DiskOperationRecords

One could divide those regions into a set of qualities. We have a set of regions that are fixed in size, another set of regions are critical and cannot handle failure to allocate memory, a set of regions have no natural upper limit and are unlimited in size, there is also a set of regions that are flexible in size that can work together to achieve the best use of memory. We can also divide regions based on whether the memory is short term or long term. Each region can belong to multiple categories.

To handle these qualities of the regions we have priorities on each memory region, this priority can be affected by the amount of memory that the resource has allocated.

Fixed regions have a fixed size, this is used for database objects, the Redo log Buffer, the Undo log buffer, the DataMemory and the DiskPageBufferCache (the page cache for disk pages). There is code to ensure that we queue up when those resources are no longer available. DataMemory is a bit special and we will describe it separately below.

Critical regions are regions where a request to allocate memory would cause a crash. This relates to the job buffer which is used for internal messages inside a node, it also relates to send buffers which are used for messages to other nodes. DataMemory is a critical region during recovery, if we fail to allocate memory for database objects during recovery we would not be able to recover the database. Thus DataMemory is a critical region in the startup phase, but not during normal operation. DiskOperationRecords are also a critical resource since otherwise we cannot maintain the disk data columns. Finally we also treat BackupSchemaMemory as critical since not being able to perform a backup would make it very hard to manage RonDB.

Unlimited regions have no natural upper limit, thus as long as memory is available at the right priority level, the memory region can continue to grow. The regions in this category is BackupSchemaMemory, QueryMemory and SchemaTransactionMemory. QueryMemory is memory used to handle complex SQL queries such as large join queries. SchemaTransactionMemory can grow indefinitely, but the meta data operations try avoid growing too big.

Flexible regions are regions that can grow indefinitely but that have to set limits on its own growth to ensure that other flexible regions are also allowed to grow. Thus one flexible resource isn't allowed to grab all the shared memory resources. There are limits to how much memory a resource can grab before its priority is significantly lowered.

Flexible regions are TransactionMemory, ReplicationMemory, SchemaMemory, QueryMemory, SchemaTransactionMemory, SendBuffers, BackupSchemaMemory, DiskOperationRecords, 

Finally we have short term versus long term memory regions. A short term memory region allocation is of smaller signifance compared to a long term memory region. In particular this relates to SchemaMemory. SchemaMemory contains metadata about tables, indexes, columns, triggers, foreign keys and so forth. This memory once allocated will stay for a very long time. Thus if we allow it to grow too much into the shared memory we will not have space to handle large transactions that require TransactionMemory.

Each region has a reserved space, a maximum space and a priority. In some cases a region can also have a limit where its priority is lowered.

4% of the shared global memory is only accessible to the highest priority regions plus half of the reserved space for job buffers and communication buffers.

10% of the shared global memory is only available to high prio requesters. The remainder of the shared global memory is accessible to all memory regions that are allowed to allocate from the shared global memory.

The actual limits might change over time as we learn more about how to adapt the memory allocations.

Most regions have access also to a shared global memory. It will first use its reserved memory and if there is shared global memory available it can allocate from this as well.

The most important ones are DataMemory and DiskPageBufferMemory. Any row stored in memory and all indexes in RonDB are stored in the DataMemory. The DiskPageBufferMemorycontains the page cache for data stored on disk. To ensure that we can always handlerecovery, DataMemory is fixed in size and since recovery can sometimes grow the data size a bit. We don't allow the DataMemory to be filled beyond 95% in normal operation. In recovery it can use the full DataMemory size. Those extra 5% memory resources are also reserved for critical operations such as growing the cluster with more nodes and reorganising the data inside RonDB. The DiskPageBufferCache is fixed in size, operations towards the disk is queued by using DiskOperationRecords.

Critical regions which have higher priority to get memory compared to the rest of the regions. These are job buffers used for sending messages between modules inside a data node, send buffers used for sending messages between nodes in the cluster, the meta data required for handling backup operations and finally operation records to access disk data.

These regions will be able to allocate memory even when all other regions will fail to allocate memory. Failure to access memory for those regions would lead to failure of the data node or failure to backup the data which are not events that are acceptable in a DBMS.

We have 2 more regions that are fixed in size, the Redo log buffer and the Undo log buffer (the Undo log is only used for operations on disk pages). Those allocate memory at startup and use that memory, there is some functionality to handle overload on those buffers by queueing operations when those buffers are full.

The remaining 4 regions we will go through in detail.

The first one is TransactionMemory. This memory region is used for all sorts of operations such as transaction records, scan records, key operation records and many more records used to handle the queries issued towards RonDB.

The TransactionMemory region have a reserved space, but it can grow up to 50% of the shared global memory beyond that. It can even grow beyond that, but in this case it only has access to the lowest priority region of the shared global memory. Failure to allocate memory in this region leads to aborted transactions.

The second region in this category is SchemaMemory. This region contains a lot of meta data objects representing tables, fragments, fragment replicas, columns, and triggers. These are long-term objects that will be there long-term. Thus we want this region to be flexible in size, but we don't want it grow such that it diminishes the possibility to execute queries towards region. Thus we calculate a reserved part and allow this part to grow into at most 20% of the shared memory region in addition to its reserved region. This region cannot access the higher priority memory regions of the shared global memory.

Failure to allocate SchemaMemory causes meta data operations to be aborted.

Next region in this category is ReplicationMemory. These are memory structures used to represent replication towards other clusters supporting Global Replication. It can also be used to replicate changes from RonDB to other systems such as ElasticSearch. The memory in this region is of temporary nature with memory buffers used to store the changes that are being replicated. The meta data of the replication is stored in the SchemaMemory region.

This region has a reserved space, but it can also grow to use up to 30% of the shared global memory. After that it will only have access to the lower priority regions of the shared global memory.

Failure to allocate memory in this region lead to failed replication. Thus replication have to be set up again. This is a fairly critical error, but it is something that can be handled.

The final region in this category is QueryMemory. This memory has no reserved space, it can use the shared global lower priority regions. This memory is used to handle complex SQL queries. Failure to allocate memory in this region will lead to complex queries being aborted.

This blog presents the memory management architecture in RonDB that is currently in a branch called schema_mem_21102, this branch is intended for RonDB 21.10.2, but could also be postponed to RonDB 22.04. The main difference in RonDB 21.04 is that the SchemaMemory and ReplicationMemory are fixed in size and cannot use the shared global memory. The BackupSchemaMemory is also introduced in this branch. It was currently part of the TransactionMemory.

In the next blog on this topic I will discuss how one configures the automatic memory in RonDB.

Friday, September 24, 2021

Automatic Memory Management in RonDB

RonDB has now grown up to the same level of memory management as you find in expensive commercial DBMSs like Oracle, IBM DB2 and Microsoft SQL Server.

Today I made the last development steps in this large project. This project started with a prototype effort by Jonas Oreland already in 2013 after being discussed for a long time before that. After he left for Google the project was taken over by Mauritz Sundell that implemented the first steps for operational records in the transaction manager.

Last year I added the rest of the operational records in NDB. Today I completed the programming of the final step in RonDB. This last step meant moving around 30 more internal data structures towards using the global memory manager. These memory structures are used to represent meta data about tables, fragments, fragment replicas, triggers and global replication objects.

One interesting part that is contained in this work is a malloc-like implementation that interacts with all record-level data structures that is already in RonDB to handle linked list, hash tables and so forth for internal data structures.

So after more than 5 years it feels like a major step forward in the development of RonDB.

What does this mean for a user of RonDB? It means that the user won't have to bother much with memory management configuration. If RonDB is started in a cloud VM, it will simply use all memory in the VM and ensure that the memory is handled as a global resource that can be used by all parts of RonDB. This feature is exactly existing already in RonDB 21.04. What this new step means is that the memory management is even more flexible, there is no need to allocate more memory than needed for meta data objects (and vice versa if more memory is needed, it is likely to be accessible).

Thus memory can be used for other purposes as well. Thus the end result is that more memory is made available in all parts of RonDB, both to store data in it and to perform more parallel transactions and more query handling.

Another important step is that this step opens up for many new developments to handle larger objects in various parts of RonDB.

In later blogs we will describe how the memory management in RonDB works. This new development will either appear in RonDB 21.10 or in RonDB 22.04.


Friday, August 13, 2021

How to achieve AlwaysOn

When discussing how to achieve High Availability most DBMS focus on handling it via replication. Most of the focus has thus been focused on various replication algorithms.

However truly achieving AlwaysOn availability requires more than just a clever replication algorithm.

RonDB is based on NDB Cluster, NDB has been able to prove in practice that it can deliver capabilities that makes it possible to build systems with less than 30 seconds of downtime per year.

So what is required to achieve this type of availability?

  1. Replication
  2. Instant Failover
  3. Global Replication
  4. Failfast Software Architecture
  5. Modular Software Architecture
  6. Advanced Crash Analysis
  7. Managed software

Thus a clever replication algorithm is only 1 of 7 very important parts to achieve the highest possible level of availability. Managed software is one of the addition that RonDB does to NDB Cluster. This won't be discussed in this blog.

Instant Failover means that the cluster must handle failover immediately. This is the reason why RonDB implements a Shared Nothing DBMS architecture. Other HA DBMS such as Oracle and MySQL InnoDB Cluster and Galera Cluster relies on replaying the logs at failover to catch up. Before this catch up has happened the failover hasn't completed. In RonDB every updating transaction updates both data and logs as part of the changing transaction, thus at failover we only need to update the distribution information.

In a DBMS updating information about node state is required to be a transaction itself. This transaction takes less than one millisecond to perform in a cluster. Thus in RonDB the time it takes to failover is dependent on the time it takes to discover that the node has failed. In most cases the reason for the failure is a software failure and this usually leads to dropped network connections which are discovered within microseconds. Thus most failovers are handled within milliseconds and the cluster is repaired and ready to handle all transactions again.

The hardest failure to discover are the silent failures, this can happen e.g. when the power on a server is broken. In this case the time it takes is dependent on the time configured for heartbeat messages. How low this time can be set is dependent on the operating system and how much one can depend on that it sends a message in a highly loaded system. Usually this time is a few seconds.

But even with replication and instant failover we still have to handle failures caused by things like power breaks, thunderstorms and many more problems that cause an entire cluster to fail. A DBMS cluster is usually located within a confined space to achieve low latency on database transactions.

To handle this we need to handle failover from one RonDB cluster to another RonDB cluster. This is achieved in RonDB by using asynchronous replication from one cluster to another. This second RonDB cluster needs to physically separated from the other cluster to ensure higher independence of failures.

Actually having global replication implemented also means that one can handle complex software changes such as if your application does a massive rewrite of the data model in your application.

Ok, are we done now, is this sufficient to get a DBMS cluster which is AlwaysOn.

Nope, more is needed. After implementing these features it is also required to be able to quickly find the bugs and be able to support your customers when they hit issues.

The nice thing with this architecture is that a software failure will most of the time not cause anything more than a few aborted transactions which the application layer should be able to handle.

However in order to build an AlwaysOn architecture one has to be able to quickly get rid of bugs as well.

When NDB Cluster joined MySQL two different software architectures met each other. MySQL was a standalone DBMS, this meant that when it failed the database was no longer available. Thus MySQL strived to avoid crashes since that meant that the customer no longer could access its data.

With NDB Cluster the idea was that there would always be another node available to take over if we fail. Thus NDB, and thus also RonDB implements a Failfast Software Architecture. In RonDB this is implemented using a macro in the RonDB called ndbrequire, this is similar how most software uses assert. However ndbrequire stays in the code also when we run in production code.

Thus every transaction that is performed in RonDB causes thousands error checks to be checked. If one of those ndbrequire's returns false we will immediately fail the node. Thus RonDB will never proceed when we have an indication that we have reached a disallowed state. This ensures that the likelihood of a software failure leading to data being incorrect is minimised.

However crashing solves only the problem as a short-term solution. In order to solve the problem for real we also have to fix the bug. To be able to fix bugs in a complex DBMS requires a modular software architecture. RonDB software architecture is based on experiences from AXE, this is a switch developed in the 1970s at Ericsson.

The predecessor of AXE at Ericsson was AKE, this was the first electronic switch developed at Ericsson. It was built as one big piece of code without clear boundaries between the code parts. When this software reached sizes of millions of lines of code it became very hard to maintain the software.

Thus when AXE was developed in a joint project between Ericsson and Telia (a swedish telco operator) the engineers needed to find a new software architecture that was more modular.

The engineers had lots of experiences of designing hardware as well. In hardware the only path to communicate between two integrated circuits is by using signals on an electrical wire. Since this made it possible to design complex hardware with small amount of failures, the engineers reasoned that this architecture should work as a software architecture as well.

Thus the AXE software architecture used blocks instead of integrated circuits and signals instead of electrical signals. In modern software language these would have been called modules and messages most likely.

A block owns its own data, it cannot peek at other blocks data, the only manner to communicate between blocks is by using signals that send messages from one block to another block.

RonDB is designed like this with 23 blocks that implements different parts of the RonDB software architecture. The method to communicate between blocks is mainly through signals. These blocks are implemented as large C++ classes.

This software architecture leads to a modular architecture that makes it easy to find bugs. If a state is wrong in a block it can either be caused by code in the block, or by a signal sent to the block.

In RonDB signals can be sent between blocks in the same thread, to blocks in another thread in the same node and they can be sent to a thread in another node in the cluster.

In order to be able to find the problem in the software we want access to a number of things. The most important feature to discover is to discover the code path that led to the crash.

In order to find this RonDB software contains a macro called jam (Jump Address Memory). This means that we can track a few thousand of the last jumps before the crash. The code is filled with those jam macros. This is obviously an extra overhead that makes RonDB a bit slower, but to deliver the best availability is even more important than being fast.

Just watch Formula 1, the winner of Formula 1 over a season will never be a car that fails every now and then, the car must be both fast and reliable. Thus in RonDB reliability has priority over speed even though we mainly talk about the performance of RonDB.

Now this isn't enough, the jam only tracks jumps in the software, but it doesn't provide any information about which signals that led to the crash. This is also important. In RonDB each thread will track a few thousand of the last signals executed by the thread before the crash. Each signal will carry a signal id that makes it possible to follow signals being sent also between threads within RonDB.

Let's take an example of how useful this information is. Lately we had an issue in the NDB forum where a user complained that he hadn't been able to produce any backups the last couple of months since one of the nodes in the cluster failed each time the backup was taken.

In the forum the point in the code was described in the error log together with a stack trace of which code we executed while crashing. However this information wasn't sufficient to find the software bug.

I asked for the trace information that includes both the jam's and the signal logs of all the threads in the crashed node.

Using this information one could quickly discover how the fault occurred. It would only happen in high-load situations and required very tricky races to occur, thus the failure wasn't seen by most users. However with the trace information it was fairly straightforward to find what caused the issue and based on this information a work-around to the problem was found as well as a fix of the software bug. The user could again be comfortable by being able to produce backups.

Thursday, August 12, 2021

RonDB and Docker Compose

After publishing the Docker container for RonDB I got a suggestion to simplify it further by using Docker Compose. After a quick learning using Google I came up with a Docker Compose configuration file that will start the entire RonDB cluster and stop it using a single command.

First of all I had to consider networking. I decided that using an external network was the best solution. This makes it easy to launch an application that uses RonDB as a back-end database. Thus I presume that an external network has been created with the following command before using Docker Compose to start RonDB:

docker network create mynet --subnet=192.168.0.0/16

The docker-compose.yml is available on GitHub at

https://github.com/logicalclocks/rondb-docker

In the file rondb/21.04/docker-compose.yml for RonDB 21.04 and in rondb/21.10/docker-compose.yml for RonDB 21.10. Link to docker-compose.yml

To start a RonDB cluster now run this command from a directory where you have placed docker-compose.yml.

docker-compose up -d

After about 1 minute the cluster should be up and running and you can access it using:

docker exec -it compose_test_my1_1 mysql -uroot -p

password: password

The MySQL Server is available at port 3306 on IP 192.168.0.10 using the mynet subnet

When you want to stop the RonDB cluster use the command:

docker-compose stop

Docker Compose creates normal Docker containers that can be viewed using docker ps and docker logs commands as usual.

RonDB and Docker

There was a request to be able to test RonDB using Docker. This is now working.
These commands will set up a RonDB cluster on your local machine that can be used to test RonDB:

Step 1: Download the Docker containers for RonDB

docker pull mronstro/rondb

Step 2: Create a Docker subnet

docker network create mynet --subnet=192.168.0.0/16

Step3: Start the RonDB management server

docker run -d \
  --net=mynet \
  -v /path/datadir:/var/lib/rondb \
  -ip 192.168.0.2 \
  -name mgmt1 \
  mronstro/rondb ndb_mgmd --ndb-nodeid=65

Step 4: Start the first RonDB data node

docker run -d \
  --net=mynet \
  -v /path/datadir:/var/lib/rondb \
  -ip 192.168.0.4 \
  -name ndbd1 \
  mronstro/rondb ndbmtd --ndb-nodeid=1

Step 5: Start the second RonDB data node

docker run -d \
  --net=mynet \
  -v /path/datadir:/var/lib/rondb \
  -ip 192.168.0.5 \
  -name ndbd2 \
  mronstro/rondb ndbmtd --ndb-nodeid=2

Step 6: Check that the cluster has started and is working

This step isn't required, but just to show that the cluster is
up and running, start the RonDB management client and issue the
show command.

docker exec -it mgmt1 ndb_mgm
ndb_mgm> show

This should hopefully show a starting cluster and after about
half a minute the cluster should be started.

Step 7: Start a MySQL Server

Note that the MySQL Server uses /var/lib/mysql as datadir internally
whereas the RonDB management server and data node uses
/var/lib/rondb.

docker run -d \
  --net=mynet \
  -v /path/datadir:/var/lib/mysql \
  -e MYSQL_ROOT_PASSWORD=your_password \
  -ip 192.168.0.10 \
  -name mysqld1 \
  mronstro/rondb mysqld --ndb-cluster-connection-pool-nodeids=67

Step 8: Start a MySQL client

docker exec -it mysqld1 mysql -uroot -p
Password: your_password

Now you are connected to a MySQL client that can issue SQL commands
towards the RonDB cluster. Below is a very simple example of such
commands:

mysql> CREATE DATABASE TEST;
mysql> USE TEST;
mysql> CREATE TABLE t1 (a int primary key) engine=ndb;
mysql> INSERT INTO t1 VALUES (1),(2);
mysql> SELECT * FROM t1;

I tested this on my development machine using Mac OS X. To succeed with the setup
my Docker setup required at least 8 GByte of memory. RonDB is optimised for use
in VMs in the cloud where a minimum of 8 GByte of memory is available for the
data node VMs. Since the default configuration of Docker will presumably mainly
be used for simple tests I decided to decrease the size of the RonDB data nodes
such that they fit in 3 GBytes of memory. It is definitely possible to run
RonDB in an even smaller environment, but I think that the default should at least
be able to load at least 1 GByte of data and a fair amount of tables into RonDB.

RonDB and Docker is documented at https://docs.rondb.com/rondb_docker/

The RonDB documentation has also been improved at the same time.

The GitHub tree for the Docker containers can be found at:

The GitHub tree is based on the MySQL Docker tree at GitHub.

The Docker Hub is found at:

Thursday, July 29, 2021

Improvements of DBT2 benchmark in RonDB 21.10.1

In the development of RonDB 21.10.1 we have had some focus on improving the performance of the DBT2 benchmark for RonDB.  Actually NDB Cluster already had very good performance for DBT2. However this performance relies on a thread configuration that uses a lot of LDM threads and this means that tables will have very many partitions.

For an application like DBT2 (open source variant of TPC-C) this is not an issue since it is a very scalable application. But most real applications are not as scalable as DBT2 when the number of table partitions increases.

In RonDB we have focused on decreasing the number of table partitions. Thus in RonDB the number of partitions are independent of the number of LDM threads. In DBT2 most of the load are generated towards one of the tables, this means that only a subset of the LDM threads are used in executing DBT2. Even more most of the load is directed towards the primary replicas.

In RonDB 21.10.1 we improved the placement of the primary replicas such that more LDM threads were active in executing the queries. This improved DBT2 performance by about 20%.

Already in RonDB 21.04 we have introduced query threads that can be used for reads using Committed Reads. This makes application using Committed Reads scale very well such as the Online Feature Store in used by Hopsworks. However DBT2 uses a very small number of Committed Reads, most reads are using reads that lock rows. To handle this we modified RonDB 21.10 to allow also locked reads to use query threads.

Query thread already have an efficient scheduling of read queries towards LDM threads and query threads, thus ensuring that all CPUs used for LDM threads and query threads are efficiently used. With the ability to schedule locked read operations towards query threads we automatically make more efficient use of the CPU resources in the DBT2 benchmark. This improvement gives 50% better DBT2 performance for RonDB.

Another feature we made use of in the DBT2 benchmark is the ndb_import tool. Thus the load phase for DBT2 is using the ndb_import tool. This provides a very efficient parallel load tool. Both RonDB 21.04 and RonDB 21.10 contains improvements of the ndb_import tool to enable DBT2 to use it as a load tool.

Finally in RonDB 21.10.1 we also removed the index statistics mutex in the NDB storage engine as a bottleneck. This improves Sysbench throughput by about 10% at high load. We haven't measured how much it impacts the DBT2 performance.

New RonDB releases

It is in the middle of the summer, but we found some time to prepare a new RonDB release. Today we are proud to release new RonDB versions.

https://www.logicalclocks.com/blog/new-rondb-release-21-10-1

RonDB is a stable distribution of NDB Cluster, a key-value store with SQL capabilities. It is based on a release of MySQL, an SQL database server.

RonDB 21.04.1 is the second release of the stable version of RonDB. It contains 3 new features and 18 bug fixes. We plan to support RonDB 21.04 until 2024.

RonDB 21.10.1 is the first beta version of RonDB 21.10. It contains 4 new features that improves throughput of the DBT2 benchmark by 70% compared to RonDB 21.04.1.

Detailed release notes are available in the RonDB documentation.

The new features in RonDB 21.04.1 are:

  1. Support for primary keys using Longvarchar in ClusterJ, the native Java API for RonDB
  2. Support for autoincrement in the ndb_import tool
  3. Killing ndbmtd now uses a graceful shutdown avoiding unnecessary abort
The new features in RonDB 21.10.1 are:
  1. Improved placement of primary replicas
  2. Removing index statistics mutex as a bottleneck in MySQL Server
  3. More flexibility in thread configuration
  4. Use query threads also for Reads which locks the row
Work on RonDB 21.04.2 is already ongoing and is mainly focused on backporting bug fixes from MySQL 8.0.24 through 8.0.26 that are deemed safe and important enough for a back port. The branch used for development of RonDB 21.04 is called 21.04.

Work on RonDB 21.10.2 has already started where we integrated changes from MySQL 8.0.26. The branch used for RonDB 21.10 is called 21.10.1.

There is ongoing work on RonDB to improve use of memory resources. This includes making schema memory use the global memory resources. It also introduces some common malloc functions using global memory resources. This development is a base for many future RonDB improvements that will make it easier to develop new features in RonDB. This work is found in the branch schema_memory_21102 currently.

Here is the GitHub tree for RonDB.

The flexible thread configuration was used for some research on thread pipelines presented in this blog.

Tuesday, May 25, 2021

HA vs AlwaysOn

 In the 1990s I spent a few years studying requirements on databases used in 3G telecom networks. The main requirement was centered around three keywords, Latency, Throughput and Availability. In this blog post I will focus on Availability.


If a telecom database is down it means that no phone calls can be made, internet connections will not work and your app on your smartphone will cease to work. So more or less impacting each and everyone's life immediately.


The same requirements on databases now also start to appear in AI applications such as online Fraud detection, self-driving cars, smartphone apps.


Availability is measured in percent and for telecom databases the requirement is to reach 99.9999% availability. One often calls this Class 6 availability where 6 is the number of nines in the availability percentage.


Almost every database today promises High Availability, this has led to inflation in the term. Most databases that promise HA today reach Class 4 availability, thus 99.99% availability. Class 4 availability means that 50 minutes each year your system is down and unavailable for transactions. For many this is sufficient, but imagine relying on such a system for self-driving cars, or phone networks or modern AI applications used to make online decisions for e.g. hospitals. Thus I sometimes use the term AlwaysOn to refer to a higher availability reaching Class 5 and Class 6 and beyond.


To analyse Availability we need to look at the reasons for unavailability. There are the obvious ones like hardware and software failure. The main solution for those is to use replication. However as I will show below, most replication solutions of today incur downtime on every failure. Thus it is necessary to analyse the replication solution in more detail to see what happens at a failure. Does the database immediately deliver the same latency and throughput after a failure as required by the application?


There are also a lot of planned outages that can happen in many HA systems. We could have downtime when a node goes down before the cluster has been reorganised again, we can have downtime at Software upgrades, at Hardware upgrades, at schema reorganisations. Major upgrades involving lots of systems may cause significant downtime.


Thus to be called AlwaysOn a Database must have zero downtime solutions for at least the following events:

  1. Hardware Failure

  2. Software Failure

  3. Software Upgrade

  4. Hardware Upgrade

  5. Schema Reorganisation

  6. Online Scaling

  7. Major Upgrades

  8. Major disasters


Most of these problems require replication as the solution. However the replication must be at 2 levels. There must be a local replication that makes it possible to sustain the Latency requirements of your application even in the presence of failures. In addition there is a need to handle replication on a global level that makes it possible to survive major upgrades and major disasters.


Finally, to handle schema reorganisations and scaling the system without downtime requires careful design of algorithms in your database.


Today replication is used in almost every database. However the replication scheme in most databases leads to significant downtime for AlwaysOn applications at each failure and upgrade.


To handle a SW/HW failure and SW/HW upgrade without any downtime requires that the replicas are ready to start serving requests within one millisecond after discovering the configuration change required by the failure/upgrade.


Most replication solutions today rely on eventual consistency, thus they are by design unavailable even in normal operation and even more so when the primary replica fails. Many other replication solutions only ship the log records to the backup replica, thus at failure the backup replica must execute the log before it can serve all requests. Again downtime for the application which makes it impossible to meet the requirements of AlwaysOn applications.


To reach AlwaysOn availability it is necessary to actually perform all write operations on the backup replicas as well as on the primary replica. This ensures that the data is immediately available in the presence of failures or other configuration changes.


This has consequences for the latency, many databases take shortcuts when performing writes and doesn’t ensure that the backup replicas are immediately available at failures.


Thus obviously if you want your application to reach AlwaysOn availability, you need to ensure that your database software delivers a solution that doesn't involve log replay at failure (thus all shared disk solutions and eventual consistency solutions are immediately discarded).


As part of my Ph.D research in the 1990s I concluded those things. All those requirements was fed into the development of NDB Cluster, NDB Cluster was the base for MySQL Cluster (nowadays often referred to as MySQL NDB Cluster) and today Logical Clocks AB develops RonDB a distribution of NDB Cluster.


To reach AlwaysOn it is necessary that your database software can deliver solutions to all 8 problems listed above without any downtime (not even 10 seconds). Actually the limit on the availability is the time it takes to discover a failure, for the most part this is immediate since most failures are SW failures, but for some HW failures it is based on a heartbeat mechanism that can lead to a few seconds of downtime.


Thus to reach AlwaysOn availability it is necessary that your database software can handle failures with zero downtime, it is necessary that your database software can handle the algorithms for online schema management and online scaling of your database, it is necessary that your database software supports global replication to also be able to handle major upgrades and major disasters where entire regions are affected. Finally you need to have an organisation that operates your systems in a reliable manner.


RonDB is designed to handle all the requirements on the database software which have been proven in production sites reaching Class 6 availability for more than 15 years. Logical Clocks is building the competence to operate RonDB for customers at these availability levels.

Wednesday, May 19, 2021

5.5M Key Lookups per second on 16 VCPU VMs

 As introduced in a previous blog RonDB enables us to easily execute benchmarks on RonDB using the Sysbench benchmark.


In this blog I will present some results where the RonDB cluster had 2 data nodes, each using a r5.4xlarge VM in AWS that has 16 VCPUs and 128 GB memory. The Sysbench test uses SQL to access RonDB.


In this particular test case we wanted to test the Key-Lookup performance using SQL. Key-Lookup performance is essential in the RonDB use case as an online Feature Store in Hopsworks.


In this case we use the following settings in the Sysbench configuration file (see RonDB documentation for more details on how to set up those tests):


SB_USE_IN_STATEMENT=”100”

SYSBENCH_TEST=”oltp_ro”

SB_SIMPLE_RANGES=”0”

SB_ORDER_RANGES=”0”

SB_SUM_RANGES=”0”

SB_DISTINCT_RANGES=”0”


These settings means that each transaction has 10 SQL queries, each of those queries fetch 100 rows using a primary key lookup. Thus each transaction performs 1000 key lookups.


We will look at two things here, how many key lookups can be delivered per second and what is the latency of each of those batched SELECT statements.


In the image above we see how the number of key value lookups grows extremely quickly, even with only 2 threads per MySQL Server we reach beyond 1M key lookups per second. The total throughput at 32 threads is 5.56M key lookups per second.


Even more interesting is the response time of each of those SELECT statements. Remember that each of those SELECT fetch 100 rows using a primary key lookup. Using 4 threads per MySQL Server we respond within 1.0 millisecond with a throughput of 2.17M key lookups per second. At 12 threads we deliver around 4.5M key lookups with a latency around 1.5 millisecond per SELECT query. At 16 threads we deliver 5.27M key lookups at a response time of 2.28 milliseconds.


Thus as can be seen from these numbers, even using SQL, RonDB can provide astonishing key lookup performance. Obviously using C++, Java or JavaScript key-value APIs we expect to perform a lot better yet and in particular it will use much less CPU on the client side.




Monday, May 10, 2021

Research on Thread Pipelines using RonDB

 

Introduction

In my previous two blogs on this topic I first introduced the concept of automatic thread configuration and the thread model we use in RonDB. After receiving some questions on the topic I dived a bit deeper into explaining the RonDB thread model and its thread pipeline and compared it to another similar concept called batch pipelines.


Since then I read up a bit more on the research in this area with a focus on implementations in other key-value stores. Some researchers argue that a model where one handles the request immediately is superior to a model using a thread pipeline.

RonDB Software Architecture

RonDB has its roots in MySQL NDB Cluster which was developed at Ericsson in the 1990s. The execution engine in NDB has always used asynchronous programming. NDB has its roots in AXE, a telecom switch that Ericsson developed in the 1970s and still develops actually. The development model in AXE used two concepts called blocks and signals. This architecture was inherited from using a HW architecture as base for the software architecture.


The idea is that the blocks are independent units that only communicate with other blocks using signals. This is the model that all hardware is built with due to physical constraints. The reasoning in AXE was that this model is beneficial also to develop large software systems. Development of blocks can be made independent of each other as long as they handle the signalling protocols correctly.


In the 1990s CPUs were single-threaded, sometimes you had 2 CPUs on the motherboard. So a single-threaded model was good enough. When moving to a multi-threaded environment it was necessary to find a way to break up the single-threaded model into a multi-threaded model. In NDB this used two approaches, the first approach was to simply divide the blocks into execution modules. These execution modules are true to the block concept, thus they only communicate with each other through asynchronous signals. Blocks within an execution module are allowed to communicate with synchronous signals and can for optimisation purposes use direct method calls.

RonDB execution modules


In NDB the execution modules are the following:

  1. Transaction Coordinator (tc)

  2. Local Data Manager (ldm)

  3. Main execution module (main)

  4. Replication module (rep)

  5. Network receive (recv)

  6. Network send (send)


These execution modules are now natural to divide into threads. Each execution module can potentially be an independent thread.


However to scale RonDB to larger systems it is also necessary to split a few of those execution modules. The tc module can easily be scaled by allowing each tc module to handle its own set of transactions. This makes for a simple round-robin model to select the tc module to handle a transaction.


The ldm module can be split by partitioning the data and let each ldm module handle a subset of the data. This makes for an easy way to route the data requests to the correct ldm module.


The network receive and the network send modules can be split by partitioning on sockets. Thus each receive socket is handled by one recv module and each send socket is handled by one send module. Each recv and send module can handle multiple sockets.


Actually the send module is handled a bit using its own model where other modules can either handle the send themselves or offload it to the send module. This is not the topic for this blog, so I won’t go into more details about send modules.


One problem with this design is that partitioning the data makes perfect sense for key-value lookups that access data using a primary key. However RonDB is a key-value database with SQL capabilities, thus it is important to also perform well for more complex queries even complex join queries accessing a multitude of tables.


In the partitioned model a range scan not using the partition key has to scan all partitions. This is both beneficial and inefficient. It allows for automatic parallelisation of the range scan, but the lookup in the index tree has to be done once in each partition which increases the cost of the range scan. The solution would be to have some amount of partitioning, but to limit it.

RonDB Query threads


To alleviate this problem a new execution module was invented which is used in RonDB. This is the query execution module. This is in reality the same module as the ldm module, but it is only allowed to handle Committed Read queries. Thus all locked reads and all writes are still routed to the ldm module.

Experiment setup

Now to the point of this description. As can be seen from the above description we have a set of execution modules where each execution module can be assigned to an independent thread. This thread can participate in the thread pipeline previously described and shown in the image above.


The important thing here is that we are actually at liberty to place the execution modules into threads in any manner we like. This opens up the floor to use RonDB as a vehicle to research the benefits and deficiencies of thread pipelines.


The code used in these experiments is found in the hopsworks_2525 branch in the RonDB tree.


In this branch of RonDB it is possible to colocate the tc threads and recv execution modules. Thus each recv thread has both a tc execution module as well as a recv execution module. The assignment of the tc execution module will always be done such that the recv execution module and the tc execution module is placed in the same recv thread.


This makes it possible to create a RonDB data node where we only have separate recv threads and ldm/query threads. However I took another step and even made it possible to run RonDB with only receive threads. In this case each recv thread contains a recv, tc, ldm, and query execution module. In this case we will always select the local query module to execute the query if possible.


The above model makes it possible to experiment with the various settings.


To test this I used a micro-benchmark using variants of the Sysbench test presented here and here. The tests were executed on a P620 Lenovo workstation with a AMD Threadripper processor based on the AMD Zen 2 architecture.


To analyze the tests and to try to understand the reasons for the performance differences in the setups I used a set of perf scripts that can be found in the dbt2 benchmark tree.


The setup I am going to report here used two different setups of RonDB. The setups are not using the automated thread configuration, but rather use the ThreadConfig configuration variable that makes it possible to set up a large variety of thread configurations and experiment with those. I will add the autobench.conf used in those benchmarks at the end of the blog.


The RonDB data nodes used 4 threads that were locked to 4 CPUs using 2 CPU cores in both experiments. One experiment used 4 receive threads that try to do all processing on one thread, the second experiment used 2 receive threads that handle both the receive execution module and the tc execution module and then 1 ldm thread and 1 query thread.


We will call the set up with separated ldm and recv threads the LDM setup and the other setup will be called the Receive setup in this document.

Key-value lookups

The first experiment uses a variant where Sysbench only performs batched primary key lookups with 100 lookups in each SQL statement.


In this case the model LDM setup is able to handle 1.34M lookups per second. The Receive setup is able to handle 1.05M lookups. Thus the thread pipeline achieves almost 30% better throughput in this experiment.


Given that both experiments are executed within RonDB using exactly the same code, from the same RonDB version, it is safe to assume that the execution uses the same amount of code, the same amount of instructions are executed to perform a key-value lookup.


So what differs when looking into the CPU statistics. What is clear is that modern CPUs are very hard to analyze given that they are extremely good at parallelising tasks. 


The two things that stand out are the IPC achieved, the CPU usage and the clock frequency used. The ldm thread in the LDM setup achieves an IPC of 1.40, a frequency of 3.96 GHz and 99.8% CPU usage. The query thread has an IPC of 1.30, a frequency of 3.95 GHz and a CPU usage of 99.9%. The receive threads in the LDM setup have an IPC of 1.26, a frequency of 3.886 and a CPU usage of 66.4%.


The Receive setup threads have an IPC of 1.05, a frequency of 3.468 GHz and a 77.6% CPU usage.


Thus the performance difference comes from a combination of a higher IPC in the LDM setup, but also actually it runs at 500MHz higher clock frequency. The total CPU usage is slightly higher in the LDM setup. Those factors combined account for the 30% difference.


So trying to understand those numbers one can see that the LDM setup has a higher hit rate in the L1 data cache and slightly higher hit rate in the L1 instruction cache. There is also a slightly higher branch prediction miss rate in the Receive setup. However nothing revolutionary is seen here. The L2 miss rate is very similar in both experiments. The other difference is that the Receive setup sees a 10x higher rate iTLB misses.


The only real difference between the two cases is that the Receive setup will see more code. Thus it is expected that it will have to pay for that with more instruction cache misses, but interestingly this also triggers changes in the CPU frequency.


Thus our conclusion here is that if one can achieve a setup where the thread uses a limited amount of code to achieve its purpose, the performance will be much higher.


In principle the conclusion I do is that if we can decrease the code size we touch in a thread, this will directly affect the performance.


Using some number crunching we see that executing one key-value lookup in the ldm thread uses about 6000-7000 x86  assembler instructions. Thus the payback of decreasing the code size is almost logarithmic. So reusing the same code gives great payback in throughput of your program.


To summarize executing about 10k instructions in a single thread costs almost 30% more than

splitting up the execution in 2 threads where one thread takes care of 4k instructions and the other takes care of 6k instructions.


It isn’t expected that this gain is independent of the code size. It is likely that when the code size fits well into the L1 instruction cache, that brings about a large performance enhancement. Thus if the code size of each of threads in the thread pipeline is too big for the L1 instruction cache, it is not expected that the gain of the thread pipeline is very large. So to optimize performance one needs to keep the code size aligned with the size of the instruction cache. This means that another important factor impacting the performance is the choice of compiler flags, this choice plays together with the software architecture.

Range Scans

The next test I singled out only included range scans from Sysbench. This had two effects, first the code executed was different, in the ldm/query thread this led to even higher IPC and even higher frequency of the CPU. The IPC increased to 2.25 and the frequency increased by 2%. However another effect was that a lot more of the processing moved to the ldm/query threads, thus the receive thread was much less involved. This had the effect that the Receive setup actually had higher throughput in this case.

Outcome of tests

What we have shown with those tests is that it is beneficial from an efficiency point of view to use a thread pipeline. It increases the IPC and surprisingly also increases the frequency the CPU core operates in. This can increase the throughput up to at least 30% and even more than that in some tests executed.


At the same time the workload can cause the load to be unbalanced, this can cause some CPUs to be a bit unused. This can cause up to 30-40% loss of performance in some workloads and no loss at all and even a small gain in other workloads.


Thus the outcome is that whether or not using a thread pipeline or not is dependent on the workload. We have concluded that thread pipelines definitely provide an advantage, but also that this advantage can be wiped out by imbalances in the load on the various threads in the pipeline.

Adaptive solution

So actually when I started this exercise I expected to find out which model would work best. I wasn’t really expecting to come up with a new model as the solution. However the results made it clear that thread pipelines should be used, but that one needs to find ways of using the free CPU cycles in the LDM setup.


Given the flexibility of the query threads we have an obvious candidate for this flexibility. It is possible to execute some queries directly from the receive threads instead of passing them on to the ldm/query threads. The introduction of query threads makes it possible to be more flexible in allocating the execution of those queries.


So the conclusion is that a combination of the Receive setup and the LDM setup is the winning concept. However this introduces a new technical challenge. It is necessary to implement an adaptive algorithm that decides when to execute queries locally in the receive thread and when to execute them using the ldm/query threads.


This is in line with the development of RonDB the last couple of years. The static solutions are rarely optimal, it is often necessary to introduce adaptive algorithms that use CPU load statistics and queue size statistics for the threads in the mix.


In a sense this could be viewed as a query optimizer for key-value databases. SQL queries are usually complex and require careful selection of the execution plan for a single query. In a key-value database millions of queries per second flow through the database and by adapting the execution model to the current workload we achieve query optimisation of large flows of small simple queries.


autobench.conf

#

# Software definition

#

USE_RONDB="no"

USE_BINARY_MYSQL_TARBALL="no"

TARBALL_DIR="/home/mikael/tarballs"

MYSQL_BIN_INSTALL_DIR="/home/mikael/mysql"

BENCHMARK_TO_RUN="sysbench"

MYSQL_VERSION="rondb-gpl-21.10.0"

DBT2_VERSION="dbt2-0.37.50.17"

SYSBENCH_VERSION="sysbench-0.4.12.17"

MYSQL_BASE="8.0"

TASKSET="numactl"

#

# Storage definition

#

DATA_DIR_BASE="/home/mikael/ndb"

#

# MySQL Server definition

#

SERVER_HOST="127.0.0.1"

SERVER_PORT="3316"

ENGINE="ndb"

MYSQL_USER="root"

#

# NDB node definitions

#

NDB_EXTRA_CONNECTIONS="32"

NDB_MULTI_CONNECTION="4"

NDB_TOTAL_MEMORY="25200M"

NDB_MGMD_NODES="127.0.0.1"

NDBD_NODES="127.0.0.1"

NDB_FRAGMENT_LOG_FILE_SIZE="1G"

NDB_NO_OF_FRAGMENT_LOG_FILES="16"

NDB_SPIN_METHOD="StaticSpinning"

NDB_MAX_NO_OF_CONCURRENT_OPERATIONS="262144"

NDB_AUTOMATIC_THREAD_CONFIG="no"

NDB_THREAD_CONFIG="ldm={count=1,cpubind=0},query={count=1,cpubind=1},tc={count=0,cpubind=2},recv={count=2,cpubind=32,33},send={count=0,cpubind=35},main={count=0,cpubind=36},io={count=1,cpubind=33}"

#NDB_THREAD_CONFIG="ldm={count=0,cpubind=0},query={count=0,cpubind=3},tc={count=0,cpubind=2},recv={count=4,cpubind=0,1,32,33},send={count=0,cpubind=35},main={count=0,cpubind=36},io={count=1,cpubind=33}"

#

# Benchmark definition

#

SYSBENCH_TEST="oltp_rw"

MAX_TIME="20000"

#SB_POINT_SELECTS="0"

SB_NUM_TABLES="1"

#SB_SUM_RANGES="0"

#SB_DISTINCT_RANGES="0"

#SB_ORDER_RANGES="0"

#SB_SIMPLE_RANGES="0"

#SB_USE_IN_STATEMENT="100"

THREAD_COUNTS_TO_RUN="48"

SERVER_BIND="0"

SERVER_CPUS="8-27,40-59"

BENCHMARK_BIND="0"

BENCHMARK_CPUS="28-31,60-63"