Mikael Ronstrom: AlwaysOn

Showing posts with label AlwaysOn. Show all posts

Friday, August 13, 2021

How to achieve AlwaysOn

When discussing how to achieve High Availability most DBMS focus on handling it via replication. Most of the focus has thus been focused on various replication algorithms.

However truly achieving AlwaysOn availability requires more than just a clever replication algorithm.

RonDB is based on NDB Cluster, NDB has been able to prove in practice that it can deliver capabilities that makes it possible to build systems with less than 30 seconds of downtime per year.

So what is required to achieve this type of availability?

Replication
Instant Failover
Global Replication
Failfast Software Architecture
Modular Software Architecture
Advanced Crash Analysis
Managed software

Thus a clever replication algorithm is only 1 of 7 very important parts to achieve the highest possible level of availability. Managed software is one of the addition that RonDB does to NDB Cluster. This won't be discussed in this blog.

Instant Failover means that the cluster must handle failover immediately. This is the reason why RonDB implements a Shared Nothing DBMS architecture. Other HA DBMS such as Oracle and MySQL InnoDB Cluster and Galera Cluster relies on replaying the logs at failover to catch up. Before this catch up has happened the failover hasn't completed. In RonDB every updating transaction updates both data and logs as part of the changing transaction, thus at failover we only need to update the distribution information.

In a DBMS updating information about node state is required to be a transaction itself. This transaction takes less than one millisecond to perform in a cluster. Thus in RonDB the time it takes to failover is dependent on the time it takes to discover that the node has failed. In most cases the reason for the failure is a software failure and this usually leads to dropped network connections which are discovered within microseconds. Thus most failovers are handled within milliseconds and the cluster is repaired and ready to handle all transactions again.

The hardest failure to discover are the silent failures, this can happen e.g. when the power on a server is broken. In this case the time it takes is dependent on the time configured for heartbeat messages. How low this time can be set is dependent on the operating system and how much one can depend on that it sends a message in a highly loaded system. Usually this time is a few seconds.

But even with replication and instant failover we still have to handle failures caused by things like power breaks, thunderstorms and many more problems that cause an entire cluster to fail. A DBMS cluster is usually located within a confined space to achieve low latency on database transactions.

To handle this we need to handle failover from one RonDB cluster to another RonDB cluster. This is achieved in RonDB by using asynchronous replication from one cluster to another. This second RonDB cluster needs to physically separated from the other cluster to ensure higher independence of failures.

Actually having global replication implemented also means that one can handle complex software changes such as if your application does a massive rewrite of the data model in your application.

Ok, are we done now, is this sufficient to get a DBMS cluster which is AlwaysOn.

Nope, more is needed. After implementing these features it is also required to be able to quickly find the bugs and be able to support your customers when they hit issues.

The nice thing with this architecture is that a software failure will most of the time not cause anything more than a few aborted transactions which the application layer should be able to handle.

However in order to build an AlwaysOn architecture one has to be able to quickly get rid of bugs as well.

When NDB Cluster joined MySQL two different software architectures met each other. MySQL was a standalone DBMS, this meant that when it failed the database was no longer available. Thus MySQL strived to avoid crashes since that meant that the customer no longer could access its data.

With NDB Cluster the idea was that there would always be another node available to take over if we fail. Thus NDB, and thus also RonDB implements a Failfast Software Architecture. In RonDB this is implemented using a macro in the RonDB called ndbrequire, this is similar how most software uses assert. However ndbrequire stays in the code also when we run in production code.

Thus every transaction that is performed in RonDB causes thousands error checks to be checked. If one of those ndbrequire's returns false we will immediately fail the node. Thus RonDB will never proceed when we have an indication that we have reached a disallowed state. This ensures that the likelihood of a software failure leading to data being incorrect is minimised.

However crashing solves only the problem as a short-term solution. In order to solve the problem for real we also have to fix the bug. To be able to fix bugs in a complex DBMS requires a modular software architecture. RonDB software architecture is based on experiences from AXE, this is a switch developed in the 1970s at Ericsson.

The predecessor of AXE at Ericsson was AKE, this was the first electronic switch developed at Ericsson. It was built as one big piece of code without clear boundaries between the code parts. When this software reached sizes of millions of lines of code it became very hard to maintain the software.

Thus when AXE was developed in a joint project between Ericsson and Telia (a swedish telco operator) the engineers needed to find a new software architecture that was more modular.

The engineers had lots of experiences of designing hardware as well. In hardware the only path to communicate between two integrated circuits is by using signals on an electrical wire. Since this made it possible to design complex hardware with small amount of failures, the engineers reasoned that this architecture should work as a software architecture as well.

Thus the AXE software architecture used blocks instead of integrated circuits and signals instead of electrical signals. In modern software language these would have been called modules and messages most likely.

A block owns its own data, it cannot peek at other blocks data, the only manner to communicate between blocks is by using signals that send messages from one block to another block.

RonDB is designed like this with 23 blocks that implements different parts of the RonDB software architecture. The method to communicate between blocks is mainly through signals. These blocks are implemented as large C++ classes.

This software architecture leads to a modular architecture that makes it easy to find bugs. If a state is wrong in a block it can either be caused by code in the block, or by a signal sent to the block.

In RonDB signals can be sent between blocks in the same thread, to blocks in another thread in the same node and they can be sent to a thread in another node in the cluster.

In order to be able to find the problem in the software we want access to a number of things. The most important feature to discover is to discover the code path that led to the crash.

In order to find this RonDB software contains a macro called jam (Jump Address Memory). This means that we can track a few thousand of the last jumps before the crash. The code is filled with those jam macros. This is obviously an extra overhead that makes RonDB a bit slower, but to deliver the best availability is even more important than being fast.

Just watch Formula 1, the winner of Formula 1 over a season will never be a car that fails every now and then, the car must be both fast and reliable. Thus in RonDB reliability has priority over speed even though we mainly talk about the performance of RonDB.

Now this isn't enough, the jam only tracks jumps in the software, but it doesn't provide any information about which signals that led to the crash. This is also important. In RonDB each thread will track a few thousand of the last signals executed by the thread before the crash. Each signal will carry a signal id that makes it possible to follow signals being sent also between threads within RonDB.

Let's take an example of how useful this information is. Lately we had an issue in the NDB forum where a user complained that he hadn't been able to produce any backups the last couple of months since one of the nodes in the cluster failed each time the backup was taken.

In the forum the point in the code was described in the error log together with a stack trace of which code we executed while crashing. However this information wasn't sufficient to find the software bug.

I asked for the trace information that includes both the jam's and the signal logs of all the threads in the crashed node.

Using this information one could quickly discover how the fault occurred. It would only happen in high-load situations and required very tricky races to occur, thus the failure wasn't seen by most users. However with the trace information it was fairly straightforward to find what caused the issue and based on this information a work-around to the problem was found as well as a fix of the software bug. The user could again be comfortable by being able to produce backups.

Tuesday, May 25, 2021

HA vs AlwaysOn

In the 1990s I spent a few years studying requirements on databases used in 3G telecom networks. The main requirement was centered around three keywords, Latency, Throughput and Availability. In this blog post I will focus on Availability.

If a telecom database is down it means that no phone calls can be made, internet connections will not work and your app on your smartphone will cease to work. So more or less impacting each and everyone's life immediately.

The same requirements on databases now also start to appear in AI applications such as online Fraud detection, self-driving cars, smartphone apps.

Availability is measured in percent and for telecom databases the requirement is to reach 99.9999% availability. One often calls this Class 6 availability where 6 is the number of nines in the availability percentage.

Almost every database today promises High Availability, this has led to inflation in the term. Most databases that promise HA today reach Class 4 availability, thus 99.99% availability. Class 4 availability means that 50 minutes each year your system is down and unavailable for transactions. For many this is sufficient, but imagine relying on such a system for self-driving cars, or phone networks or modern AI applications used to make online decisions for e.g. hospitals. Thus I sometimes use the term AlwaysOn to refer to a higher availability reaching Class 5 and Class 6 and beyond.

To analyse Availability we need to look at the reasons for unavailability. There are the obvious ones like hardware and software failure. The main solution for those is to use replication. However as I will show below, most replication solutions of today incur downtime on every failure. Thus it is necessary to analyse the replication solution in more detail to see what happens at a failure. Does the database immediately deliver the same latency and throughput after a failure as required by the application?

There are also a lot of planned outages that can happen in many HA systems. We could have downtime when a node goes down before the cluster has been reorganised again, we can have downtime at Software upgrades, at Hardware upgrades, at schema reorganisations. Major upgrades involving lots of systems may cause significant downtime.

Thus to be called AlwaysOn a Database must have zero downtime solutions for at least the following events:

Hardware Failure
Software Failure
Software Upgrade
Hardware Upgrade
Schema Reorganisation
Online Scaling
Major Upgrades
Major disasters

Most of these problems require replication as the solution. However the replication must be at 2 levels. There must be a local replication that makes it possible to sustain the Latency requirements of your application even in the presence of failures. In addition there is a need to handle replication on a global level that makes it possible to survive major upgrades and major disasters.

Finally, to handle schema reorganisations and scaling the system without downtime requires careful design of algorithms in your database.

Today replication is used in almost every database. However the replication scheme in most databases leads to significant downtime for AlwaysOn applications at each failure and upgrade.

To handle a SW/HW failure and SW/HW upgrade without any downtime requires that the replicas are ready to start serving requests within one millisecond after discovering the configuration change required by the failure/upgrade.

Most replication solutions today rely on eventual consistency, thus they are by design unavailable even in normal operation and even more so when the primary replica fails. Many other replication solutions only ship the log records to the backup replica, thus at failure the backup replica must execute the log before it can serve all requests. Again downtime for the application which makes it impossible to meet the requirements of AlwaysOn applications.

To reach AlwaysOn availability it is necessary to actually perform all write operations on the backup replicas as well as on the primary replica. This ensures that the data is immediately available in the presence of failures or other configuration changes.

This has consequences for the latency, many databases take shortcuts when performing writes and doesn’t ensure that the backup replicas are immediately available at failures.

Thus obviously if you want your application to reach AlwaysOn availability, you need to ensure that your database software delivers a solution that doesn't involve log replay at failure (thus all shared disk solutions and eventual consistency solutions are immediately discarded).

As part of my Ph.D research in the 1990s I concluded those things. All those requirements was fed into the development of NDB Cluster, NDB Cluster was the base for MySQL Cluster (nowadays often referred to as MySQL NDB Cluster) and today Logical Clocks AB develops RonDB a distribution of NDB Cluster.

To reach AlwaysOn it is necessary that your database software can deliver solutions to all 8 problems listed above without any downtime (not even 10 seconds). Actually the limit on the availability is the time it takes to discover a failure, for the most part this is immediate since most failures are SW failures, but for some HW failures it is based on a heartbeat mechanism that can lead to a few seconds of downtime.

Thus to reach AlwaysOn availability it is necessary that your database software can handle failures with zero downtime, it is necessary that your database software can handle the algorithms for online schema management and online scaling of your database, it is necessary that your database software supports global replication to also be able to handle major upgrades and major disasters where entire regions are affected. Finally you need to have an organisation that operates your systems in a reliable manner.

RonDB is designed to handle all the requirements on the database software which have been proven in production sites reaching Class 6 availability for more than 15 years. Logical Clocks is building the competence to operate RonDB for customers at these availability levels.

Mikael Ronstrom

Friday, August 13, 2021

How to achieve AlwaysOn

Tuesday, May 25, 2021

HA vs AlwaysOn

About Me

Links