Mikael Ronstrom: HA vs AlwaysOn

In the 1990s I spent a few years studying requirements on databases used in 3G telecom networks. The main requirement was centered around three keywords, Latency, Throughput and Availability. In this blog post I will focus on Availability.

If a telecom database is down it means that no phone calls can be made, internet connections will not work and your app on your smartphone will cease to work. So more or less impacting each and everyone's life immediately.

The same requirements on databases now also start to appear in AI applications such as online Fraud detection, self-driving cars, smartphone apps.

Availability is measured in percent and for telecom databases the requirement is to reach 99.9999% availability. One often calls this Class 6 availability where 6 is the number of nines in the availability percentage.

Almost every database today promises High Availability, this has led to inflation in the term. Most databases that promise HA today reach Class 4 availability, thus 99.99% availability. Class 4 availability means that 50 minutes each year your system is down and unavailable for transactions. For many this is sufficient, but imagine relying on such a system for self-driving cars, or phone networks or modern AI applications used to make online decisions for e.g. hospitals. Thus I sometimes use the term AlwaysOn to refer to a higher availability reaching Class 5 and Class 6 and beyond.

To analyse Availability we need to look at the reasons for unavailability. There are the obvious ones like hardware and software failure. The main solution for those is to use replication. However as I will show below, most replication solutions of today incur downtime on every failure. Thus it is necessary to analyse the replication solution in more detail to see what happens at a failure. Does the database immediately deliver the same latency and throughput after a failure as required by the application?

There are also a lot of planned outages that can happen in many HA systems. We could have downtime when a node goes down before the cluster has been reorganised again, we can have downtime at Software upgrades, at Hardware upgrades, at schema reorganisations. Major upgrades involving lots of systems may cause significant downtime.

Thus to be called AlwaysOn a Database must have zero downtime solutions for at least the following events:

Hardware Failure
Software Failure
Software Upgrade
Hardware Upgrade
Schema Reorganisation
Online Scaling
Major Upgrades
Major disasters

Most of these problems require replication as the solution. However the replication must be at 2 levels. There must be a local replication that makes it possible to sustain the Latency requirements of your application even in the presence of failures. In addition there is a need to handle replication on a global level that makes it possible to survive major upgrades and major disasters.

Finally, to handle schema reorganisations and scaling the system without downtime requires careful design of algorithms in your database.

Today replication is used in almost every database. However the replication scheme in most databases leads to significant downtime for AlwaysOn applications at each failure and upgrade.

To handle a SW/HW failure and SW/HW upgrade without any downtime requires that the replicas are ready to start serving requests within one millisecond after discovering the configuration change required by the failure/upgrade.

Most replication solutions today rely on eventual consistency, thus they are by design unavailable even in normal operation and even more so when the primary replica fails. Many other replication solutions only ship the log records to the backup replica, thus at failure the backup replica must execute the log before it can serve all requests. Again downtime for the application which makes it impossible to meet the requirements of AlwaysOn applications.

To reach AlwaysOn availability it is necessary to actually perform all write operations on the backup replicas as well as on the primary replica. This ensures that the data is immediately available in the presence of failures or other configuration changes.

This has consequences for the latency, many databases take shortcuts when performing writes and doesn’t ensure that the backup replicas are immediately available at failures.

Thus obviously if you want your application to reach AlwaysOn availability, you need to ensure that your database software delivers a solution that doesn't involve log replay at failure (thus all shared disk solutions and eventual consistency solutions are immediately discarded).

As part of my Ph.D research in the 1990s I concluded those things. All those requirements was fed into the development of NDB Cluster, NDB Cluster was the base for MySQL Cluster (nowadays often referred to as MySQL NDB Cluster) and today Logical Clocks AB develops RonDB a distribution of NDB Cluster.

To reach AlwaysOn it is necessary that your database software can deliver solutions to all 8 problems listed above without any downtime (not even 10 seconds). Actually the limit on the availability is the time it takes to discover a failure, for the most part this is immediate since most failures are SW failures, but for some HW failures it is based on a heartbeat mechanism that can lead to a few seconds of downtime.

Thus to reach AlwaysOn availability it is necessary that your database software can handle failures with zero downtime, it is necessary that your database software can handle the algorithms for online schema management and online scaling of your database, it is necessary that your database software supports global replication to also be able to handle major upgrades and major disasters where entire regions are affected. Finally you need to have an organisation that operates your systems in a reliable manner.

RonDB is designed to handle all the requirements on the database software which have been proven in production sites reaching Class 6 availability for more than 15 years. Logical Clocks is building the competence to operate RonDB for customers at these availability levels.

Mikael Ronstrom

Tuesday, May 25, 2021

HA vs AlwaysOn

No comments:

Post a Comment