My colleague Zhao Song presented a walkthrough of the evolution of the DBMSs and how it relates to Google Spanner, Aurora, PolarDB and MySQL NDB Cluster.
I had some interesting discussions with him on the topic and it makes sense to return to the 1990s when I designed NDB Cluster and the impact on the recovery algorithms from the requirements for a Telco DBMS.
A Telco DBMS is a DBMS that operates in a Telco environment, this DBMS is involved in each interaction with the Telco system through smartphones such as call setup, location updates, SMS, Mobile Data. If the DBMS is down it means no service available for smartphones. Obviously there is no time of day or night when it is ok to be down. Thus even a few seconds of downtime is important to avoid.
Thus in the design of NDB Cluster I had to take into account the following events:
- DBMS Software Upgrade
- Application Software Upgrade
- SW Failure in DBMS Node
- SW Failure in Application Service
- HW Failure in DBMS Node
- HW Failure in Application Service
- SW Failure in DBMS Cluster
- Region Failure
It was clear that the design had to be a distributed DBMS, in Telcos it was not uncommon to build HW Redundant solutions with a single node but redundant HW. But this solution will obviously have difficulties with SW failures. Also it requires very specialised HW which costs hundreds of million of dollars to develop. Today this solution is very rarely used.
One of the first design decisions would be to choose between a disk-based DBMS and an in-memory DBMS. This was settled by the fact that latency requirement was to handle transactions involving tens of rows within around 10 milliseconds, thus with the hard drives of those days not really possible. Today with the introduction of SSDs and NVMe drives there is still a latency impact of at least 3x in using disk drives compared to using an in-memory DBMS.
If we play with the thought of using a Shared Disk DBMS using modern HW we still have a problem. The Shared Disk solution requires a storage solution which is a Shared Nothing solution. In addition Shared Disk DBMS commit by writing the REDO log to the Shared Disk. This means at recovery we need to replay part of the REDO log to allow the node to take over after a failed node. Thus since the latest state of some disk pages is only available in the Shared Disk, we cannot serve any transactions of these pages until we replayed the REDO log. This used to be a period of around 30 seconds, it is shorter now, but it is still not good enough for the Telco requirements.
Thus we have settled for a Shared Nothing DBMS solution using in-memory tables. The next problem is how to handle replication in a Shared Nothing. The replication sends REDO logs or something similar to this towards the backup replicas. Now one has a choice, either one applies the REDO logs immediately or one only writes them to the REDO log and applies them later.
Again applying them later means that we will suffer downtime if the backup replica is forced to take over as primary replica. Thus we have to apply the REDO logs immediately as part of the transaction execution. This means we are able to takeover within milliseconds after a node failure.
Failures could happen in two ways, most SW failures will be discovered by the other nodes in the cluster immediately. In this case node failures are discovered very quickly. However in particular HW failures can lead to silent failures, here one is required to use some sort of I-am-alive protocol (heartbeat in NDB). The discovery time here is a product of the real-time properties of the operating system and of the DBMS.
Now transaction execution can be done using a replication protocol such as PAXOS where a global order of transactions is maintained or through a non-blocking 2PC protocol. Both are required to handle failures of the coordinator through a leader-selection algorithm and handling the ongoing transactions that are affected by this.
The benefits of the non-blocking 2PC is that it can handle millions of concurrent transactions since the coordinator role can be handled by any node in the cluster. There is no central point limiting the transaction throughput. To be a non-blocking 2PC it is required to handle failed coordinators by finishing ongoing transactions using a take-over protocol. To handle cluster recovery an epoch transaction is created that regularly creates consistent recovery points. This epoch transaction can also be used to replicate to other regions even supporting Active-Active replication using various Conflict Detection Algorithms.
So the conclusion of how to design a DBMS for Telco requirements is:
- Use an in-memory DBMS
- Use a Shared Nothing DBMS
- Apply the changes on both primary replica and backup replica as part of transaction
- Use non-blocking 2PC for higher concurrency of write transactions
- Implement Heartbeat protocol to discover silent failures in both APIs and DBMS nodes
- Implement Take-over protocols for each distributed protocol, especially Leader-Selection
- Implement Software Upgrade mechanisms in both APIs and DBMS nodes
- Implement Failure Handling of APIs and DBMS nodes
- Support Online Schema Changes
- Support Regional Replication
The above implementation makes it possible to run a DBMS with Class 6 availability (less than 30 seconds of downtime per year). This means that all SW, HW and regional failures, including the catastrophic ones are accounted for within this 30 seconds per year.
MySQL NDB Cluster has been used at this level for more than 20 years and continues to serve billions of people with a highly available service.
At Hopsworks MySQL NDB Cluster was selected as the platform to build a highly available real-time AI platform. To make MySQL NDB Cluster accessible for anyone to use we forked it and call it RonDB. RonDB has made many improvements of ease-of-use, scalable reads, creating a managed service that makes it possible to easily install and manage RonDB. We have also added a set of new interfaces, a REST server to handle batches of lookups for generic database lookups and for feature store lookups, RonSQL to handle optimised aggregation queries that are very common in AI applications and finally an experimental Redis interface called Rondis.
Check out
rondb.com for more information, you can try it out and if you want to read the latest scalable benchmark go directly
here. If you want to have a walkthrough of the challenges in running a highly scalable benchmark you can find it
here.
Happy reading!