Friday, February 21, 2020

Original NDB Cluster Requirements

NDB Cluster was originally developed for Network DataBases in the telecom
network. I worked in a EU project between 1991 and 1995 that focused on
developing a pre-standardisation effort on UMTS that later became standardised
under the term 3G. I worked in a part of the project where we focused on
simulating the network traffic in such a 3G network. I was focusing my attention
especially on the requirements that this created on a network database
in the telecom network.

In the same time period I also dived deeply into research literatures about DBMS
implementation.

The following requirements from the 3G studies emerged as the most important:

1) Class 5 Availability (less than 5 minutes of unavailability per year)
2) High Write Scalability as well as High Read Scalability
3) Predictable latency down to milliseconds
4) Efficient API
5) Failover in crash scenarios within seconds or even subseconds with a real-time OS

In another blog on the influences leading to the use of an asynchronous programming
model in NDB Cluster we derive the following requirements on the software
architecture.

1) Fail-fast architecture (implemented through ndbrequire macro in NDB)
2) Asynchronous programming (provides much tracing information in crashes)
3) Highly modular SW architecture
4) JAM macros to track SW in crash events

In another blog I present the influences leading to NDB Cluster using a shared
nothing model.

One important requirement that NDB Cluster is fairly unique in addressing is high
write scalability. Most DBMSs solves this by grouping together large amounts of
small transactions to make commits more efficient. This means that most DBMSs
have a very high cost of committing a transaction.

Modern replicated protocols actually have even made this worse. As an example in
most modern replicated protocols all transactions have to commit in a serial fashion.
This means that commit handling is a major bottleneck in many modern DBMSs.
Often this limits their transaction rates to tens of thousands commits per second.

NDB Cluster went another path and essentially commits every single row change
separate from any other row change. Thus the cost of executing 1000 transactions
with 1000 operations per transaction is exactly the same as the cost of executing
1 million single row transactions.

To achieve the grouping we used the fact that we are working in an asynchronous
environment. Thus we used several levels of piggybacking of messages. One of the
most important things here is that one socket is used to transport many thousands of
simultaneous database transactions. With NDB Cluster 8.0.20 we use multiple sockets
between data nodes and this scales another 10-20x to ensure that HW limitations is
the bottleneck and not the NDB software.

The asynchronous programming model ensures that we can handle thousands of
operations each millisecond and that changing from working on one transaction to
another is a matter of tens to hundreds of nanoseconds. In addition we can handle
these transactions independently in a number of different data nodes and even
within different threads within the same data node. Thus we can handle tens of millions
transactions per second even within a single data node.

The protocol we used for this is a variant of the two-phase commit protocol with
some optimisations based on the linear two-phase commit protocol. However the
requirements on Class 5 Availability meant that we had to solve the blocking part
of the two-phase commit protocol. We solved this by recreating the state of the
failed transaction coordinators in a surviving node as part of failover handling.
This meant that we will never be blocked by a failure as long as there is still a
sufficient amount of nodes to keep the cluster operational.

No comments: