Tuesday, February 25, 2020

Use Cases for MySQL NDB Cluster 8.0

In this blog I will go through a number of popular applications that use
NDB Cluster 8.0 and also how these applications have developed over the
years.

There is a presentation at slideshare.net accompanying this blog.

The first major NDB development project was to build a prototype of a
number portability application together with a swedish telecom provider.
The aim of this prototype was to show how one could build advanced
telecom applications and manage it through standardised interfaces.
The challenge here was that the telecom applications have stringent
requirement, both on uptime and on latency. If the database access
took too long time the telecom applications would suffer from
abandoned calls and other problems. Obviously the uptime of the
database part had to be at least as high as the uptime of the
telecom switches. Actually even higher since many telecom databases
are used by many telecom switches.

In the prototype setup NDB Cluster was running in 2 SPARC computers
that was interconnected using a low latency SCI interconnect from
Dolphin, the SPARC computer was also connected to the AXE switch
through Ethernet that connected to the central computer in the
AXE switch through a regional processor. This demo was developed
in 1997 and 1998 and concluded with a successful demo.

In 1999 a new development project started up within a startup arm
of Ericsson. In 1999 the financial market was very hot and to have
instant access to stock quotes was seen as a major business benefit
(it still is).

We worked together with a swedish financial company and together with
them we developed an application that had two interfaces towards
NDB Cluster. One was the feed from the stock exchange where stock
order was fed into the database. This required low latency writes into
NDB Cluster and also very high update rates.

The second interface provided real-time stock quotes to users and
other financial applications. This version was a single-node
database service.

We delivered a prototype of this service that worked like a charm in
2001. At this point however the stock markets plunged and the financial
markets was no longer the right place for a first application for NDB
Cluster.

Thus we refocused the development of NDB Cluster back towards the
telecom market. This meant that we focused heavily on completing the work
on handling node failures of all sorts. We developed test programs that
ran thousands of node failures of all sorts every day. We worked with a
number of prospective customers in 2002 and 2003 and developed a number
of new versions of NDB Cluster.

The first customer that adopted NDB Cluster in a production environment
was Bredbandsbolaget, we worked together with them in 2003 and 2004 and
assisted them in developing their applications. Bredbandsbolaget was and
is an internet service provider. Thus the applications they used NDB
Cluster in was things like a DNS service, a DHCP service and so forth.

We worked close with them, we even had offices in the same house and on
the same floor, so we interacted on a daily basis. This meant that the
application and NDB Cluster was developed together and had a perfect
fit for each other. This application is still operational and have been
so since 2004. I even had Bredbandsbolaget as my own internet service
provider for 10 years. So I was not only developing NDB Cluster, I was
also one of its first users.

In 2003 NDB Cluster development was acquired by MySQL and we changed the
name to MySQL Cluster. Nowadays there are other clustering products within
the MySQL area, so to distinguish NDB Cluster I sometimes use
MySQL NDB Cluster and sometimes simply NDB Cluster. However the product
name is still MySQL Cluster.

After Bredbandsbolaget we got a number of new large customers in the telecom
area. Many of those telecom customers have used LDAP as the application
protocol to access their data since there was some standardisation in the
telecom sector around LDAP. To assist this there is a storage engine
to access NDB Cluster from OpenLDAP. One example of such a telecom
application is Juniper SBR Carrier System that has a combination of
SQL access, LDAP access, HTTP access, RADIUS acces towards NDB Cluster.
NDB is used in this application as a Session State database.

All sorts of telecom applications remains a very important use case for
NDB Cluster. One interesting area of development in the telecom space is
5G and IoT that will expand the application space for telecom substantially
and will expand also into self-driving cars, smart cities and many more
interesting applications that require ultra high availability coupled with
high write scalability and predictable low latency access.

Coming back to financial applications this remains an important use case
for NDB Cluster. High write scalability, ultra high availability and
predictable and low latency access to data is again the keywords that
drives the choice of NDB Cluster in this application area.

The financial markets also add one more dimension to the NDB use cases.
Given that NDB can handle large amounts of payment, payment checks,
white lists, black lists and so forth, it is also possible to use
the data in NDB Cluster for real-time analysis of the data.

Thus NDB Cluster 8.0 have focused significantly on delivering more
capabilities in the area of complex queries as well. We have seen
many substantial improvements in this area.

More and more of our users work with standard SQL interfaces towards
NDB Cluster and we worked very hard on ensuring that this provides
low latency access patterns. All the traditional interfaces towards
MySQL will also work with NDB Cluster. Thus NDB can be accessed from
all programming languages that one can use to access MySQL from.

However many financial applications are written in Java. From Java
we have a NDB API called ClusterJ. This API uses a Data Object Model
that makes it very easy to use. In many ways it can be easier to
work with ClusterJ compared to working with SQL in object-oriented
applications.

The next application category that recognized that NDB Cluster had a
very good fit for them was the Computer Gaming industry. There is a
number of applications within Computer Gaming where NDB Cluster has
a good fit. User profile management is one area where it is important
to always be up and running such that users can join and leave the
games at any time. Game state is another area that requires very high
write scalability. Most of these applications use the SQL interface
and many applications use fairly complex SQL queries and thus benefit
greatly from our improvements of parallel queries in NDB Cluster 8.0.

An interesting application that was developed at the SICS research
institute in Stockholm is HopsFS. This implements a file system in the
Hadoop world based on Hadoop HDFS. It scales to millions of
file operations per second.

This means that NDB Cluster 8.0 is already used in many important
AI applications as the platform for a distributed file system.

In NDB Cluster 8.0 we have improved such that write scalability is
even higher also when the writes are large in volume. NDB Cluster 8.0
scales to updates measured in GBytes per second even in a 2-node
cluster and in a larger cluster one can reach almost hundreds of
GBytes per second.

Thus NDB Cluster 8.0 is a very efficient tool to implement modern
key-value stores, distributed file systems and other highly
scalable applications.

NDB Cluster 8.0 is also a perfect tool to use in building many of the
modern applications that is the base of cloud applications. This is
one more active area of development for MySQL NDB Cluster.

Obviously all sorts of Web applications is also a good fit for NDB
Cluster. This is particularly true with developments in NDB Cluster
7.6 and 8.0 where we improved latency of simple queries and
implemented a shared memory transporter that makes it very
efficient to setup small clusters with low latency access to all data.

For web applications we also have a NodeJS API that can access
NDB Cluster directly without going through a MySQL Server.

In a keynote in 2015 GE showed some templates for how to setup
NDB Cluster in GE applications for the health-care industry. More on
architectures for NDB Cluster in a later blog.

Friday, February 21, 2020

Requirements on NDB Cluster 8.0

In this blog I am going to go through the most important requirements that
NDB Cluster 8.0 is based on. I am going to also list a number of consequences
these requirements have on the product and what it supports.

One slideshare.net I uploaded a presentation of the NDB Cluster 8.0
requirements. In this blog and several accompanying I am going to present the
reasoning that these requirements led to in terms of software architecture, data
structures and so forth.

The requirements on NDB Cluster 8.0 is the following:

1) Unavailability of less than 30 seconds per year (Class 6 Availability)
2) Predictable latency
3) Transparent Distribution and Replication
4) Write and Read Scalability
5) Highest availability even with 2 replicas
6) Support SQL, LDAP, File System interface, ...
7) Mixed OLTP and OLAP for real-time data analysis
8) Follow HW development for CPUs, networks, disks and memories
9) Follow HW development in Cloud Setups

The original requirements of NDB Cluster was to only support Class 5 Availability.
Telecom providers have continued supporting even higher number of subscribers per
telecom database and thus driving the requirements to even be Class 6 Availability.
NDB Cluster have more than 15 years proven track record of handling Class 6
Availability.

The requirements on predictable latency means that we need to be able to handle
transactions involving around twenty operations within 10 milliseconds even when
the cluster is working at a high load.

To make sure that application development is easy we opted for a model where
distribution and replication is transparent from the application code. This means that
NDB Cluster is one of very few DBMSs that support auto-sharding requirements.

High Write Scalability has been a major requirement in NDB from day one.
NDB Cluster can handle tens of million transactions per second, most competing
DBMS products that are based on replication protocols can only handle
tens of thousands of transactions per second.

We used an arbitration model to avoid the requirements of 3 replicas, with
NDB Cluster 8.0 we fully support 3 and 4 replicas as well, but even with 2 replicas
we get the same availability as competing products based on replication protocols
require 3 replicas for.

The original requirements on NDB didn't include a SQL interface. An efficient
API was much more important for telecom applications. However when meeting
customers of a DBMS it was obvious that an SQL interface was needed.
So this requirement was added in the early 2000s. However most early users of
NDB Cluster still opted for a more direct API and this means that NDB Cluster
today have LDAP interfaces through OpenLDAP, file system interface through
HopsFS and a lot of products that use the NDB API (C++), ClusterJ (Java) and
an NDB NodeJS API.

The model of development for NDB makes it possible to also handle complex queries
in an efficient manner. Thus in development of NDB Cluster 8.0 we added the
requirement to better support also OLAP use cases of the OLTP data that is stored in
NDB Cluster. We have already made very significant improvements in this area by
supporting parallelised filters and to a great extent parallelisation of join processing
in the NDB Data Nodes. This is an active development area for the coming
generations of NDB Cluster.

NDB Cluster started its development in the 1990s. Already in this development we
could foresee some of the HW development that was going to happen. The product
has been able to scale as HW have been more and more scalable. Today this means that
each node in NDB Cluster can scale to 64 cores, data nodes can scale to 16 TB of
memory and at least 100 TB of disk data and can benefit greatly from higher and
higher bandwidth on the network.

Finally modern deployments often happen in cloud environments. Clouds are based
on an availability model with regions, availability domains and failure domains.
Thus NDB Cluster software needs to make it possible to make efficient use of
locality in the HW configurations.

Original NDB Cluster Requirements

NDB Cluster was originally developed for Network DataBases in the telecom
network. I worked in a EU project between 1991 and 1995 that focused on
developing a pre-standardisation effort on UMTS that later became standardised
under the term 3G. I worked in a part of the project where we focused on
simulating the network traffic in such a 3G network. I was focusing my attention
especially on the requirements that this created on a network database
in the telecom network.

In the same time period I also dived deeply into research literatures about DBMS
implementation.

The following requirements from the 3G studies emerged as the most important:

1) Class 5 Availability (less than 5 minutes of unavailability per year)
2) High Write Scalability as well as High Read Scalability
3) Predictable latency down to milliseconds
4) Efficient API
5) Failover in crash scenarios within seconds or even subseconds with a real-time OS

In another blog on the influences leading to the use of an asynchronous programming
model in NDB Cluster we derive the following requirements on the software
architecture.

1) Fail-fast architecture (implemented through ndbrequire macro in NDB)
2) Asynchronous programming (provides much tracing information in crashes)
3) Highly modular SW architecture
4) JAM macros to track SW in crash events

In another blog I present the influences leading to NDB Cluster using a shared
nothing model.

One important requirement that NDB Cluster is fairly unique in addressing is high
write scalability. Most DBMSs solves this by grouping together large amounts of
small transactions to make commits more efficient. This means that most DBMSs
have a very high cost of committing a transaction.

Modern replicated protocols actually have even made this worse. As an example in
most modern replicated protocols all transactions have to commit in a serial fashion.
This means that commit handling is a major bottleneck in many modern DBMSs.
Often this limits their transaction rates to tens of thousands commits per second.

NDB Cluster went another path and essentially commits every single row change
separate from any other row change. Thus the cost of executing 1000 transactions
with 1000 operations per transaction is exactly the same as the cost of executing
1 million single row transactions.

To achieve the grouping we used the fact that we are working in an asynchronous
environment. Thus we used several levels of piggybacking of messages. One of the
most important things here is that one socket is used to transport many thousands of
simultaneous database transactions. With NDB Cluster 8.0.20 we use multiple sockets
between data nodes and this scales another 10-20x to ensure that HW limitations is
the bottleneck and not the NDB software.

The asynchronous programming model ensures that we can handle thousands of
operations each millisecond and that changing from working on one transaction to
another is a matter of tens to hundreds of nanoseconds. In addition we can handle
these transactions independently in a number of different data nodes and even
within different threads within the same data node. Thus we can handle tens of millions
transactions per second even within a single data node.

The protocol we used for this is a variant of the two-phase commit protocol with
some optimisations based on the linear two-phase commit protocol. However the
requirements on Class 5 Availability meant that we had to solve the blocking part
of the two-phase commit protocol. We solved this by recreating the state of the
failed transaction coordinators in a surviving node as part of failover handling.
This meant that we will never be blocked by a failure as long as there is still a
sufficient amount of nodes to keep the cluster operational.

Influences leading to NDB Cluster using a Shared Nothing Model

The requirements on Class 5 availability and immediate failover had two important
consequences for NDB Cluster. The first is that we wanted a fail-fast architecture.
Thus as soon as we have any kind of inconsistency in our internal data structures we
immediately fail and rely on the failover and recovery mechanisms to make the failure
almost unnoticable. The second is that we opted for a shared nothing model where all
replicas are able to take over immediately.

The shared disk model requires replay of the REDO log before failover is completed
and this can be made fast, but not immediate. In addition as one quickly understands
with the shared disk model is that it relies on an underlying shared nothing storage
service. The shared disk implementation can never be more available than the
underlying shared nothing storage service.

Thus it is actually possible to build a shared disk DBMS on top of NDB Cluster.

The most important research paper influencing the shared nothing model used in NDB
is the paper presented at VLDB 1992 called "Dynamic Data Distribution in a
Shared-Nothing Multiprocessor Data Store".

Obviously it was required to fully understand the ARIES model that was presented
also in 1992 by a team at IBM. However NDB Cluster actually choose a very different
model since we wanted to achieve a logical REDO log coupled with a checkpoint
model that actually changed a few times in NDB Cluster.