Tuesday, February 25, 2020

Use Cases for MySQL NDB Cluster 8.0

In this blog I will go through a number of popular applications that use
NDB Cluster 8.0 and also how these applications have developed over the

There is a presentation at slideshare.net accompanying this blog.

The first major NDB development project was to build a prototype of a
number portability application together with a swedish telecom provider.
The aim of this prototype was to show how one could build advanced
telecom applications and manage it through standardised interfaces.
The challenge here was that the telecom applications have stringent
requirement, both on uptime and on latency. If the database access
took too long time the telecom applications would suffer from
abandoned calls and other problems. Obviously the uptime of the
database part had to be at least as high as the uptime of the
telecom switches. Actually even higher since many telecom databases
are used by many telecom switches.

In the prototype setup NDB Cluster was running in 2 SPARC computers
that was interconnected using a low latency SCI interconnect from
Dolphin, the SPARC computer was also connected to the AXE switch
through Ethernet that connected to the central computer in the
AXE switch through a regional processor. This demo was developed
in 1997 and 1998 and concluded with a successful demo.

In 1999 a new development project started up within a startup arm
of Ericsson. In 1999 the financial market was very hot and to have
instant access to stock quotes was seen as a major business benefit
(it still is).

We worked together with a swedish financial company and together with
them we developed an application that had two interfaces towards
NDB Cluster. One was the feed from the stock exchange where stock
order was fed into the database. This required low latency writes into
NDB Cluster and also very high update rates.

The second interface provided real-time stock quotes to users and
other financial applications. This version was a single-node
database service.

We delivered a prototype of this service that worked like a charm in
2001. At this point however the stock markets plunged and the financial
markets was no longer the right place for a first application for NDB

Thus we refocused the development of NDB Cluster back towards the
telecom market. This meant that we focused heavily on completing the work
on handling node failures of all sorts. We developed test programs that
ran thousands of node failures of all sorts every day. We worked with a
number of prospective customers in 2002 and 2003 and developed a number
of new versions of NDB Cluster.

The first customer that adopted NDB Cluster in a production environment
was Bredbandsbolaget, we worked together with them in 2003 and 2004 and
assisted them in developing their applications. Bredbandsbolaget was and
is an internet service provider. Thus the applications they used NDB
Cluster in was things like a DNS service, a DHCP service and so forth.

We worked close with them, we even had offices in the same house and on
the same floor, so we interacted on a daily basis. This meant that the
application and NDB Cluster was developed together and had a perfect
fit for each other. This application is still operational and have been
so since 2004. I even had Bredbandsbolaget as my own internet service
provider for 10 years. So I was not only developing NDB Cluster, I was
also one of its first users.

In 2003 NDB Cluster development was acquired by MySQL and we changed the
name to MySQL Cluster. Nowadays there are other clustering products within
the MySQL area, so to distinguish NDB Cluster I sometimes use
MySQL NDB Cluster and sometimes simply NDB Cluster. However the product
name is still MySQL Cluster.

After Bredbandsbolaget we got a number of new large customers in the telecom
area. Many of those telecom customers have used LDAP as the application
protocol to access their data since there was some standardisation in the
telecom sector around LDAP. To assist this there is a storage engine
to access NDB Cluster from OpenLDAP. One example of such a telecom
application is Juniper SBR Carrier System that has a combination of
SQL access, LDAP access, HTTP access, RADIUS acces towards NDB Cluster.
NDB is used in this application as a Session State database.

All sorts of telecom applications remains a very important use case for
NDB Cluster. One interesting area of development in the telecom space is
5G and IoT that will expand the application space for telecom substantially
and will expand also into self-driving cars, smart cities and many more
interesting applications that require ultra high availability coupled with
high write scalability and predictable low latency access.

Coming back to financial applications this remains an important use case
for NDB Cluster. High write scalability, ultra high availability and
predictable and low latency access to data is again the keywords that
drives the choice of NDB Cluster in this application area.

The financial markets also add one more dimension to the NDB use cases.
Given that NDB can handle large amounts of payment, payment checks,
white lists, black lists and so forth, it is also possible to use
the data in NDB Cluster for real-time analysis of the data.

Thus NDB Cluster 8.0 have focused significantly on delivering more
capabilities in the area of complex queries as well. We have seen
many substantial improvements in this area.

More and more of our users work with standard SQL interfaces towards
NDB Cluster and we worked very hard on ensuring that this provides
low latency access patterns. All the traditional interfaces towards
MySQL will also work with NDB Cluster. Thus NDB can be accessed from
all programming languages that one can use to access MySQL from.

However many financial applications are written in Java. From Java
we have a NDB API called ClusterJ. This API uses a Data Object Model
that makes it very easy to use. In many ways it can be easier to
work with ClusterJ compared to working with SQL in object-oriented

The next application category that recognized that NDB Cluster had a
very good fit for them was the Computer Gaming industry. There is a
number of applications within Computer Gaming where NDB Cluster has
a good fit. User profile management is one area where it is important
to always be up and running such that users can join and leave the
games at any time. Game state is another area that requires very high
write scalability. Most of these applications use the SQL interface
and many applications use fairly complex SQL queries and thus benefit
greatly from our improvements of parallel queries in NDB Cluster 8.0.

An interesting application that was developed at the SICS research
institute in Stockholm is HopsFS. This implements a file system in the
Hadoop world based on Hadoop HDFS. It scales to millions of
file operations per second.

This means that NDB Cluster 8.0 is already used in many important
AI applications as the platform for a distributed file system.

In NDB Cluster 8.0 we have improved such that write scalability is
even higher also when the writes are large in volume. NDB Cluster 8.0
scales to updates measured in GBytes per second even in a 2-node
cluster and in a larger cluster one can reach almost hundreds of
GBytes per second.

Thus NDB Cluster 8.0 is a very efficient tool to implement modern
key-value stores, distributed file systems and other highly
scalable applications.

NDB Cluster 8.0 is also a perfect tool to use in building many of the
modern applications that is the base of cloud applications. This is
one more active area of development for MySQL NDB Cluster.

Obviously all sorts of Web applications is also a good fit for NDB
Cluster. This is particularly true with developments in NDB Cluster
7.6 and 8.0 where we improved latency of simple queries and
implemented a shared memory transporter that makes it very
efficient to setup small clusters with low latency access to all data.

For web applications we also have a NodeJS API that can access
NDB Cluster directly without going through a MySQL Server.

In a keynote in 2015 GE showed some templates for how to setup
NDB Cluster in GE applications for the health-care industry. More on
architectures for NDB Cluster in a later blog.

Friday, February 21, 2020

Requirements on NDB Cluster 8.0

In this blog I am going to go through the most important requirements that
NDB Cluster 8.0 is based on. I am going to also list a number of consequences
these requirements have on the product and what it supports.

One slideshare.net I uploaded a presentation of the NDB Cluster 8.0
requirements. In this blog and several accompanying I am going to present the
reasoning that these requirements led to in terms of software architecture, data
structures and so forth.

The requirements on NDB Cluster 8.0 is the following:

1) Unavailability of less than 30 seconds per year (Class 6 Availability)
2) Predictable latency
3) Transparent Distribution and Replication
4) Write and Read Scalability
5) Highest availability even with 2 replicas
6) Support SQL, LDAP, File System interface, ...
7) Mixed OLTP and OLAP for real-time data analysis
8) Follow HW development for CPUs, networks, disks and memories
9) Follow HW development in Cloud Setups

The original requirements of NDB Cluster was to only support Class 5 Availability.
Telecom providers have continued supporting even higher number of subscribers per
telecom database and thus driving the requirements to even be Class 6 Availability.
NDB Cluster have more than 15 years proven track record of handling Class 6

The requirements on predictable latency means that we need to be able to handle
transactions involving around twenty operations within 10 milliseconds even when
the cluster is working at a high load.

To make sure that application development is easy we opted for a model where
distribution and replication is transparent from the application code. This means that
NDB Cluster is one of very few DBMSs that support auto-sharding requirements.

High Write Scalability has been a major requirement in NDB from day one.
NDB Cluster can handle tens of million transactions per second, most competing
DBMS products that are based on replication protocols can only handle
tens of thousands of transactions per second.

We used an arbitration model to avoid the requirements of 3 replicas, with
NDB Cluster 8.0 we fully support 3 and 4 replicas as well, but even with 2 replicas
we get the same availability as competing products based on replication protocols
require 3 replicas for.

The original requirements on NDB didn't include a SQL interface. An efficient
API was much more important for telecom applications. However when meeting
customers of a DBMS it was obvious that an SQL interface was needed.
So this requirement was added in the early 2000s. However most early users of
NDB Cluster still opted for a more direct API and this means that NDB Cluster
today have LDAP interfaces through OpenLDAP, file system interface through
HopsFS and a lot of products that use the NDB API (C++), ClusterJ (Java) and
an NDB NodeJS API.

The model of development for NDB makes it possible to also handle complex queries
in an efficient manner. Thus in development of NDB Cluster 8.0 we added the
requirement to better support also OLAP use cases of the OLTP data that is stored in
NDB Cluster. We have already made very significant improvements in this area by
supporting parallelised filters and to a great extent parallelisation of join processing
in the NDB Data Nodes. This is an active development area for the coming
generations of NDB Cluster.

NDB Cluster started its development in the 1990s. Already in this development we
could foresee some of the HW development that was going to happen. The product
has been able to scale as HW have been more and more scalable. Today this means that
each node in NDB Cluster can scale to 64 cores, data nodes can scale to 16 TB of
memory and at least 100 TB of disk data and can benefit greatly from higher and
higher bandwidth on the network.

Finally modern deployments often happen in cloud environments. Clouds are based
on an availability model with regions, availability domains and failure domains.
Thus NDB Cluster software needs to make it possible to make efficient use of
locality in the HW configurations.

Original NDB Cluster Requirements

NDB Cluster was originally developed for Network DataBases in the telecom
network. I worked in a EU project between 1991 and 1995 that focused on
developing a pre-standardisation effort on UMTS that later became standardised
under the term 3G. I worked in a part of the project where we focused on
simulating the network traffic in such a 3G network. I was focusing my attention
especially on the requirements that this created on a network database
in the telecom network.

In the same time period I also dived deeply into research literatures about DBMS

The following requirements from the 3G studies emerged as the most important:

1) Class 5 Availability (less than 5 minutes of unavailability per year)
2) High Write Scalability as well as High Read Scalability
3) Predictable latency down to milliseconds
4) Efficient API
5) Failover in crash scenarios within seconds or even subseconds with a real-time OS

In another blog on the influences leading to the use of an asynchronous programming
model in NDB Cluster we derive the following requirements on the software

1) Fail-fast architecture (implemented through ndbrequire macro in NDB)
2) Asynchronous programming (provides much tracing information in crashes)
3) Highly modular SW architecture
4) JAM macros to track SW in crash events

In another blog I present the influences leading to NDB Cluster using a shared
nothing model.

One important requirement that NDB Cluster is fairly unique in addressing is high
write scalability. Most DBMSs solves this by grouping together large amounts of
small transactions to make commits more efficient. This means that most DBMSs
have a very high cost of committing a transaction.

Modern replicated protocols actually have even made this worse. As an example in
most modern replicated protocols all transactions have to commit in a serial fashion.
This means that commit handling is a major bottleneck in many modern DBMSs.
Often this limits their transaction rates to tens of thousands commits per second.

NDB Cluster went another path and essentially commits every single row change
separate from any other row change. Thus the cost of executing 1000 transactions
with 1000 operations per transaction is exactly the same as the cost of executing
1 million single row transactions.

To achieve the grouping we used the fact that we are working in an asynchronous
environment. Thus we used several levels of piggybacking of messages. One of the
most important things here is that one socket is used to transport many thousands of
simultaneous database transactions. With NDB Cluster 8.0.20 we use multiple sockets
between data nodes and this scales another 10-20x to ensure that HW limitations is
the bottleneck and not the NDB software.

The asynchronous programming model ensures that we can handle thousands of
operations each millisecond and that changing from working on one transaction to
another is a matter of tens to hundreds of nanoseconds. In addition we can handle
these transactions independently in a number of different data nodes and even
within different threads within the same data node. Thus we can handle tens of millions
transactions per second even within a single data node.

The protocol we used for this is a variant of the two-phase commit protocol with
some optimisations based on the linear two-phase commit protocol. However the
requirements on Class 5 Availability meant that we had to solve the blocking part
of the two-phase commit protocol. We solved this by recreating the state of the
failed transaction coordinators in a surviving node as part of failover handling.
This meant that we will never be blocked by a failure as long as there is still a
sufficient amount of nodes to keep the cluster operational.

Influences leading to NDB Cluster using a Shared Nothing Model

The requirements on Class 5 availability and immediate failover had two important
consequences for NDB Cluster. The first is that we wanted a fail-fast architecture.
Thus as soon as we have any kind of inconsistency in our internal data structures we
immediately fail and rely on the failover and recovery mechanisms to make the failure
almost unnoticable. The second is that we opted for a shared nothing model where all
replicas are able to take over immediately.

The shared disk model requires replay of the REDO log before failover is completed
and this can be made fast, but not immediate. In addition as one quickly understands
with the shared disk model is that it relies on an underlying shared nothing storage
service. The shared disk implementation can never be more available than the
underlying shared nothing storage service.

Thus it is actually possible to build a shared disk DBMS on top of NDB Cluster.

The most important research paper influencing the shared nothing model used in NDB
is the paper presented at VLDB 1992 called "Dynamic Data Distribution in a
Shared-Nothing Multiprocessor Data Store".

Obviously it was required to fully understand the ARIES model that was presented
also in 1992 by a team at IBM. However NDB Cluster actually choose a very different
model since we wanted to achieve a logical REDO log coupled with a checkpoint
model that actually changed a few times in NDB Cluster.

Influences leading to Asynchronous Programming Model in NDB Cluster

A number of developments was especially important in influencing the development
of NDB Cluster. I was working at Ericsson, so when I didn't work on DBMS research
I was deeply involved in prototyping the next generation telecom switches. I was the
lead architect in a project that we called AXE VM. AXE was the cash cow of Ericsson
in those days. It used an in-house developed CPU called APZ. I was involved in some
considerations into how to develop a new generation of the next generation APZ in the
early 1990s. However I felt that the decided architecture didn't make use of modern
ideas on CPU development. This opened for the possibility to use a commercial CPU
to build a virtual machine for APZ. The next APZ project opted for a development
based on the ideas from AXE VM at the end of the 1990s. I did however at this time
focus my full attention to development of NDB Cluster.

One interesting thing about the AXE is that was the last single CPU telecom switch on
the market. The reason that the AXE architecture was so successful was due to the
concept of blocks and signals.

The idea with blocks came from inheriting ideas from HW development for SW
development. The idea is that each block is self-contained in that it contains all the
software and data for its operation. The only way to communicate between blocks is
through signals. More modern names on blocks and signals are modules and
messages. Thus AXE was entirely built on a message passing architecture.
However to make the blocks truly independent of each other it is important to only
communicate using asynchronous signals. As soon as synchronous signals are used
between blocks, these blocks are no longer independent of each other.

I became a very strong proponent of the AXE architecture, in my mind I saw that the
asynchronous model gave a 10x improvement of performance in a large distributed
system. The block and signal model constitutes a higher entrance fee to SW
development, but actually it provides large benefits when scaling the software for new

One good example of this is when I worked on scaling MySQL towards higher
CPU core counts between 2008 and 2012. I worked on both improving scalability of
NDB Cluster and the MySQL Server. The block and signal model made it possible to
scale the NDB data nodes with an almost lock-free model. There are very few
bottlenecks in NDB data nodes for scaling to higher number of CPUs.
The main ones that still existed have been extensively improved in NDB Cluster 8.0.20.

Thus it is no big surprise that NDB Cluster was originally based on AXE VM. This
heritage gave us some very important parts that enabled quick bug fixing of
NDB Cluster. All the asynchronous messages goes through a job buffer. This means
that in a crash we can print the last few thousand messages that have been executed in
each thread in the crashed data node. In addition we also use a concept called
Jump Address Memory (jam). This is implemented in our code as macros that write
the line number and file number into memory such that we can track exactly how we
came to the crash point in the software.

So NDB Cluster comes from marrying the requirements on a network database for
3G networks with the AXE model that was developed in Ericsson in the 1970s.
As can be seen this model is still going strong given that NDB Cluster is able to deliver
the best performance, highest availability of any DBMS for telecom applications,
financial applications, key-value stores and even distributed file systems.

Thus listing the most important requirements we have on the software
engineering model:

1) Fail-fast architecture (implemented through ndbrequire macro in NDB)
2) Asynchronous programming (provides much tracing information in crashes)
3) Highly modular SW architecture
4) JAM macros to track SW in crash events

Thursday, February 13, 2020

NDB Cluster, the World's Fastest Key-Value Store

Using numbers produced already with MySQL Cluster 7.6.10 we have
shown that NDB Cluster is the world's fastest Key-Value store using
the Yahoo Cloud Serving Benchmark (YCSB) Workload A.

Presentation at Slideshare.net.

We reached 1.4M operations using 2 Data Nodes and 2.8M operations
using a 4 Data Node setup. All this using a standard JDBC driver.
Obviously using a specialised ClusterJ client will improve performance
further. These benchmarks was executed by Bernd Ocklin.

The benchmark was executed in the Oracle Cloud. Each Data Node used
a Bare Metal Server using DenseIO which have 52 CPU cores with
8 NVMe drives.

The MySQL Servers and Benchmark clients was executed on Bare Metal
servers with 2 MySQL Server per server (1 MySQL Server per CPU socket).
These Bare Metal servers contained 36 CPU cores each.

All servers used Oracle Linux 7.

YCSB Workload A means that 50% of the operations are reads that read the
full rows (1 kByte in size) and 50% perform updates of one of the fields
(100 bytes in size).

Oracle Cloud contains 3 different levels of domains. The first level is that
servers are placed in different Failure Domains within the same Availability
Domains. This means essentially that servers are not relying on the same
switches and power supplies. But they can still be in the same building.

The second level is Availability Domains that are in the same region, but each
Availability Domain is failing independently of the other Availability Domains
in the same region.

The third level is regions that are separated by long distances as well.

Most applications of NDB Cluster relies on a model that would use 2 or
more NDB Clusters in different regions, but each cluster contained inside
an Availability Domain. Next global replication between the NDB Clusters
is used for fail-over when one region or availability domain fails.

With Oracle Cloud one can also setup a cluster to have Data Nodes in
different Availability Domain. This increases the availability of the
cluster at the expense of higher latency for write operations. NDB Cluster
have configuration options to ensure that one always performs local
reads on either the same server or at least in the same Availability/Failure

The Oracle Cloud have the most competitive real-time characteristics of
the enterprise clouds. Our experience is that the Oracle Cloud provides
2-4x better latency compared to other cloud vendors. Thus the Oracle
Cloud is perfectly suitable for NDB Cluster.

The DenseIO Bare Metal Servers or DenseIO VMs are suitable for
use for NDB Data Nodes or a NDB Data Nodes colocated with
MySQL Server. These servers have excellent CPU combined with
25G Ethernet links and extremely high performing NVMe drives.

This benchmark reported here stores the table as In-Memory tables.
We will later report on some benchmarks where we use a slightly
modified YCSB benchmark to show numbers when we instead use
Disk-based tables with much heavier update loads.

The Oracle Cloud contains a number of variants of Bare Metal servers
and VMs that are suitable for MySQL Servers and applications.

In NDB Cluster the MySQL Servers are actually stateless since all
the state is in the NDB Data Node. The only exception to this rule
is the MySQL Servers used for replication to another cluster that
requires disk storage for the MySQL binlog.

So usually a standard server can be setup without any special extra
disks for MySQL Servers and clients.

In the presentation we show the following important results.

The latency of DBMS operations is independent of the data size. We
get the same latency when Data set have 300M rows as when there are
600M rows.

We show that scaling to 8 Data Nodes with 4 Data Nodes in each Availability
Domains scales from 4 Data Nodes in the same Availability Domain. But
the extra latency increases the latency and this also some loss in throughput.
Still we reach 3.7M operations per second for this 8-node setup.

We show that an important decision for the cluster setup is the number of
LDM threads. These are the threads doing the actual database work. We get
best scalability when going for the maximum number of LDM threads which
is 32. Using 32 LDM threads can increase latency at low number of clients,
but when the clients increase the 32 LDM setup will scale much longer than
the 16 LDM thread setup.

In MySQL Cluster 8.0.20 we have made more efforts to improve scaling to
many LDM threads. So we expect the performance of large installations to
scale even further in 8.0.20.

The benchmark report above gives very detailed numbers of latency in various
situations. As can seen there we can handle 1.3M operations per second with
latency of reads below 1 ms and updates having latency below 2 ms!

Finally the benchmark report also shows the impact of various NUMA settings
on performance and latency. It is shown that Interlaced NUMA settings have
a slight performance disadvantage, but since it means that we get access to the
full DRAM and the full machine, it is definitely a good idea to use this setting.
In NDB Cluster this is the default setting.

The YCSB benchmark shows NDB Cluster in its home turf with enormous
throughput of key operations, both read and write, with predictable and low

Couple with the high availability features that have been proven in the field
with more than 15 years of continous operations with better than Class 6
availability we feel confident to claim that NDB Cluster is the World's
Fastest and Most Available Key-Value Store!

The YCSB benchmark is a standard benchmark, so any competing solution
is free to challenge our claim. We used a standard YCSB client of version
0.15.0 using a standard MySQL JDBC driver.

NDB Cluster supports full SQL through the MySQL Server, it can push joins
down to the NDB Data Nodes for parallel query filtering and joining.
NDB Cluster supports sharding transparently and complex SQL queries
executes cross-shard joins which most competing Key-Value stores don't

One interesting example using NDB Cluster as a Key-Value Store is HopsFS
that implements a hierarchical file system based on Hadoop HDFS. It has been
shown to scale to 1.6M file operations per second and small files can be stored
in the NDB Cluster for low latency access to small files.

Monday, February 10, 2020

Benchmarking a 5 TB Data Node in NDB Cluster

Through the courtesy of Intel I have access to a machine with 6 TB of Intel
Optane DC Persistent Memory. This is memory that can be used both as
persistent memory in App Direct Mode or simply used as a very large
DRAM in Memory Mode.

Slides for a presentation of this is available at slideshare.net.

This memory can be bigger than DRAM, but has some different characteristics
compared to DRAM. Due to this different characteristics all accesses to this
memory goes through a cache and here the cache is the entire DRAM in the

In the test machine there was a 768 GB DRAM acting as a cache for the
6 TB of persistent memory. When a miss happens in the DRAM cache
one has to go towards the persistent memory instead. The persistent memory
has higher latency and lower throughput. Thus it is important as a programmer
to ensure that your product can work with this new memory.

What one can expect performance-wise is that performance will be similar to
using DRAM as long as the working set is smaller than DRAM. As the working
set grows one expects the performance to drop a bit, but not in a very significant

We tested NDB Cluster using the DBT2 benchmark which is based on the
standard TPC-C benchmark but uses zero latency between transactions in
the benchmark client.

This benchmark has two phases, the first phase loads the data from 32 threads
where each threads loads one warehouse at a time. Each warehouse contains
almost 500.000 rows in a number of tables.

The second phase executes the benchmark where a number of threads execute
transactions in parallel towards the database using 5 different transactions.

The result is based on how many new order transactions can be processed per
minute. Each such transaction report requires more than 50 SQL statements to be
executed where the majority is UPDATE's and SELECT FOR UPDATE.

Through experiments using the same machines with only DRAM it was
verified that performance running a benchmark with a working set smaller
than DRAM size the performance was within a few percent's margin the

Next we performed benchmarks comparing results when running in a database
of almost 5 TB in size and comparing it to a benchmark that executed only on
warehouses that fit in the DRAM cache.

Our findings was that latency of DBT2 transactions increased by 10-12% when
using the full data set of the machine. However the benchmark was limited by
the CPUs available to run the MySQL Server and thus the throughput was
the same.

NDB Cluster worked like a charm during these tests. We found a minor issue in
the local checkpoint processing where we prefetched some cache lines that
wasn't going to be used. This had a negative performance effect, in particular
when loading. This is fixed in MySQL Cluster 8.0.20.

This benchmark proves two things. First that MySQL Cluster 8.0 works fine
with Intel Optane DC Persistent Memory in Memory Mode. Second it proves
that NDB can work with very large memories, here we tested with up to
more than 5 TB of data in a single data node. The configuration parameter
for DataMemory supports settings up to 16 TB. Beyond 16 TB there are some
constants in the checkpoint processing that would require tweaking. The
current product is designed to work very well up to 16 TB and even work
with even larger memories.

Thus with support for up to 144 data nodes and thus 72 node groups we can
support up to more than 1 PB of in-memory data. On top of this one can also
use disk data of even bigger sizes making it possible to handle multiple
PBs of data in one NDB Cluster.