Wednesday, August 12, 2020

Setting up NVMe drives on Oracle Cloud for NDB Cluster

 In a blog post I posted on the 16th of January I showed some graphs of CPU

usage, network usage and disk usage. The benchmark that was running was

a modified variant of YCSB (Yahoo Cloud Serving Benchmark) based on version

0.15.0 of YCSB.


In this blog post I will describe the setup of the NVMe drives for this

benchmark using DenseIO machines in the Oracle Cloud.


Oracle Cloud has a variety of machines available. In this benchmark we wanted

to show NDB with a database size of around 20 TByte of user data in a replicated

setup.


There are numerous ways to connect disk drives to Oracle Cloud machines. One

manner is to use block storage. In this case the actual storage is on separate storage

servers and all disk traffic requires using the network. The other option is to

use DenseIO machines that has 8 NVMe drives with a total of 51.2 TByte of disk

space.


In this benchmark we opted for the DenseIO machines given that they can handle more

than twice the disk load compared to the block storage. Block storage is limited by

the Ethernet connection for block storage at 25 Gb per second. In addition the Oracle

Cloud has limits on the amount of IOPS that are allowed for block storage. So in our

case the DenseIO machines was the obvious candidate.


We used the Bare Metal variant called BM.DenseIO2.52.


This DenseIO machine has the following HW setup. It has 2 CPU sockets that each

contain an Intel Xeon Platinum 8167 that each have 26 CPU cores for a total of

52 CPU cores with 104 CPUs (my terminology is CPUs are contained in CPU cores

that are contained in CPU sockets that are contained in a computer/server).

It is equipped with 768 GB of memory. In addition it has 8 NVMe drives,

each with 6.4 TByte of disk space. Each such disk is capable of achieving around

1-2 GB/sec read/write speed dependent on workload.


An NDB data node requires disks for the following things:

  1. Checkpoints of in-memory data
  2. REDO log for in-memory data and disk data records
  3. UNDO log for disk data records
  4. Tablespaces for disk data records


The first 3 use cases are all write-only in normal operation and read-only in recovery.

Thus they mainly require disk bandwidth to write sequential writes of sizes that make

use of the disk bandwidth in an optimal manner. The in-memory data is in this case a

very small part of the disk load.


We strive to reach beyond 1 GByte per second insert and update rates for each data

node. For insert loads the the main load happens in checkpoint the tablespaces for

disk data records and writing the REDO log for disk data records.


For update loads there will also be a fair amount of UNDO log records to write. Thus

the worst case for the first three parts is when performing updates since then we have

to write the record to both the REDO log and to the UNDO log.


So now we come to the point of how to setup the 8 NVMe drives. The OCI provides

those 8 devices as bare devices and so we need to decide how to set those drives

up to get a proper filesystem.


One approach would be to simply create one file system with all 8 drives. Given that

the first 3 use cases are completely sequential in nature and the tablespace is making

lots of small read and writes, we opted to split at least into 2 file systems.


At most the sequential part will generate 2-3 GByte per second of disk writes and

thus 2 of the 8 NVMe drives will handle this nicely.


The commands to create this file system is the following:

#Make sure the mdadm tool is installed

sudo yum install mdadm -y


#Create a RAID0 device using the last 2 NVMe devices with chunk size of 256 kByte since

#most writes are fairly large sequential writes

sudo mdadm --create /dev/md0 --chunk=256 --raid-devices=2 \

           --level=raid0 /dev/nvme6n1 /dev/nvme7n1


#Create an XFS file system on this device

#Block size is 4kB since this is NVMe devices that has no reason to have smaller block size

#The stripe unit 256 kByte aligned to the chunk size, there are many parallel writes going on,

#so no need to parallelise

#With 2 disks our stripe width is 2 * 256 kByte thus 512 kB

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=1024 /dev/md0


#Use /ndb as the directory for the sequential file system

sudo mkdir -p /ndb


#Mount the file system on this directory, no need to keep times on files and directories

sudo mount /dev/md0 /ndb -o noatime,nodiratime


#Ensure that the Oracle Cloud user and group opc owns this directory

sudo chown opc /ndb

sudo chgrp opc /ndb


Now we have a directory /ndb that we can use for the NDB data directory


The next step is to setup the file system for the tablespace data for disk records.

One variant to do this is to follow the above scheme with a smaller chunk size and

stripe unit.


This was also what I did as my first attempt. This had some issues. It created a

tablespace of 6 * 6.4 TByte. I created a tablespace file of around 35 TByte. Everything

went well in loading the data, it continued to run will during the benchmark for a few

minutes. But after some time of running the benchmark performance dropped to half.


The problem is that NVMe devices and SSD devices need some free space to prepare

for new writes. At the disk write speeds NDB achieves in this benchmark,

the NVMe devices simply couldn't keep up with preparing free space to write into.

Thus when the file system was close to full the performance started dropping.

It stabilised at around half the workload it could handle with a non-full tablespace.


So what to do? Google as usual found the answer. The solution was to use the tool

parted to ensure that 40% of the disk space is not usable by the file system and thus

always available for new writes. This gave the NVMe devices sufficient amount of

time to prepare new writes even at benchmarks that ran for many hours with

consistent load of many GBytes per second of disk writes.


Obviously the more disk space that is removed from file system usage, the better disk

bandwidth one gets, but also there is less disk space available for the application.

In this case I was running a benchmark and wanted the optimal performance and then

using 60% of the disk space for the file system.


Using the full disk space cut performance in half, probably most of the performance

would be available also with 80% available for file system usage.


Here I also decided to skip using the RAID of the devices. Instead I created 6 different

file systems. This will work on MySQL Cluster 8.0.20 where we will spread the use of

the different tablespace files on a round robin basis.


So here are the commands to create those 6 file systems.

#Use the parted tool to only allow 60% of the usage for the file system

sudo parted -a opt --script /dev/nvme0n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme1n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme2n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme3n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme4n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme5n1 mklabel gpt mkpart primary 0% 60%


#Create 6 file systems with each 4kB blocks and stripe size 256 kByte

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme0n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme1n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme2n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme3n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme4n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme5n1p1


#Create the 6 directories for the tablespace files

sudo mkdir -p /ndb_data1

sudo mkdir -p /ndb_data2

sudo mkdir -p /ndb_data3

sudo mkdir -p /ndb_data4

sudo mkdir -p /ndb_data5

sudo mkdir -p /ndb_data6


#Mount those 6 file systems

sudo mount /dev/nvme0n1p1 /ndb_data1 -o noatime,nodiratime

sudo mount /dev/nvme1n1p1 /ndb_data2 -o noatime,nodiratime

sudo mount /dev/nvme2n1p1 /ndb_data3 -o noatime,nodiratime

sudo mount /dev/nvme3n1p1 /ndb_data4 -o noatime,nodiratime

sudo mount /dev/nvme4n1p1 /ndb_data5 -o noatime,nodiratime

sudo mount /dev/nvme5n1p1 /ndb_data6 -o noatime,nodiratime


#Move ownership of the file systems to the Oracle Cloud user

sudo chown opc /ndb_data1

sudo chgrp opc /ndb_data1

sudo chown opc /ndb_data2

sudo chgrp opc /ndb_data2

sudo chown opc /ndb_data3

sudo chgrp opc /ndb_data3

sudo chown opc /ndb_data4

sudo chgrp opc /ndb_data4

sudo chown opc /ndb_data5

sudo chgrp opc /ndb_data5

sudo chown opc /ndb_data6


When NDB has been started the following commands are used to create the

tablespace on these file system. These commands will take some time since

NDB will initialise the file system to ensure that the disk space is

really allocated to avoid that we run out of disk space.


#First create the UNDO log file group

#We set the size to 512G, we have plenty of disk space for logs, so no need

#to use a very small file. We allocate 4GByte of memory for the UNDO

#log buffer, the machine have 768 GByte of memory, and the memory

#is thus abundant and there is no need to save on UNDO buffer memory.

CREATE LOGFILE GROUP lg1

ADD UNDOFILE 'undofile.dat'

INITIAL_SIZE 512G

UNDO_BUFFER_SIZE 4G

ENGINE NDB;


#Next create the tablespace and the first data file

#We set the size to more than 3 TByte of usable space

CREATE TABLESPACE ts1

ADD DATAFILE '/ndb_data1/datafile.dat'

USE LOGFILE GROUP lg1

INITIAL_SIZE 3200G

ENGINE NDB;


#Now add the remaining 5 data files

#Each of the same size

ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data2/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data3/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data4/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data5/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data6/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


In the next blog post we will show how to write the configuration file for this

NDB Cluster.


This blog post showed the setup of a highly optimised setup for a key-value store

that can store 20 TB in a 2-node setup that can handle more than 1 GByte insert

rate and also around 1 GByte update speed.


From the graphs in the blog post one can see that performance is a function of the

performance of the NVMe drives and the available network bandwidth.

The CPU usage is never more than 20% of the available CPU power.

Tuesday, February 25, 2020

Use Cases for MySQL NDB Cluster 8.0

In this blog I will go through a number of popular applications that use
NDB Cluster 8.0 and also how these applications have developed over the
years.

There is a presentation at slideshare.net accompanying this blog.

The first major NDB development project was to build a prototype of a
number portability application together with a swedish telecom provider.
The aim of this prototype was to show how one could build advanced
telecom applications and manage it through standardised interfaces.
The challenge here was that the telecom applications have stringent
requirement, both on uptime and on latency. If the database access
took too long time the telecom applications would suffer from
abandoned calls and other problems. Obviously the uptime of the
database part had to be at least as high as the uptime of the
telecom switches. Actually even higher since many telecom databases
are used by many telecom switches.

In the prototype setup NDB Cluster was running in 2 SPARC computers
that was interconnected using a low latency SCI interconnect from
Dolphin, the SPARC computer was also connected to the AXE switch
through Ethernet that connected to the central computer in the
AXE switch through a regional processor. This demo was developed
in 1997 and 1998 and concluded with a successful demo.

In 1999 a new development project started up within a startup arm
of Ericsson. In 1999 the financial market was very hot and to have
instant access to stock quotes was seen as a major business benefit
(it still is).

We worked together with a swedish financial company and together with
them we developed an application that had two interfaces towards
NDB Cluster. One was the feed from the stock exchange where stock
order was fed into the database. This required low latency writes into
NDB Cluster and also very high update rates.

The second interface provided real-time stock quotes to users and
other financial applications. This version was a single-node
database service.

We delivered a prototype of this service that worked like a charm in
2001. At this point however the stock markets plunged and the financial
markets was no longer the right place for a first application for NDB
Cluster.

Thus we refocused the development of NDB Cluster back towards the
telecom market. This meant that we focused heavily on completing the work
on handling node failures of all sorts. We developed test programs that
ran thousands of node failures of all sorts every day. We worked with a
number of prospective customers in 2002 and 2003 and developed a number
of new versions of NDB Cluster.

The first customer that adopted NDB Cluster in a production environment
was Bredbandsbolaget, we worked together with them in 2003 and 2004 and
assisted them in developing their applications. Bredbandsbolaget was and
is an internet service provider. Thus the applications they used NDB
Cluster in was things like a DNS service, a DHCP service and so forth.

We worked close with them, we even had offices in the same house and on
the same floor, so we interacted on a daily basis. This meant that the
application and NDB Cluster was developed together and had a perfect
fit for each other. This application is still operational and have been
so since 2004. I even had Bredbandsbolaget as my own internet service
provider for 10 years. So I was not only developing NDB Cluster, I was
also one of its first users.

In 2003 NDB Cluster development was acquired by MySQL and we changed the
name to MySQL Cluster. Nowadays there are other clustering products within
the MySQL area, so to distinguish NDB Cluster I sometimes use
MySQL NDB Cluster and sometimes simply NDB Cluster. However the product
name is still MySQL Cluster.

After Bredbandsbolaget we got a number of new large customers in the telecom
area. Many of those telecom customers have used LDAP as the application
protocol to access their data since there was some standardisation in the
telecom sector around LDAP. To assist this there is a storage engine
to access NDB Cluster from OpenLDAP. One example of such a telecom
application is Juniper SBR Carrier System that has a combination of
SQL access, LDAP access, HTTP access, RADIUS acces towards NDB Cluster.
NDB is used in this application as a Session State database.

All sorts of telecom applications remains a very important use case for
NDB Cluster. One interesting area of development in the telecom space is
5G and IoT that will expand the application space for telecom substantially
and will expand also into self-driving cars, smart cities and many more
interesting applications that require ultra high availability coupled with
high write scalability and predictable low latency access.

Coming back to financial applications this remains an important use case
for NDB Cluster. High write scalability, ultra high availability and
predictable and low latency access to data is again the keywords that
drives the choice of NDB Cluster in this application area.

The financial markets also add one more dimension to the NDB use cases.
Given that NDB can handle large amounts of payment, payment checks,
white lists, black lists and so forth, it is also possible to use
the data in NDB Cluster for real-time analysis of the data.

Thus NDB Cluster 8.0 have focused significantly on delivering more
capabilities in the area of complex queries as well. We have seen
many substantial improvements in this area.

More and more of our users work with standard SQL interfaces towards
NDB Cluster and we worked very hard on ensuring that this provides
low latency access patterns. All the traditional interfaces towards
MySQL will also work with NDB Cluster. Thus NDB can be accessed from
all programming languages that one can use to access MySQL from.

However many financial applications are written in Java. From Java
we have a NDB API called ClusterJ. This API uses a Data Object Model
that makes it very easy to use. In many ways it can be easier to
work with ClusterJ compared to working with SQL in object-oriented
applications.

The next application category that recognized that NDB Cluster had a
very good fit for them was the Computer Gaming industry. There is a
number of applications within Computer Gaming where NDB Cluster has
a good fit. User profile management is one area where it is important
to always be up and running such that users can join and leave the
games at any time. Game state is another area that requires very high
write scalability. Most of these applications use the SQL interface
and many applications use fairly complex SQL queries and thus benefit
greatly from our improvements of parallel queries in NDB Cluster 8.0.

An interesting application that was developed at the SICS research
institute in Stockholm is HopsFS. This implements a file system in the
Hadoop world based on Hadoop HDFS. It scales to millions of
file operations per second.

This means that NDB Cluster 8.0 is already used in many important
AI applications as the platform for a distributed file system.

In NDB Cluster 8.0 we have improved such that write scalability is
even higher also when the writes are large in volume. NDB Cluster 8.0
scales to updates measured in GBytes per second even in a 2-node
cluster and in a larger cluster one can reach almost hundreds of
GBytes per second.

Thus NDB Cluster 8.0 is a very efficient tool to implement modern
key-value stores, distributed file systems and other highly
scalable applications.

NDB Cluster 8.0 is also a perfect tool to use in building many of the
modern applications that is the base of cloud applications. This is
one more active area of development for MySQL NDB Cluster.

Obviously all sorts of Web applications is also a good fit for NDB
Cluster. This is particularly true with developments in NDB Cluster
7.6 and 8.0 where we improved latency of simple queries and
implemented a shared memory transporter that makes it very
efficient to setup small clusters with low latency access to all data.

For web applications we also have a NodeJS API that can access
NDB Cluster directly without going through a MySQL Server.

In a keynote in 2015 GE showed some templates for how to setup
NDB Cluster in GE applications for the health-care industry. More on
architectures for NDB Cluster in a later blog.

Friday, February 21, 2020

Requirements on NDB Cluster 8.0

In this blog I am going to go through the most important requirements that
NDB Cluster 8.0 is based on. I am going to also list a number of consequences
these requirements have on the product and what it supports.

One slideshare.net I uploaded a presentation of the NDB Cluster 8.0
requirements. In this blog and several accompanying I am going to present the
reasoning that these requirements led to in terms of software architecture, data
structures and so forth.

The requirements on NDB Cluster 8.0 is the following:

1) Unavailability of less than 30 seconds per year (Class 6 Availability)
2) Predictable latency
3) Transparent Distribution and Replication
4) Write and Read Scalability
5) Highest availability even with 2 replicas
6) Support SQL, LDAP, File System interface, ...
7) Mixed OLTP and OLAP for real-time data analysis
8) Follow HW development for CPUs, networks, disks and memories
9) Follow HW development in Cloud Setups

The original requirements of NDB Cluster was to only support Class 5 Availability.
Telecom providers have continued supporting even higher number of subscribers per
telecom database and thus driving the requirements to even be Class 6 Availability.
NDB Cluster have more than 15 years proven track record of handling Class 6
Availability.

The requirements on predictable latency means that we need to be able to handle
transactions involving around twenty operations within 10 milliseconds even when
the cluster is working at a high load.

To make sure that application development is easy we opted for a model where
distribution and replication is transparent from the application code. This means that
NDB Cluster is one of very few DBMSs that support auto-sharding requirements.

High Write Scalability has been a major requirement in NDB from day one.
NDB Cluster can handle tens of million transactions per second, most competing
DBMS products that are based on replication protocols can only handle
tens of thousands of transactions per second.

We used an arbitration model to avoid the requirements of 3 replicas, with
NDB Cluster 8.0 we fully support 3 and 4 replicas as well, but even with 2 replicas
we get the same availability as competing products based on replication protocols
require 3 replicas for.

The original requirements on NDB didn't include a SQL interface. An efficient
API was much more important for telecom applications. However when meeting
customers of a DBMS it was obvious that an SQL interface was needed.
So this requirement was added in the early 2000s. However most early users of
NDB Cluster still opted for a more direct API and this means that NDB Cluster
today have LDAP interfaces through OpenLDAP, file system interface through
HopsFS and a lot of products that use the NDB API (C++), ClusterJ (Java) and
an NDB NodeJS API.

The model of development for NDB makes it possible to also handle complex queries
in an efficient manner. Thus in development of NDB Cluster 8.0 we added the
requirement to better support also OLAP use cases of the OLTP data that is stored in
NDB Cluster. We have already made very significant improvements in this area by
supporting parallelised filters and to a great extent parallelisation of join processing
in the NDB Data Nodes. This is an active development area for the coming
generations of NDB Cluster.

NDB Cluster started its development in the 1990s. Already in this development we
could foresee some of the HW development that was going to happen. The product
has been able to scale as HW have been more and more scalable. Today this means that
each node in NDB Cluster can scale to 64 cores, data nodes can scale to 16 TB of
memory and at least 100 TB of disk data and can benefit greatly from higher and
higher bandwidth on the network.

Finally modern deployments often happen in cloud environments. Clouds are based
on an availability model with regions, availability domains and failure domains.
Thus NDB Cluster software needs to make it possible to make efficient use of
locality in the HW configurations.

Original NDB Cluster Requirements

NDB Cluster was originally developed for Network DataBases in the telecom
network. I worked in a EU project between 1991 and 1995 that focused on
developing a pre-standardisation effort on UMTS that later became standardised
under the term 3G. I worked in a part of the project where we focused on
simulating the network traffic in such a 3G network. I was focusing my attention
especially on the requirements that this created on a network database
in the telecom network.

In the same time period I also dived deeply into research literatures about DBMS
implementation.

The following requirements from the 3G studies emerged as the most important:

1) Class 5 Availability (less than 5 minutes of unavailability per year)
2) High Write Scalability as well as High Read Scalability
3) Predictable latency down to milliseconds
4) Efficient API
5) Failover in crash scenarios within seconds or even subseconds with a real-time OS

In another blog on the influences leading to the use of an asynchronous programming
model in NDB Cluster we derive the following requirements on the software
architecture.

1) Fail-fast architecture (implemented through ndbrequire macro in NDB)
2) Asynchronous programming (provides much tracing information in crashes)
3) Highly modular SW architecture
4) JAM macros to track SW in crash events

In another blog I present the influences leading to NDB Cluster using a shared
nothing model.

One important requirement that NDB Cluster is fairly unique in addressing is high
write scalability. Most DBMSs solves this by grouping together large amounts of
small transactions to make commits more efficient. This means that most DBMSs
have a very high cost of committing a transaction.

Modern replicated protocols actually have even made this worse. As an example in
most modern replicated protocols all transactions have to commit in a serial fashion.
This means that commit handling is a major bottleneck in many modern DBMSs.
Often this limits their transaction rates to tens of thousands commits per second.

NDB Cluster went another path and essentially commits every single row change
separate from any other row change. Thus the cost of executing 1000 transactions
with 1000 operations per transaction is exactly the same as the cost of executing
1 million single row transactions.

To achieve the grouping we used the fact that we are working in an asynchronous
environment. Thus we used several levels of piggybacking of messages. One of the
most important things here is that one socket is used to transport many thousands of
simultaneous database transactions. With NDB Cluster 8.0.20 we use multiple sockets
between data nodes and this scales another 10-20x to ensure that HW limitations is
the bottleneck and not the NDB software.

The asynchronous programming model ensures that we can handle thousands of
operations each millisecond and that changing from working on one transaction to
another is a matter of tens to hundreds of nanoseconds. In addition we can handle
these transactions independently in a number of different data nodes and even
within different threads within the same data node. Thus we can handle tens of millions
transactions per second even within a single data node.

The protocol we used for this is a variant of the two-phase commit protocol with
some optimisations based on the linear two-phase commit protocol. However the
requirements on Class 5 Availability meant that we had to solve the blocking part
of the two-phase commit protocol. We solved this by recreating the state of the
failed transaction coordinators in a surviving node as part of failover handling.
This meant that we will never be blocked by a failure as long as there is still a
sufficient amount of nodes to keep the cluster operational.