Monday, October 19, 2020

New things coming in MySQL Cluster 8.0.22

In version 8.0.22 we have added a number of important new features. The first

one is that we introduced the capability to create encrypted backups.

This includes capabilities to create encrypted backups, restore encryped

backups and also encrypt backups that was created without encryption.


The second is that we have added support for IPv6 addresses.  If it is

necessary to mix IPv4 and IPv6 addresses in a cluster it is important

to be careful and follow a number of rules as documented.


Performance of replication of BLOB columns has been significantly

improved and similarly for SELECT statements using BLOB columns.

This has been achieved through decreasing the amount of round-trips used

to perform operations involving BLOB tables. This significantly decreases

the latency of operations involving BLOBs.


Also in 8.0.21 an important new feature was added that made some

complex queries significantly faster. We added the ability to push

antijoins and semijoin execution down to the NDB data nodes.

One example query of this is Q22 in TPC-H. This query executes

3.5x faster due to this change and there is likely queries that will

benefit even more than this query due to this change.

select

        cntrycode,

        count(*) as numcust,

        sum(c_acctbal) as totacctbal

from

        (

                select

                        substr(c_phone from 1 for 2) as cntrycode,

                        c_acctbal

                from

                        customer

                where

                        substr(c_phone from 1 for 2) in

                                ('13', '31', '23', '29', '30', '18', '17')

                        and c_acctbal > (

                                select

                                        avg(c_acctbal)

                                from

                                        customer

                                where

                                        c_acctbal > 0.00

                                        and substr(c_phone from 1 for 2) in

                                                ('13', '31', '23', '29', '30', '18', '17')

                        )

                        and not exists (

                                select

                                        *

                                from

                                        orders

                                where

                                        o_custkey = c_custkey

                        )

        ) as custsale

group by

        cntrycode

order by

        cntrycode;


Wednesday, October 07, 2020

DBT2 benchmarks with NDB Cluster

This blog post is referring to the uploaded slides.


DBT2 is based on the standard benchmark TPC-C. This benchmark is a mix of read

and write activity. It is much more write intensive than many other benchmarks.

This has resulted that performance in NDB in previous versions have only scaled

to nodes with around 6 LDM threads. With the introduction of multiple sockets

for communication between nodes we have been able to remove this limitation.


Thus DBT2 serves as a good tool to verify that MySQL Cluster 8.0.20 has improved

its capability to scale writes.


To our disposal for those tests we had access to a set of bare metal servers in

the Oracle Cloud (OCI). The data nodes used DenseIO2 bare metal servers with

52 CPU cores, 768 GByte of memory and 8 NVMe drives. We had 6 such servers,

these 6 servers were spread in 3 different availability domains.


The queries was executed by MySQL Servers, we used 15 bare metal servers with

36 CPU cores each to run the MySQL Servers. These servers are spread such that

5 servers are located in each availability domain.


The actual benchmark driver and client was executed on its own bare metal server.


DBT2 has five different transaction types. Each transaction is a complex transaction

consisting of a set of SQL queries. On average each transaction consists of around

30-35 SQL queries. Only NewOrder transactions are counted in the results. These

constitute about 45% of the transactions. This means that if the report says 45.000

TPM (transactions per minute), it means that actually 100.000 transactions were

executed.


DBT2 doesn't follow the standard TPC-C since each terminal continues immediately

without waiting before it starts a new transaction. This makes it possible to also

use DBT2 to test in-memory DBMS. Interestingly however with persistent memory a

server can have 6 TB of memory and this means that more than 50.000 warehouses

can be used and thus results beyond 500.000 TPM can be handled even with an

in-memory setup.


Each SQL query is created in the DBT2 driver, sent to the DBT2 client and from there

it is sent to the MySQL Server and from there it accesses a set of NDB data nodes to

perform the actual query.


Since the servers are located in different availability domains the main cause of

latency comes from the communication between availability domains. When executing

DBT2 in a local network one can achieve latency down to 5-10 ms for a transaction

that contains around 30 SQL statements. In this environment we get latencies of

around 30 ms. Most of this time is spent communicating from the DBT2 client to the

MySQL Servers when they are located in a different availability domain.


NDB has a configuration option that ensures that any reads will always stay local

in the availability domain of the MySQL Server it was sent to. Updates need to

update all replicas and this might require communication over availability domains.


We have selected 3 different configurations to execute DBT2 in.


The first configuration is shown in slide 5. It uses 2 data nodes, these are both

located in the same availability domain. They are however located in different

failure domains and thus only share electricity network. We use 10 MySQL servers

from 2 different availability domains.


The second configuration shows a setup where we still have 2 replicas, but now we

have one "shard" per availability domain. Thus replicas are still in the same

availability domain. This means that we use all 6 data node servers. We also use

all 15 MySQL servers in 3 different availability domains.


The third configuration we use 3 replicas where the replicas are in different

availability domains. In this case we get 2 node groups using the 6 data nodes.

There is still 15 MySQL Servers in 3 different availability domains.


The first configuration reaches almost 1.4M TPM at 2560 connections. The second

configuration with 3 node groups reaches almost linear scaling  and reaches

a bit beyond 4M TPM with 12000 connections.


Finally the last configuration reaches around 2.5M TPM at 8400 connections.


Thus we can see that performance is fairly constant per node group. The latency

differs and the number of connections required to reach optimal throughput

differs a bit.


It is very clear that the multi-socket solution has improved the scalability of

writes and thus ensuring that NDB can reach very high numbers using DBT2.


In this benchmark we showed NDB scale to using more than 1000 CPUs in the MySQL

Servers and the NDB data nodes. MySQL NDB Cluster 8.0 can scale to 144 data

nodes and thus can scale far beyond those numbers.

TPC-H Improvements in NDB Cluster

 This blog is a presentation based on the uploaded slides.


We have focused a lot of energy on improving our capabilities to execute complex

queries in an efficient manner. The first step on these improvements we took in

MySQL Cluster 7.6, but even more improvements have been made in

MySQL Cluster 8.0.


There are four types of improvements that we have made. One is that we have added

the option to use a shared memory transporter between the MySQL Server and the

NDB data nodes when they are located on the same machine.


The second improvement is the adaptive CPU spinning described in blog post 1.


The third improvement is that we made Read Backup the default, this means that in

complex queries any replica can be used for reading.


Finally we have added a number of algorithmic improvements, these mainly focus on

enabling more parts of the queries to be pushed down into the NDB data nodes.

We described the improvements in 8.0.20 in blog post 2.


Slide 4 shows a 2-node setup optimised for optimal query latency.


In slide 5 we show the improvements in TPC-H comparing MySQL Cluster 8.0.20 vs

MySQL Cluster 7.6.12. 15 out of 22 queries have improved by more 100% and most

of those even a lot more than 100%. 3 queries have improved betwen 50% and 100%.

Only 4 queries have been unaffected by these improvements.


In slide 6 we see the latency improvements that comes from colocating the MySQL

Server and the NDB data nodes using a shared memory transporter. 14 out of 22

queries improve in this scenario and some of them almost 100% and even above

100%.


Slide 7 shows the improvements based on using LatencyOptimisedSpinning as

SpinMethod as described in blog post 3. 13 out of 22 queries sees an improvement

in this scenario. Most of them around 15-20% with a few that improves up to 60%.


Slide 8 shows the sum of the improvements of colocation and CPU spinning.


Slide 9 shows that some queries can benefit from more LDMs, but some also get a

negative impact from this. Thus we conclude that there is a limit to have many

fragments we split tables into for optimal latency. This is an important thing

to consider in any distributed DBMS.

Tuesday, October 06, 2020

YCSB Disk Data Benchmark with NDB Cluster

 As mentioned in  blog post 1 we have improved the capabilities to handle

very large write bandwidth and blog post 2 improved the checkpointing of

disk data columns in MySQL Cluster 8.0.20.


We wanted to verify that these changes were succesful. To do this we

selected to use the YCSB benchmark. This is a very simple benchmark,

it contains one table with in our variant 2 columns. The first

column is the primary key, the second key contains the payload data

and is stored in VARCHAR(29500) column that is stored on disk.


The first part of the benchmark fills the database. The database is

mostly stored in a tablespace that is setup in accordance with the

blog post 3. This means that we had a tablespace size of 20 TBytes.

We loaded 600M rows, thus creating a database size of 18 TByte.


The load phase inserted 44.500 rows per second. This means that we

loaded 1.25 GByte per second into the database. The bottleneck in

both the load phase and the benchmark run phase was mainly the

NVMe drives, but in some cases also the 25G Ethernet became the

bottleneck. The CPUs were never loaded more than 20%, thus never

becoming a bottleneck.


From this we can conclude that the setup and use of the NVMe drives

is the most important part of achieving extreme write rates for

use cases where NDB Cluster is used for file write loads.


The cluster setup used 2 data nodes, each of the data nodes was a bare

metal server in the Oracle Cloud (OCI) that had 8 NVMe drives (2 used

for logs and checkpoints and 6 used for the tablespace). The servers

had 52 CPU cores each. Instead of setting up a RAID on the 6 NVMe drives

we instead opted for one file system per NVMe drive and added one

data file per NVMe drive to NDB. This meant that NDB handled the

spread of writes on the different data files. Thus no complex RAID

solutions was required. However to get the best possible performance

it was necessary to use SSD overprovisioning.


The disk usage and CPU usage and network usage during this benchmark can

be found in blog post 4. The mean latency of those transactions was a bit more

than 2 milliseconds where reads took a bit more than 1 ms and writes around

4-5 milliseconds.


The actual benchmark consisted of 50% reads and 50% writes. Here we

achieved almost 70.000 transactions per second. This meant that we

read 1 GByte per second in parallel with writing 1 GByte per second.

Parallel execution of Outer/Semi joins in NDB Cluster

In MySQL Cluster 8.0.20 we added one more query type that can be pushed

down to NDB data nodes. These queries are outer join queries and semi-join

queries using the FirstMatch strategy.


These queries can be pushed if conditions can be pushed on those parts of

the query that involve the outer join using a concept in the MySQL Server

called join nests.


Pushed queries means that the join execution is pushed down to the NDB

data nodes. This means that the rows to process for the second table in

the join is sent directly from the LDM threads to the join processor

that will start the select on the second table. The join can contain

up to 32 tables. Many filters can be pushed as well. It is also possible

to push parts of a join query. We are actively working on supporting

more and more query variants for push down to NDB data nodes.


The main benefits of pushing the queries down to data nodes is that it

enables the queries to be parallelised since both accesses to the data

owning parts (LDM threads) and the join processors (tc threads) can be

executed in parallel on multiple nodes and multiple threads in each

node. This means that queries can return the results much quicker.


Some queries can become more than 50 times faster using this approach

compared to when queries are executed entirely in the MySQL Server.


In our development of these new pushdown join features we measure the

result of executing queries in TPC-H. In TPC-H there is one outer join

query that looks like this:


select

        c_count,

        count(*) as custdist

from

        (

                select

                        c_custkey,

                        count(o_orderkey) as c_count

                from

                        customer left outer join orders on

                                c_custkey = o_custkey

                                and o_comment not like '%special%requests%'

                group by

                        c_custkey

        ) as c_orders

group by

        c_count

order by

        custdist desc,

        c_count desc;


This query benefits greatly from both the push of outer joins, it also

benefits greatly from the new CPU spinning and it benefits from using the

shared memory transporter between the data node and the MySQL Server.

Combined together these 3 things together improve latency of this query

(Q13) in TPC-H by a factor of 10x when compared to executing it in

MySQL Cluster 7.6.


Another query that benefits greatly from these changes is Q4 in TPC-H.


select

        o_orderpriority,

        count(*) as order_count

from

        orders

where

        o_orderdate >=  '1993-07-01'

        and o_orderdate < date_add( '1993-07-01', interval '3' month)

        and exists (

                select

                        *

                from

                        lineitem

                where

                        l_orderkey = o_orderkey

                        and l_commitdate < l_receiptdate

        )

group by

        o_orderpriority

order by

        o_orderpriority;


This query executes more than 50 times faster in MySQL Cluster 8.0.20

compared to in MySQL Cluster 7.6 and this benefit comes strictly from

changing the execution algorithm.

Adaptive CPU spinning in NDB Cluster

 CPU spinning is something that has been around in NDB for a long time. This CPU

spinning has been static which means that each time a thread wakes up it will spin

waiting for incoming signals. Thus if spin time is set to e.g. 500 microsecond we will

spin 5% of the time even when there is no activity since each thread wakes up each

10 ms to perform timer based activities.


In the new spinning approach we wanted to achieve the following.


- Power efficient CPU spinning

  NDB Cluster data nodes can execute queries very efficiently, this leads to efficient

  power usage. However CPU spinning requires careful control to ensure that one

  doesn't  waste power while spinning.

- No spinning unless there is something to gain from it

  Static spinning will always spin independent of whether it makes sense.

  With the new approach we gather statistics about the usefulness of spinning. This

  means that when the thread goes to sleep we will collect statistics on the sleep

  time when the thread wakes up again. With this information we are able to calculate

  the gains from spinning. We also calculate the cost of going to sleep and waking up

  again (this cost is higher usually in virtual machines compared to bare metal servers).


One important part of CPU spinning is what we do while spinning. In x86 and ARM

CPUs there exists CPU instructions for spinning. While spinning we will spend around

one microsecond in the spinning state before checking for input again. Going to sleep

and waking up again can consume between 10 and 50 microseconds dependent on

OS, hardware and whether we are running in a VM, in Docker and so forth. These

special assembler instructions will ensure that the CPU doesn't use so much power

and in hyperthreaded CPUs the other hyperthread(s) will get almost exclusive access

to the CPU core while spinning is performed.


The adaptive part of our new CPU spinning is that we will keep track of the workload

and its ability to benefit from CPU spinning. We do so through a number of

configuration parameters. However for ease of use we have only 4 variants that one

can configure.


The configuration parameter SpinMethod is used to set the variant of CPU spinning to

use. The following settings are available:


- CostBasedSpinning

This is the default setting. This means that we will only spin when we actually benefit

from it. This means that if we spend 10 microseconds spinning we will gain 10

microseconds from spinning in avoiding the cost of going to sleep and waking up.

This is a conservative setting that ensures good power efficiency and avoids issues

in situations where the data nodes executes in a shared environment.


- LatencyOptimisedSpinning

This setting is a very good trade off between best latency and best power efficiency.

Here we will spin if 10 microseconds of spinning can gain 1-2 microseconds of

avoided costs of going to sleep and waking up. This is recommended to use in cases

where the data nodes executes in an exclusive environment, e.g. in its own virtual

machine, Docker container or even exclusively using the entire server. This provides

a very distinct advantage in latency of queries as we will show in a later blog.


- DatabaseMachineSpinning

This setting provides the optimal latency at the expense of decreased power

efficiency. This setting is intended for use cases where latency is of outmost important

and one is willing to sacrifice some extra power to achieve this.


-StaticSpinning

This disables the adaptive spinning and makes NDB Cluster backwards compatible.

Adaptive checkpoints of Disk Data columns in NDB Cluster

 In MySQL Cluster 8.0.20 we made it possible to write to disk data columns

at considerably higher write rates than before. In MySQL Cluster 7.6 we

introduced adaptive checkpointing of in-memory rows to ensure that restarts

can be faster.


In order to handle very high write rates (up to more than 1 GByte per second)

on disk data columns in NDB it is necessary to control write rates more

efficiently.


To get the most efficient use of disk bandwidth it is important to ensure that

write rates are kept fairly constant such that we constantly make use of the

disk bandwidth. If the load on the disks goes up and down too fast we will not

be able to make use of all the available bandwidth in the disks.


With modern NVMe devices found in e.g. the Oracle Cloud one can achieve 10-20

GByte of reads and writes to a set of 8 NVMe devices using specialised disk

benchmarks. Achieving this with a DBMS is not as easy, but we were able to

see NDB use 7 GBytes per second of reads and writes in some YCSB benchmarks.


This represents one more step in the direction of adaptive algorithms in NDB

Cluster to ensure that users always will get the optimal behaviour of their

hardware when using NDB.


Some previous steps are:

- Adaptive send assistance

  This feature was introduced in 7.5 that made it possible for all threads

  to assist the send threads to send messages on sockets. The main input to

  the adaptive algorithm is the amount of CPU time spent in each thread for

  normal execution, sleeping, assisting send threads.

- Adaptive in-memory checkpoint speed

  This feature adapts the speed of checkpoints based on the amount of write

  activity in NDB Cluster.

- Adaptive CPU spinning

  This is introduced in 8.0.20 and will be presented in separate blog.

Multi-socket feature in NDB Cluster

 In MySQL Cluster 8.0.20 a new feature was added that enables higher write rates

in NDB Cluster. In previous releases the communication between nodes in the same

node group was handled by one socket. This single socket was able to handle about

200-300 MBytes of write traffic between the replicas inside one node group. Thus

in clusters with many node groups the write rates can be much higher.


For most applications this doesn't represent any problem at all. However there are

certain applications that require wast amount of writes.


One such application is when NDB is used to store file data. File writes can easily

go beyond 200-300 MBytes per second. Another application is when NDB is used to

store data from measurements, stock data or other applications that require millions of

writes per second.


The solution to this problem is to simply use multiple sockets when communicating

between nodes in the same cluster. This removes the bottleneck on write rates

between nodes in the same node group. The solution is applied by default since we

haven't seen any negative effects of using it. The number of sockets will be

derived from the number of LDM threads used in the data nodes.


We have verified the feature through the use of YCSB where the data is stored in

a disk data column. This is a good representation of writes of large amounts of

file data to NDB.


The second type of verification is through DBT2, an open source variant of TPC-C

where each benchmark thread starts a new operation immediately after completing

the previous transaction. This benchmark was previously limited by the single

socket such that scaling beyond 6 LDM threads didn't improve numbers at all.

With this change we can see scaling to full bare metal servers in the Oracle Cloud

with 52 CPU cores.


Reports of those benchmarks will be in separate blogs.

Wednesday, August 12, 2020

Setting up NVMe drives on Oracle Cloud for NDB Cluster

 In a blog post I posted on the 16th of January I showed some graphs of CPU

usage, network usage and disk usage. The benchmark that was running was

a modified variant of YCSB (Yahoo Cloud Serving Benchmark) based on version

0.15.0 of YCSB.


In this blog post I will describe the setup of the NVMe drives for this

benchmark using DenseIO machines in the Oracle Cloud.


Oracle Cloud has a variety of machines available. In this benchmark we wanted

to show NDB with a database size of around 20 TByte of user data in a replicated

setup.


There are numerous ways to connect disk drives to Oracle Cloud machines. One

manner is to use block storage. In this case the actual storage is on separate storage

servers and all disk traffic requires using the network. The other option is to

use DenseIO machines that has 8 NVMe drives with a total of 51.2 TByte of disk

space.


In this benchmark we opted for the DenseIO machines given that they can handle more

than twice the disk load compared to the block storage. Block storage is limited by

the Ethernet connection for block storage at 25 Gb per second. In addition the Oracle

Cloud has limits on the amount of IOPS that are allowed for block storage. So in our

case the DenseIO machines was the obvious candidate.


We used the Bare Metal variant called BM.DenseIO2.52.


This DenseIO machine has the following HW setup. It has 2 CPU sockets that each

contain an Intel Xeon Platinum 8167 that each have 26 CPU cores for a total of

52 CPU cores with 104 CPUs (my terminology is CPUs are contained in CPU cores

that are contained in CPU sockets that are contained in a computer/server).

It is equipped with 768 GB of memory. In addition it has 8 NVMe drives,

each with 6.4 TByte of disk space. Each such disk is capable of achieving around

1-2 GB/sec read/write speed dependent on workload.


An NDB data node requires disks for the following things:

  1. Checkpoints of in-memory data
  2. REDO log for in-memory data and disk data records
  3. UNDO log for disk data records
  4. Tablespaces for disk data records


The first 3 use cases are all write-only in normal operation and read-only in recovery.

Thus they mainly require disk bandwidth to write sequential writes of sizes that make

use of the disk bandwidth in an optimal manner. The in-memory data is in this case a

very small part of the disk load.


We strive to reach beyond 1 GByte per second insert and update rates for each data

node. For insert loads the the main load happens in checkpoint the tablespaces for

disk data records and writing the REDO log for disk data records.


For update loads there will also be a fair amount of UNDO log records to write. Thus

the worst case for the first three parts is when performing updates since then we have

to write the record to both the REDO log and to the UNDO log.


So now we come to the point of how to setup the 8 NVMe drives. The OCI provides

those 8 devices as bare devices and so we need to decide how to set those drives

up to get a proper filesystem.


One approach would be to simply create one file system with all 8 drives. Given that

the first 3 use cases are completely sequential in nature and the tablespace is making

lots of small read and writes, we opted to split at least into 2 file systems.


At most the sequential part will generate 2-3 GByte per second of disk writes and

thus 2 of the 8 NVMe drives will handle this nicely.


The commands to create this file system is the following:

#Make sure the mdadm tool is installed

sudo yum install mdadm -y


#Create a RAID0 device using the last 2 NVMe devices with chunk size of 256 kByte since

#most writes are fairly large sequential writes

sudo mdadm --create /dev/md0 --chunk=256 --raid-devices=2 \

           --level=raid0 /dev/nvme6n1 /dev/nvme7n1


#Create an XFS file system on this device

#Block size is 4kB since this is NVMe devices that has no reason to have smaller block size

#The stripe unit 256 kByte aligned to the chunk size, there are many parallel writes going on,

#so no need to parallelise

#With 2 disks our stripe width is 2 * 256 kByte thus 512 kB

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=1024 /dev/md0


#Use /ndb as the directory for the sequential file system

sudo mkdir -p /ndb


#Mount the file system on this directory, no need to keep times on files and directories

sudo mount /dev/md0 /ndb -o noatime,nodiratime


#Ensure that the Oracle Cloud user and group opc owns this directory

sudo chown opc /ndb

sudo chgrp opc /ndb


Now we have a directory /ndb that we can use for the NDB data directory


The next step is to setup the file system for the tablespace data for disk records.

One variant to do this is to follow the above scheme with a smaller chunk size and

stripe unit.


This was also what I did as my first attempt. This had some issues. It created a

tablespace of 6 * 6.4 TByte. I created a tablespace file of around 35 TByte. Everything

went well in loading the data, it continued to run will during the benchmark for a few

minutes. But after some time of running the benchmark performance dropped to half.


The problem is that NVMe devices and SSD devices need some free space to prepare

for new writes. At the disk write speeds NDB achieves in this benchmark,

the NVMe devices simply couldn't keep up with preparing free space to write into.

Thus when the file system was close to full the performance started dropping.

It stabilised at around half the workload it could handle with a non-full tablespace.


So what to do? Google as usual found the answer. The solution was to use the tool

parted to ensure that 40% of the disk space is not usable by the file system and thus

always available for new writes. This gave the NVMe devices sufficient amount of

time to prepare new writes even at benchmarks that ran for many hours with

consistent load of many GBytes per second of disk writes.


Obviously the more disk space that is removed from file system usage, the better disk

bandwidth one gets, but also there is less disk space available for the application.

In this case I was running a benchmark and wanted the optimal performance and then

using 60% of the disk space for the file system.


Using the full disk space cut performance in half, probably most of the performance

would be available also with 80% available for file system usage.


Here I also decided to skip using the RAID of the devices. Instead I created 6 different

file systems. This will work on MySQL Cluster 8.0.20 where we will spread the use of

the different tablespace files on a round robin basis.


So here are the commands to create those 6 file systems.

#Use the parted tool to only allow 60% of the usage for the file system

sudo parted -a opt --script /dev/nvme0n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme1n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme2n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme3n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme4n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme5n1 mklabel gpt mkpart primary 0% 60%


#Create 6 file systems with each 4kB blocks and stripe size 256 kByte

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme0n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme1n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme2n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme3n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme4n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme5n1p1


#Create the 6 directories for the tablespace files

sudo mkdir -p /ndb_data1

sudo mkdir -p /ndb_data2

sudo mkdir -p /ndb_data3

sudo mkdir -p /ndb_data4

sudo mkdir -p /ndb_data5

sudo mkdir -p /ndb_data6


#Mount those 6 file systems

sudo mount /dev/nvme0n1p1 /ndb_data1 -o noatime,nodiratime

sudo mount /dev/nvme1n1p1 /ndb_data2 -o noatime,nodiratime

sudo mount /dev/nvme2n1p1 /ndb_data3 -o noatime,nodiratime

sudo mount /dev/nvme3n1p1 /ndb_data4 -o noatime,nodiratime

sudo mount /dev/nvme4n1p1 /ndb_data5 -o noatime,nodiratime

sudo mount /dev/nvme5n1p1 /ndb_data6 -o noatime,nodiratime


#Move ownership of the file systems to the Oracle Cloud user

sudo chown opc /ndb_data1

sudo chgrp opc /ndb_data1

sudo chown opc /ndb_data2

sudo chgrp opc /ndb_data2

sudo chown opc /ndb_data3

sudo chgrp opc /ndb_data3

sudo chown opc /ndb_data4

sudo chgrp opc /ndb_data4

sudo chown opc /ndb_data5

sudo chgrp opc /ndb_data5

sudo chown opc /ndb_data6


When NDB has been started the following commands are used to create the

tablespace on these file system. These commands will take some time since

NDB will initialise the file system to ensure that the disk space is

really allocated to avoid that we run out of disk space.


#First create the UNDO log file group

#We set the size to 512G, we have plenty of disk space for logs, so no need

#to use a very small file. We allocate 4GByte of memory for the UNDO

#log buffer, the machine have 768 GByte of memory, and the memory

#is thus abundant and there is no need to save on UNDO buffer memory.

CREATE LOGFILE GROUP lg1

ADD UNDOFILE 'undofile.dat'

INITIAL_SIZE 512G

UNDO_BUFFER_SIZE 4G

ENGINE NDB;


#Next create the tablespace and the first data file

#We set the size to more than 3 TByte of usable space

CREATE TABLESPACE ts1

ADD DATAFILE '/ndb_data1/datafile.dat'

USE LOGFILE GROUP lg1

INITIAL_SIZE 3200G

ENGINE NDB;


#Now add the remaining 5 data files

#Each of the same size

ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data2/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data3/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data4/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data5/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data6/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


In the next blog post we will show how to write the configuration file for this

NDB Cluster.


This blog post showed the setup of a highly optimised setup for a key-value store

that can store 20 TB in a 2-node setup that can handle more than 1 GByte insert

rate and also around 1 GByte update speed.


From the graphs in the blog post one can see that performance is a function of the

performance of the NVMe drives and the available network bandwidth.

The CPU usage is never more than 20% of the available CPU power.

Tuesday, February 25, 2020

Use Cases for MySQL NDB Cluster 8.0

In this blog I will go through a number of popular applications that use
NDB Cluster 8.0 and also how these applications have developed over the
years.

There is a presentation at slideshare.net accompanying this blog.

The first major NDB development project was to build a prototype of a
number portability application together with a swedish telecom provider.
The aim of this prototype was to show how one could build advanced
telecom applications and manage it through standardised interfaces.
The challenge here was that the telecom applications have stringent
requirement, both on uptime and on latency. If the database access
took too long time the telecom applications would suffer from
abandoned calls and other problems. Obviously the uptime of the
database part had to be at least as high as the uptime of the
telecom switches. Actually even higher since many telecom databases
are used by many telecom switches.

In the prototype setup NDB Cluster was running in 2 SPARC computers
that was interconnected using a low latency SCI interconnect from
Dolphin, the SPARC computer was also connected to the AXE switch
through Ethernet that connected to the central computer in the
AXE switch through a regional processor. This demo was developed
in 1997 and 1998 and concluded with a successful demo.

In 1999 a new development project started up within a startup arm
of Ericsson. In 1999 the financial market was very hot and to have
instant access to stock quotes was seen as a major business benefit
(it still is).

We worked together with a swedish financial company and together with
them we developed an application that had two interfaces towards
NDB Cluster. One was the feed from the stock exchange where stock
order was fed into the database. This required low latency writes into
NDB Cluster and also very high update rates.

The second interface provided real-time stock quotes to users and
other financial applications. This version was a single-node
database service.

We delivered a prototype of this service that worked like a charm in
2001. At this point however the stock markets plunged and the financial
markets was no longer the right place for a first application for NDB
Cluster.

Thus we refocused the development of NDB Cluster back towards the
telecom market. This meant that we focused heavily on completing the work
on handling node failures of all sorts. We developed test programs that
ran thousands of node failures of all sorts every day. We worked with a
number of prospective customers in 2002 and 2003 and developed a number
of new versions of NDB Cluster.

The first customer that adopted NDB Cluster in a production environment
was Bredbandsbolaget, we worked together with them in 2003 and 2004 and
assisted them in developing their applications. Bredbandsbolaget was and
is an internet service provider. Thus the applications they used NDB
Cluster in was things like a DNS service, a DHCP service and so forth.

We worked close with them, we even had offices in the same house and on
the same floor, so we interacted on a daily basis. This meant that the
application and NDB Cluster was developed together and had a perfect
fit for each other. This application is still operational and have been
so since 2004. I even had Bredbandsbolaget as my own internet service
provider for 10 years. So I was not only developing NDB Cluster, I was
also one of its first users.

In 2003 NDB Cluster development was acquired by MySQL and we changed the
name to MySQL Cluster. Nowadays there are other clustering products within
the MySQL area, so to distinguish NDB Cluster I sometimes use
MySQL NDB Cluster and sometimes simply NDB Cluster. However the product
name is still MySQL Cluster.

After Bredbandsbolaget we got a number of new large customers in the telecom
area. Many of those telecom customers have used LDAP as the application
protocol to access their data since there was some standardisation in the
telecom sector around LDAP. To assist this there is a storage engine
to access NDB Cluster from OpenLDAP. One example of such a telecom
application is Juniper SBR Carrier System that has a combination of
SQL access, LDAP access, HTTP access, RADIUS acces towards NDB Cluster.
NDB is used in this application as a Session State database.

All sorts of telecom applications remains a very important use case for
NDB Cluster. One interesting area of development in the telecom space is
5G and IoT that will expand the application space for telecom substantially
and will expand also into self-driving cars, smart cities and many more
interesting applications that require ultra high availability coupled with
high write scalability and predictable low latency access.

Coming back to financial applications this remains an important use case
for NDB Cluster. High write scalability, ultra high availability and
predictable and low latency access to data is again the keywords that
drives the choice of NDB Cluster in this application area.

The financial markets also add one more dimension to the NDB use cases.
Given that NDB can handle large amounts of payment, payment checks,
white lists, black lists and so forth, it is also possible to use
the data in NDB Cluster for real-time analysis of the data.

Thus NDB Cluster 8.0 have focused significantly on delivering more
capabilities in the area of complex queries as well. We have seen
many substantial improvements in this area.

More and more of our users work with standard SQL interfaces towards
NDB Cluster and we worked very hard on ensuring that this provides
low latency access patterns. All the traditional interfaces towards
MySQL will also work with NDB Cluster. Thus NDB can be accessed from
all programming languages that one can use to access MySQL from.

However many financial applications are written in Java. From Java
we have a NDB API called ClusterJ. This API uses a Data Object Model
that makes it very easy to use. In many ways it can be easier to
work with ClusterJ compared to working with SQL in object-oriented
applications.

The next application category that recognized that NDB Cluster had a
very good fit for them was the Computer Gaming industry. There is a
number of applications within Computer Gaming where NDB Cluster has
a good fit. User profile management is one area where it is important
to always be up and running such that users can join and leave the
games at any time. Game state is another area that requires very high
write scalability. Most of these applications use the SQL interface
and many applications use fairly complex SQL queries and thus benefit
greatly from our improvements of parallel queries in NDB Cluster 8.0.

An interesting application that was developed at the SICS research
institute in Stockholm is HopsFS. This implements a file system in the
Hadoop world based on Hadoop HDFS. It scales to millions of
file operations per second.

This means that NDB Cluster 8.0 is already used in many important
AI applications as the platform for a distributed file system.

In NDB Cluster 8.0 we have improved such that write scalability is
even higher also when the writes are large in volume. NDB Cluster 8.0
scales to updates measured in GBytes per second even in a 2-node
cluster and in a larger cluster one can reach almost hundreds of
GBytes per second.

Thus NDB Cluster 8.0 is a very efficient tool to implement modern
key-value stores, distributed file systems and other highly
scalable applications.

NDB Cluster 8.0 is also a perfect tool to use in building many of the
modern applications that is the base of cloud applications. This is
one more active area of development for MySQL NDB Cluster.

Obviously all sorts of Web applications is also a good fit for NDB
Cluster. This is particularly true with developments in NDB Cluster
7.6 and 8.0 where we improved latency of simple queries and
implemented a shared memory transporter that makes it very
efficient to setup small clusters with low latency access to all data.

For web applications we also have a NodeJS API that can access
NDB Cluster directly without going through a MySQL Server.

In a keynote in 2015 GE showed some templates for how to setup
NDB Cluster in GE applications for the health-care industry. More on
architectures for NDB Cluster in a later blog.

Friday, February 21, 2020

Requirements on NDB Cluster 8.0

In this blog I am going to go through the most important requirements that
NDB Cluster 8.0 is based on. I am going to also list a number of consequences
these requirements have on the product and what it supports.

One slideshare.net I uploaded a presentation of the NDB Cluster 8.0
requirements. In this blog and several accompanying I am going to present the
reasoning that these requirements led to in terms of software architecture, data
structures and so forth.

The requirements on NDB Cluster 8.0 is the following:

1) Unavailability of less than 30 seconds per year (Class 6 Availability)
2) Predictable latency
3) Transparent Distribution and Replication
4) Write and Read Scalability
5) Highest availability even with 2 replicas
6) Support SQL, LDAP, File System interface, ...
7) Mixed OLTP and OLAP for real-time data analysis
8) Follow HW development for CPUs, networks, disks and memories
9) Follow HW development in Cloud Setups

The original requirements of NDB Cluster was to only support Class 5 Availability.
Telecom providers have continued supporting even higher number of subscribers per
telecom database and thus driving the requirements to even be Class 6 Availability.
NDB Cluster have more than 15 years proven track record of handling Class 6
Availability.

The requirements on predictable latency means that we need to be able to handle
transactions involving around twenty operations within 10 milliseconds even when
the cluster is working at a high load.

To make sure that application development is easy we opted for a model where
distribution and replication is transparent from the application code. This means that
NDB Cluster is one of very few DBMSs that support auto-sharding requirements.

High Write Scalability has been a major requirement in NDB from day one.
NDB Cluster can handle tens of million transactions per second, most competing
DBMS products that are based on replication protocols can only handle
tens of thousands of transactions per second.

We used an arbitration model to avoid the requirements of 3 replicas, with
NDB Cluster 8.0 we fully support 3 and 4 replicas as well, but even with 2 replicas
we get the same availability as competing products based on replication protocols
require 3 replicas for.

The original requirements on NDB didn't include a SQL interface. An efficient
API was much more important for telecom applications. However when meeting
customers of a DBMS it was obvious that an SQL interface was needed.
So this requirement was added in the early 2000s. However most early users of
NDB Cluster still opted for a more direct API and this means that NDB Cluster
today have LDAP interfaces through OpenLDAP, file system interface through
HopsFS and a lot of products that use the NDB API (C++), ClusterJ (Java) and
an NDB NodeJS API.

The model of development for NDB makes it possible to also handle complex queries
in an efficient manner. Thus in development of NDB Cluster 8.0 we added the
requirement to better support also OLAP use cases of the OLTP data that is stored in
NDB Cluster. We have already made very significant improvements in this area by
supporting parallelised filters and to a great extent parallelisation of join processing
in the NDB Data Nodes. This is an active development area for the coming
generations of NDB Cluster.

NDB Cluster started its development in the 1990s. Already in this development we
could foresee some of the HW development that was going to happen. The product
has been able to scale as HW have been more and more scalable. Today this means that
each node in NDB Cluster can scale to 64 cores, data nodes can scale to 16 TB of
memory and at least 100 TB of disk data and can benefit greatly from higher and
higher bandwidth on the network.

Finally modern deployments often happen in cloud environments. Clouds are based
on an availability model with regions, availability domains and failure domains.
Thus NDB Cluster software needs to make it possible to make efficient use of
locality in the HW configurations.

Original NDB Cluster Requirements

NDB Cluster was originally developed for Network DataBases in the telecom
network. I worked in a EU project between 1991 and 1995 that focused on
developing a pre-standardisation effort on UMTS that later became standardised
under the term 3G. I worked in a part of the project where we focused on
simulating the network traffic in such a 3G network. I was focusing my attention
especially on the requirements that this created on a network database
in the telecom network.

In the same time period I also dived deeply into research literatures about DBMS
implementation.

The following requirements from the 3G studies emerged as the most important:

1) Class 5 Availability (less than 5 minutes of unavailability per year)
2) High Write Scalability as well as High Read Scalability
3) Predictable latency down to milliseconds
4) Efficient API
5) Failover in crash scenarios within seconds or even subseconds with a real-time OS

In another blog on the influences leading to the use of an asynchronous programming
model in NDB Cluster we derive the following requirements on the software
architecture.

1) Fail-fast architecture (implemented through ndbrequire macro in NDB)
2) Asynchronous programming (provides much tracing information in crashes)
3) Highly modular SW architecture
4) JAM macros to track SW in crash events

In another blog I present the influences leading to NDB Cluster using a shared
nothing model.

One important requirement that NDB Cluster is fairly unique in addressing is high
write scalability. Most DBMSs solves this by grouping together large amounts of
small transactions to make commits more efficient. This means that most DBMSs
have a very high cost of committing a transaction.

Modern replicated protocols actually have even made this worse. As an example in
most modern replicated protocols all transactions have to commit in a serial fashion.
This means that commit handling is a major bottleneck in many modern DBMSs.
Often this limits their transaction rates to tens of thousands commits per second.

NDB Cluster went another path and essentially commits every single row change
separate from any other row change. Thus the cost of executing 1000 transactions
with 1000 operations per transaction is exactly the same as the cost of executing
1 million single row transactions.

To achieve the grouping we used the fact that we are working in an asynchronous
environment. Thus we used several levels of piggybacking of messages. One of the
most important things here is that one socket is used to transport many thousands of
simultaneous database transactions. With NDB Cluster 8.0.20 we use multiple sockets
between data nodes and this scales another 10-20x to ensure that HW limitations is
the bottleneck and not the NDB software.

The asynchronous programming model ensures that we can handle thousands of
operations each millisecond and that changing from working on one transaction to
another is a matter of tens to hundreds of nanoseconds. In addition we can handle
these transactions independently in a number of different data nodes and even
within different threads within the same data node. Thus we can handle tens of millions
transactions per second even within a single data node.

The protocol we used for this is a variant of the two-phase commit protocol with
some optimisations based on the linear two-phase commit protocol. However the
requirements on Class 5 Availability meant that we had to solve the blocking part
of the two-phase commit protocol. We solved this by recreating the state of the
failed transaction coordinators in a surviving node as part of failover handling.
This meant that we will never be blocked by a failure as long as there is still a
sufficient amount of nodes to keep the cluster operational.

Influences leading to NDB Cluster using a Shared Nothing Model

The requirements on Class 5 availability and immediate failover had two important
consequences for NDB Cluster. The first is that we wanted a fail-fast architecture.
Thus as soon as we have any kind of inconsistency in our internal data structures we
immediately fail and rely on the failover and recovery mechanisms to make the failure
almost unnoticable. The second is that we opted for a shared nothing model where all
replicas are able to take over immediately.

The shared disk model requires replay of the REDO log before failover is completed
and this can be made fast, but not immediate. In addition as one quickly understands
with the shared disk model is that it relies on an underlying shared nothing storage
service. The shared disk implementation can never be more available than the
underlying shared nothing storage service.

Thus it is actually possible to build a shared disk DBMS on top of NDB Cluster.

The most important research paper influencing the shared nothing model used in NDB
is the paper presented at VLDB 1992 called "Dynamic Data Distribution in a
Shared-Nothing Multiprocessor Data Store".

Obviously it was required to fully understand the ARIES model that was presented
also in 1992 by a team at IBM. However NDB Cluster actually choose a very different
model since we wanted to achieve a logical REDO log coupled with a checkpoint
model that actually changed a few times in NDB Cluster.

Influences leading to Asynchronous Programming Model in NDB Cluster

A number of developments was especially important in influencing the development
of NDB Cluster. I was working at Ericsson, so when I didn't work on DBMS research
I was deeply involved in prototyping the next generation telecom switches. I was the
lead architect in a project that we called AXE VM. AXE was the cash cow of Ericsson
in those days. It used an in-house developed CPU called APZ. I was involved in some
considerations into how to develop a new generation of the next generation APZ in the
early 1990s. However I felt that the decided architecture didn't make use of modern
ideas on CPU development. This opened for the possibility to use a commercial CPU
to build a virtual machine for APZ. The next APZ project opted for a development
based on the ideas from AXE VM at the end of the 1990s. I did however at this time
focus my full attention to development of NDB Cluster.

One interesting thing about the AXE is that was the last single CPU telecom switch on
the market. The reason that the AXE architecture was so successful was due to the
concept of blocks and signals.

The idea with blocks came from inheriting ideas from HW development for SW
development. The idea is that each block is self-contained in that it contains all the
software and data for its operation. The only way to communicate between blocks is
through signals. More modern names on blocks and signals are modules and
messages. Thus AXE was entirely built on a message passing architecture.
However to make the blocks truly independent of each other it is important to only
communicate using asynchronous signals. As soon as synchronous signals are used
between blocks, these blocks are no longer independent of each other.

I became a very strong proponent of the AXE architecture, in my mind I saw that the
asynchronous model gave a 10x improvement of performance in a large distributed
system. The block and signal model constitutes a higher entrance fee to SW
development, but actually it provides large benefits when scaling the software for new
requirements.

One good example of this is when I worked on scaling MySQL towards higher
CPU core counts between 2008 and 2012. I worked on both improving scalability of
NDB Cluster and the MySQL Server. The block and signal model made it possible to
scale the NDB data nodes with an almost lock-free model. There are very few
bottlenecks in NDB data nodes for scaling to higher number of CPUs.
The main ones that still existed have been extensively improved in NDB Cluster 8.0.20.

Thus it is no big surprise that NDB Cluster was originally based on AXE VM. This
heritage gave us some very important parts that enabled quick bug fixing of
NDB Cluster. All the asynchronous messages goes through a job buffer. This means
that in a crash we can print the last few thousand messages that have been executed in
each thread in the crashed data node. In addition we also use a concept called
Jump Address Memory (jam). This is implemented in our code as macros that write
the line number and file number into memory such that we can track exactly how we
came to the crash point in the software.

So NDB Cluster comes from marrying the requirements on a network database for
3G networks with the AXE model that was developed in Ericsson in the 1970s.
As can be seen this model is still going strong given that NDB Cluster is able to deliver
the best performance, highest availability of any DBMS for telecom applications,
financial applications, key-value stores and even distributed file systems.

Thus listing the most important requirements we have on the software
engineering model:

1) Fail-fast architecture (implemented through ndbrequire macro in NDB)
2) Asynchronous programming (provides much tracing information in crashes)
3) Highly modular SW architecture
4) JAM macros to track SW in crash events

Thursday, February 13, 2020

NDB Cluster, the World's Fastest Key-Value Store

Using numbers produced already with MySQL Cluster 7.6.10 we have
shown that NDB Cluster is the world's fastest Key-Value store using
the Yahoo Cloud Serving Benchmark (YCSB) Workload A.

Presentation at Slideshare.net.

We reached 1.4M operations using 2 Data Nodes and 2.8M operations
using a 4 Data Node setup. All this using a standard JDBC driver.
Obviously using a specialised ClusterJ client will improve performance
further. These benchmarks was executed by Bernd Ocklin.

The benchmark was executed in the Oracle Cloud. Each Data Node used
a Bare Metal Server using DenseIO which have 52 CPU cores with
8 NVMe drives.

The MySQL Servers and Benchmark clients was executed on Bare Metal
servers with 2 MySQL Server per server (1 MySQL Server per CPU socket).
These Bare Metal servers contained 36 CPU cores each.

All servers used Oracle Linux 7.

YCSB Workload A means that 50% of the operations are reads that read the
full rows (1 kByte in size) and 50% perform updates of one of the fields
(100 bytes in size).

Oracle Cloud contains 3 different levels of domains. The first level is that
servers are placed in different Failure Domains within the same Availability
Domains. This means essentially that servers are not relying on the same
switches and power supplies. But they can still be in the same building.

The second level is Availability Domains that are in the same region, but each
Availability Domain is failing independently of the other Availability Domains
in the same region.

The third level is regions that are separated by long distances as well.

Most applications of NDB Cluster relies on a model that would use 2 or
more NDB Clusters in different regions, but each cluster contained inside
an Availability Domain. Next global replication between the NDB Clusters
is used for fail-over when one region or availability domain fails.

With Oracle Cloud one can also setup a cluster to have Data Nodes in
different Availability Domain. This increases the availability of the
cluster at the expense of higher latency for write operations. NDB Cluster
have configuration options to ensure that one always performs local
reads on either the same server or at least in the same Availability/Failure
Domain.

The Oracle Cloud have the most competitive real-time characteristics of
the enterprise clouds. Our experience is that the Oracle Cloud provides
2-4x better latency compared to other cloud vendors. Thus the Oracle
Cloud is perfectly suitable for NDB Cluster.

The DenseIO Bare Metal Servers or DenseIO VMs are suitable for
use for NDB Data Nodes or a NDB Data Nodes colocated with
MySQL Server. These servers have excellent CPU combined with
25G Ethernet links and extremely high performing NVMe drives.

This benchmark reported here stores the table as In-Memory tables.
We will later report on some benchmarks where we use a slightly
modified YCSB benchmark to show numbers when we instead use
Disk-based tables with much heavier update loads.

The Oracle Cloud contains a number of variants of Bare Metal servers
and VMs that are suitable for MySQL Servers and applications.

In NDB Cluster the MySQL Servers are actually stateless since all
the state is in the NDB Data Node. The only exception to this rule
is the MySQL Servers used for replication to another cluster that
requires disk storage for the MySQL binlog.

So usually a standard server can be setup without any special extra
disks for MySQL Servers and clients.

In the presentation we show the following important results.

The latency of DBMS operations is independent of the data size. We
get the same latency when Data set have 300M rows as when there are
600M rows.

We show that scaling to 8 Data Nodes with 4 Data Nodes in each Availability
Domains scales from 4 Data Nodes in the same Availability Domain. But
the extra latency increases the latency and this also some loss in throughput.
Still we reach 3.7M operations per second for this 8-node setup.

We show that an important decision for the cluster setup is the number of
LDM threads. These are the threads doing the actual database work. We get
best scalability when going for the maximum number of LDM threads which
is 32. Using 32 LDM threads can increase latency at low number of clients,
but when the clients increase the 32 LDM setup will scale much longer than
the 16 LDM thread setup.

In MySQL Cluster 8.0.20 we have made more efforts to improve scaling to
many LDM threads. So we expect the performance of large installations to
scale even further in 8.0.20.

The benchmark report above gives very detailed numbers of latency in various
situations. As can seen there we can handle 1.3M operations per second with
latency of reads below 1 ms and updates having latency below 2 ms!

Finally the benchmark report also shows the impact of various NUMA settings
on performance and latency. It is shown that Interlaced NUMA settings have
a slight performance disadvantage, but since it means that we get access to the
full DRAM and the full machine, it is definitely a good idea to use this setting.
In NDB Cluster this is the default setting.

The YCSB benchmark shows NDB Cluster in its home turf with enormous
throughput of key operations, both read and write, with predictable and low
latency.

Couple with the high availability features that have been proven in the field
with more than 15 years of continous operations with better than Class 6
availability we feel confident to claim that NDB Cluster is the World's
Fastest and Most Available Key-Value Store!

The YCSB benchmark is a standard benchmark, so any competing solution
is free to challenge our claim. We used a standard YCSB client of version
0.15.0 using a standard MySQL JDBC driver.

NDB Cluster supports full SQL through the MySQL Server, it can push joins
down to the NDB Data Nodes for parallel query filtering and joining.
NDB Cluster supports sharding transparently and complex SQL queries
executes cross-shard joins which most competing Key-Value stores don't
support.

One interesting example using NDB Cluster as a Key-Value Store is HopsFS
that implements a hierarchical file system based on Hadoop HDFS. It has been
shown to scale to 1.6M file operations per second and small files can be stored
in the NDB Cluster for low latency access to small files.

Monday, February 10, 2020

Benchmarking a 5 TB Data Node in NDB Cluster

Through the courtesy of Intel I have access to a machine with 6 TB of Intel
Optane DC Persistent Memory. This is memory that can be used both as
persistent memory in App Direct Mode or simply used as a very large
DRAM in Memory Mode.

Slides for a presentation of this is available at slideshare.net.

This memory can be bigger than DRAM, but has some different characteristics
compared to DRAM. Due to this different characteristics all accesses to this
memory goes through a cache and here the cache is the entire DRAM in the
machine.

In the test machine there was a 768 GB DRAM acting as a cache for the
6 TB of persistent memory. When a miss happens in the DRAM cache
one has to go towards the persistent memory instead. The persistent memory
has higher latency and lower throughput. Thus it is important as a programmer
to ensure that your product can work with this new memory.

What one can expect performance-wise is that performance will be similar to
using DRAM as long as the working set is smaller than DRAM. As the working
set grows one expects the performance to drop a bit, but not in a very significant
manner.

We tested NDB Cluster using the DBT2 benchmark which is based on the
standard TPC-C benchmark but uses zero latency between transactions in
the benchmark client.

This benchmark has two phases, the first phase loads the data from 32 threads
where each threads loads one warehouse at a time. Each warehouse contains
almost 500.000 rows in a number of tables.

The second phase executes the benchmark where a number of threads execute
transactions in parallel towards the database using 5 different transactions.

The result is based on how many new order transactions can be processed per
minute. Each such transaction report requires more than 50 SQL statements to be
executed where the majority is UPDATE's and SELECT FOR UPDATE.

Through experiments using the same machines with only DRAM it was
verified that performance running a benchmark with a working set smaller
than DRAM size the performance was within a few percent's margin the
same.

Next we performed benchmarks comparing results when running in a database
of almost 5 TB in size and comparing it to a benchmark that executed only on
warehouses that fit in the DRAM cache.

Our findings was that latency of DBT2 transactions increased by 10-12% when
using the full data set of the machine. However the benchmark was limited by
the CPUs available to run the MySQL Server and thus the throughput was
the same.

NDB Cluster worked like a charm during these tests. We found a minor issue in
the local checkpoint processing where we prefetched some cache lines that
wasn't going to be used. This had a negative performance effect, in particular
when loading. This is fixed in MySQL Cluster 8.0.20.

This benchmark proves two things. First that MySQL Cluster 8.0 works fine
with Intel Optane DC Persistent Memory in Memory Mode. Second it proves
that NDB can work with very large memories, here we tested with up to
more than 5 TB of data in a single data node. The configuration parameter
for DataMemory supports settings up to 16 TB. Beyond 16 TB there are some
constants in the checkpoint processing that would require tweaking. The
current product is designed to work very well up to 16 TB and even work
with even larger memories.

Thus with support for up to 144 data nodes and thus 72 node groups we can
support up to more than 1 PB of in-memory data. On top of this one can also
use disk data of even bigger sizes making it possible to handle multiple
PBs of data in one NDB Cluster.

Friday, January 31, 2020

Thursday, January 16, 2020

Preview of upcoming NDB Cluster benchmark


Just a fun image from running a benchmark in the Oracle Cloud. The image above
shows 6 hours of benchmark run in a data node on a Bare Metal Server. First creating
the disk data tablespaces, next loading the data and finally running the benchmark.

During loading the network was loaded to 1.8 GByte per second, the disks was writing
4 Gbyte per second. During the benchmark run the disks was writing 5 GByte per
second in addition to reading 1.5 Gbyte per second.

All this while CPUs were never loaded to more than 20 percent. Many interesting
things to consider when running benchmarks against modern disk drives.
Bottlenecks can appear in CPUs, in disk drives, in networks and of course it is
possible to create bottlenecks in your software. But pretty satisfied above in that
we're close to the physical limits of both network and disk drives.

Wednesday, January 15, 2020

Support 3-4 replicas in NDB Cluster 8.0

One thing that we announced in the GA release of MySQL Cluster 8.0 is
that we now support using 3 and 4 replicas. Actually the NDB software
was designed to support 1 through 4 replicas from the very get go in the
1990s. The reason it has not been supported is simply that we haven't
had any regular testing effort attached to it. Thus we haven't felt that
we can support such an important feature without having proper testing
of it.

What we did in the development of MySQL Cluster 8.0 is that we have
added a number of new configurations with both 3 replicas and with
4 replicas. These configurations will now be used in our daily testing
of NDB.

In the process of working on this we found some issues with arbitration
already in MySQL Cluster 7.6. This issue was fixed in 7.6 already.
So the way to decide how to handle a node failure in 7.6 and onwards
is the following:

1) If all nodes in one node group is down then the cluster will fail.
2) If 1) is false AND at least one node group has all nodes still up
then the cluster will survive.
3) If 1) AND 2) is false AND a majority of the nodes are still up
the cluster will survive.
4) If 1) AND 2) AND 3) is false and a minority of the nodes are
still up the cluster will fail
5) If 1) AND 2) AND 3) AND 4) is false then half the nodes are
still up and arbitration is required to decide whether the cluster
will stay up.

Arbitration means that we contact a predefined (is set up dynamically
by cluster using configuration) management node or API node. This
node will say yes to the first node requesting arbitration and will
respond no to any further request. After that a new arbitrator has to
be selected.

When deciding if a majority of the nodes are up we use the previous
amount of alive nodes and compare it with the number of nodes
up after the failure.

This means that a cluster with 4 replicas can survive a crash of
2 replicas simultaneously. After this crash has been handled we
can survive one more crash and thus a 4 replica cluster can still
be operational even with just 1 replica still operational.

Our testing of 3-4 replicas found  4 bugs that all
have been fixed in MySQL Cluster 8.0. So using 3-4 replicas in 7.6
means that one can hit any of those bugs.

Properties of 3 replica cluster

Updates in a 3 replica cluster takes a bit more time. NDB employs a linear
commit protocol within replicas, this means that each additional replica
adds linearly to the latency of the transaction. So one should expect at least
50% longer latency for write operations and commits when using 3 replicas
compared to when using 2 replicas.

3 replicas means 50% more CPU bandwidth available for reads when all nodes
are up.

Previous blogs I did already in June 2018 using 7.6 shows some performance
when using 3 replicas in combination with Read Backup feature introduced
in 7.5. These blogs also shows the use LocationDomainId to ensure that reads
in cloud environments are sent to replicas within the same
Availability Domain/Zone.

16 TB data node instances in NDB Cluster 8.0

As you probably noticed MySQL Cluster 8.0 was released yesterday as a GA release.
One important thing that we changed in the GA release is that we made it possible to
set DataMemory to up to 16 TB.

We are currently working with Intel with machines that can handle 6 TB of memory
and using these machines we have already been able to load more than 5 TB of user
data in the DBT2 benchmark. We will publish more details on the results of those
benchmark investigations later.

Given that we have also improved support for disk data so much in MySQL Cluster 8.0
it means that it is perfectly sensible to store 10s of TB of data in each data node and
even up to 100 TB. So this means that a cluster with 144 data nodes would be able to
store all the way up to 5 PB of data even with 3 replicas.

The changes that made this possible is actually done already in MySQL Cluster 7.6
as part of the development of Partial LCP which was a key feature in 7.6.
Now that machines with more than 1 TB of memory becomes common it is
important for MySQL Cluster to support even larger memories.

The development of Partial LCP was developed to handle up to 16 TB DataMemory
efficiently. It will still work with even bigger DataMemory, but to support larger
memories efficiently would require some minor changes in the product.