Tuesday, October 06, 2020

Adaptive CPU spinning in NDB Cluster

 CPU spinning is something that has been around in NDB for a long time. This CPU

spinning has been static which means that each time a thread wakes up it will spin

waiting for incoming signals. Thus if spin time is set to e.g. 500 microsecond we will

spin 5% of the time even when there is no activity since each thread wakes up each

10 ms to perform timer based activities.


In the new spinning approach we wanted to achieve the following.


- Power efficient CPU spinning

  NDB Cluster data nodes can execute queries very efficiently, this leads to efficient

  power usage. However CPU spinning requires careful control to ensure that one

  doesn't  waste power while spinning.

- No spinning unless there is something to gain from it

  Static spinning will always spin independent of whether it makes sense.

  With the new approach we gather statistics about the usefulness of spinning. This

  means that when the thread goes to sleep we will collect statistics on the sleep

  time when the thread wakes up again. With this information we are able to calculate

  the gains from spinning. We also calculate the cost of going to sleep and waking up

  again (this cost is higher usually in virtual machines compared to bare metal servers).


One important part of CPU spinning is what we do while spinning. In x86 and ARM

CPUs there exists CPU instructions for spinning. While spinning we will spend around

one microsecond in the spinning state before checking for input again. Going to sleep

and waking up again can consume between 10 and 50 microseconds dependent on

OS, hardware and whether we are running in a VM, in Docker and so forth. These

special assembler instructions will ensure that the CPU doesn't use so much power

and in hyperthreaded CPUs the other hyperthread(s) will get almost exclusive access

to the CPU core while spinning is performed.


The adaptive part of our new CPU spinning is that we will keep track of the workload

and its ability to benefit from CPU spinning. We do so through a number of

configuration parameters. However for ease of use we have only 4 variants that one

can configure.


The configuration parameter SpinMethod is used to set the variant of CPU spinning to

use. The following settings are available:


- CostBasedSpinning

This is the default setting. This means that we will only spin when we actually benefit

from it. This means that if we spend 10 microseconds spinning we will gain 10

microseconds from spinning in avoiding the cost of going to sleep and waking up.

This is a conservative setting that ensures good power efficiency and avoids issues

in situations where the data nodes executes in a shared environment.


- LatencyOptimisedSpinning

This setting is a very good trade off between best latency and best power efficiency.

Here we will spin if 10 microseconds of spinning can gain 1-2 microseconds of

avoided costs of going to sleep and waking up. This is recommended to use in cases

where the data nodes executes in an exclusive environment, e.g. in its own virtual

machine, Docker container or even exclusively using the entire server. This provides

a very distinct advantage in latency of queries as we will show in a later blog.


- DatabaseMachineSpinning

This setting provides the optimal latency at the expense of decreased power

efficiency. This setting is intended for use cases where latency is of outmost important

and one is willing to sacrifice some extra power to achieve this.


-StaticSpinning

This disables the adaptive spinning and makes NDB Cluster backwards compatible.

Adaptive checkpoints of Disk Data columns in NDB Cluster

 In MySQL Cluster 8.0.20 we made it possible to write to disk data columns

at considerably higher write rates than before. In MySQL Cluster 7.6 we

introduced adaptive checkpointing of in-memory rows to ensure that restarts

can be faster.


In order to handle very high write rates (up to more than 1 GByte per second)

on disk data columns in NDB it is necessary to control write rates more

efficiently.


To get the most efficient use of disk bandwidth it is important to ensure that

write rates are kept fairly constant such that we constantly make use of the

disk bandwidth. If the load on the disks goes up and down too fast we will not

be able to make use of all the available bandwidth in the disks.


With modern NVMe devices found in e.g. the Oracle Cloud one can achieve 10-20

GByte of reads and writes to a set of 8 NVMe devices using specialised disk

benchmarks. Achieving this with a DBMS is not as easy, but we were able to

see NDB use 7 GBytes per second of reads and writes in some YCSB benchmarks.


This represents one more step in the direction of adaptive algorithms in NDB

Cluster to ensure that users always will get the optimal behaviour of their

hardware when using NDB.


Some previous steps are:

- Adaptive send assistance

  This feature was introduced in 7.5 that made it possible for all threads

  to assist the send threads to send messages on sockets. The main input to

  the adaptive algorithm is the amount of CPU time spent in each thread for

  normal execution, sleeping, assisting send threads.

- Adaptive in-memory checkpoint speed

  This feature adapts the speed of checkpoints based on the amount of write

  activity in NDB Cluster.

- Adaptive CPU spinning

  This is introduced in 8.0.20 and will be presented in separate blog.

Multi-socket feature in NDB Cluster

 In MySQL Cluster 8.0.20 a new feature was added that enables higher write rates

in NDB Cluster. In previous releases the communication between nodes in the same

node group was handled by one socket. This single socket was able to handle about

200-300 MBytes of write traffic between the replicas inside one node group. Thus

in clusters with many node groups the write rates can be much higher.


For most applications this doesn't represent any problem at all. However there are

certain applications that require wast amount of writes.


One such application is when NDB is used to store file data. File writes can easily

go beyond 200-300 MBytes per second. Another application is when NDB is used to

store data from measurements, stock data or other applications that require millions of

writes per second.


The solution to this problem is to simply use multiple sockets when communicating

between nodes in the same cluster. This removes the bottleneck on write rates

between nodes in the same node group. The solution is applied by default since we

haven't seen any negative effects of using it. The number of sockets will be

derived from the number of LDM threads used in the data nodes.


We have verified the feature through the use of YCSB where the data is stored in

a disk data column. This is a good representation of writes of large amounts of

file data to NDB.


The second type of verification is through DBT2, an open source variant of TPC-C

where each benchmark thread starts a new operation immediately after completing

the previous transaction. This benchmark was previously limited by the single

socket such that scaling beyond 6 LDM threads didn't improve numbers at all.

With this change we can see scaling to full bare metal servers in the Oracle Cloud

with 52 CPU cores.


Reports of those benchmarks will be in separate blogs.

Wednesday, August 12, 2020

Setting up NVMe drives on Oracle Cloud for NDB Cluster

 In a blog post I posted on the 16th of January I showed some graphs of CPU

usage, network usage and disk usage. The benchmark that was running was

a modified variant of YCSB (Yahoo Cloud Serving Benchmark) based on version

0.15.0 of YCSB.


In this blog post I will describe the setup of the NVMe drives for this

benchmark using DenseIO machines in the Oracle Cloud.


Oracle Cloud has a variety of machines available. In this benchmark we wanted

to show NDB with a database size of around 20 TByte of user data in a replicated

setup.


There are numerous ways to connect disk drives to Oracle Cloud machines. One

manner is to use block storage. In this case the actual storage is on separate storage

servers and all disk traffic requires using the network. The other option is to

use DenseIO machines that has 8 NVMe drives with a total of 51.2 TByte of disk

space.


In this benchmark we opted for the DenseIO machines given that they can handle more

than twice the disk load compared to the block storage. Block storage is limited by

the Ethernet connection for block storage at 25 Gb per second. In addition the Oracle

Cloud has limits on the amount of IOPS that are allowed for block storage. So in our

case the DenseIO machines was the obvious candidate.


We used the Bare Metal variant called BM.DenseIO2.52.


This DenseIO machine has the following HW setup. It has 2 CPU sockets that each

contain an Intel Xeon Platinum 8167 that each have 26 CPU cores for a total of

52 CPU cores with 104 CPUs (my terminology is CPUs are contained in CPU cores

that are contained in CPU sockets that are contained in a computer/server).

It is equipped with 768 GB of memory. In addition it has 8 NVMe drives,

each with 6.4 TByte of disk space. Each such disk is capable of achieving around

1-2 GB/sec read/write speed dependent on workload.


An NDB data node requires disks for the following things:

  1. Checkpoints of in-memory data
  2. REDO log for in-memory data and disk data records
  3. UNDO log for disk data records
  4. Tablespaces for disk data records


The first 3 use cases are all write-only in normal operation and read-only in recovery.

Thus they mainly require disk bandwidth to write sequential writes of sizes that make

use of the disk bandwidth in an optimal manner. The in-memory data is in this case a

very small part of the disk load.


We strive to reach beyond 1 GByte per second insert and update rates for each data

node. For insert loads the the main load happens in checkpoint the tablespaces for

disk data records and writing the REDO log for disk data records.


For update loads there will also be a fair amount of UNDO log records to write. Thus

the worst case for the first three parts is when performing updates since then we have

to write the record to both the REDO log and to the UNDO log.


So now we come to the point of how to setup the 8 NVMe drives. The OCI provides

those 8 devices as bare devices and so we need to decide how to set those drives

up to get a proper filesystem.


One approach would be to simply create one file system with all 8 drives. Given that

the first 3 use cases are completely sequential in nature and the tablespace is making

lots of small read and writes, we opted to split at least into 2 file systems.


At most the sequential part will generate 2-3 GByte per second of disk writes and

thus 2 of the 8 NVMe drives will handle this nicely.


The commands to create this file system is the following:

#Make sure the mdadm tool is installed

sudo yum install mdadm -y


#Create a RAID0 device using the last 2 NVMe devices with chunk size of 256 kByte since

#most writes are fairly large sequential writes

sudo mdadm --create /dev/md0 --chunk=256 --raid-devices=2 \

           --level=raid0 /dev/nvme6n1 /dev/nvme7n1


#Create an XFS file system on this device

#Block size is 4kB since this is NVMe devices that has no reason to have smaller block size

#The stripe unit 256 kByte aligned to the chunk size, there are many parallel writes going on,

#so no need to parallelise

#With 2 disks our stripe width is 2 * 256 kByte thus 512 kB

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=1024 /dev/md0


#Use /ndb as the directory for the sequential file system

sudo mkdir -p /ndb


#Mount the file system on this directory, no need to keep times on files and directories

sudo mount /dev/md0 /ndb -o noatime,nodiratime


#Ensure that the Oracle Cloud user and group opc owns this directory

sudo chown opc /ndb

sudo chgrp opc /ndb


Now we have a directory /ndb that we can use for the NDB data directory


The next step is to setup the file system for the tablespace data for disk records.

One variant to do this is to follow the above scheme with a smaller chunk size and

stripe unit.


This was also what I did as my first attempt. This had some issues. It created a

tablespace of 6 * 6.4 TByte. I created a tablespace file of around 35 TByte. Everything

went well in loading the data, it continued to run will during the benchmark for a few

minutes. But after some time of running the benchmark performance dropped to half.


The problem is that NVMe devices and SSD devices need some free space to prepare

for new writes. At the disk write speeds NDB achieves in this benchmark,

the NVMe devices simply couldn't keep up with preparing free space to write into.

Thus when the file system was close to full the performance started dropping.

It stabilised at around half the workload it could handle with a non-full tablespace.


So what to do? Google as usual found the answer. The solution was to use the tool

parted to ensure that 40% of the disk space is not usable by the file system and thus

always available for new writes. This gave the NVMe devices sufficient amount of

time to prepare new writes even at benchmarks that ran for many hours with

consistent load of many GBytes per second of disk writes.


Obviously the more disk space that is removed from file system usage, the better disk

bandwidth one gets, but also there is less disk space available for the application.

In this case I was running a benchmark and wanted the optimal performance and then

using 60% of the disk space for the file system.


Using the full disk space cut performance in half, probably most of the performance

would be available also with 80% available for file system usage.


Here I also decided to skip using the RAID of the devices. Instead I created 6 different

file systems. This will work on MySQL Cluster 8.0.20 where we will spread the use of

the different tablespace files on a round robin basis.


So here are the commands to create those 6 file systems.

#Use the parted tool to only allow 60% of the usage for the file system

sudo parted -a opt --script /dev/nvme0n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme1n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme2n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme3n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme4n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme5n1 mklabel gpt mkpart primary 0% 60%


#Create 6 file systems with each 4kB blocks and stripe size 256 kByte

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme0n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme1n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme2n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme3n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme4n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme5n1p1


#Create the 6 directories for the tablespace files

sudo mkdir -p /ndb_data1

sudo mkdir -p /ndb_data2

sudo mkdir -p /ndb_data3

sudo mkdir -p /ndb_data4

sudo mkdir -p /ndb_data5

sudo mkdir -p /ndb_data6


#Mount those 6 file systems

sudo mount /dev/nvme0n1p1 /ndb_data1 -o noatime,nodiratime

sudo mount /dev/nvme1n1p1 /ndb_data2 -o noatime,nodiratime

sudo mount /dev/nvme2n1p1 /ndb_data3 -o noatime,nodiratime

sudo mount /dev/nvme3n1p1 /ndb_data4 -o noatime,nodiratime

sudo mount /dev/nvme4n1p1 /ndb_data5 -o noatime,nodiratime

sudo mount /dev/nvme5n1p1 /ndb_data6 -o noatime,nodiratime


#Move ownership of the file systems to the Oracle Cloud user

sudo chown opc /ndb_data1

sudo chgrp opc /ndb_data1

sudo chown opc /ndb_data2

sudo chgrp opc /ndb_data2

sudo chown opc /ndb_data3

sudo chgrp opc /ndb_data3

sudo chown opc /ndb_data4

sudo chgrp opc /ndb_data4

sudo chown opc /ndb_data5

sudo chgrp opc /ndb_data5

sudo chown opc /ndb_data6


When NDB has been started the following commands are used to create the

tablespace on these file system. These commands will take some time since

NDB will initialise the file system to ensure that the disk space is

really allocated to avoid that we run out of disk space.


#First create the UNDO log file group

#We set the size to 512G, we have plenty of disk space for logs, so no need

#to use a very small file. We allocate 4GByte of memory for the UNDO

#log buffer, the machine have 768 GByte of memory, and the memory

#is thus abundant and there is no need to save on UNDO buffer memory.

CREATE LOGFILE GROUP lg1

ADD UNDOFILE 'undofile.dat'

INITIAL_SIZE 512G

UNDO_BUFFER_SIZE 4G

ENGINE NDB;


#Next create the tablespace and the first data file

#We set the size to more than 3 TByte of usable space

CREATE TABLESPACE ts1

ADD DATAFILE '/ndb_data1/datafile.dat'

USE LOGFILE GROUP lg1

INITIAL_SIZE 3200G

ENGINE NDB;


#Now add the remaining 5 data files

#Each of the same size

ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data2/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data3/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data4/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data5/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data6/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


In the next blog post we will show how to write the configuration file for this

NDB Cluster.


This blog post showed the setup of a highly optimised setup for a key-value store

that can store 20 TB in a 2-node setup that can handle more than 1 GByte insert

rate and also around 1 GByte update speed.


From the graphs in the blog post one can see that performance is a function of the

performance of the NVMe drives and the available network bandwidth.

The CPU usage is never more than 20% of the available CPU power.